Documentation, examples, tutorials and more

<<

Name

doAssembly.pl - Smash driver script for metagenome sequence assembly

Synopsis

        doAssembly.pl --metagenome=<name> --assembler=<program> \
            --genome_size=<size> [options]

Options

--metagenome (required)

name of the metagenome

--assembler (required)

sequence assembly program (Arachne|Celera|Forge)

--genome_size (required)

estimated genome size for assembly

--cluster

name of cluster (required for Forge)

--single_genome

specify a single genome assembly (detault: false)

--superiterative

do superiterative assembly, applicable to Arachne for now (default: false)

--finish

finish post-processing of an assembly that was started independently (default: false)

--assembly

assembly id to finish (used only with --finish>

--pkg_dir

location where the assembler <program> is installed

--extra_options

extra options to pass to the assembler. The assembler object must know how to handle this option. See "Extra Options" for more details.

--help

Prints this manual.

Description

doAssembly.pl is a wrapper script to perform sequence assembly on a given metagenome. This script looks up the data and data type for the given metagenome, set things up for the given assembler and runs the assembly.

Currently it supports two assemblers: Arachne and Forge. Where this script finishes and what should be done next depends on which assembler is used.

Arachne

You can start assembling a metagenome using Arachne as follows:

        doAssembly.pl --metagenome=MC20.MG1 --assembler=Arachne \
            --genome_size=20000000 --superiterative
The script finishes the assembly and copies all the necessary files from its temporary workspace to Smash data repository. (See User Manual for more information on these locations.) The next step to run would be loadAssembly.pl like so:

        loadAssembly.pl --assembly=MC20.MG1.AS1
Please check Smash::Analyses::Assembler::Arachne for information on Smash's Arachne assembly process.

Celera

Celera assembler is more an assembly pipeline with several steps. Therefore it is normally run on a Sun Grid Engine. To do so, you must have configured a cluster with Smash::Utils::Cluster. In this case, you can run the assembly like so:

        doAssembly.pl --metagenome=MC20.MG2 --assembler=Celera \
            --genome_size=20000000 --cluster=sigma
Make sure to capture the output somewhere, since it contains useful information. The output looks something like this:

        Retrieving Sanger read information ... done
        Retrieving 454 read information ... done
        Job submitted as 264970
        Job submitted as 264971
        Step 1 of Celera assembly of MC20.MG2.AS3 submitted as job 264970 in cluster sigma.
        Step 2 of Celera assembly of MC20.MG2.AS3 submitted as 264971 with a 'hold'.
        1. Wait till all the Step 1 jobs finish. 
           Usually these contain MC20.MG2.AS3 in the job-name.
        2. Check workspace/Assembler/Celera/MC20.MG2.AS3/MC20.MG2.AS3/9-terminator/MC20.MG2.AS3.qc.
        3. If it looks good, then release the hold on Step 2 using
                qrls 264971
        4. Wait till all the Step 2 jobs finish. 
           These also contain MC20.MG2.AS3 in the job-name.
        5. Check workspace/Assembler/Celera/MC20.MG2.AS3/MC20.MG2.AS3/10-toggledAsm/9-terminator/MC20.MG2.AS3.qc.
        6. If it looks good, then run doAssembly.pl as follows to finish up
                doAssembly.pl --assembler=Celera --version=6.1 --finish --assembly=MC20.MG2.AS3
Smash uses a two step assembly procedure for Celera assembly of metagenomes. The first step is the normal assembly, and the second step is called "repeat toggling". Celera assembler lets you run both steps automatically. However, to make things work, Smash uses slightly different parameters in each step, which Celera assembler does not support. Therefore these two steps have to be managed by Smash externally. In this example, Celera assembler pipeline will run as job 264970, but it will submit more jobs for all its steps. All these jobnames will contain "MC20.MG2.AS3". After all these steps are done, the only job with "MC20.MG2.AS3" in its name is job 264971. At this stage, check the first qc file in bullet (2) to see if everything looks good. The first few lines should look like:

        [Scaffolds]
        TotalScaffolds=1079
        TotalContigsInScaffolds=1079
        MeanContigsPerScaffold=1.00
        MinContigsPerScaffold=1
        MaxContigsPerScaffold=1
If it looks good, release the hold on the second job by typing:

        qrls 264971
This will start step 2 of the assembly, and will submit more jobs to the cluster. When there is no more job in the cluster with "MC20.MG2.AS3", check the qc file in bullet (5). If it looks good, then you can finish this assembly by running

        doAssembly.pl --assembler=Celera --version=6.1 --finish --assembly=MC20.MG2.AS3
You can now load the assembly to Smash using:

        loadAssembly.pl --assembly=MC20.MG2.AS3
Please check Smash::Analyses::Assembler::Celera for information on Smash's Arachne assembly process.

Forge

Since Forge uses MPI, this script does not wait until the MPI processes finish. Instead, it starts the MPI threads and quits. The user must check if the processes are done and then run a finishing step before loading the assembly into database, like so:

        doAssembly.pl --metagenome=MC20.MG2 --assembler=Forge \
            --genome_size=20000000
        doAssembly.pl --metagenome=MC20.MG2 --assembler=Forge \
            --genome_size=20000000 --finish
        loadAssembly.pl --assembly=MC20.MG2.AS1
This finish script must be run with the exact same options with an extra --finish for this to work right.

Extra Options

Smash has in-built default options for each of the assemblers. You can override these options using the --extra_options parameter. Currently Arachne and Celera understand this option. Please check Smash::Analyses::Assembler::Arachne for default parameters for Arachne and Smash::Analyses::Assembler::Celera for default parameters for Celera.

<<