Documentation, examples, tutorials and more

<<

NAME

Smash::Analyses::Assembler::Arachne - Implementation of Arachne assembly software pipeline

SYNOPSIS

DESCRIPTION

This module performs iterative (or superiterative) assemblies using Arachne assembler.

Default options

Arachne uses two executables: Assemblez and Assemble. For metagenomic assembly, we have compared a few assemblies and chosen Assemble to be the default since it assembles more reads although the longest scaffold and contig N50 are shorter. The default parameters used by Smash for Assemble are:

                            ACE=True
                   one_ace_file=True
          aggressive_correction=False
                    min_overlap=10
                 REINDEX_SUPERS=True
         ignore_version_warning=True
                  FORCE_VERSION=True
                ENLARGE_CONTIGS=True
                 IMPROVE_SUPERS=True
                     PATCH_GAPS=True
                    k_for_merge=12
                   check_plates=True
                       maxcliq1=500;
                       maxcliq2=500;

Assemblez provides better assemblies of single genomes, and is recommended by the developers of Arachne. Here are the defaults used by Smash for Assemblez.

                            ACE=True
                   one_ace_file=True
          aggressive_correction=False
                    min_overlap=10
                       FAST_RUN=True
                       maxcliq1=500;
                       maxcliq2=500;
            recycle_bad_contigs=True;
                    SW_GAP_STEP=True;
                       FAST_RUN=False;
                 mc_min_overlap=30;

FUNCTIONS

prepare()

Prepares the assembly files in the Smash working directory: specifically, the reads, quals and xml files in fasta, qual, traceinfo directories, respectively. It also copies the reads_config.xml file that is required by Arachne.

prepare_genome_size()

Arachne uses a file to specify the genome size. This creates this file at every iteration so the right genome size is used by that iteration.

get_command_line()

Creates the command to run from all the arguments and options.

superiterate($max_iterations)

Specific to Arachne. If --superiterate was selected in doAssembly.pl, this sub runs the superiteration. Here is the pseudocode of the superiteration:

        init: genome_size, max=$max_iterations, k=12, k_max=20, m=10, m_max=19

        i=1
        while (genome_size > 1Mb)
                option="aggressive_correction=True k_for_merge=<k> min_overlap=<m>"
                assemble()
                if (i == max)
                        genome_size = genome_size / 2
                        m = min(m+1, m_max)
                        k = min(k+1, k_max)
                i = (i+1)%max

        genome_size = 800Kb
        i=1
        while (genome_size > 100Kb)
                option="aggressive_correction=True k_for_merge=<k> min_overlap=<m> IMPROVE_SUPERS=False MERGE_SUPERCONTIGS=False"
                assemble()
                if (i == max)
                        genome_size = genome_size / 2
                        m = min(m+1, m_max)
                        k = min(k+1, k_max)
                i = (i+1)%max
assemble()

Arachne overrides assemble() from Analyses::Assembler since it uses iterative assembly. Iterative assembly forces the assembler to keep assembling as long it can. In simpler terms, after every assembly, it takes the unassembled reads and reassembles them using the same parameters, until it can no longer assemble. Each iteration runs as a new assembly with its own data directory, called something like MC20.MG1.AS1_run1, MC20.MG1.AS1_run2 and so on. The last iteration is stored as a global variable $this-{ITERATION}> so that the next call to assemble() knows where it left off.

local_post_assembly()

This is the post-assembly step that summarizes the assembly and makes the contig-to-read maps and the contig fasta files after each iteration inside assemble().

post_assembly()

This is the global post-assembly step at the end of the assembly after all the iterative/superiterative assemblies are done. It generates:

        1. Contig fasta file
        2. Contig-to-read mapping file in GFF format
        3. Scaffold-to-contig mapping file in GFF format

<<