Documentation, examples, tutorials and more

<<

Name

loadExternalAssembly.pl - Wrapper script to load an external assembly into SMASH repository

Synopsis

        loadExternalAssembly.pl [options]

Options

--metagenome

(required) name of the metagenome this assembly belongs to. This metagenome must be present in the repository, and reads should already have been loaded.

--assembler

(required) name of the assembly software used to generate this assembly outside of SMASH.

--version

(required) version of the assembly software used to generate this assembly outside of SMASH.

--parameters

(optional) special parameters used to generate this assembly outside of SMASH.

--contig_ace

ace file containing the assembly information

--contig_fasta

fasta file containing the assembled contig sequences

--contig_gff

tab-delimited GFF file containing read to contig mapping

--scaffold_agp

AGP file containing contig to scaffold mapping

--scaffold_fasta

fasta file containing the assembled scaffold sequences

--scaffold_gff

tab-delimited GFF file containing contig to scaffold mapping

--gene_gff

GFF file containing gene coordinates in the standard GFF format

--help

Prints this manual.

Description

loadExternalAssembly.pl is a wrapper script to load an external assembly into SMASH repository and database.

Using ACE assembly files

A typical use of loadExternalAssembly.pl follows loadMetaGenome.pl. It can handle the ACE file format for the contig assembly information and the AGP format for the scaffolding information. For example, if you have an assembly from Newbler and want to load it into SMASH,

        loadExternalAssembly.pl --metagenome=MC99.MG1 \
            --assembler=Newbler --version=2.3 \
            --contig_ace=454Contigs.ace \
            --scaffold_agp=454Scaffolds.txt \
            --scaffold_fasta=454Scaffolds.fna

Without ACE assembly files

If you have a program that does not generate ACE files, then you have to either make it yourself, or create a GFF file that explains the contig assembly in the following format:

        <contig_id>  <assembler>  read  <start>  <end>  <contig_length>  <strand>  .  \
        read "<read_name>"; [mate_pair "<mate_name>"; [contig "<contig_name>"; insert_size "<insert_size>";]]

E.g.,

        contig1 Newbler read    131     259     2023    +       .       \
        read "MC99.MG1.ABC.y"; mate_pair "MC99.MG1.ABC.z"; contig "contig1"; insert_size 1000;
        contig1 Newbler read    198     363     2023    +       .       \
        read "MC99.MG1.XYZ.y";
        contig1 Newbler read    910     1030    2023    +       .       \
        read "MC99.MG1.ABC.z"; mate_pair "MC99.MG1.ABC.y"; contig "contig1"; insert_size 1000;

Once you have it, then you can load the assembly as follows:

        loadExternalAssembly.pl --metagenome=MC99.MG1 \
            --assembler=Unknown --version=0.0 \
            --contig_fasta=assembly.fasta \
            --contig_gff=contigs.gff \
            --scaffold_gff=scaffolds.gff

If you dont have scaffolding information, then you could use

        loadExternalAssembly.pl --metagenome=MC99.MG1 \
            --assembler=Unknown --version=0.0 \
            --contig_fasta=assembly.fasta \
            --contig_gff=contigs.gff

The GFF file can have contig names generated by the external assembly program. The script will rename the contigs to SMASH format. However, the contig names in the contig_fasta file and the contig_gff files MUST match, since the script will match them accordingly the contig names.

Scaffolding information

Some assemblies have scaffold information, and some don't. For example, Newbler generates the scaffold information in AGP format and writes it to a file called 454Scaffolds.txt. If you do have scaffold information, but it is not in AGP format, then you must create either the AGP file or a GFF file similar to the one above. If you dont have scaffold information, then do not create a scaffold_gff file. Without a --scaffold_gff or --scaffold_agp option, a "fake" scaffold is created for each contig.

Gene predictions

For external assemblies, we recommend loading the external assembly into SMASH first, and using SMASH to predict genes. If you have a gene-calling program that is not supported by SMASH, then of course this won't work. In that case we recommend using the contig FASTA file from the repository to make gene prediction. The location of the contig FASTA file can be obtained as follows:

        showLocations.pl --item=MC99.MG1.AS1

Then load the genes using loadExternalGenePrediction.pl, but if that is not possible you can load both the assembly and gene prediction together by specifying --gene_gff. Please make sure that the contig name in the gene_gff file matches that from contig_ace or contig_fasta.

<<