Documentation, examples, tutorials and more

<<

Name

loadMetaGenome.pl - Wrapper script to load/unload a metagenome to/from Smash repository and database

Synopsis

        loadMetaGenome.pl [options]

Options

--metagenome (required)

Name of metagenome where data is added.

--sample

If you have multiple samples collected from the same source, but you want to consider them as a single metagenome, you can use the sample parameter to specify that. An example would be samples collected at two timepoints, which you can process together. If you use sample, you could trace a gene or a contig back to which of the samples it comes from. If you pool the samples together and do not use sample, that information is lost.

--library

This is mostly useful for 454 data. If you have constructed multiple libraries and sequenced them independently, you can specify that here. It is advisable to label everything that was processed together in the emulsion-PCR step as coming from the same library, so that artificial replicates from the emulsion-PCR step for each batch can be removed.

--type

Type of metagenome data being added. For raw sequence data, the options should be one of sanger, 454 or external. Use external for preassembled sequences, or reads without quality values.

--tech

Name of 454 sequencing technology used to generate the sequences. Required when SFF files containing paired end 454 sequences are added. Ignored otherwise. Must be one of flx or titanium.

--insert_size

Paired end insert size for this library. Again valid only for 454 data. Sanger data takes the insert information from the XML files.

--insert_stdev

Paired end insert size standard deviation for this library. If not specified, 10% of the insert size is used. Again valid only for 454 data. Sanger data takes the insert information from the XML files.

--reads

List of fasta files containing DNA sequence reads, separated by whitespace.

--quals

List of quality files containing quality values for the reads specified through --reads.

--xmls

List of tracearchive style XML files that contain ancillary information about the reads. The following fields are required to be in the XML file: SEQ_LIB_ID or LIBRARY_ID, PLATE_ID, WELL_ID, TEMPLATE_ID, TRACE_END, TRACE_NAME, INSERT_SIZE, INSERT_STDEV.

--sffs

List of Roche 454 flowgrams (SFF files)

--weird_fasta

If set, the fasta headers are (what I call) weird. The actual identifier of the sequence is the last word of the fasta header, and not the first word. E.g., if a fasta file contains an entry like this:

        >i_am_not_the_id but the real id is ME
        agtcgactacagagcatcagcagctagactg
if --weird_fasta is set, ME is the identifier. Otherwise i_am_not_the_id is the identifier. Most data downloaded from NCBI trace archive have this format.

--quality_trim

Specifies if quality trimming should be performed or not. Possible values are forge, lucy or xml. forge uses the quality trimming program part of the Forge assembler, and lucy uses the lucy quality/vector trimming software. Be aware that Forge assembler or lucy software should be installed in your system and available in the Smash software directory or your path, resp., for this to work. The last option is to use the CLIP_LEFT, CLIP_VECTOR_LEFT, CLIP_QUALITY_LEFT fields for left trim and CLIP_RIGHT, CLIP_VECTOR_RIGHT, CLIP_QUALITY_RIGHT fields for right trim. In case any of these operations are performed, the six fields mentioned above, if present, will be removed since they are not valid any more after trimming. If no quality trim is chosen, these fields will be left in tact.

--cluster

Name of cluster to run quality trimming procedure (only when using Forge), or name of cluster under which Celera assembler has been installed (for handling SFF files).

--unload

Unloads the entries corresponding to this metagenome from the database. The entry in the main metagenome table listing all metagenomes is not modified. This allows you to reload the metagenome without adding it first to the main table. All the fasta, quality and xml files corresponding to this metagenome are also removed. Use this option if the data are corrupt or did not load properly or if you are bored and want to remove the data but would eventually load it back again.

--wipeout

Unloads the entries corresponding to this metagenome from the database and removes the entry in the main table listing all metagenomes. All the fasta, quality and xml files corresponding to this metagenome are also removed. Once a metagenome is wipedout, it disappears from Smash. Thus it has to be added to the main metagenome table before it can be loaded into the database again.

--help

Prints this manual.

Interdependence of options

  • --type and --sample are required unless you are unloading with --unload or --wipeout
  • --reads, --quals and --xmls are required for --type=sanger
  • (--reads and --quals) or --sffs is required for --type=454
  • --library is required for --type=454 with --sffs
  • --reads is required for --type=external

Description

loadMetaGenome.pl is a wrapper script to add given sequence data to a metagenome in Smash. This script processes the sequence data (see --quality_trim), adds the sequence to the data repository and loads the information to the database.

A typical use of loadMetaGenome.pl is like this:

        loadMetaGenome.pl --metagenome=MC20.MG1 --type=sanger \
              --reads seq1.fasta seq2.fasta --quals seq1.qual seq2.qual \
              --xmls seq1.xml seq2.xml --quality_trim=forge

Multiple runs of loadMetaGenome.pl on the same metagenome add the new sequences to the database and append them to the files in the repository. Therefore, be careful not to run loadMetaGenome.pl on the same data twice by accident. Although it might seem odd in the beginning, it is quite a useful feature when we have data from multiple sequencing technologies.

For example, a typical metagenome with both Sanger and 454 technology data must be loaded as follows:

        loadMetaGenome.pl --metagenome=MC20.MG1 --type=sanger \
              --reads seq1.fasta seq2.fasta --quals seq1.qual seq2.qual \
              --xmls seq1.xml seq2.xml --quality_trim=forge
        loadMetaGenome.pl --metagenome=MC20.MG1 --type=454 \
              --reads run1.fna run2.fna --quals run1.qual run2.qual

For reasons far beyond the scope of this manual, a sanger+454 hybrid dataset should be added as mentioned above: first load all the sanger data and then load all the 454 data. This is important for consistent execution of all the scripts part of Smash.

<<