Documentation, examples, tutorials and more

<<

doComparativeAnalysis.pl

Name

doComparativeAnalysis.pl - Comparative analysis of metagenomes

Synopsis

        doComparativeAnalysis.pl [options]

Options

--tag (required)

Label for this analysis. Every file created by this run of this step of the analysis will have the given label as its prefix, to help you perform a series of steps as an analysis unit.

--analysis (required)

Type of analysis you would like to perform. Currently available options are refgenome, 16S and functional.

--list (required)

Name of file containing a list of samples, one per line. For refgenome and 16S analysis, this should contain the metagenome ids. For functional analysis, this list should contain the gene prediction ids, because there can be multiple gene predictions per sample.

By default, this script assumes that the samples to be analyzed are in the database and will try to get additional information from the database. If you want to analyze external data using SMASH, then you have to provide this information through the list file (See "Analysis types") for more details).

--mode (required)

Which step of the analysis do you want to perform now? Available steps are raw, summary, distmat, cluster, bootstrap+distmat, bootstrap+cluster, itol.

--replicate

Replicate to process, if this is part of bootstrap analysis. If this is not specified, the whole dataset will be processed.

--level

Functional or phylogenetic level at which to perform the analysis. For refgenome and 16S, these are phylogenetic levels. Please keep in mind that some organisms do not have some taxonomic levels, so asking for that level will assign the closest rank above the requested level available for that organism. For functional analysis, this must be one of og (for orthologous groups), module (for KEGG modules, if the data contains KEGG annotations) or pathway (for KEGG pathways, if the data contains KEGG annotations).

--refdb

Name of the reference database used. Although it is recommended that you use the real name of databases (e.g., eggnog if the db is the eggNOG proteins, NCBI if the db is the NCBI genomes), it is really used in identifying the file to parse.

For example, if file list.txt contains

        MC20.MG1
        MC20.MG2
        MC30.MG5
and you call the script

        doComparativeAnalysis.pl --analysis=refgenome --list=list.txt \
                --mode=raw --refdb=NCBI --tag=April01 ...
then the script expects MC20.MG1.NCBI.blastn, MC20.MG2.NCBI.blastn and MC30.MG5.NCBI.blastn in the present working directory. This also creates an output file called April01.feature_vector.rawcount.refgenome.NCBI (bold words are from the commandline options).

If file list.txt contains

        MC20.MG1.AS2.GP2
        MC20.MG2.AS3.GP1
        MC30.MG5.AS1.GP3
and you call the script

        doComparativeAnalysis.pl --analysis=functional --list=list.txt \
                --mode=raw --refdb=eggnog --tag=April01 ...
then the script expects MC20.MG1.AS2.GP2.eggnogmapping.txt, MC20.MG2.AS3.GP1.eggnogmapping.txt and MC30.MG5.AS1.GP3.eggnogmapping.txt in the present working directory. This also creates an output file called April01.feature_vector.rawcount.functional.eggnog (bold words are from the commandline options).

For historic reasons, this option can also take a comma-separated list of such refdb's to perform the analysis on multiple reference databases. For example, running the command

        doComparativeAnalysis.pl --analysis=functional --list=list.txt \
                --mode=rawcount --refdb=eggnog,kegg --tag=April01 ...
is equivalent to running the same command with --refdb=eggnog and --refdb=kegg sequentially. This was just an example, and unless you just have one CPU to do all your analysis, we do not recommend running the command above. Instead, run two analysis for --refdb=eggnog and --refdb=kegg simultaneously so that you have your results in roughly half as much time. However, this options comes in very handy when you use --mode=itol. For example, running the command

        doComparativeAnalysis.pl --analysis=refgenome --list=list.txt \
                --level=genus --mode=itol --refdb=NCBI,MyDB --tag=April01 ...
will parse the read mappings at the genus level using the NCBI database and MyDB database, and upload them as two datasets to one underlying iTOL tree which contains the genera hit in either database.

Unfortunately, this won't work for functional analysis using kegg and eggnog, since the orthologous groups have different names! The resulting tree will have all the eggNOG OGs and the KEGG KOs, and you end up with a bizarre iTOL tree of mutually exclusive datasets that shouldn't have uploaded together.

--normalize

Normalization procedure to be used. Available options are: data that will normalize the abundances by the datasize, and hits that will normalize the abundances by the positive hits. For example, if a dataset contains 1000 reads, 500 of which are mapped to some genus, and 50 of them are mapped to Lactobacillus, then using hits will result in an abundance of 10% for Lactobacillus (50 out of 500), whereas using data will result in 5% (50 out of 1000). You get the idea!

--filterlow

filters features (genera, species, orthologous groups) that are below the given threshold in ALL samples. If at least one sample is above the threshold, the feature remains in all samples. Otherwise, it is removed from the feature table.

--help

Prints this manual.

Options specific to --mode=itol

These options tell iTOL which features to display in the figure. --top lets you display all the top N features in each sample. Compare this with --filterlow that lets you filter all the features whose maximum value in any sample is below the threshold.

--top

Applies only to itol mode. Use only the N most abundant features, be it genera, species or orthologous groups. This top N filter is applied to each sample independently, so the final list could be more than N long when multiple samples have different sets of top N features.

Options specific to --analysis=refgenome

--identity

Percent identity cutoff to be used for this analysis. Only applies to refgenome analysis where it is used as the threshold above which a read is assigned to the organism of the BLAST hit at the taxonomic level specified by level. Typical values we use are 65% for phylum, 85% for genus and 95% for species.

--local

Applies only to refgenome analysis. SMASH uses a database that maps each sequence in the reference genome database to its source organism identified by its NCBI taxonomy id. These sequences are also annotated for protein-coding genes, non-coding genes, etc. If you do not have this database installed at your site, e.g. because you are calling this script from SmashCell, you probably have downloaded the information from SMASH website. Then this option is just for you.

--genomesize | --nogenomesize

By default the script performs genome size normalization for the refgenome analysis. If you do not need this, specify --nogenomesize to turn this off.

Options specific to --analysis=16S

--conf_threshold

Minimum bootstrap confidence threshold reported by RDP classifier. (default: 0.6)

--length_threshold

Minimum length of a read to be considered mapped to a rank. (default: 250)

Description

doComparativeAnalysis.pl is a multifunctional script that enables comparative metagenomic analysis. Before I explain the functionalities of the script, I must explain its dependencies on external information and/or resources.

External dependencies


iTOL batch access

Many functionalities of this script generate cluster or trees and these can be uploaded to the iTOL webserver for viewing as well storage. SMASH uses the iTOL batch access website through scripts kindly provided by iTOL. You must have an user account at the iTOL website to use this feature. When you upload a tree through this script, the tree will be uploaded to a project called itol_uploader under your user account. Please create this project under your user account, otherwise the trees might be lost in digital space. Additionally, for iTOL to allow batch access to your account, you must enable it. Please see http://itol.embl.de/help/batch_help.shtml for details (under Enabling batch upload). Once you enable it, you receive an upload ID, which you should enter it in the SMASH config file under the iTOL section as shown below:

        [Smash]
        data_dir          : /some/embl/location/smash/data_repos
        software_dir      : software
        workspace_dir     : workspace
        collection_prefix : MC
        metagenome_prefix : MG
        assembly_prefix   : AS
        genepred_prefix   : GP

        [SmashDB]
        database_engine : mysql
        database_name   : SmashDB
        user : smashdb
        pass : smashdb
        host : server
        port : 9999

        [RefOrganismDB]
        data_dir        : /some/embl/location/smash/data_repos/reference_organisms
        database_engine : sqlite3
        database_name   : /some/embl/location/smash/db/RefOrganismDB.sqlite
        user :
        pass : 
        host : 
        port : 

        [iTOL]
        uploadID : xyzw

NCBI Taxonomy information

Phylogenetic analysis performed by this script uses the taxonomy information from NCBI. It uses a local repository of the NCBI taxonomy dump (normally available at ftp://ftp.ncbi.nih.gov/pub/taxonomy) to get detailed information on each organism. You must specify the remote and local repository location in the configuration file, in the Taxonomy section as shown below:

        [Taxonomy]
        remote_repository : ftp://ftp.ncbi.nih.gov/pub/taxonomy
        local_repository  : /some/embl/location/smash/data_repos/external/taxonomy

When the script needs to parse this taxonomy information, it checks whether local_repository contains the required files. If they do not exist, it downloads them from remote_repository. If you would like to update the files using the latest available from NCBI, you could replace the files in local_repository with the newer ones yourself, or update it automatically by running the script updateNCBITaxonomyFiles.pl without any arguments:

        updateNCBITaxonomyFiles.pl

PHYLIP package

For clustering the samples, we use the neighbor method from the phylip package. For generating consensus clusters from bootstrap analysis, we use the consense program from phylip as well. Please make sure that neighbor and consense are available in your path. You can get phylip package from http://evolution.genetics.washington.edu/phylip.html.


Subroutines

get_analyzer($type)

returns a Smash>Analyses::Comparative::$type> object, if it exists.

init_reporter

inits the reporter for this instance.

finish_reporter

closes the reporter after finalizing the report.

Dependencies

doComparativeAnalysis.pl requires the following PERL modules:

        Config, Pod::Usage 

The iTOL batch access scripts require the following PERL modules:

        Getopt::Regex, Config::File::Simple, HTTP::Request::Common and LWP::UserAgent.

<<