Documentation, examples, tutorials and more

<<

Phylogenetic annotation of samples

Metagenomic samples can be phylogenetically annotated in several different ways. We would focus here on using DNA alignments to reference sequence database to assign phylogeny to metagenomic reads.

1. Phylogenetic annotation using reference genome mapping

Metagenomic reads can be phylogenetically classified using BLASTN homology search against a database of reference sequences. We have estimated that 85% sequence identity is a good cut-off for accurately identifying the genus of a read. For example, if a read maps to Bacteroides fragilis with >85% identity, it most likely belongs to the genus Bacteroides, although it is hard to say if the read indeed comes from the species Bacteroides fragilis. If you specified --enable-refgenome-db to the configure script when you installed SMASH, then your installation already contains a reference genome sequence database that contains 1509 microbial genomes (as of 04.07.2010). We call this dataset reference_genomes.20100407. The complete taxonomical information of each sequence in this database is provided in an SQL database that is also installed in your system when you specified --enable-refgenome-db. With these two files (the sequence database and the SQL database), you are ready to perform phylogenetic assignment of metagenomic reads. The first step is to run BLAST of reads against the reference genome database.

For the rest of this section, let us use MC20.MG10 as the example metagenome. Please change the value accordingly when you analyze your own samples.

Note:

SMASH supports the use of NCBI BLAST and WU-BLAST for the homology search steps and will process the outputs according to the flavor of BLAST used.

1.1. Easy option: using runBlast.pl

You can use this option if you have:

  1. configured SMASH with --enable-refgenome-db
  2. installed NCBI BLAST using --enable-ncbi-blast (or) WU-BLAST is installed in your system

If you satisfy these requirements, then you can run BLAST using the runBlast.pl wrapper script that comes with SMASH. You can choose the flavor of BLASTN you want to run (WU or NCBI). Here's how you can do this:

        runBlast.pl --flavor=NCBI --blast=blastn \
            --database=reference_genomes.20100704 --makedb --query=MC20.MG10 \
            --subjects=50 --evalue=0.1

or

        runBlast.pl --flavor=WU --blast=blastn \
            --database=reference_genomes.20100704 --makedb --query=MC20.MG10 \
            --subjects=50 --evalue=0.1

If you want to make it faster by using multiple threads, you can specify that using --cpus:

        runBlast.pl --flavor=NCBI --blast=blastn \
            --database=reference_genomes.20100704 --makedb --query=MC20.MG10 \
            --subjects=50 --evalue=0.1 --cpus=4

Please see runBlast.pl for more information.


Notes:

  1. If your metagenome is huge and contains several thousands or even millions of reads, running BLASTN as mentioned above will take a long time. You are then advised to parallelize BLASTN by splitting your input DNA file and running several BLASTN jobs and then combining them later. Support for this is being added to SMASH right now - please watch "6. Parallelizing BLAST" where we will add further information about running parallelized BLAST.
  2. Running runBlast.pl with the metagenome id in the --query option tells SMASH to find the DNA sequences of reads from that metagenome, run BLASTN and place the output of BLASTN in a directory reserved for analyis results for that metagenome. If you want to see where runBlast.pl will place the output, run
  3.         showLocations.pl --item=MC20.MG10

    and it will list two locations. The second one, analyses_dir is where the BLASTN outputs will reside.

  4. Using --makedb creates the blast database for reference_genomes.20100704 if it does not exist. Since it does not make the database if it exists already, it is safe to use that option always. If the database does not exist, and you do not specify --makedb, runBlast.pl will quit with an error message.

What next?

Once this is done, your metagenomic reads have been mapped to the reference genome database! If you have performed this step for multiple metagenomes that you would like to compare, you are now ready to proceed to "comparative phylogenetic analysis".

<<