Documentation, examples, tutorials and more

<<

Functional annotation of proteins

Predicted genes are assigned to orthologous groups using BLASTP based sequence homology to known proteins. We use the eggNOG and KEGG orthologous groups to assign functions to predicted genes. First of all, genes need to be blastp'ed against the set of eggNOG/KEGG proteins. Then they can be mapped to orthologous groups using doOrthologMapping.pl or doKeggMapping.pl, wrapper scripts that maps predicted proteins in a metagenome to eggNOG/KEGG orthologous groups.

For the rest of Section 4, we will use MC20.MG10.AS2.GP1 as the example protein set. Please change the value accordingly when you analyze your own samples.

Note:

SMASH supports the use of NCBI BLAST and WU-BLAST for the homology search steps and will process the outputs according to the flavor of BLAST used.

1. Functional annotation using eggNOG database

eggNOG is a resource of orthologous groups developed at EMBL. SMASH derives a lot of functional annotation of metagenomes using the eggNOG database. Here's how you can annotate metagenomic proteins using the eggNOG database.

1.1. Easy option: using runBlast.pl and doOrthologMapping.pl

You can use this option if you have:

  1. configured SMASH with --enable-eggnog-db
  2. installed NCBI BLAST using --enable-ncbi-blast (or) WU-BLAST is installed in your system

If you satisfy these requirements, then you can run BLAST using the runBlast.pl wrapper script that comes with SMASH. You can choose the flavor of BLASTP you want to run (WU or NCBI). Here's how you can do this:

        runBlast.pl --flavor=NCBI --blast=blastp \
            --database=eggnog2 --makedb --query=MC20.MG10.AS2.GP1 \
            --subjects=1000 --evalue=0.1

If you want to make it faster by using multiple threads, you can specify that using --cpus:

        runBlast.pl --flavor=NCBI --blast=blastp \
            --database=eggnog2 --makedb --query=MC20.MG10.AS2.GP1 \
            --subjects=1000 --evalue=0.1 --cpus=4

Please see runBlast.pl for more information.

runBlast.pl places the output file where doOrthologMapping.pl needs it. So to map the proteins to orthologous groups, all you have to do is to run:

        doOrthologMapping.pl --flavor=NCBI --genepred=MC20.MG10.AS2.GP1 --eggnog=eggnog2

Please see doOrthologMapping.pl for more information.


Notes:

  1. If your metagenome is huge and contains several thousands or even millions of proteins, running BLASTP as mentioned above will take a long time. You are then advised to parallelize BLASTP by splitting your input protein file and running several BLASTP jobs and then combining them later. Support for this is being added to SMASH right now - please watch "6. Parallelizing BLAST" where we will add further information about running parallelized BLAST.
  2. Running runBlast.pl with the gene prediction id in the --query option tells SMASH to find the protein sequences from that gene prediction set, run BLASTP and place the output of BLASTP in the directory where all files related to that gene prediction are located. This means that you can immediately run doOrthologMapping.pl for this gene prediction. If you want to see where runBlast.pl will place the output, run
  3.         showLocations.pl --item=MC20.MG10.AS2.GP1

    and genepred_dir is where the BLASTP outputs will reside.

  4. Using --makedb creates the blast database for eggnog2 if it does not exist. Since it does not make the database if it exists already, it is safe to use that option always. If the database does not exist, and you do not specify --makedb, runBatch.pl will quit with an error message.

1.2. Advanced option: do it yourself

If you want to do the functional annotation step yourself, or if you want to learn what is going on behind the scenes when you choose the easy option, you can read on to find out how to perform each step in the orthologous group annotation procedure.

1.2.1. BLASTP search against eggNOG proteins

If you want to run the BLASTP step yourself for any reason, including the need to parallelize, here are the steps involved:

1. Create eggNOG protein database

Generating the BLAST alignments of predicted proteins against eggNOG protein database requires that you have the eggNOG proteins. If you configured SMASH with --enable-eggnog-db, you will find a flat file containing all eggNOG proteins in your repository under the data directory. To see the location, run:

        perl showLocations.pl --external
If you do not have <eggnog2.protein.fa> under that directory, you must download the eggNOG proteins yourself from http://eggnog.embl.de. Check the "Downloads" section for the flat file containing all eggNOG proteins. As of writing this manual, the latest version of eggNOG is 2.0.

You should create a BLASTP database using this file with xdformat for WU-BLAST

        xdformat -p -o eggnog2 eggnog2.protein.fa
or formatdb for NCBI BLAST.

        formatdb -p T -n eggnog2 -i eggnog2.protein.fa
This will be your BLASTP database of eggNOG proteins.

2. Locate the predicted protein sequences

To annotate the proteins from predicted proteins in MC20.MG10.AS2.GP1, you should align them to the eggNOG protein database mentioned above. To find the file containing the protein sequences, run

        perl showLocations.pl --item=MC20.MG10.AS2.GP1
and look for a file named MC20.MG10.AS2.GP1.protein.fa. This is your query file in the BLASTP run.

3. Run BLASTP

WU-BLAST version:

        blastp eggnog2 MC20.MG10.AS2.GP1.protein.fa \
            span1 B=1000 mformat=2 E=0.1 \
            hitdist=40 T=11 kap s2=41 gaps2=62 x=16 gapx=38 \
            q=12 r=1 gapL=.27 gapK=.047 gapH=.23 \
            > MC20.MG10.AS2.GP1.eggnog2.blastp
NCBI-BLAST version:

        blastall -p blastp \
            -d eggnog2 -i MC20.MG10.AS2.GP1.protein.fa \
            -F F -K 1 -e 0.1 -b 1000 -m 8 \
            > MC20.MG10.AS2.GP1.eggnog2.blastp
Note:

The actual command will vary depending on the BLAST flavor and your system settings. What we provide must be used as a template and must be adapted to your local environment. Please do not send us error reports that BLAST did not work properly if you just copied and pasted the commands mentioned below.

Once you have finished the BLASTP search as mentioned above, please move the BLASTP output to the directory corresponding to the gene prediction MC20.MG10.AS2.GP1. To see the location of this directory for MC20.MG10.AS2.GP1 (or for any other gene prediction), run:

        perl showLocations.pl --item=MC20.MG10.AS2.GP1

You should place MC20.MG10.AS2.GP1.eggnog2.blastp in that directory.

1.2.2. Orthologous group annotation of predicted proteins

For the functional annotation of proteins, you need the following files (assuming eggNOG version 2; replace 2 with the correct version):

1. BLAST results

Note: Currently SMASH supports tabular BLAST outputs from WU-BLAST (run using "-mformat=2") and NCBI BLAST (run using "-m 8").

A BLAST output file containing results from BLASTing the predicted proteins against eggNOG proteins is expected to be in the gene prediction directory. For the location of this directory for a given gene prediction (e.g. MC20.MG10.AS2.GP1), run:

        perl showLocations.pl --item=MC20.MG10.AS2.GP1
You should find MC20.MG10.AS2.GP1.eggnog2.blastp in that directory.

2. eggNOG orthologous group information and protein length files

These files are automatically downloaded from SMASH website when you install SMASH. They should be installed in your repository under the data directory. To see where they are, run:

        perl showLocations.pl --external
You should find the following files there: eggnog2_final_orthgroups.txt, eggnog2_protein_lengths.txt

Once you have all these files, you can annotate the predicted proteins by running

        doOrthologMapping.pl --flavor=NCBI --genepred=MC20.MG10.AS2.GP1 --eggnog=eggnog2

This will generate a file called MC20.MG10.AS2.GP1.eggnogmapping.txt in the gene prediction directory. This file contains the following tab-delimited fields in this order:

        Name of predicted gene
        Name of eggNOG protein
        Orthologous group
        Start of alignment in eggNOG protein
        End of alignment in eggNOG protein
        Alignment score
        eggNOG protein length

Here is an example:

        MC37.MG3.AS1.GP1.I1.C1017.G1    394221.Mmar10_0729      COG0617 60      419     515.0   419
        MC37.MG3.AS1.GP1.I1.C1017.G2    394221.Mmar10_0732      NOG78154        147     236     110.0   236
        MC37.MG3.AS1.GP1.I1.C1050.G1    269799.Gmet_1677        COG0578 1       243     322.0   516

Please see doOrthologMapping.pl for more information.

2. Functional annotation using KEGG database

KEGG is a pathway/functional resource maintained at the University of Kyoto. KEGG organizes genes into KEGG orthologous groups (KOs), functional modules, and functional pathways, which are very useful to get higher level functional content of environmental samples.

2.1. Easy option: using runBlast.pl and doKegggMapping.pl

You can use this option if you have:

  1. configured SMASH with --enable-kegg-db
  2. installed NCBI BLAST using --enable-ncbi-blast (or) WU-BLAST is installed in your system

If you satisfy these requirements, then you can run BLAST using the runBlast.pl wrapper script that comes with SMASH. You can choose the flavor of BLASTP you want to run (WU or NCBI). Here's how you can do this:

        runBlast.pl --flavor=NCBI --blast=blastp \
            --database=kegg57 --makedb --query=MC20.MG10.AS2.GP1 \
            --subjects=1000 --evalue=0.1

If you want to make it faster by using multiple threads, you can specify that using --cpus:

        runBlast.pl --flavor=NCBI --blast=blastp \
            --database=kegg57 --makedb --query=MC20.MG10.AS2.GP1 \
            --subjects=1000 --evalue=0.1 --cpus=4

Please see runBlast.pl for more information.

runBlast.pl places the output file where doKeggMapping.pl needs it. So to map the proteins to orthologous groups, all you have to do is to run:

        doKeggMapping.pl --flavor=NCBI --genepred=MC20.MG10.AS2.GP1 --kegg=kegg57

Notes:

  1. If your metagenome is huge and contains several thousands or even millions of proteins, running BLASTP as mentioned above will take a long time. You are then advised to parallelize BLASTP by splitting your input protein file and running several BLASTP jobs and then combining them later. Support for this is being added to SMASH right now - please watch "6. Parallelizing BLAST" where we will add further information about running parallelized BLAST.
  2. Running runBlast.pl with the gene prediction id in the --query option tells SMASH to find the protein sequences from that gene prediction set, run BLASTP and place the output of BLASTP in the directory where all files related to that gene prediction are located. This means that you can immediately run doKegggMapping.pl for this gene prediction. If you want to see where runBlast.pl will place the output, run
  3.         showLocations.pl --item=MC20.MG10.AS2.GP1

    and genepred_dir is where the BLASTP outputs will reside.

  4. Using --makedb creates the blast database for kegg57 if it does not exist. Since it does not make the database if it exists already, it is safe to use that option always. If the database does not exist, and you do not specify --makedb, runBatch.pl will quit with an error message.

2.2. Advanced option: do it yourself

If you want to do the functional annotation step yourself, or if you want to learn what is going on behind the scenes when you choose the easy option, you can read on to find out how to perform each step in the orthologous group annotation procedure.

2.2.1. BLASTP search against KEGG proteins

If you want to run the BLASTP step yourself for any reason, including the need to parallelize, here are the steps involved:

1. Create KEGG protein database

Generating the BLAST alignments of predicted proteins against KEGG protein database requires that you have the KEGG proteins. If you configured SMASH with --enable-eggnog-db, you will find a flat file containing all KEGG proteins in your repository under the data directory. To see the location, run:

        perl showLocations.pl --external
If you do not have <kegg57.protein.fa> under that directory, you must download the KEGG proteins yourself from http://www.genome.jp/kegg/. Check the "Downloads" section for the flat file containing all KEGG proteins. As of writing this manual, the latest version of KEGG that SMASH supports is 57.0.

You should create a BLASTP database using this file with xdformat for WU-BLAST

        xdformat -p -o kegg57 kegg57.protein.fa
or formatdb for NCBI BLAST.

        formatdb -p T -n kegg57 -i kegg57.protein.fa
This will be your BLASTP database of KEGG proteins.

2. Locate the predicted protein sequences

To annotate the proteins from predicted proteins in MC20.MG10.AS2.GP1, you should align them to the KEGG protein database mentioned above. To find the file containing the protein sequences, run

        perl showLocations.pl --item=MC20.MG10.AS2.GP1
and look for a file named MC20.MG10.AS2.GP1.protein.fa. This is your query file in the BLASTP run.

3. Run BLASTP

WU-BLAST version:

        blastp kegg57 MC20.MG10.AS2.GP1.protein.fa \
            span1 B=1000 mformat=2 E=0.1 \
            hitdist=40 T=11 kap s2=41 gaps2=62 x=16 gapx=38 \
            q=12 r=1 gapL=.27 gapK=.047 gapH=.23 \
            > MC20.MG10.AS2.GP1.kegg57.blastp
NCBI-BLAST version:

        blastall -p blastp \
            -d kegg57 -i MC20.MG10.AS2.GP1.protein.fa \
            -F F -K 1 -e 0.1 -b 1000 -m 8 \
            > MC20.MG10.AS2.GP1.kegg57.blastp
Note:

The actual command will vary depending on the BLAST flavor and your system settings. What we provide must be used as a template and must be adapted to your local environment. Please do not send us error reports that BLAST did not work properly if you just copied and pasted the commands mentioned below.

Once you have finished the BLASTP search as mentioned above, please move the BLASTP output to the directory corresponding to the gene prediction MC20.MG10.AS2.GP1. To see the location of this directory for MC20.MG10.AS2.GP1 (or for any other gene prediction), run:

        perl showLocations.pl --item=MC20.MG10.AS2.GP1

You should place MC20.MG10.AS2.GP1.kegg57.blastp in that directory.

2.2.2. Orthologous group annotation of predicted proteins

For the functional annotation of proteins, you need the following files (assuming KEGG version 2; replace 2 with the correct version):

1. BLAST results

Note: Currently SMASH supports tabular BLAST outputs from WU-BLAST (run using "-mformat=2") and NCBI BLAST (run using "-m 8").

A BLAST output file containing results from BLASTing the predicted proteins against KEGG proteins is expected to be in the gene prediction directory. For the location of this directory for a given gene prediction (e.g. MC20.MG10.AS2.GP1), run:

        perl showLocations.pl --item=MC20.MG10.AS2.GP1
You should find MC20.MG10.AS2.GP1.kegg57.blastp in that directory.

2. KEGG protein, orthologous group, module and pathway information

This information is contained in the RefProteinDB database which is configured when you configure with --enable-kegg-db. This database is automatically downloaded from SMASH website when you install SMASH. They should be installed in your repository under the data directory.

Once you have all these files, you can annotate the predicted proteins by running

        doKegggMapping.pl --flavor=NCBI --genepred=MC20.MG10.AS2.GP1 --kegg=kegg57

This will generate a file called MC20.MG10.AS2.GP1.keggmapping.txt in the gene prediction directory. This file contains the following tab-delimited fields in this order:

        Name of predicted gene
        Name of KEGG protein
        Start of alignment in predicted protein
        End of alignment in predicted protein
        Start of alignment in KEGG protein
        End of alignment in KEGG protein

Here is an example:

        MC37.MG3.AS1.GP1.I1.C1017.G1    eco:Mmar10_0729      30      390        60      419
        MC37.MG3.AS1.GP1.I1.C1017.G2    eco:Mmar10_0732      147     236     31      124
        MC37.MG3.AS1.GP1.I1.C1050.G1    eco:Gmet_1677        1       243     1       238

What next?

Once this is done, your proteins have been mapped to the eggNOG orthologous groups or KEGG proteins! If you have performed this step for multiple metagenomes that you would like to compare, you are now ready to proceed to "comparative functional analysis".

<<