1.1. Easy option: using runBlast.pl and doOrthologMapping.pl
1.2. Advanced option: do it yourself
1.2.1. BLASTP search against eggNOG proteins
1.2.2. Orthologous group annotation of predicted proteins

2. Functional annotation using KEGG database

2.1. Easy option: using runBlast.pl and doKegggMapping.pl
2.2. Advanced option: do it yourself
2.2.1. BLASTP search against KEGG proteins
2.2.2. Orthologous group annotation of predicted proteins

What next?

Functional annotation of proteins

Predicted genes are assigned to orthologous groups using BLASTP based sequence homology to known proteins. We use the eggNOG and KEGG orthologous groups to assign functions to predicted genes. First of all, genes need to be blastp'ed against the set of eggNOG/KEGG proteins. Then they can be mapped to orthologous groups using doOrthologMapping.pl or doKeggMapping.pl, wrapper scripts that maps predicted proteins in a metagenome to eggNOG/KEGG orthologous groups.

For the rest of Section 4, we will use MC20.MG10.AS2.GP1 as the example protein set. Please change the value accordingly when you analyze your own samples.

Note:: SMASH supports the use of NCBI BLAST and WU-BLAST for the homology search steps and will process the outputs according to the flavor of BLAST used.

1. Functional annotation using eggNOG database

eggNOG is a resource of orthologous groups developed at EMBL. SMASH derives a lot of functional annotation of metagenomes using the eggNOG database. Here's how you can annotate metagenomic proteins using the eggNOG database.

1.1. Easy option: using `runBlast.pl` and `doOrthologMapping.pl`

You can use this option if you have:

configured SMASH with --enable-eggnog-db

installed NCBI BLAST using --enable-ncbi-blast (or) WU-BLAST is installed in your system

If you satisfy these requirements, then you can run BLAST using the runBlast.pl wrapper script that comes with SMASH. You can choose the flavor of BLASTP you want to run (WU or NCBI). Here's how you can do this:

        runBlast.pl --flavor=NCBI --blast=blastp \
            --database=eggnog2 --makedb --query=MC20.MG10.AS2.GP1 \
            --subjects=1000 --evalue=0.1

If you want to make it faster by using multiple threads, you can specify that using --cpus:

        runBlast.pl --flavor=NCBI --blast=blastp \
            --database=eggnog2 --makedb --query=MC20.MG10.AS2.GP1 \
            --subjects=1000 --evalue=0.1 --cpus=4

Please see runBlast.pl for more information.

runBlast.pl places the output file where doOrthologMapping.pl needs it. So to map the proteins to orthologous groups, all you have to do is to run:

        doOrthologMapping.pl --flavor=NCBI --genepred=MC20.MG10.AS2.GP1 --eggnog=eggnog2

Please see doOrthologMapping.pl for more information.

Notes:

1.2. Advanced option: do it yourself

If you want to do the functional annotation step yourself, or if you want to learn what is going on behind the scenes when you choose the easy option, you can read on to find out how to perform each step in the orthologous group annotation procedure.

1.2.1. BLASTP search against eggNOG proteins

If you want to run the BLASTP step yourself for any reason, including the need to parallelize, here are the steps involved:

1. Create eggNOG protein database

Generating the BLAST alignments of predicted proteins against eggNOG protein database requires that you have the eggNOG proteins. If you configured SMASH with --enable-eggnog-db, you will find a flat file containing all eggNOG proteins in your repository under the data directory. To see the location, run:
        perl showLocations.pl --external
If you do not have <eggnog2.protein.fa> under that directory, you must download the eggNOG proteins yourself from http://eggnog.embl.de. Check the "Downloads" section for the flat file containing all eggNOG proteins. As of writing this manual, the latest version of eggNOG is 2.0.

You should create a BLASTP database using this file with xdformat for WU-BLAST
        xdformat -p -o eggnog2 eggnog2.protein.fa
or formatdb for NCBI BLAST.
        formatdb -p T -n eggnog2 -i eggnog2.protein.fa
This will be your BLASTP database of eggNOG proteins.

2. Locate the predicted protein sequences

To annotate the proteins from predicted proteins in MC20.MG10.AS2.GP1, you should align them to the eggNOG protein database mentioned above. To find the file containing the protein sequences, run
        perl showLocations.pl --item=MC20.MG10.AS2.GP1
and look for a file named MC20.MG10.AS2.GP1.protein.fa. This is your query file in the BLASTP run.

3. Run BLASTP

WU-BLAST version:
        blastp eggnog2 MC20.MG10.AS2.GP1.protein.fa \
            span1 B=1000 mformat=2 E=0.1 \
            hitdist=40 T=11 kap s2=41 gaps2=62 x=16 gapx=38 \
            q=12 r=1 gapL=.27 gapK=.047 gapH=.23 \
            > MC20.MG10.AS2.GP1.eggnog2.blastp
NCBI-BLAST version:
        blastall -p blastp \
            -d eggnog2 -i MC20.MG10.AS2.GP1.protein.fa \
            -F F -K 1 -e 0.1 -b 1000 -m 8 \
            > MC20.MG10.AS2.GP1.eggnog2.blastp
Note:

The actual command will vary depending on the BLAST flavor and your system settings. What we provide must be used as a template and must be adapted to your local environment. Please do not send us error reports that BLAST did not work properly if you just copied and pasted the commands mentioned below.

Once you have finished the BLASTP search as mentioned above, please move the BLASTP output to the directory corresponding to the gene prediction MC20.MG10.AS2.GP1. To see the location of this directory for MC20.MG10.AS2.GP1 (or for any other gene prediction), run:

        perl showLocations.pl --item=MC20.MG10.AS2.GP1

You should place MC20.MG10.AS2.GP1.eggnog2.blastp in that directory.

1.2.2. Orthologous group annotation of predicted proteins

For the functional annotation of proteins, you need the following files (assuming eggNOG version 2; replace 2 with the correct version):

1. BLAST results: Note: Currently SMASH supports tabular BLAST outputs from WU-BLAST (run using "-mformat=2") and NCBI BLAST (run using "-m 8").; A BLAST output file containing results from BLASTing the predicted proteins against eggNOG proteins is expected to be in the gene prediction directory. For the location of this directory for a given gene prediction (e.g. MC20.MG10.AS2.GP1), run:; You should find MC20.MG10.AS2.GP1.eggnog2.blastp in that directory.
2. eggNOG orthologous group information and protein length files: These files are automatically downloaded from SMASH website when you install SMASH. They should be installed in your repository under the data directory. To see where they are, run:; You should find the following files there: eggnog2_final_orthgroups.txt, eggnog2_protein_lengths.txt

Once you have all these files, you can annotate the predicted proteins by running

        doOrthologMapping.pl --flavor=NCBI --genepred=MC20.MG10.AS2.GP1 --eggnog=eggnog2

This will generate a file called MC20.MG10.AS2.GP1.eggnogmapping.txt in the gene prediction directory. This file contains the following tab-delimited fields in this order:

        Name of predicted gene
        Name of eggNOG protein
        Orthologous group
        Start of alignment in eggNOG protein
        End of alignment in eggNOG protein
        Alignment score
        eggNOG protein length

Here is an example:

        MC37.MG3.AS1.GP1.I1.C1017.G1    394221.Mmar10_0729      COG0617 60      419     515.0   419
        MC37.MG3.AS1.GP1.I1.C1017.G2    394221.Mmar10_0732      NOG78154        147     236     110.0   236
        MC37.MG3.AS1.GP1.I1.C1050.G1    269799.Gmet_1677        COG0578 1       243     322.0   516

Please see doOrthologMapping.pl for more information.

2. Functional annotation using KEGG database

KEGG is a pathway/functional resource maintained at the University of Kyoto. KEGG organizes genes into KEGG orthologous groups (KOs), functional modules, and functional pathways, which are very useful to get higher level functional content of environmental samples.

2.1. Easy option: using `runBlast.pl` and `doKegggMapping.pl`

You can use this option if you have:

configured SMASH with --enable-kegg-db

installed NCBI BLAST using --enable-ncbi-blast (or) WU-BLAST is installed in your system

        runBlast.pl --flavor=NCBI --blast=blastp \
            --database=kegg57 --makedb --query=MC20.MG10.AS2.GP1 \
            --subjects=1000 --evalue=0.1

If you want to make it faster by using multiple threads, you can specify that using --cpus:

        runBlast.pl --flavor=NCBI --blast=blastp \
            --database=kegg57 --makedb --query=MC20.MG10.AS2.GP1 \
            --subjects=1000 --evalue=0.1 --cpus=4

Please see runBlast.pl for more information.

runBlast.pl places the output file where doKeggMapping.pl needs it. So to map the proteins to orthologous groups, all you have to do is to run:

        doKeggMapping.pl --flavor=NCBI --genepred=MC20.MG10.AS2.GP1 --kegg=kegg57

Notes:

2.2. Advanced option: do it yourself

2.2.1. BLASTP search against KEGG proteins

If you want to run the BLASTP step yourself for any reason, including the need to parallelize, here are the steps involved:

1. Create KEGG protein database

Generating the BLAST alignments of predicted proteins against KEGG protein database requires that you have the KEGG proteins. If you configured SMASH with --enable-eggnog-db, you will find a flat file containing all KEGG proteins in your repository under the data directory. To see the location, run:
        perl showLocations.pl --external
If you do not have <kegg57.protein.fa> under that directory, you must download the KEGG proteins yourself from http://www.genome.jp/kegg/. Check the "Downloads" section for the flat file containing all KEGG proteins. As of writing this manual, the latest version of KEGG that SMASH supports is 57.0.

You should create a BLASTP database using this file with xdformat for WU-BLAST
        xdformat -p -o kegg57 kegg57.protein.fa
or formatdb for NCBI BLAST.
        formatdb -p T -n kegg57 -i kegg57.protein.fa
This will be your BLASTP database of KEGG proteins.

2. Locate the predicted protein sequences

To annotate the proteins from predicted proteins in MC20.MG10.AS2.GP1, you should align them to the KEGG protein database mentioned above. To find the file containing the protein sequences, run
        perl showLocations.pl --item=MC20.MG10.AS2.GP1
and look for a file named MC20.MG10.AS2.GP1.protein.fa. This is your query file in the BLASTP run.

3. Run BLASTP

WU-BLAST version:
        blastp kegg57 MC20.MG10.AS2.GP1.protein.fa \
            span1 B=1000 mformat=2 E=0.1 \
            hitdist=40 T=11 kap s2=41 gaps2=62 x=16 gapx=38 \
            q=12 r=1 gapL=.27 gapK=.047 gapH=.23 \
            > MC20.MG10.AS2.GP1.kegg57.blastp
NCBI-BLAST version:
        blastall -p blastp \
            -d kegg57 -i MC20.MG10.AS2.GP1.protein.fa \
            -F F -K 1 -e 0.1 -b 1000 -m 8 \
            > MC20.MG10.AS2.GP1.kegg57.blastp
Note:

The actual command will vary depending on the BLAST flavor and your system settings. What we provide must be used as a template and must be adapted to your local environment. Please do not send us error reports that BLAST did not work properly if you just copied and pasted the commands mentioned below.

        perl showLocations.pl --item=MC20.MG10.AS2.GP1

You should place MC20.MG10.AS2.GP1.kegg57.blastp in that directory.

2.2.2. Orthologous group annotation of predicted proteins

For the functional annotation of proteins, you need the following files (assuming KEGG version 2; replace 2 with the correct version):

1. BLAST results: Note: Currently SMASH supports tabular BLAST outputs from WU-BLAST (run using "-mformat=2") and NCBI BLAST (run using "-m 8").; A BLAST output file containing results from BLASTing the predicted proteins against KEGG proteins is expected to be in the gene prediction directory. For the location of this directory for a given gene prediction (e.g. MC20.MG10.AS2.GP1), run:; You should find MC20.MG10.AS2.GP1.kegg57.blastp in that directory.
2. KEGG protein, orthologous group, module and pathway information: This information is contained in the RefProteinDB database which is configured when you configure with --enable-kegg-db. This database is automatically downloaded from SMASH website when you install SMASH. They should be installed in your repository under the data directory.