Special Topics

Notes

1. Advanced installation

1.1. Installation with more options:
1.2. Installing GeneMark or MetaGeneMark gene prediction software

1.2.1. Installing GeneMarkSuite
1.2.2. Installing MetaGeneMark

1.3. Configuring MySQL database
1.4. Installation on a cluster
1.5. Compiling Celera assembler from source
1.6. Installing reference genome database
1.7. Installing functional databases
1.8. Applying a patch

2. Skipping assembly
3. Using external data

3.1. Loading preassembled data
3.2. Loading an assembly from Newbler

4. Phylogenetic annotation of samples
5. Functional annotation of proteins
6. Comparative analysis of metagenomes
7. Parallelizing BLAST
8. Downsampling analysis
9. Batch mode
10. Matrix/Hash manipulation

Special Topics

Notes

30 Oct 2010:: The special topics page is evolving, since I am constantly adding more information or modifying existing information to make it clearer. Please don't consider this page final. I will remove this note when it is indeed final.

1. Advanced installation

1.1. Installation with more options:

Use --prefix to change the location of SMASH codebase. Use --datarootdir to change the installation location of external software (which by default will be $prefix/share). Metagenomic data that you add will reside under $prefix/data.

There are several useful options to the configure script that configure and install external software in your SMASH repository. The following options take the form --enable-feature. Unless noted otherwise, the only allowed forms are: --enable-feature=yes, --enable-feature=no and --disable-feature. For some of the packages that can be downloaded as prebuilt binaries or source code, you can specify --enable-feature=source or --enable-feature=binary. Choosing --enable-feature=yes falls back to the default setting for that feature as listed below. The full list of available features can be obtained by running

        ./configure --help

--enable-multi-arch (default: no): enables support for multiple architecture - useful when installing for a cluster of heterogenous machines.
--enable-lucy (default: yes): enables automatic download and configuration of sequence trimming software lucy v1.20p
--enable-rdp-classifier (default: yes): enables automatic download and configuration of RDP Classifier v2.2
--enable-celera (default: binary): enables automatic download and configuration of Celera assembler v6.1
--enable-hmmer (default: binary): enables automatic download and configuration of HMMER 3.0
--enable-metagene (default: yes): enables automatic download and configuration of MetaGeneAnnotator
--enable-ncbi-blast (default: binary): enables automatic download and configuration of NCBI BLAST+ v2.2.23
--enable-meta-rna (default: yes): enables automatic download and configuration of Meta_rna H3
--enable-genemark (default: no): enables the configuration of GeneMarkSuite microbial gene prediction software. See "1.2.1. Installing GeneMarkSuite" for more details.
--enable-metagenemark (default: no): enables the configuration of MetaGeneMark metagenomic gene prediction software. See "1.2.2. Installing MetaGeneMark" for more details.
--enable-dbi (default: sqlite): You do not have to do anything special to use sqlite as your DBI interface. Should you prefer MySQL as the backend for SMASH, you should specify --enable-dbi=mysql. For details on how to configure MySQL properly, see "1.3. Configuring MySQL database".
--enable-eggnog-db (default: yes): enables automatic download and configuration of eggNOG database and related support files, which is useful in functional annotation of predicted metagenomic proteins.
--enable-kegg-db (default: no): enables automatic download and configuration of KEGG database and related support files, which is useful in functional annotation of predicted metagenomic proteins.

1.2. Installing GeneMark or MetaGeneMark gene prediction software

1.2.1. Installing GeneMarkSuite

Since GeneMark is available only under license, the user must download the software from http://exon.gatech.edu/GeneMark/ and provide the locations of the downloaded program and the downloaded key files. The locations of these files should be in the environment variables GENEMARK_TAR_FILE and GENEMARK_KEY_FILE respectively. Note: Use full paths for the files in these variables, since the files will be searched for from a different directory than where you run ./configure. For example,

        export GENEMARK_KEY_FILE=/home/somebody/downloads/gm_key_32.tar
        export GENEMARK_TAR_FILE=/home/somebody/downloads/genemark_suite_linux.tar.gz
        ./configure --enable-genemark

1.2.2. Installing MetaGeneMark

Since GeneMark is available only under license, the user must download the software from http://exon.gatech.edu/GeneMark/ and provide the locations of the downloaded program and the downloaded key files. The locations of these files should be in the environment variables METAGENEMARK_TAR_FILE and METAGENEMARK_KEY_FILE respectively. Note: Use full paths for the files in these variables, since the files will be searched for from a different directory than where you run ./configure. For example,

        export METAGENEMARK_KEY_FILE=/home/somebody/downloads/gm_key_32.tar
        export METAGENEMARK_TAR_FILE=/home/somebody/downloads/MetaGeneMark_linux32.tar.gz
        ./configure --enable-metagenemark

1.3. Configuring MySQL database

When you configure with --enable-dbi=mysql, the configure script automatically assumes that the MySQL server is running on the localhost on the default port 3306. Smash will later add a new user to this instance of MySQL server with user id smash and password smash. If you would like to change any of these, please use the following environment variables: MYSQL_SERVER, MYSQL_PORT, MYSQL_USER and MYSQL_PASS.

Let us say you have a mysql instance running on mysql.remote.com on port 9999. And you want the new userid to be smashdb and the password to be smashdb. You can do this by:

        export MYSQL_SERVER=mysql.remote.com
        export MYSQL_PORT=9999
        export MYSQL_USER=smashdb
        export MYSQL_PASS=smashdb
        ./configure --prefix=/home/somebody/smash --enable-dbi=mysql

There is an another way to specify these environment variables to the configure script without setting them permanently in the UNIX/Linux shell:

        MYSQL_SERVER=mysql.remote.com MYSQL_PORT=9999 \
        MYSQL_USER=smashdb MYSQL_PASS=smashdb \
        ./configure --prefix=/home/somebody/smash --enable-dbi=mysql

Notice the backslashes at the end of each line: this is one single line, broken down for your convenience. If you had configured using either one of the two methods above, the results are exactly the same and your config file looks like this:

        ...

        [SmashDB]
        database_engine : mysql
        database_name   : SmashDB
        user : smashdb
        pass : smashdb
        host : mysql.remote.com
        port : 9999
        
        ...
        
        [RefOrganismDB]
        database_engine : mysql
        database_name   : RefOrganismDB.v4
        user : smashdb
        pass : smashdb
        host : mysql.remote.com
        port : 9999
        
        ...

        [RefProteinDB]
        database_engine : mysql
        database_name   : RefProteinDB.v4
        user : smashdb
        pass : smashdb
        host : mysql.remote.com
        port : 9999

As mentioned earlier, this configuration requires a mysql server running on a host called mysql.remote.com on port 9999. But the new user with userid smashdb and password smashdb will be created by SMASH.

./configure creates a file called prepare_mysql.sh to help you set up the necessary databases, add this user account and add the necessary privileges to the user. You should execute the shell commands in that file. For the configuration above, the file looks like:

        #!/bin/sh
        mysql -h mysql.remote.com -P 9999 -u root -p < prepare_mysql.sql
        sleep 5
        if [ -f "/home/somebody/smash/data/RefOrganismDB.v4.mysql.gz" ]; then
          zcat /home/somebody/smash/data/RefOrganismDB.v4.mysql.gz | mysql -h mysql.remote.com -P 9999 -u smashdb -p RefOrganismDB.v4
        fi
        if [ -f "/home/somebody/smash/data/RefProteinDB.v4.mysql.gz" ]; then
          zcat /home/somebody/smash/data/RefProteinDB.v4.mysql.gz  | mysql -h mysql.remote.com -P 9999 -u smashdb -p RefProteinDB.v4
        fi

You don't necessarily need to understand what these steps do, but pay attention to the first command:

        mysql -h mysql.remote.com -P 9999 -u root -p < prepare_mysql.sql

This command will set up the SmashDB database, add the user account and provide the necessary privileges to this user. These steps can only be performed by a MySQL user that has GRANT privileges. By default, this line uses the root account for this. When this command is run it will ask for the password for root. If you do not have root access to the database, but have another user with GRANT privileges, you should replace root with that userid before you run this shell script.

Before you run the prepare_mysql.sh script, you can test if it will run successfully or not. Instead of reading from prepare_mysql.sql, try to ask the MySQL server to list its databases like so:

        echo 'show databases;' | mysql -h mysql.remote.com -P 9999 -u root -p

When it asks you for the password, enter the password. It should now list all the databases in your server. If it gives you an error, then your MySQL configuration should be fixed before you can proceed with the next step.

Once you have made sure that the root or a privileged account can run SQL queries from the current host, here is how you would run this shell script. It is better to run it in bash -x mode so that you know which commands are being executed. This is especially because you have three runs of mysql that will ask for passwords, and the first one is the password of the privileged user, and the last two are the password for the new account created by SMASH.

        bash -x prepare_mysql.sh

These commands

create the SmashDB database and add an account for SMASH in MySQL database,

load the reference genome information into the database, and

load the reference proteome information into the database.

1.4. Installation on a cluster

If you want to install SMASH on a single location so that a cluster of computers can access that single installation, you should install it on a disk that all the computers have access to. This could be through NFS, or through common disks mounted locally by multiple computers. Please do not install on the local hard disk of a cluster node and expect it to work on all nodes. If you are unsure what this note means, please consult with your system administrator.

1.4.1. Installing in a homogenous cluster: If the cluster contains homogenous collection of computers, you can still use the default installation, since any software compiled on one machine should run on the other.
1.4.2. Installing in a heterogenous cluster: If the cluster contains heterogenous collection of computers, meaning any software compiled on one machine may not run the other machines, you should add support for multiple architecture during configuration.; This will configure SMASH host-cpu-type-specifically. All the relevant software will be installed under an architecture-specific location. This should be run on each host-cpu-type once. Data (meaning metagenomes) loaded to SMASH will still reside in a common location if given the same --prefix and/or --datarootdir, and all the different machines will share the data.

1.5. Compiling Celera assembler from source

By default, SMASH downloads the precompiled binaries from the developers of the software. However, there are times when you need to use binaries compiled for your own architecture. Les Dethlefsen identified an issue in Celera assembler that led to a segmentation violation and compiling from the source got rid of that issue (Thanks Les!). No matter what your reasons are, SMASH allows you to compile Celera assembler from source. However, this means that you will have to satisfy additional dependency requirements as mandated by the developers of Celera. As of October 2010, these are:

gmake

For non-GNU flavors of linux, you should install gmake. In GNU linux flavors like Ubuntu, the default make program is in fact gmake. However, it is called make and not gmake. Therefore, you should either copy or create a symbolic link called gmake that is pointing to make.

python development package

1.6. Installing reference genome database

The default phylogenetic analysis performed by SMASH involves mapping metagenomic reads to a set of microbial reference genomes using BLASTN. THis requires a database of sequences and their phylogeny which can then be transferred to the metagenomic reads. SMASH automaticall downloads this database from our website and configures it for you. To turn this off, please configure using

        ./configure --disable-refgenome-db

1.7. Installing functional databases

By default, SMASH downloads the eggNOG protein database and associated files that will help annotate predicted genes from metagenomes. You can turn this off by specifying --disable-eggnog-db.

        ./configure --disable-eggnog-db

Optionally, SMASH can download the KEGG protein database and associated files that will help annotated predicted genes using KEGG orthologous groups, functional modules and functional pathways. You can enable this by specifying

        ./configure --enable-kegg-db

1.8. Applying a patch

Applying a given patch within the same minor version of Smash is a very simple procedure. It is very quick and simple if you have not deleted the old directory where you had extracted the Smash package and had run "configure", "make" and "make install". Assuming that the package file for v1.6p1 was downloaded to /tmp/downloads and you extracted the package contents in the same directory, the package file will be in /tmp/downloads/smashcommunity-v1.6p1.tar.gz and the contents will be in /tmp/downloads/smashcommunity-v1.6. Here are the steps involved in upgrading the installation to v1.6p2:

Check the directory /tmp/downloads/smashcommunity-v1.6 for the file config.log.

        head /tmp/downloads/smashcommunity-v1.6/config.log

The beginning of the file looks like:

        This file contains any messages produced by compilers while
        running configure, to aid debugging if configure makes a mistake.

        It was created by SmashCommunity configure v1.6, which was
        generated by GNU Autoconf 2.63.  Invocation command line was

          $ ./configure --prefix=/home/somebody/smash --enable-multi-arch --enable
        -metagenemark METAGENEMARK_TAR_FILE=/tmp/MetaGeneMark_linux64.tar.gz METAG
        ENEMARK_KEY_FILE=/tmp/gm_key_64.tar

The line starting with a dollar sign tells you exactly how configure was run the previous time. Please note this down.

Download the Smash package file corresponding to the v1.6p2.

        cd /tmp/downloads
        wget http://www.bork.embl.de/software/smash/downloads/smashcommunity-v1.6p2.tar.gz

Extract it into the same directory as before.

        tar xvfz smashcommunity-v1.6p2.tar.gz

The contents of /tmp/downloads/smashcommunity-v1.6 will be updated with the new package.

Go to the directory and reconfigure exactly like last time, using the information from Step 1.

        cd smashcommunity-v1.6
        ./configure --prefix=/home/somebody/smash --enable-multi-arch \
                --enable-metagenemark \
                METAGENEMARK_TAR_FILE=/tmp/MetaGeneMark_linux64.tar.gz \
                METAGENEMARK_KEY_FILE=/tmp/gm_key_64.tar

Run make.

        make

Install the new patch.

        make install

Voila! You have now upgraded Smash to v1.6p2.

2. Skipping assembly

There are situations where you do not wish to perform an assembly of a metagenomic sample. For example, metagenomic sequences from high-complexity environments do not have enough information and depth to enable assembly. In these cases, you are advised to perform the analysis at the read level. However, the workflow of SMASH involves a mandatory assembly process. To solve this, we introduce a fake assembly of the samples. For a metagenome that you have just added to the repository, you can make a fake assembly by running:

        makeFakeAssembly.pl --metagenome=MC1.MG1

This will make a new assembly for this metagenome (e.g., MC1.MG1.AS1 if it is available), create a contig for each read in the metagenome and pretend as though they were assembled that way. In principle, performing a real assembly on such samples will probably have the same result. See makeFakeAssembly.pl for more details.

3. Using external data

Smash is designed to start from the raw data and generate the results you want. However, if you have processed the raw data by yourself using other programs or scripts for whatever reason, you can still use Smash for the later part of the analysis. This section explains how to do that.

3.1. Loading preassembled data

If you have data that is preassembled, please check if the assembler produced an ACE format file that summarizes the assembly information. If there are scaffolds in the assembly, then there should be an AGP format file explaining the scaffold layout. For example, Newbler generates the ACE file as 454Contigs.ace and the AGP file as 454Scaffolds.txt. Other programs might differ in what they do. They may not generate these file by default, and might require a flag to generate these. Note that if you do not have paired end reads, then there is no scaffolding information. If you do not have ACE or AGP format files, then you could generate contig-to-read and scaffold-to-contig mapping in GFF format and then load this information. Please see loadExternalAssembly.pl (the script that lets you load preassembled data) for more information.

3.2. Loading an assembly from Newbler

If you have assembled your 454 runs using Newbler, you can easily add it to SMASH. This procedure requires the following files that can be obtained from the Roche 454 software suite:

454Reads.fna: reads in fasta format
454Reads.qual: quality values for the bases in reads
454Contigs.ace: assembly information in ACE format
454Scaffolds.txt (optional): scaffolding information (for paired end runs only)

Here are the steps involved.

1. Adding sequences
2. Adding assembly: Note: You can skip the --scaffold_agp option if you did not have paired end runs or if there is no file called 454Scaffolds.txt.

Please see loadExternalAssembly.pl (the script that lets you load preassembled data) for more information.

4. Phylogenetic annotation of samples

Reads from metagenomic samples can be assigned to a reference species through BLAST based sequence similarity. We use blastn to align the reads to the reference genome set (requires --enable-refgenome-db during configuration). For more details on how to perform this, and to analyze the phylogenetic composition of the samples, please see "Phylogenetic annotation of samples".

5. Functional annotation of proteins

Predicted genes are assigned to orthologous groups using BLAST based sequence homology to known proteins. We use the eggNOG orthologous groups to assign functions to predicted genes (requires --enable-eggnog-db during configuration). First of all, genes need to be blastp'ed against the set of eggNOG proteins. Then they can be mapped to orthologous groups. Optionally, this can also be performed using the KEGG database (requires --enable-kegg-db during configuration). For more details, see "Functional annotation".

6. Comparative analysis of metagenomes

Once you have phylogenetically or functionally annotated your metagenomes, you can start comparing them based on the phylogenetic or functional composition. For more details, see "Comparative metagenomic analysis".

7. Parallelizing BLAST

8. Downsampling analysis

9. Batch mode

10. Matrix/Hash manipulation

Comparative metagenomic analysis using SMASH involves handling a lot of matrices. For example, the species abundances of multiple samples can be provided by a 2-D matrix. SMASH includes a special module Smash::Utils::MatrixIO that performs a lot of matrix/hash manipilation. If you find them useful, you can also use the functionalities yourself for other purposes.