Documentation, examples, tutorials and more

<<

Install SmashCommunity

Before you can use the awesome scripts part of SmashCommunity, you have to install it first!

0. Storage space

SMASH's use of storage space is directly proportional to the size of metagenomics data you want to store and analyze, and the counts and types of analyses you want to perform on that data. A rough estimate would be 1-2 terabytes for 100 metagenomes each containing 100 megabases of sanger sequencing reads, including space used by all the analysis results and the database. Now you that you know the space requirements, proceed to the next steps.

1. Download SMASH

Download your copy of SMASH from http://www.bork.embl.de/software/smash/ or using wget:

        wget http://www.bork.embl.de/software/smash/downloads/smashcommunity-v1.6.tar.gz

2. Unpack the tar file

        tar xfz smashcommunity-v1.6.tar.gz
        cd smashcommunity-v1.6

3. Check and install prerequisites

It is important to know what you are getting yourself into! Please check this list of prerequisites before you download SMASH and start using it. If these are not met, you may not be able to use SMASH after all. SMASH is designed to be used under a UNIX/Linux environment. A 64-bit environment is recommended, although it is not required. It has been tested on a few flavors of Linux, including a 32bit Ubuntu OS running on a virtual machine on Mani's laptop that runs Windows XP.

3.1. Preparing a Ubuntu 10.4 server edition

If you use Ubuntu Linux, we give you a ready-made script install_dependencies.ubuntu.sh) that will install all dependencies. You can find this under the root directory of the distributon, e.g., smashcommunity-v1.6/ in the case above. Running this on a barebone Ubuntu Lucid (10.4) server edition installs all required dependencies for SMASH.

        ./install_dependencies.ubuntu.sh

SMASH uses SQLite as the database backend by default. If you would like to use MySQL for the database backend, then you should also run the following script:

        ./install_mysql.ubuntu.sh

You are now ready to proceed to "4. Configure and install SMASH".

3.2. Preparing other flavors of Linux

SMASH also requires the following programming environments:

C compiler

A C compiler to build the maui library written in C

Perl

A reasonably new version of Perl interpreter (>= 5.8)

Perl dependencies

XML::Parser, DBI, DBD::mysql or DBD::SQLite, Config::File::Simple, Getopt::Regex, Statistics::Descriptive.

The easier option would be to use the CPAN installer to install these. If you have the rights to install Perl packages, you can run this from command line:

        perl -MCPAN -e 'install XML::Parser, DBD::SQLite, \
                        Config::File::Simple, Getopt::Regex, \
                        Statistics::Descriptive, Math::Round'
To make your life easier, we have included a bash script to run this command for you - we will keep that script up-to-date so that it lists all the perl dependencies. You can run this script by typing:

        ./install_perl_dependencies.linux.sh
Alternatively, you can download these libraries from CPAN (http://search.cpan.org/), build and then install them.

Database software

SMASH uses a database backend to store digested data in organized form. It supports MySQL and SQLite3 out of the box. If you decide to use SQLite, you don't have to do anything, since the Perl module comes with an independent implementation of the SQLite database. If you would like to use MySQL however, you are responsible for installing MySQL somewhere and provide the details of the MySQL instance when you configure SMASH in "4. Configure and install SMASH" (See "Configuring MySQL database" in Special Topics for more details). Adding support to other database engines requires minimal work. See ("Adding support for new database engine" in ProgrammerManual.)

4. Configure and install SMASH

Configuring and installing SMASH has been made much easier and simpler using the automake/autoconf GNU tools. Before you can install SMASH, you should run the configure script that comes with the package. To see all the options available through configure, run configure --help. The default invocation of configure is equivalent to the following:

        ./configure --prefix=/usr/local --enable-dbi=sqlite \
            --enable-rdp-classifier=yes --enable-hmmer=binary --enable-celera=binary \
            --enable-metagene=yes --enable-ncbi-blast=binary --enable-meta-rna=yes \
            --enable-lucy=yes --enable-refgenome-db --enable-eggnog-db \
            --disable-genemark --disable-metagenemark --disable-kegg-db \
            --disable-multi-arch

The following subsections provides a brief overview of the installation procedure. For advanced installation and details as well as the meaning of the options specified above, please see "Advanced installation" in Special Topics.

WARNING:

The configure script from SMASH was generated using automake/autoconf and it assumes that you will not install SMASH in the same directory that you extracted the package into. If you downloaded the package to /home/somebody and extracted it from there, the contents will be in /home/somebody/smashcommunity-v1.6 (replace 1.6 with the version you downloaded).

        cd /home/somebody
        wget http://www.bork.embl.de/software/smash/downloads/smashcommunity-v1.6.tar.gz
        tar xfz smashcommunity-v1.6.tar.gz
        cd smashcommunity-v1.6
        ./configure
You should NOT use /home/somebody/smashcommunity-v1.6 or any of its subdirectories as --prefix or --datarootdir when you run configure.

4.1. Simplest installation:

We recommend that you change the --prefix option unless you have root privileges and want to install Smash for everyone.

        cd smashcommunity-v1.6
        ./configure --prefix=/home/somebody/smash
        make
        make install

This installation will install SMASH and the codebase in /home/somebody/smash. External software will be downloaded and installed under /home/somebody/smash/share. Metagenomic data that you add will reside under /home/somebody/smash/data. By default, this will download, configure and install the following software:

Celera assembler (v6.1)

MetaGeneAnnotator gene predictor

RDP Classifier (v2.2)

NCBI BLAST+ (v2.2.23)

HMMER (v3.0)

Meta_rna (vH3)

lucy (v1.20p)

maui C library

Some of the modules in SMASH use programs in maui library to get things done quicker than in Perl. These programs are automatically installed in the SMASH repository under the software_dir directory.

libargtable2

This library is required by maui library, and is automatically downloaded installed when maui is being installed.

You are now ready to move on to "5. Set up your environment to use SMASH".

4.2. Advanced installation

For advanced installation and details, please see "Advanced installation" in Special Topics, which deals with installation of GeneMark, MetaGeneMark and using MySQL as a database backend.

5. Set up your environment to use SMASH

5.1. Perl environment

Add the installation location to PERL5LIB environment variable

        export PERL5LIB=$PERL5LIB:/home/somebody/smash/lib

5.2. Configuration file

SMASH believes in customization of pipelines for local environment. A key aspect of that customization is a configuration file used by SMASH named smash.conf. SMASH checks for it first in the current working directory, and then in $HOME/.smash. Ideally, you have site-wide configuration in the config file in $HOME/.smash and project specific configuration in a directory where you run specially configured steps for that project.

When you configure Smash (see "4. Configure and install SMASH"), it generates the configuration file based on your choices to the configure script and copies it to the installation directory. You should copy it to $HOME/.smash/smash.conf.

        mkdir /home/somebody/.smash
        cp /home/somebody/smash/smash.conf /home/somebody/.smash/

If the installation procedure was successful, the configuration file is guaranteed to work. Unless you are an advanced user, we do not recommend changing the configuration file at all! However, you can then edit it if you want to change any of the behavior. The format of this file is:

        # Comments can go anywhere in a line that starts with a '#'.
        # Internal comments like perl, where a '#' symbol followed 
        #   by comments at the end of the line, are not recognized.
        # And empty lines don't matter.

        [Section1]
        key1    : value1
        key2    : value2

        [Section2]
        key1    : value1
        key2    : value2
        key3    : value3

There are four required sections in any SMASH configuration file: Smash, Software, SmashDB and RefOrganismDB. There are other optional but recommended sections: Current Version, Taxonomy, and iTOL. If you have run configure as follows:

        ./configure --prefix=/home/somebody/smash

your configuration file looks as follows:

        # Smash section

        [Smash]

        data_dir          : /home/somebody/smash/data
        workspace_dir     : workspace
        config_dir        : config
        collection_prefix : MC
        metagenome_prefix : MG
        assembly_prefix   : AS
        genepred_prefix   : GP

        # Software installation section

        [Software]

        software_dir      : /home/somebody/smash/share
        multi_arch        : no
        meta_rna          : meta_rna
        MetaGene          : metagene
        hmmer             : hmmer
        ncbi-blast        : ncbi-blast
        rdp_classifier    : rdp_classifier
        Celera            : wgs


        [Current Version]

        meta_rna          : H3
        MetaGene          : 2008-08-19
        hmmer             : 3.0
        ncbi-blast        : 2.2.23
        rdp_classifier    : 2.2
        Celera            : 6.1


        # Database section

        # type of database used to store the pipeline data. One of 'sqlite3', or 'mysql'.
        # If using sqlite3 <database_name> will be the name of the file under <data_dir> 
        # without the .sqlite extension.
        # The following should be left blank for sqlite3 databases.
        #     user, pass, host, port

        [SmashDB]

        database_engine : sqlite3
        database_name   : SmashDB

        user : 
        pass : 
        host : 
        port : 

        [RefOrganismDB]

        data_dir : /home/somebody/smash/data/reference_organisms

        database_engine : sqlite3
        database_name   : RefOrganismDB

        user : 
        pass : 
        host : 
        port : 

        [iTOL]

        uploadID : Ywx7ay

        [Taxonomy]

        local_repository  : /home/somebody/smash/data/external/taxonomy
        remote_repository : ftp://ftp.ncbi.nih.gov/pub/taxonomy

The following rules, as shown in the above example, must be followed:

  1. database_engine defines the type of database used to store the pipeline data. It must be one of 'sqlite3' or 'mysql'. The configure script fills it based on your choice using the --enable-dbi option, or uses sqlite3 by default.
  2. database_name is the name of the database. We recommend that you leave it as it is. If you change this value, then SMASH expects a database with that new name instead.
  3. If database_engine is sqlite3, then this is the name of the database file that SMASH will create under data_dir. SQLite database files created by SMASH will have the .sqlite extension.

    If database_engine is mysql, then this is the name of the database that should be available under the connection settings given in the configuration file (See "Configuring MySQL database" in Special Topics).

  4. if the engine is sqlite3, connection credentials should be left empty.

For more information on the database structure, see "Database structure".

5.3. Set up the data repository

        bin/initSmash.pl

6. Understand the SMASH Code Repository

SMASH is a modular software package written in Perl (as object-oriented as Mani could make it). It contains several Perl modules and a few scripts that provide the interface to these modules. A normal user who just wants to install SMASH locally and analyze his/her metagenome sequencing data has less to worry about. He/She must download SMASH package, install the prerequisites if necessary, install it locally, install the external software, configure SMASH properly and can then start using it by running the scripts. SMASH provides support for a chosen few software packages that are state-of-the-art in metagenomics sequence analysis. If the user wants to use an unsupported external software package or to analyze data that SMASH was not originally designed to analyze, they would have to hack the Perl modules. They are then strongly advised to consult the "Programmer Manual".

6.1. SMASH codebase

The SMASH codebase is organized into three categories. They reside in different locations under the smash_dir.

Scripts

These are the Perl scripts a general user needs. They are located in a directory bin under smash_dir. They provide the interface to the Perl modules located under Smash.

Libraries

These are the Perl modules that do the actual work - the real people behind the scene! They are located in Smash under smash_dir. See "Programmer Manual" for a detailed explanation of these modules.

Programs

When speed is of importance, Perl kind-of drops the ball. For important tasks that need to be done fast, SMASH uses the maui library written in C. The programs part of maui are located in lib/maui under smash_dir. maui borrows several ideas and designs from the zoe library developed by the Laboratory of Computational Genomics at Washington University (part of the Twinscan/N-SCAN gene prediction software suite, http://mblab.wustl.edu/) and Ian Korf (part of his SNAP gene prediction program, http://homepage.mac.com/iankorf/). It even uses Ian Korf's implementation of a hashtable (with his kind permission).

6.2. External dependencies distributed with SMASH

SMASH is bundled with some other code not developed as part of SMASH, sometimes not even by the developers of SMASH. These are distributed under the license of the original code, and not of SMASH. Anything you can do with the SMASH code may or may not apply for the bundled external code. They have their license details inside themselves, so please adhere to their guidelines and keep the developers of SMASH out of legal trouble!

FAlite

FAlite is a light-weight Fasta file parser developed by Ian Korf (ikorf@ucdavis.edu).

FQlite

FQlite is a light-weight Quality file (in fasta format) parser developed by Mani Arumugam, largely based on FAlite, and thus distributed under FAlite's license and not SMASH's license.

6.3. Dude, where is my software?

For the brave-at-heart, here is what happens behind the scenes when you install external software using either of the options above. SMASH uses upto 4 different parameters to find the right location of the programs. What follows is an explanation of these parameters and how they are used in finding the right location. You don't need to know this, unless you are just curious. If you just need to know where a given software is located, you can use a script that does this job for you:

        somebody@ubuntu:~/smash$ bin/showLocations.pl --software=Celera
        software_dir  : /home/somebody/smash/share/wgs/6.1
        curr. version : 6.1
        somebody@ubuntu:~/smash$

This says that the current (latest) version of Celera is 6.1, and it is installed in /home/somebody/smash/share/wgs/6.1.

software_dir

This is where all software reside in SMASH. This comes from entry software_dir in the smash.conf file under section Software. For the default installation with --prefix=/home/somebody/smash, this is /home/somebody/smash/share, and if you specify a value to configure using --datarootdir, then this will be datarootdir/share.

host-cpu-type

Host cpu type for machine specific versions of the software. This only matters if you said --enable-multi-arch when you ran configure. The cpu type is decided by the GNU script that is packaged with autoconf/automake. A copy of that script will be installed in /home/somebody/smash/bin/config.guess. If you run this, you can get the host cpu type for the current host. E.g., on a 64-bit Ubuntu 10.4 server, I get this:

        somebody@ubuntu:~/smash$ bin/config.guess
        x86_64-unknown-linux-gnu
        somebody@ubuntu:~/smash$
If multiple architecture support was requested, SMASH looks for a directory with the cput-type under software_dir. For example, software would be installed under /home/somebody/smash/share/x86_64-unknown-linux-gnu/.

software_name

There are two kinds of software:

  • Software type A: external software that is an integral part of Smash and has a module associated with it (e.g., assemblers, gene predictors)
  • Software type B: external software that is used by some analysis, but it does not have its own module associated with it (e.g., sequence trimming software, BLAST flavors, HMMER, RDP Classifier)
Usually, a software package is installed using the name with which it is distributed. But sometimes Smash might call it differently. For example, Celera assembler is installed using its popular alias wgs.

The installation locations for these are listed under the [Software] section as follows:

        [Software]

        software_dir      : /home/somebody/smash/share
        multi_arch        : no
        meta_rna          : meta_rna
        MetaGene          : metagene
        hmmer             : hmmer
        ncbi-blast        : ncbi-blast
        rdp_classifier    : rdp_classifier
        Celera            : wgs
(The name of type A software is usually obtained by calling software_name on the instance of that module as well, and usually it is case sensitive.)

version

Version of software to use. When you ask for a software, SMASH finds it using the given name and the version (this is how the showLocations.pl script finds it). If you just want to use the current version, you can ask for the software just by name, or by asking for current as the version. This information is stored in the configuration file. The current (latest) versions of the softwares are listed under the [Current Version] section as follows:

        [Current Version]

        meta_rna          : H3
        MetaGene          : 2008-08-19
        hmmer             : 3.0
        ncbi-blast        : 2.2.23
        rdp_classifier    : 2.2
        Celera            : 6.1
If you do not have this section in the configuration file, no need to worry. You can still find the software by specifying the version:

        somebody@ubuntu:~/smash$ bin/showLocations.pl --software=Celera --version=6.1
        software_dir  : /home/somebody/smash/share/wgs/6.1
        curr. version : 6.1
        somebody@ubuntu:~/smash$

The final path of an external software, when requested through pkg_dir of an instance of itself, is then: software_dir/software_name/version or software_dir/host-cpu-type/software_name/version. (See above for explanation of these.)

Sounds complicated? Actually it is not THAT complicated in practice. If you think about it, it is a pretty neat way of handling multiple versions of multiple software to run on different hosts with different architecture. The functions that are responsible to figure this all out, if you set the SMASH repository the right way, are listed in "Software package related functions" in Smash::Analyses.

<<