Documentation, examples, tutorials and more

<<

NAME

Smash::Core - Basic interface to the Smash library

SYNOPSIS

        use Smash::Core;
        my $smash = new Smash::Core(GENEPRED => "MC1.MG1.AS1.GP1");
        $smash->init();

        my $collection = $smash->collection; # "MC1"
        my $metagenome = $smash->metagenome; # "MC1.MG1"
        my $assembly   = $smash->assembly;   # "MC1.MG1.AS1"
        my $genepred   = $smash->genepred;   # "MC1.MG1.AS1.GP1"

        my ($dir, $dbh);

        $dir = $smash->read_dir($metagenome);
        $dir = $smash->assembly_dir($assembly);
        $dir = $smash->genepred_dir($genepred);
        
        my $contig_id   = $smash->get_id_by_name("contig", "MC2.MG1.AS1.C34");
        my $lib_id      = $smash->get_id_by_name("library", "MC2.BAAU");

        my $fasta = new FAlite(\*STDIN);
        if (my $entry = $fasta->nextEntry) {
                my $gc = $smash->get_gc_percent($fasta->seq, 0, 100);
                print $smash->pretty_fasta($fasta->seq);
        }

        $dbh = $smash->get_db_handle();        # collection specific DB
        $dbh = $smash->get_smashdb_handle();   # general SmashDB
        $smash->finish(); # closes these two handles

DESCRIPTION

Smash::Core is the core module for the Smash perl codebase. It provides all the necessary functionalities for most of the modules in Smash::Analyses, Smash::Utils and Smash::Databases. Almost all the functionalities of Smash::Core are object-oriented: you cannot just call a function of Smash::Core like

        Smash::Core::function($arg1, $arg2); # illegal

since it expects to be called as:

        Smash::Core->function($arg1, $arg2); # legal, yet incorrect

Moreover, calling it in an object-oriented fashion just doesn't cut it. The object must have been created and initialized properly, like so:

        my $smash = new Smash::Core(COLLECTION => "MC1");
        $smash->init();
        $smash->function($arg1, $arg2);

When an instance of Smash::Core is created and its init() method is called, it does the following:

1. parses the configuration file and registers all the configuration details

2. makes two database connections (SmashDB and collection-specific)

When you are done with the Smash::Core object, be sure to destroy it using the mandatory method finish(), like so:

        $smash->finish();

The finish() method commits any pending changes to the database and closes all open database connections.

Some important roles of Smash::Core are:

  • parsing the configuration file and registering all the configuration details
  • providing data location for all levels of data (collection, metagenome, reads, assembly, gene prediction)
  • providing software package locations
  • making database connections depending on the database engine used
  • querying database for internal ids using external names of sample, library, assembly, gene prediction, contig, gene, etc.

There are groups of functions in Smash::Core that perform the above roles.

OBJECT CREATION AND DESTRUCTION

new

Returns a new Smash::Core object. Must be called as follows:

        my $smash = Smash::Core->new(GENEPRED => "MC1.MG1.AS1.GP1");
Valid keys include (but are not limited to) COLLECTION, METAGENOME, ASSEMBLY and GENEPRED. However, one of these four must be passed to the constructor to use collection specific databases. Calling new without any of these keys will initialize connections to the general SmashDB database and a subsequent call to get_db_handle results in an error. See "DBI interface with database engine".

init

Parses config file and initializes database connections.

finish

Closes all open database connections.

MEMBER VARIABLES

These are member variables of the Smash object, once it is initialized.

collection

collection id

metagenome

metagenome id

assembly

assembly id

genepred

genepred id

config

hash containing the configuration

host

short name of the execution host, set using `hostname -s`

cluster

name to identify a group of hosts with the same architecture, so a program built on one of them can run on all of them

FUNCTIONS

Smash-specific configuration file parsing

parse_config

Parses the config file and sets the relevant variables. The following locations are searched for smash.conf in the given order: current working directory and $HOME/.smash.

get_smash_conf_value($key)

Get a value from the config file for any key $key under [Smash] section. For example,

        $smash->get_smash_conf_value("data_dir");
returns the value for data_dir under [Smash] section.

get_conf_value($section, $key)

Get a value from the config file for any key $key under [$section] section. For example,

        $smash->get_conf_value("Taxonomy", "data_dir");
returns the value for data_dir under [Taxonomy] section.

Smash-specific string parsing

parse_metagenome_id

Parses a metagenome id and returns the metagenome collection id. For example,

        $smash->parse_metagenome_id("MC2.MG1");
returns "MC2";

parse_assembly_id

Parses an assembly id and returns the metagenome collection id and metagenome id. For example,

        $smash->parse_assembly_id("MC2.MG1.AS1");
returns ("MC2", "MC2.MG1");

parse_genepred_id

Parses a genepred id and returns the metagenome collection id, metagenome id and assembly id. For example,

        $smash->parse_assembly_id("MC2.MG1.AS1.GP1");
returns ("MC2", "MC2.MG1", "MC2.MG1.AS1");

Data location

data_dir

Returns the location of Smash data repository.

read_dir

Returns the raw data location for the given metagenome id.

assembly_dir

Returns the assembled data location for the given assembly id.

genepred_dir

Returns the predicted protein/gene data location for the given gene prediction id.

analyses_dir

Returns the analyses data location for the given metagenome id.

get_blastdb

Returns the full path of a given blast database in the repository. For example,

        $smash->get_blastdb("STRING7");
will return the full path like /home/smash/data_repos/databases/STRING7.

maui_dir

Returns the path of the maui program installation for this host or cluster.

DBI interface with database engine

get_db_handle

Returns a database handle to the given metagenome collection.

get_smashdb_handle

Closes the database handle to the SmashDB meta-database.

last_db_insert_id

Returns the autoincrement value of an autoincrement primary key in collection specific DB. Encapsulates the database specific functions this way.

Methods accessing database

get_metagenomes_for_collection

Returns a list of metagenomes present in the given collection.

get_metagenome_label

Returns the external label for an internal metagenome id.

get_metagenome_description

Returns the description (specified when creating the metagenome) for an internal metagenome id.

get_refsequence_details($seq_id)

returns (taxonomy_id, definition, length, display_tax_id) as a hash given a sequence identifier.

TO DO

clean up here

get_metagenome_files($metagenome, $extension)

Returns an array of all files in the read directory with the given extension. This function is called by fasta_files($metagenome), qual_files($metagenome) and xml_files($metagenome) as get_metagenome_files($metagenome, "fasta"), get_metagenome_files($metagenome, "qual") and get_metagenome_files($metagenome, "xml") respectively. Can be extended by the user to retrieve any kind of files.

fasta_files($metagenome)

Returns a list of read fasta files corresponding to this metagenome.

qual_files($metagenome)

Returns a list of read quality files corresponding to this metagenome.

xml_files($metagenome)

Returns a list of read xml files corresponding to this metagenome.

get_read_lengths($metagenome)

Returns a hash containing read length information grouped into templates. For example, if paired end reads for template GFTRSDA are GFTRSDA.b from the forward primer and GFTRSDA.z from the reverse primer, the values in the hash will be:

        {"GFTRSDA" => {"GFTRSDA.b" => 656, "GFTRSDA.z" => 643}}

Generic functions to map between internal and external ids

These functions get the relevant internal(external) ids given external(internal) ids. Works for a few elements where the table structure is simple. The list of element types where these functions work is listed below.

get_id_by_name

Returns the internal id for a given element from its corresponding table. Supported element types are: assembly, gene_prediction, contig, scaffold, gene, library. For example,

        $smash->get_id_by_name("contig", "MC1.MG1.AS1.C2");
will return the integer primary key in the contig table.

get_name_by_id

Returns the external id for a given element from its corresponding table. Supported element types are: assembly, gene_prediction, contig, scaffold, gene, library. For example,

        $smash->get_name_by_id("contig", 1532);
where 1532 is the integer primary key in the contig table, returns the external string id of that contig, such as "MC1.MG1.AS1.C2".

Specific functions to get internal ids for external information

These functions get the relevant internal ids given external information. They create an entry if it does not exist, and return the newly created internal id.

get_program_id

Returns an internal id for a software given its name, version and parameters. Creates an entry in the database if necessary. For example,

        $smash->get_program_id("GeneMark", "0.96", "heu_11_gc");
checks if there is a record of GeneMark version 0.96 using parameter key heu_11_gc, and makes one if necessary.

get_sample_id

Returns an internal id for a sample given the metagenome id and metadata of this specific sample. Creates an entry in the database if necessay. For example,

        $smash->get_sample_id("MC2.MG1", "Depth:50m, Temp:14F");
checks if there is a record of sample "MC2.MG1" with associated metadata "Depth:50m, Temp:14F", and makes one if necessary.

get_library_id

Returns an internal id for a library given the sample id, type of reads, and if applicable insert length and standard deviation. Creates an entry in the database if necessary. For example,

        $smash->get_library_id("MC2.BAAU", 23, "sanger", 10000, 3000);
checks if there is a record of library "MC2.BAAU" associated with sample 23, containing sanger reads of insert size 10000 and insert standard deviation 3000.

Creating new entries and removing them

make_new_assembly

Makes a new assembly. Any instance of subclasses of Smash::Analyses::Assembler should call this function to get a new assembly id from Smash. It creates an entry in the assembly table with the program_id of this instance.

remove_assembly_by_name

Removes the assembly from the assembly table given the external id.

safe_make_new_entry

Safely adds a new entry into the SmashDB database. First selects the entries in the table using $select_st using

        $select_sth->execute(@$st1_args);
finds the maximum value after removing $prefix, increments it by one, makes the new entry using

        $entry = "$prefix$max";
and then inserts into the table using

        $insert_sth->execute($entry, @$st2_args);
For example, make_new_assembly() calls this function as follows:

        sub make_new_assembly {
                my $this       = shift;
                my $metagenome = shift;
                my $assembler  = shift;
                my $as         = $this->get_smash_conf_value("assembly_prefix");
                my $prefix     = "$metagenome.$as";

                my $st1        = 'SELECT external_id FROM assembly WHERE metagenome_id=?';
                my $st2        = 'INSERT INTO assembly(external_id, metagenome_id, assembler) VALUES(?, ?, ?)';
                my $assembly   = $this->safe_make_new_entry($prefix, $st1, $st2, [$metagenome], [$metagenome, $assembler]);

                return $assembly;
        }
make_new_genepred

Makes a new gene prediction. Any instance of subclasses of Smash::Analyses::GenePredictor should call this function to get a new gene prediction id from Smash. It creates an entry in the gene_prediction table with the program_id of this instance.

remove_genepred_by_name

Removes the gene prediction from the gene_prediction table given the external id.

Other methods

filter_input_file($input_file, $output_file, $field_idx, $filter_hash)

Selects a subset of input file based on field-level filtering. You can filter a file using a defined list of accepted values in a specified field. Reads $input_file and checks the field at field position $field_idx (zero-based white-space delimited). If this field has a key in %$filter_hash, then that line is printed to $output_file.

e.g.,

        $smash->filter_input_file($in, $out, 2, {seq1=>1, seq2=>1, seq5=>2})
will print each line in $in that has seq1 or seq2 or seq5 in 3rd column (0-based field index 2 refers to the 3rd column)

get_gc_percent

Returns GC percent of the given string, within the bounds. It is called as:

        $smash->get_gc_percent($string, $lower_bound, $upper_bound);
For example,

        $smash->get_gc_percent("ggggcccccgggcgcgcgcgacggcgcgcc");         
                # returns 98 (49 out of 50)
        $smash->get_gc_percent("ggggcccccgggcgcgcgcgacggcgcgcc", 20, 90); 
                # returns 90 (since 90 is the upper bound)
pretty_fasta

Returns a string broken into fixed number of characters per line. Useful to format Fasta sequences. By default it breaks them into 80 characters per line, but this can be specified in functional call.

        $smash->pretty_fasta($string);     
                # returns a string with newline every 80 characters
        $smash->pretty_fasta($string, 50); 
                # returns a string with newline every 50 characters
pretty_qual

Returns a string broken into fixed number of quality values per line. Useful to format quality sequences. By default it breaks them into 17 characters per line, but this can be specified in functional call. Input string is a space delimited set of integers.

        $smash->pretty_qual($string);     
                # returns a string with newline every 17 characters
        $smash->pretty_qual($string, 10); 
                # returns a string with newline every 10 characters

Non-object-oriented functions

get_median(@list)

returns the median value from the given list

DEPENDENCIES

Smash requires certain perl modules to be installed on your local system. The most notable ones are:

FAlite, FQlite, XML::Parser, DBI

Depending on the database you use, DBD::mysql or DBD::SQLite must also be installed.

FAlite (courtesy of Ian Korf, ifkorf@ucdavis.edu) and FQlite are included in the Smash distribution.

<<