Documentation, examples, tutorials and more

<<

NAME

Smash::Analyses::GenePredictor - extensible gene prediction software pipeline

DESCRIPTION

Smash::Analyses::GenePredictor is the parent class for all gene prediction software wrappers in Smash. It provides a working pipeline for performing gene prediction on a DNA fasta file and generating the following output files:

  • gff file with gene coordinates,
  • gene fasta file,
  • protein fasta file.

Smash::Analyses::GenePredictor, hereafter referred to as GenePredictor, provides an easy interface to Smash to make gene predictions on DNA fasta files. The minimal steps in the pipeline in GenePredictor are:

        new();
        init();
        predict();
        finish();

KNOWN SUBCLASSES

Smash::Analyses::GenePredictor::GeneMark, Smash::Analyses::GenePredictor::MetaGene.

MEMBER VARIABLES

The following member variables are set by GenePredictor at runtime, from the values passed to it during creation of the object. Any subclass can assume that these will be populated.

predictor

name of the software

self_train

numeric value specifying if this instance must self-train (if supported by the program)

organism

the organism where the sequence comes from

translation_table

translation table corresponding to organism

fasta_filename

name of the input fasta file to make gene predictions

gff

gff file containing gene coordinates

output

concatenated raw output file for the predictor

gene

fasta file containing predicted gene transcripts (DNA space)

protein

fasta file containing predicted proteins (aminoacid space)

FUNCTIONS

The following functions are implemented in GenePredictor. Any subclass can implement its own versions of these functions. I recommend that subclasses perform their steps, and also call SUPER::function() to let GenePredictor do what it ought to. If they do not call the function from the superclass, it is upto them to additionally implement the functionalities of the superclass.

new

This is a default constructor for GenePredictor or its subclasses. It blesses the hash argument into an object of GenePredictor class or its subclass.

init

This function initializes the relevant member variables for GenePredictor class or its subclasses. It also initializes the input and output files that were passed to new().

predict (now obsolete)

This is the workhorse in the pipeline. This function creates filehandles for all output files, and writes to them as it processes the input file. It reads each sequence in the input fasta file and runs gene prediction on them individually. (Aside: This was a design decision due to the fact that some gene predictors, like GeneMark, do not recognize multi-fasta files. For scalability of Smash and to maintain its modularity, I chose to process sequences one after the other.) It reads a DNA sequence entry, checks if it is long enough for the program, writes it into a file, passes it to the subclass through get_command_line to get a command-line to execute, and then executes it.

make_genes_and_proteins_wrapper

Makes a fasta file containing gene sequences (see gene above) and another one containing the protein sequences (protein above). It used the input fasta file (see fasta_filename above) and the GFF file containing gene coordinates (see gff above) to do this. Subclasses are recommended to leave this as is and stop at making the GFF file. However, if the program itself generates the transcripts and the protein sequences, and the subclass decides to parse it as well, then these should be written to files whose names can be obtained by calling gene and protein, respectively. If this is the case, then make_genes_and_proteins_wrapper must be overridden using an empty function, so that GenePredictor does not call it again to make transcripts and proteins.

finish

Closes all open handles and connections and calls the finish function of Smash.

abort()

Aborts the current run after removing this genepred id from the database.

run

Runs the instance of this GenePredictor.

FUNCTIONS REQUIRED IN SUBCLASSES

The following functions must be implemented by subclasses, since these will be queried during every instance they are used.

is_multifasta_compatible()

Returns 1 if this program can handle multifasta files correctly (see later).

is_trainable

Returns 1 if this program is trainable.

parameter_settings

Returns a unique representation of the parameter settings used in this instance.

Each gene prediction programs behaves differently when given a multifasta file as input. For example, when given such a file, GeneMark ignores all the fasta header lines and treats the input as a single sequence! (This behavior has changed in MetaGeneMark). But MetaGene processes each sequence in the fasta file as a different sequence, as it should. To accommodate both kinds of programs, Smash::Analyses::GenePredictor provides two different types of wrappers. Every subclass should implement a function is_multifasta_compatible(). Wrappers for programs that correctly handle multifasta files (like MetaGene) should return 1 for this call. Wrappers for programs that cannot, such as GeneMark, should return 0.

Multifasta Compatible Subclasses

Ideally, you need just one system call to make predictions on the whole file. If self-training is involved, that would make it perhaps two steps. Since this is straightforward, we require these subclasses to implement just one function. This function, called predict_multifasta() takes no arguments, but can access the names of input file, output file, output GFF file from it's parameters. It is expected to make gene predictions on the input file, and write these predictions in GFF format in the output GFF file. Additionally, the subclass should check if self_train is set to 1, in which case, it should train parameters using the input file if it has the ability to do so.

predict_multifasta()

Takes no arguments. Makes gene predictions on the input fasta file fasta_filename, and writes the raw output from the program to output and GFF format output to gff.

Multifasta Incompatible Subclasses

Such programs are difficult to handle. You have to break the multifasta file into many fasta files containing one sequence each, and then call the gene predictor on each such file. Therefore, GenePredictor provides these necessary steps, leaving the subclass to concentrate on just running the program and parsing the output to generate GFF files. At a very high level, GenePredictor does the following:

        1. open $INPUT fasta file to read
        2. open $GFF_FH to write
        3. foreach sequence in $INPUT
        4.      decide $output to write the output to
        5.      write sequence to $file
        6.      $command = get_command_line($file, $output, $parameters)
        7.      system($command);
        8.      parse $output and write GFF format lines to $GFF_FH
        9. end

Thus any multifasta incompatible subclass of GenePredictor must implement the following functions:

get_command_line($input_file, $output_file, $parameters)

Command to run given the name of the input file name containing DNA sequence in FASTA format. If the program needs to decide the command line based on features of the DNA sequence (e.g., GeneMark uses a heuristic model depending on GC content of the sequence), the subclass is free to open the file to obtain the required information. It must close the file as well after processing. A suggested implementation could be:

        sub get_command_line {
                my $this       = shift;
                my $filename   = shift;
                my $output     = shift;
                my $parameters = shift;
                my $option1    = process_something_in($filename);
                return "/somewhere/BestGenePredictor --option1=$option1 --option2 $parameters --input $filename > $output";
        }
The last argument ($parameters) is used to specify special parameters to get the command line. For example, one could send different values for self-trained or generic parameter settings.

create_gene_gff

Writes GFF format lines to the given filehandle after parsing the raw output of the gene prediction program. It is usually called as:

                $this->create_gene_gff($gff_handle, $actual_seqname, $raw_prediction_output);
Subclasses must parse the raw output in $raw_prediction_output and write GFF lines containing gene coordinates to the filehandle in $gff_handle. The subclass must not close tha handle. Since some gene predictors do not write the sequence definition in the prediction output, the definition is passed as $actual_seqname.

Conditionally mandatory functions

Multifasta incompatible subclasses of GenePredictor must also implement the following functions.

train_min

Minimum total length of input DNA that is required to train a model under this program.

get_self_trained_settings

This function trains the gene predictor using the input data and returns a token that must be recognized by get_command_line() which then will formulate the right command line to use the parameters that resulted from this training step. A typical call to this function would be followed by a call to get_command_line, like so:

        my $settings = $this->get_self_trained_settings();
        my $command  = $this->get_command_line($fasta_file, $settings);
        system($command);
In this example, get_command_line should make a command that would use the settings returned by the previous call to get_self_trained_settings. It is upto the subclass to implement the finer details of this process. For example, since GeneMark has two steps in the self-training process (making the new model after self-training, then predicting using that model), the wrapper module for GeneMark implements get_self_trained_settings by training a model and making the command line options that will use that model.

get_generic_settings

This is similar to get_self_trained_settings above, but returns the token that must be recognized by get_command_line to generate the correct commandline for non self-train mode.

Suggested functions

Any subclass of Smash::Analyses::GenePredictor is strongly advised to implement the following functions:

new

If the constructor needs to be more specific than the one provided in GenePredictor, the subclass could implement its own constructor. But it should leave the blessing of the object to the parent class' constructor as the last step in its constructor function. It should call the parent object's new method using $this->SUPER->new(@_). If the subclass does not implement new, then the constructor for the GenePredictor is called.

min_length

Minimum DNA sequence length required by the program to make gene predictions. If this function is not implemented, GenePredictor sets it to 0 so that every sequence will be passed to the gene predictor.

translate($dna, $has_start_codon, $name)

returns the protein translation of the given DNA sequence ($name is optional and is only used in reporting errors so that you can debug which sequence had the error).

get_best_frame($transcript)

returns the frame of the longest ORF (if the longest ORF still has a STOP codon, returns negative frame). NOTE: The range it returns is [1,3] so that a negative number is visible. If the range was [0,2] then the in-frame STOP codon diagnosis won't work. If you wanted it in [0,2] just subtract one after taking the absolute value.

<<