Documentation, examples, tutorials and more

<<

Name

doGenePrediction.pl - Make gene predictions on a sequence assembly in Smash

Synopsis

        doGenePrediction.pl [options]

Options

--predictor

name of the gene predictor (GeneMark|MetaGene) (required)

--version

run the specified version of the program, if available

--assembly

assembly id (either --assembly or --genepred must be specified)

--genepred

gene prediction id (either --assembly or --genepred must be specified)

--fasta_file

fasta file

--output_dir

directory where the output files should be stored

--label

label for trained parameters (used with --self_train)

--self_train

train parameters using sequences (default: false)

--parallelize

parallelize gene prediction by breaking the input into smaller files

--cluster

cluster to run the parallel jobs for prediction

--pkg_dir

location where the gene predictor <program> is installed

--help

Prints this manual.

One of (--assembly) or (--genepred) must be specified.

Description

doGenePrediction.pl is a wrapper script to run gene prediction on a given metagenome assembly.

A normal execution of this script would be:

        doGenePrediction.pl --assembly=MC20.MG1.AS1 --predictor=GeneMark \
            --version=2.6r --self_train

When you parallelize this run using --parallelize, it will generate two shell scripts that should be run separately -- the predictor script and the loader script. First you run the predictor script, potentially on a cluster where each line could go to a different host and they can all be run simultaneously. When they are all done, then you run the loader script.

For example, if you ran:

        doGenePrediction.pl --assembly=MC20.MG1.AS1 --predictor=GeneMark \
            --version=2.6r --self_train --parallelize

it could generate two shell script files: MC20.MG1.AS1.pred.sh and MC20.MG1.AS1.load.sh. If you have a script qsub_line that submits each line in a file as a job to qsub, then you would run:

        qsub_lines MC20.MG1.AS1.GP1.pred.sh

and when all the jobs finish, you would run:

        qsub_lines MC20.MG1.AS1.GP1.load.sh

If you want Smash to manage it completely, you could specify the name of a cluster where these jobs should be sent to. For example, assuming you have an SGE grid where you can submit jobs to,

        doGenePrediction.pl --assembly=MC20.MG1.AS1 --predictor=GeneMark \
            --version=2.6r --self_train --parallelize --cluster=SGE

will submit the jobs to the default SGE queue for the execution host. Two jobs will be submitted: one for gene prediction, and one for loading the gene predictions. The loader job will only start after the prediction jobs finish.

<<