Documentation, examples, tutorials and more

<<

NAME

        submitJobs.pl -- A wrapper script to submit jobs to batch queuing systems.

SYNOPSIS

        submitJobs.pl [options] file

OPTIONS

--name

name of job (default: 15 character prefix of name of the script file)

--type

type of batch queueing system (SGE|PBS)

--queue

name of the queue in the batch queueing system (default: default queue in the system)

--memory

maximum megabytes (Mb) of physical memory usage allowed for this job (default: 2000Mb)

--cpus

number of cpus to request from the queueing system (default: 1)

--single

bundle the contents of a file into a single job (default: no)

--group

bundle jobs by grouping <n> lines into a single job (default: 1)

--extra_args

extra arguments to submit to qsub (this should be understood by the qsub command of the system)

--help

print this help

DESCRIPTION

submitJobs.pl parses an input file that contains one independent UNIX/Linux command per line and submits each line as a job to the queuing system. It is designed for handling one line commands and not shell scripts themselves.

One line commands, not shell scripts

For example, running submitJobs.pl using

        perl parse_file.pl input_file1 > output_file1
        perl parse_file.pl input_file2 > output_file2
        perl parse_file.pl input_file3 > output_file3
        perl parse_file.pl input_file4 > output_file4

will submit four jobs to the queue, each containing one fully functional UNIX shell command. However, if you submit the following, which is equivalent to the above in UNIX bash shell world,

        for i in 1 2 3 4; do
                perl parse_file.pl input_file$i > output_file$i;
        done

it will not work. This is because submitJobs.pl submits three jobs as follows:

Job 1:

        for i in 1 2 3 4; do

Job 2:

                perl parse_file.pl input_file$i > output_file$i;

Job 3:

        done

This is definitely not what was intended. In such cases, it is advisable to write the shell script into one line as follows:

        for i in 1 2 3 4; do perl parse_file.pl input_file$i > output_file$i; done

If it is a long script, this becomes totally unreadable, in which case you can use

        submitJobs.pl --single infile

Extra arguments to qsub

Sometimes it is necessary to send special arguments to the scheduler through qsub. These can be passed to qsub if they were specified with --extra_args. Examples include asking for a specific number of CPUs, a specific host, asking for the job to be submitted with a hold which the user may release later.

Submitting a job with a hold

        submitJobs.pl --type=SGE --name hold --extra_args="-h" job.queue

will submit the commands in the file job.queue with a user hold, that must be released by calling qrls. The same argument works for PBS systems as well.

Submitting a multithreaded/multicpu job on PBS

        submitJobs.pl --type=PBS --name multicpu --extra_args="-l select=1:ncpus=4" job.queue

will submit the jobs to the PBS system and ask for 4 cpus from a multiprocessor execution host. This is useful when running BLAST with multiple threads. For example, the following NCBI blast command

        blastall -p blastp [...] -a 4

or the following WU-BLAST/AB-BLAST command

        blastp db query.fa [...] cpus=4

will use 4 threads during the BLASTP search. If your job runs on a host with 4 CPUs, and you did not specify that you need 4 cpus, the queue scheduler will assume that your job needs just one CPU and will fill the execution host with three other jobs. Worse yet, if four of your BLAST jobs all start in the same execution host, they will use 16 threads altogether, while the host only has 4 cpus. This will slow your searches down because of process interleaving etc. Therefore it is very important to let the scheduler know how many CPUs you require for the job, so that it reserves those many CPUs and not overload the execution host.

Submitting a multithreaded/multicpu job on SGE

On SGE, using multiple cpus requires a parallel environment to be set up with allocation_rule set to $pe_slots. For example, a cluster of 25 execution hosts with four cpus each might have an environment named multithread with the following configuration:

        pe_name           multithread
        slots             100
        user_lists        NONE
        xuser_lists       NONE
        start_proc_args   /bin/true
        stop_proc_args    /bin/true
        allocation_rule   $pe_slots
        control_slaves    FALSE
        job_is_first_task TRUE
        urgency_slots     min

This configuration allows users to submit jobs that require 4 cpus as follows:

        submitJobs.pl --type=SGE --name multicpu --extra_args="-pe multithread 4" job.queue

Output and error files

SGE

Standard output and standard error captured by the execution host will be written to files as follows:

        stdout: <name>.o<jobid>.<taskid>
        stderr: <name>.e<jobid>.<taskid>

If you have asked for parallel environment on SGE, you will also find:

        stdout: <name>.po<jobid>.<taskid>
        stderr: <name>.pe<jobid>.<taskid>
Note

The entry slots in the above configuration is the total number of CPU slots that can be occupied by jobs using the multithread parallel environment. Since there are 25 hosts with four CPUs each, you could set it to 100. However, in theory, a user may submit jobs with -pe multithread 100. This job would never run since there is no execution host with 100 CPUs. The user must remember to ask for the correct number of CPUs that the individual execution hosts have.

<<