Documentation, examples, tutorials and more

<<

NAME

MPBLAST - multiplex BLAST

SYNOPSIS

 mpblast [hwsnmltb] <blast command line>

DESCRIPTION

MPBLAST improves the performance of BLAST searches by combining short querries into one multiplex query. For example, instead of executing BLASTN 100 times with 500 bp querries, you execute BLAST once on a multiplexed 50,000 bp query. The optimal size of a multiplex query is about 100,000 characters. Therefore if you execute MPBLAST with 10,000 500 bp querries, it would break this up into 50 separate searches of 100,000 bp each. The size of each multiplex is a commandline option should you wish to change it. Each version of BLASTN has its own optimal multiplex size, so you may want to experiment and find the optimal size for your use.

COMMANDLINE OPTIONS

-w -n

Sets the type of BLAST to either WU-BLAST or NCBI-BLAST. If one of these flags are not set, MPBLAST will guess based on the program name.

-l -s

The length of the segment separator can be set with the -l option. The separator is used to prevent alignments from crossing the single-sequence boundaries in the multiplex query. The default is to use 100 -'s for WU-BLAST and 100 N's for NCBI-BLAST.

If using WU-BLAST, you can use the -s option to make the multiplex segmentation more efficient. -s makes each segment separator into a single '-' instead of 100 char string.

-m

The default length of multiplexes is 100,000 bp. Empirically, this value works well. The optimal length is determined by various factors though, and there may be applications where smaller or larger values are more suitable. For example, on machines with little RAM, -m should be set smaller (also see -b below). I have noticed on some platforms that the optimial multipliex size for NCBI-BLAST is more than 200,000 bp.

-t -b

The default output format is space-delimited fields (see below). You can change the delimiter with the -s option.

The -b option makes the output look like a BLAST report.

OUTPUT FORMATS

MPBLAST has 2 output formats. The default is tabular. You may set the record separator with the -t option. The default is a single space.

The column definitions are as follows:

  1: query begin
  2: query end
  3: query name
  4: sbjct begin
  5: sbjct end
  6: sbjct name
  7: raw score
  8: bits (normalized score)
  9: E-value
 10: P-value
 11: percent identity
 12: number of matches
 13: number of positive scores (similarities)
 14: length of alignment
 15: length of query
 16: length of sbjct
 17: number of gaps in query alignment
 18: number of gaps in sbjct alignment

Each row of the table corresponds to an alignment. Blank lines separate query sequences.

The other format looks like concatenated WU-BLAST reports. Between each report are tags that identify each segment of the multiplex.

 MPBLAST SEGMENT 0 START
 :
 : (the blast report)
 :
 MPBLAST SEGMENT 0 END
 
 MPBLAST SEGMENT 1 START
 :
 MPBLAST SEGMENT 1 END

The -b switch turns on BLAST-style output. Note that this takes more memory.

PERFORMANCE

The peformance improvement is typically around 10x, but this depends on several factors. These include the size of each sequence in the multiplex, the size of the multiplex, the size of the database relative to RAM (caching or thrashing), the similarity between the sequences in the multiplex, and the version of BLAST. In my tests, WU-BLAST is faster than NCBI-BLAST, but NCBI-BLAST benefits more from multiplexing.

LIMITATIONS

Various BLAST commandline options are not supported or do not behave in the expected manner. For example, -V and -B (-v and -b in NCBI-BLAST) are disabled. Combining HSPs with Sum or Poisson statistics is not supported.

SEE ALSO

 WU-BLAST (http://blast.wustl.edu)
 NCBI-BLAST (http://www.ncbi.nlm.nih.gov)

AUTHORS

 Ian Korf (http://sapiens.wustl.edu/~ikorf)

ACKNOWLEDGEMENTS

This software was developed at the Genome Sequencing Center at Washington Univeristy, St. Louis, MO.

COPYRIGHT

Copyright (C) 2000 Ian Korf. All Rights Reserved.

DISCLAIMER

This software is provided "as is" without warranty of any kind. This software may not be redistributed without permission of the authors.

<<