WU BLAST 2.0 TOPICS

Copyright (c) 2003 Warren R. Gish, Saint Louis, Missouri 63108 USA.

All Rights Reserved.



Description

Washington University BLAST (WU BLAST) version 2.0 is a powerful software package for gene and protein identification, using sensitive, selective and rapid similarity searches of protein and nucleotide sequence databases. WU BLAST 2.0 builds upon WU BLAST 1.4, which in turn was based on the public domain NCBI BLAST version 1.4 (Gish, unpublished, 1994; Altschul et al., 1990; Gish and States, 1993). While NCBI BLAST and WU BLAST 1.4 are in the public domain, WU BLAST 2.0 contains significant new features and extended capabilities, the development of which began in late 1994, at Washington University in Saint Louis. First released in May 1996, or more than a year ahead of the NCBI, WU BLAST 2.0 is the original gapped BLAST with statistics and is known for setting higher standards for sensitivity, speed, correctness and accuracy, scalability and reliability than competing programs and implementations. WU BLAST is not a re-hash of NCBI BLAST and essentially shares no code with it, except for small portions that both packages derived from ungapped NCBI BLAST 1.4.

WU BLAST has been built to be the most trusted database search tool in your software arsenal. Its unique combination of speed, accuracy, efficiency, flexibility, scalability, reliability and consistency across all supported platforms is achieved through careful software coding, the use of extensive error checks, anticipation of future needs, and superior design.

[Note:   In spite of many similar or identical characteristics of the algorithms employed, WU BLAST 2.0 and NCBI Gapped BLAST are distinctly different software packages that, in ways of varying importance, carry out their work differently. Consequently, the two packages often yield different results, particularly in the areas of the default level of sensitivity, details in how the statistics are employed, and occasionally in the accuracy or completeness of the results.]

The feature list for the licensed version of WU BLAST 2.0 is large and continues to expand. Much of this is outlined below. The primary purpose of the freely available version 2.0a19 is to allow users to demonstrate for themselves the effectiveness of using gapped alignments instead of ungapped, when combined with the evaluation of the joint probability of multiple regions of similarity, using Karlin and Altschul (1993) "Sum" statistics. Not surprisingly, with version 2.0a19 one can obtain markedly improved results over version 1.4, primarily due to the introduction of gapped alignments. WU BLAST 2.0a19 executables for several UNIX platforms can be downloaded from http://blast.wustl.edu/blast/executables. The complete suite of search programs (blastp, blastn, blastx, tblastn, and tblastx) is included, as well as several support programs. Users of the freely available version 2.0a19 should keep in mind that its reliability, features, flexibility, scalability and speed are generally not comparable to the licensed version 2.0.

WU BLAST 2.0 is copyrighted and may not be sold, redistributed or modified in any form or by any means, without the express written consent of the Washington University School of Medicine in St. Louis. Other than the aforementioned restrictions, the version 2.0a19 executables posted here may be freely used for commercial, nonprofit, or academic purposes.

DISCLAIMER: THIS SOFTWARE IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND.


Key Features

Some of the key features of WU BLAST 2.0 are described below, many of which are only available in the licensed version.

In support of XDF, a new database formatting tool named xdformat is provided in the WU BLAST 2.0 package. Among other distinct capabilities and advantages to using XDF and xdformat are:

A reverse chronological list of changes is available in the HISTORY file, however this file is outdated with respect to licensed versions of WU BLAST 2.0. The reader might also get the unfortunate impression that WU BLAST 2.0 is unreliable, when in fact the licensed version has shown itself to be very robust. Furthermore, any bugs that have been found have typically been fixed within 24 hours of their being reported. For current HISTORY information, licensed users should consult the HISTORY file that is bundled in licensed software distributions.

Please send bug reports, questions, or suggestions to


Manifest

The licensed BLAST 2.0 package includes the following data analysis and utility programs:


To Fly...

If the gapped alignments are nice, but even more speed or less memory use are desired, read how to make the programs fly.


Examples

Here are some sample WU BLAST 2.0 results produced using generally default parameters, plus the oft-recommended low-complexity filter seg and the -postsw option of WU BLASTP 2.0. Exceptions to the defaults are noted and their corresponding results provided, as well.
Default parameters for NCBI blastall were also used, with the exception of using -G7 -E2 to make the scoring system identical to the WU default (penalty of 9 for the first residue in a gap).


New Command Line Options

Command line options for WU BLAST version 1.4 often apply to version 2.0 without change. (See the version 1.4 manual page in Adobe Acrobat (PDF) format). New command line options to version 2.0 include the following. Some of these options are not available in the freely available alpha releases of WU BLAST 2.0. Terse program usage can also be obtained by entering one of the program names on the command line without any arguments.
Note: parsing of command line options is alphabetic case-independent.

Option Description
Q=<q> set the penalty for a gap of length one to q (default Q=9 for proteins; Q=10 for BLASTN)
R=<r> set the per-residue penalty for extending a gap to r (default R=2 for proteins; R=10 for BLASTN)
H=<h> Set the value for the relative entropy to be used in Karlin-Altschul statistics of ungapped alignment scores. In earlier versions of BLAST, the H option was used to invoke the display of a histogram.
postsw perform full Smith-Waterman alignment of sequences and re-rank the database matches accordingly, prior to output (currently supported in BLASTP only)
hitdist=<hitdist> invoke a 2-hit BLAST algorithm that is similar to but more sensitive than that of Altschul et al. (1997), with the maximum distance between word hits of <hitdist>. Altschul et al. (1997) use the equivalent of hitdist=40 in the BLASTP, BLASTX, TBLASTN and TBLASTX search modes. In WU BLASTN, setting hitdist=W and wink=W, where W is the word length, is akin to using double-length words generated on W-mer boundaries.
NOTE: in protein-level comparisons, for best sensitivity (or the best sensitivity for the amount of memory used), 2-hit BLAST should generally be avoided.
This option is only available in the licensed 2.0 software.
wink=<wink> generate word hits at every winkth ("W increment") position along the query, where the default wink=1 produces neighborhood words at every position. For best sensitivity, this option (setting wink greater than 1) should not be used. Wink is best used to find identical or nearly identical sequences rapidly. When used in conjunction with the hitdist option to obtain the highest search speed, care should be taken that desirable alignments are not precluded by these parameters. This option is only available in the licensed 2.0 software.
wordmask=<masker> mask letters in the query sequence without altering the sequence itself, during neighborhood word generation.
lcfilter filter lower case letters in the query sequence, by replacing lower case letters with the appropriate ambiguity code (N for nucleotide sequences, X for protein sequences).
lcmask mask lower case letters in the query sequence without altering the sequence, during neighborhood word generation.
maskextra=<extra> word-mask an additional extra letters on each side flanking an already-masked region. This helps avoid the appearance of spurious alignments through low-complexity regions initiated by chance word hits immediately adjacent to masked regions.
nogaps do not create gapped alignments and, in essence, revert to WU BLAST 1.4 behavior
wstrict require that all ungapped alignments found during the initial, ungapped phase of a search contain an identical word hit (in the usual case of BLASTN usage) or a neighborhood word hit (in the case of TBLASTN and TBLASTX), when searching a nucleotide database sequence that contains one or more ambiguous residues. The -wstrict option has no effect whatsoever on BLASTX and has no effect on BLASTP when gapped alignments (the default) are to be produced.
pingpong Perform additional work to help ensure the alignments produced are locally optimal. This option typically adds 3-10% to the execution time, without affecting the results. Only rarely is an alignment and its associated score improved, for all the work involved.
gapall effectively generate a gapped alignment for every ungapped HSP found. This is the default behavior.
See also: gapE.
gapE=<gapE> generate gapped alignments for all HSPs between sequences whose expected frequency of chance occurrence is less than or equal to <gapE>. Default value is gapE=infinity, i.e., gapall is in effect.
gapW=<gapW> set the window width (or band width) within which gapped alignments are generated (default is gapW=32 for protein comparisons, gapW=16 for BLASTN).
noseqs produces greatly abbreviated output that omits sequence alignments and yet may be interpreted correctly by existing parsers.
hspmax=<hspmax> establishes <hspmax> as the maximum number of ungapped HSPs that will be saved per subject sequence or pairwise sequence comparison. Saved HSPs are then fed to the gapped alignment phase of the program or are statistically evaluated if gapped alignments are not to be performed. If more than <hspmax> HSPs are found, only the best-scoring HSPs are retained for subsequent processing.
The default value is 1000; a value of 0 implies no limit.
See also: gspmax and spoutmax.
NOTE: this usage of hspmax is subtly, but importantly, different from the parameter's classical interpretation, wherein all ungapped HSPs that satisfied the S2 score threshold were saved and <hspmax> merely limited the number of HSPs (gapped or ungapped) that would be reported. The new interpretation was instituted to provide vastly improved speed on large problems, while imparting no effect on small problems and many medium-sized problems. The new behavior can help guard against horrendously slow searches resulting from an inadvertant omission of a low-complexity filter. Adverse effects on sensitivity may be obtained, however, if every HSP is sacred. To restore classical behavior, specify hspmax=0. As a compromise between sensitivity and speed, set a higher value than the default.
NOTE: the B and V options limit the number of subject sequences for which any results whatsoever are reported, regardless of the number of HSPs or GSPs found in each case.
gspmax=<gspmax> establishes <gspmax> as the maximum number of GSPs (gapped HSPs) to report per subject sequence or pairwise sequence comparison. If more than <gspmax> GSPs are found, only the best-scoring GSPs are retained for subsequent processing and reporting. The setting of gspmax will have no effect, if the nogaps option is specified or if the setting of hspmax is more restrictive.
The default value for gspmax is 0; a value of 0 implies no limit.
See also: hspmax and spoutmax.
NOTE: the B and V options limit the number of subject sequences for which any results whatsoever are reported, regardless of the number of HSPs or GSPs found.
spoutmax=<spoutmax> establishes <spoutmax> as the maximum number of segment pairs to report in program output per subject sequence or pairwise comparison, independent of the number of HSPs or GSPs actually found and evaluated. If more than <spoutmax> segment pairs are found, the segment pairs are sorted by the sort criteria in effect for the search and only the first <spoutmax> segment pairs will be reported. The setting of spoutmax will have no effect if either <hspmax> or <gspmax> is more restrictive.
The default value for spoutmax is 0; a value of 0 implies no limit.
See also: hspmax and gspmax.
compat1.4 produces BLAST version 1.4-style output (no gaps), but with bug fixes and performance enhancements in place.
kap use Karlin-Altschul (1990) statistics on individual alignment scores (i.e., do not evaluate the joint probability of multiple scores, such as with Poisson or Karlin-Altschul (1993) "Sum" statistics).
restest causes statistical significance estimates to depend upon the size of the database, as determined by the total number of residues it contains. Restest is the default method for determining the database size in the blastn, tblastn, and tblastx search modes.
See seqtest.
seqtest causes statistical significance estimates to depend upon the size of the database, as determined by the number of sequences it contains. Seqtest is the default method for determining the database size in the blastp and blastx search modes. For backward compatibility with legacy BLAST software — in all search modes, including blastp and blastx — if the Z option is specified, Z is expected to be expressed in units of residues, unless the seqtest option is also specified.
See restest.
links display consistent link information for each HSP, indicating which “consistent” alignments were used in joint statistical significance calculations (i.e., Sum or Poisson statistics). NOTE: while all link lists describe mutually consistent HSPs, unless one of the “topcombo” options is used, only lists reported for HSPs in the most significant set for each database sequence are guaranteed to represent the precise set of HSPs for which the joint statistics were computed; all other link lists often correctly describe the set of HSPs involved but could have one or more missing or extraneous HSPs.
topcomboN=<n> report at most n "topcombo" groups of consistent (colinear) local alignments (HSPs). Each local alignment is allowed to be a member of only one group. Use of this option causes the addition of a "Group = #" indicator in the output for each HSP. Groups of HSPs tend to be assembled in decreasing order of statistical significance. Members of the most significant group thus tend to be reported with "Group = 1". See also: topcomboE.
topcomboE=<Eratio> Eratio is the maximum ratio of Ecurrent/Ebest for which the current "topcombo" group of consistent (colinear) local alignments will be reported for a given database sequence. The "best" group is reported in the output as "Group = 1" and tends to be the most statistically significant. The default behavior is to impose no limit on this ratio, in which case all topcombo groups satisfying E are reported (up to a maximum of topcomboN). See also: topcomboN.
olmax=<len> maximum permitted length of overlap (in residues), len, of two ungapped alignments for their joint (Sum or Poisson) probability to be computed. The default is unlimited length, with the maximum extent of overlap being governed only by the olfraction parameter.
golmax=<len> maximum permitted length of overlap (in residues), len, of two gapped alignments for their joint (Sum or Poisson) probability to be computed. The default is unlimited length, with the maximum extent of overlap being governed only by the golfraction parameter.
hspsepQmax=<d> maximum allowed separation along the query sequence between two HSPs (gapped or ungapped) that will be clustered into a "consistent" set. Useful for improving the statistical power of discrimination of clusters that have potential biological interest from random background, when the query is longer than the clustered features of interest. Depending on the specific search to be performed, a reasonable setting for this parameter when used with a contig sequence might be the expected maximum length of an intron in the species of interest. The units for hspsepQmax are measured in residues of the sequence type actually compared, e.g., amino acid residues for a BLASTX query.
hspsepSmax=<d> maximum allowed separation along the subject (database) sequence for two HSPs (gapped or ungapped) to be clustered into a consistent set. Useful for improving the statistical power of discrimination of clusters that have potential biological interest from random background, when the database contains sequences longer than the features of interest. Depending on the specific search to be performed, a reasonable setting for this parameter when used with a genomic sequence might be the expected maximum length of an intron in the species of interest. The units for hspsepSmax are measured in residues of the sequence type actually compared, e.g., nucleotides for a BLASTN search and amino acid residues for a TBLASTN search.
gapK=<k> set the value of the Karlin-Altschul statistics' K parameter to use when evaluating the significance of gapped alignment scores. Useful when precomputed values are unavailable in the internal tables for the chosen scoring matrix and gap penalty combination.
gapL=<l> set the value of the Karlin-Altschul statistics' λ parameter to use when evaluating the significance of gapped alignment scores
gapH=<h> set the value of the Karlin-Altschul statistics' H parameter to use when evaluating the significance of gapped alignment scores
Qoffset=<i> adjust all query sequence coordinates in the output by the fixed quantity i (default i=0).
Soffset=<i> adjust all subject sequence coordinates in the output by the fixed quantity i (default i=0).
dbchunks=<nchunks> establishes the granularity of the database, as it is divided into slices for assignment to individual threads, to make more efficient use of all CPUs when multiple CPUs are employed for a given search. Higher values are appropriate when the database contains relatively few sequences and/or when the sequences vary greatly in length, composition or content (e.g., genomic contigs). Lower values are appropriate when the database contains many sequences of comparable length (e.g., the EST division of GenBank). The minimum assignable value is the number of threads employed, but this setting is ill-advised; the optimal value for any given search type is likely to be a large multiple of the number of threads employed (although it need not be an exact multiple). When searching mammalian genomic contigs, a good value may be 1000. The default value is 500.
dbslice=m/n
dbslice=a-b/n
at run time, logically divide the database into n equivalent-sized slices and search only the m-th slice, where 1 <=m <=n <= 100000. Alternatively, search slices a through b (inclusive), where 1<=a<=b<=n. Slice size is determined merely by the number of sequence records contained within, not the sum total sequence length. In distributed computing environments, when the same database is to be searched repeatedly and database slices are to be assigned to individual compute nodes, overall throughput will likely benefit from consistently assigning the same slice(s) to the same nodes for each search; the improved efficiency results from the file caching activity typically performed by an operating system.
qrecmin=<m> in a multi-sequence query file, start database searches using the query sequence numbered m. (The first record is numbered 1).
qrecmax=<n> in a multi-sequence query file, end database searches with the query sequence numbered n.
putenv="NAME=VALUE" in the local environment to the BLAST search program, set the environment variable named NAME to the value VALUE.
endputenv for security in WWW server installations, where the command line may sometimes be left open to users, ignore any subsequent putenv options found on the command line during left-to-right parsing.
getenv="NAME" display the value of the environment variable named NAME. This may be useful for verifying that the settings of environment variables on a web server or in an analysis pipeline have been propagated all the way to the BLAST search program.
endgetenv ignore any subsequent getenv options found on the command line during left-to-right parsing.
cdb search nucleotide sequence databases in their uncompressed form. This option is only effective in the BLASTN search mode for word lengths > 6. See ucdb.
ucdb search nucleotide sequence databases in their uncompressed form, with any-and-all ambiguity codes in place. This option may be used to increase sensitivity in the presence of ambiguity codes, at the expense of memory and possibly speed. This is the standard behavior for word lengths < 7, and is not recommended for use with the default or longer word lengths, particularly for longer sequences, due to the increased memory requirements; when comparing long sequences, however, if sufficient memory is available, use of this option can yield a significant increase in speed. This option offers improved sensitivity when searching databases in XDF format that contain ambiguity codes. The option is accepted by the software but offers no improvement in sensitivity for databases in the earlier BLAST 1.4 database format. (BLASTN search mode only).
mmio turn off the use of memory-mapped I/O when reading database files. Use of this option will usually retard the search, particularly when multiple processors are being used, but it serves both to demonstrate the effectiveness of this form of I/O and to validate the I/O routines. Note that no special daemon or support programs (such as the old memfile program) are required to take full advantage of memory-mapped I/O.

Environment Variables

As described below and elsewhere, WU BLAST 2.0 supports several environment variables to adapt its behavior to different computing environments: BLASTDB, BLASTFILTER and BLASTMAT. To support dual WU/NCBI BLAST installations, WU BLAST also supports the environment variables WUBLASTDB, WUBLASTFILTER and WUBLASTMAT, with the WU versions of these variables taking precedence over the corresponding non-WU versions when both are set.

In WU BLAST 2.0, the BLASTDB (or WUBLASTDB) environment variable can be a list of one or more directory names in which the programs are to look for database files. In UNIX parlance, such an environment variable might be called a path for the database files. Directory names should be delimited from one another by a colon (":") and listed in the order that they should be searched. If the BLASTDB environment variable is not set, the programs use a default path of .:/usr/ncbi/blast/db, such that the programs first look in the current working directory (".") for the requested database and then look in the /usr/ncbi/blast/db directory. For backward compatibility with programs that expect BLASTDB to be a single directory specification and not a path, if the user has set a value for BLASTDB but omitted the current working directory, the version 2 programs will still look for database files in the current working directory as a last resort.

The BLASTFILTER (or WUBLASTFILTER) environment variable can be set to the directory containing the filter programs, such as seg and xnu. The default directory for the filter programs is /usr/ncbi/blast/filter. This usage is unchanged from version 1.4.

The BLASTMAT (or WUBLASTMAT) environment variable can be set to the parent directory for all scoring matrix files. The default directory for these files is /usr/ncbi/blast/matrix, beneath which are nt and aa subdirectories for storing scoring matrix files appropriate for nucleotide and amino acid alphabets. This usage is unchanged from version 1.4.

For more information about environment variables, see the Installation instructions.


Filters and Masks

WU BLAST provides an highly flexible means of applying both "hard" and "soft" masks to a query sequence, supporting alternative, user-defined filter programs, as well as non-standard parameters to the standard filters. The filter (for hard masking) and wordmask (for soft masking) command line options provide the basic interface. Multiple specifications of each type are acceptable on the BLAST command line; and individual filter and wordmask specifications may consist of entire pipelines of commands.

For example, three filters are used in succession by this pipeline:

      filter="myfilter1 | myfilter2 | myfilter3 -x5 -"

The first two filters in this case are expecting to read their input from UN*X standard input (also known as stdin), whereas myfilter3 apparently needs to be told (with the usual "-" or hyphen argument) to read data from stdin. The standard output (stdout) from myfilter1 will be read via stdin by myfilter2, which in turn processes the query before handing its results to myfilter3; finally, myfilter3 reports its results to stdout, which the BLAST program itself reads to obtain the fully masked sequence. The final output from the filter pipeline is expected by the BLAST program to be in FASTA format.

Instead of running all 3 filters in the above example as part of one pipeline, they could instead be specified as separate filter options like this:

    filter=myfilter1  filter=myfilter2  filter="myfilter3 -x5 -"

The same choice of running as a pipeline or running separately is available for wordmasks, too. And of course the two approaches can be combined on the same command line. An advantage to using the pipeline approach is that all 3 filters in the example above may complete a little bit faster, because much of the I/O is avoided. Furthermore, when used in the pipeline, there's no requirement that the output from myfilter1 and myfilter2 actually be in FASTA format. Those two programs could potentially pass any information between themselves and to myfilter3. The only absolute requirement is that myfilter1 must read FASTA data from stdin and myfilter3 must output FASTA data (of the same length as the query!) to stdout.

It should be noted that with some filter programs, passing the query sequence sequentially through a pipeline of filters may yield a different result than processing the query independently with each filter and OR-ing the results. The script seg+xnu included in the filter/ directory provides an example with which to test this. Specifying filter=seg+xnu on the BLAST command line invokes a seg and xnu pipeline that is built-in to the search programs; whereas specifying filter="seg+xnu -" causes the seg+xnu script to be invoked on the query, which independently executes seg and xnu, then ORs the separate results with pmerge. (The echofilter option can be used to see the results of filtering displayed in search program output). While the built-in seg+xnu pipeline is historically the way these two filters have been implemented, the latter interpretation, as illustrated by the seg+xnu script with pmerge, may be more desirable.


Bugs

The following list describes bugs that are known to exist in the WU BLAST 2.0a19 binaries posted here. These are all fixed in the licensed version 2.0 of WU BLAST, in which there are no known bugs. If you are a user of the licensed version and believe you see a bug, please send a Even users of the licensed version should read the second set of potential problem areas or pitfalls listed below this initial bug list.

The above mentioned bugs are not applicable to the licensed version of WU BLAST 2.0, but the licensed version does have some characteristics worth mentioning that could trip up or confuse even the most knowledgeable of BLAST users. Any unexpected behavior might be construed as a bug, so the following information is provided to help avoid the unexpected. If you should encounter problems or confusing areas other than those described below, or if you have questions or suggestions, please send them to


Supported Platforms

The computing platforms currently supported by BLAST 2.0 (licensed version only) include the following:

The list of supported platforms is subject to change without notice.
Multiple processors (multithreading or parallel processing) are effectively and efficiently supported by WU BLAST on all of the above platforms. WU BLAST 2.0 also supports large files (files greater than 2 GB in size) when the underlying operating system and file system support large files.

Under Mac OS X, WU BLAST is the only BLAST that runs faster on multiple G4 processors. Other BLAST implementations either don't use multiple G4 processors or — as is the case with Apple's modified version of NCBI BLAST — actually run slower when two processors are used. Unlike other BLASTs, WU BLAST won't crash or hang your system when the use of multiple CPUs is attempted and it yields the most accurate results. WU BLAST also does not utilize any G4-specific instructions for peak performance, so you can even run it on an iBook. You don't have to scrap your old G3 in order to run the fastest BLAST — WU BLAST runs on the same hardware that Mac OS X runs on — and it has done so since even before Mac OS X was publicly released — but it will run faster on a G4 and again up to twice as fast on a dual G4.

Hewlett-Packard HP-UX and IBM AIX operating systems may need to be patched for error-free support of large files over NFS. If a large file/NFS problem does exist with your HP or IBM system, it should immediately reveal itself when an attempt is made to search a large-file database over NFS: the search will simply fail to run and the application will exit non-zero. If necessary, simply contact your vendor for the patch. Both companies promptly addressed this issue over a year ago, in the first half of 2001.

Please note: while WU BLAST version 2.0a19 binaries dated February 1998 are freely available here for some platforms, newer, full-featured binaries for the above platforms are only available upon licensing. While containing several bugs, some of the more prominent features missing from WU BLAST 2.0a19 but supported by the licensed version are:


Installation

To install WU-BLAST 2.0, the first step is to download the UNIX tar archive of executables appropriate for your computing platform from here. Scoring matrix files are included in each package.

Unpack the archive in a new, empty directory. For convenience, precompiled versions of the low-complexity sequence filters (e.g., seg, xnu, and dust) are recent additions to the archive (see the filter/ subdirectory created), along with the sequence redundancy remover nrdb.

The executable programs from the tar archive may be moved as necessary into any directory listed in the PATH environment variable, whether this means adding the newly created directory to the PATH or moving the executables into an existing directory already listed in the PATH. (Lots of information about interrogating and setting environment variables — and about the PATH environment variable itself — can be found in Google and other search engines using keywords &ldquot;PATH environment variable”). If the software is installed in a directory that was already listed in the PATH, it may be necessary to exit the currently open shell and open a new one in order for the shell to recognize the existence of the newly installed programs.

Note that the files blastp, blastn, blastx, tblastn and tblastx are actually soft links to a single copy of the blasta executable containing the integrated capabilities of all 5 search methods. These links should be retained if the blasta file is moved to another directory. If desired, the links can be renamed, as long as the original names appear as substrings within the new names. Alphabetic case is unimportant. For instance, a link named wu-blastp will still invoke blasta in its blastp operational mode.

A Note to Mac OS X Users
This software is invoked via a CLI (command line interface). Programs will need to be invoked either via the Terminal application (located in the /Applications/Utilities folder) or from within a script or other application provided by a third party. The latter may be clickable to execute, but the programs bundled in this package are not intended to be double-clicked.

A Note About File Permissions and File Copying
The WU-BLAST 2.0 package is copyrighted and only available under license. To help ensure users of the software do not unintentionally copy or distribute it outside your locale, all copies of binary files are recommeneded to be maintained on your computer(s) with execute-only permissions. As delivered in the software archives from Washington University, execute-only permissions have already been set, but if the binary files should be copied by you, these permissions may become altered and thus allow other users to then copy the software in an unrestricted manner. Restoration of execute-only permissions to a binary file can be accomplished by running the command "chmod a-rw,a+x filename", where filename is the name of the binary file.

If you already had BLAST 1.4 installed (with BLAST-able databases), the installation steps for WU-BLAST 2.0 are now complete. If you did not have BLAST 1.4 installed, or if you wish to see some of the flexibility afforded by using the xdformat program, read on...

Unpacking the tar archive creates a matrix/ subdirectory containing scoring matrix files. Wherever this directory ultimately resides, the WUBLASTMAT (BLASTMAT) environment variable should be set to point there. In the absence of this environment variable being set, the programs first look for scoring matrix files in any matrix/ subdirectory of the search program's directory and then in /usr/ncbi/blast/matrix.

Low-complexity sequence filters or masking programs — e.g., seg, xnu and dust — are now included in the tar archives described here. While these filter programs are not required for running the search programs, they can enormously reduce the amount of garbage output produced, memory used, and search time taken. Hence, it is highly recommended that these programs be made available to users. If you wish to build (compile and link) the programs yourself, their source code is posted off the WU BLAST Archives home page. Whatever directory you install the filter programs in, the WUBLASTFILTER (BLASTFILTER) environment variable should be set to point there. In the absence of this environment variable being set, the programs look for masking programs in any filter/ subdirectory of the search program's directory and in /usr/ncbi/blast/filter.

Note: unlike the NCBI BLAST 2.0 search programs, the WU BLAST search programs do not employ sequence filtering by default. This behavior may change, however, as the executable filter programs are now distributed with the search programs. In case the search programs are updated on your system without warning and you wish to guarantee for an automated analysis pipeline that no filtering is performed, specify filter=none on the command line.

The databases themselves are missing from the tar archives, however! Once the source databases have been downloaded from any of many Internet sites, the database files are typically uncompressed and processed into FASTA format. Included in the tar archives are several utility programs for converting textual database files:

The NCBI software Toolbox also contains some relevant parsers. One of these is asn2fast, which converts both nucleotide and peptide sequences in GenBank ASN.1 format into FASTA format files. The asn2ff parser, which converts GenBank ASN.1 data into other flat file formats, may also come in handy, especially if you are inclined to parse GenBank into FASTA using your own routines or to using the gb2fasta and gt2fasta programs mentioned above.

All of the above parsers can read from standard input (sometimes signified by a single dash, "-"), so their input files can be maintained on disk in compressed format and dynamically zcat-ed or gunzip-ed directly into the parsers, thus saving the time and storage required for the uncompressed data. Because a dash is often used to signify the start of each command line option, if a dash is needed to specify standard input for the required input filename argument, some of these programs require that a double-dash (--) be specified on the command line before the single-dash. This double-dash signifies the end of the command line options and the start of the required arguments.

Once the databases are in FASTA format, the xdformat, setdb and pressdb programs are used to convert them into a blastable format. Concise usage instructions for these programs can be obtained by invoking them with no command line arguments. By default, each of these programs produces 3 output files whose names are derived from the name of the FASTA input file. The 3 output files have distinct filename extensions and together comprise the blastable database. When nucleotide sequences contain ambiguity codes (e.g., EST data which often contain many Ns), a database formatted by pressdb will require that the original FASTA input file remain accessible to the search programs (e.g. BLASTA), if they are to retrieve the ambiguity codes for matching sequences that contain them; however, the FASTA input need not be accessible to the search programs when databases are formatted by xdformat, as this program stores full ambiguity information within the output files it produces. More information about the blastable database file formats is available here.

While formatting the database, the xdformat program can optionally (-I option) index the sequence identifiers for later identifier-based retrieval with the xdget program. XDF databases that were formatted without an identifier index can have an index created post hoc by xdformat with its -X option. It may be of interest to note for the purposes of their maintenance that xdformat and xdget are actually one-and-the-same program file, merely invoked under the two different names to obtain the two different behaviors. This helps ensure that the index created with xdformat will be compatible with xdget. See the file "FAQ-Indexing.html" for more details on identifier indexing.

For compatibility with previous BLAST installations, the xdformat program can function in a setdb- and pressdb-compatibility mode, wherein its behavior is similar to that of setdb and pressdb. In its compatibility mode, a similar command line structure is used and the output files produced have the same names as those produced by setdb and pressdb. Compatibility mode is invoked when xdformat is renamed or has links pointing to it named setdb and pressdb. While the output files produced in compatibility mode have the same names as those produced by the corresponding setdb/pressdb program, the content of the files produced is always XDF. Versions of the BLASTA search program dated 1999-12-14 and later are able to recognize and interpret these files properly.

Which instance of a database does BLAST search if xdformat, setdb and pressdb have been used to produce multiple databases each having the same base filename? The database having the native filename extensions of xdformat is chosen preferentially.

Using the -t option to xdformat, a descriptive name or title can be assigned to a database that will appear in BLAST search output. The title of an existing database can be changed after its creation, by appending an empty FASTA database and specifying the -t option with the desired new title. For example,


     xdformat -n -a olddb -t "A New Title" /dev/null

The blastable database files can be placed anywhere, but for convenience the BLASTDB environment variable should include their directory location. If the BLASTDB environment variable is not set, the programs look for databases by default in /usr/ncbi/blast/db and in the current working directory. If the old pressdb program (instead of xdformat) is used to create the blastable database, the associated nucleotide sequence FASTA file must be located in the same directory as the three output files from pressdb, if the BLAST search programs are to find the FASTA file. It may sometimes be useful to maintain the FASTA files in a separate directory — even on another disk partition — and provide UNIX soft links in the BLASTDB directory that point to the real location of the FASTA files. In addition, on systems where NCBI BLAST will not be in use, blastable databases can be maintained in multiple directories listed in the BLASTDB environment variable, with each directory name delimited from the next by a colon (:), just as directory names are often delimited in the PATH environment variable.

On multi-processor computer systems, the search programs will employ as many CPUs as are installed; when more than about 4 CPUs are used, this default behavior cause efficiency of hardware utilization to be quite low, compared to running individual single-threaded BLAST jobs on each CPU. Memory use also increases linearly with the number of CPUs or threads employed. One way to govern the number of processors employed is to wrap the search programs in a shell script that sets a lower number of CPUs via the cpus=# command line option. Another, simpler approach to changing the default number of CPUs for all users follows below, for implementation by BLAST system managers possessing "root" or "SuperUser" privileges.

Licensed distributions of WU BLAST 2.0 include a sample file named sysblast.sample, that illustrates the system-wide configuration parameters that can be established to govern the execution of BLAST jobs and, thereby, provide a more productive, trouble-free level of service. When the sysblast file is installed under the name /etc/sysblast, all BLAST jobs executed on a given computer system can be made subject to the parameters:

The sysblast file is only effective when installed in the /etc directory. The /etc directory resides locally to any given computer system, so parameter settings can be tailored to each computer, even if the BLAST software is maintained on a shared disk partition. The /etc directory should only be writable by "root". Unlike the shell script wrapper approach described above, the limits set in /etc/sysblast typically can not be circumvented by normal (non-root) users of a computer system. See the comments included in the sample sysblast file for further details.

For further information on BLAST software installation, the out-dated manual page for the BLAST version 1.4 (ungapped) search programs sometimes is useful, as it provides a description of procedures and parameters that have not changed in BLAST&nbps;2.0.


Licensing

Site licenses for the full-featured BLAST 2.0 are available free for academic and nonprofit use; commercial licenses are available from Washington University for a fee. Academic and nonprofit licenses are typically arranged through the institutions' respective offices of technology transfer. Upon obtaining written permission from Washington University, licensees are welcome to install the software for public BLAST services. Washington University seeks additional licensees for commercial development and marketing and invites interested parties to submit proposals.

Please address all e-mail requests for licensing information and limited evaluation copies to

Be sure to include the name and address of your company or institution and the name and e-mail address of your lab head (if not you). Washington University typically negotiates site licenses for BLAST, so only one license needs to be executed per institution. If your institution already has a license, you will be informed of this upon inquiring, and your lab head will be provided with download instructions. If a license for your institution does not exist, you may be provided with a draft of the license agreement, which will need to be signed by authorized representatives of both institutions. If you do not receive a response within the next business day, please re-send your message and indicate it is a repeat request. Please note that responses during holiday periods may be slower than usual.


Citing BLAST

Citations or acknowledgements of WU BLAST usage are greatly appreciated, as are any personal accounts of how the software is being used that you might wish to share. When URLs are acceptable, please cite with:

   Gish, W. (1996-2003) http://blast.wustl.edu

When URLs are not acceptable, please use:

   Gish, W., personal communication.

The WU BLAST search program may also be referred to by the name BLASTA. I know of no other program (BLAST-related or otherwise) going by this name.

In scientific communications, it is typically important to report the program name, as well as the specific version used. In the case of WU BLAST or BLASTA, the version is a combination of the "2.0" moniker and the release date. The release date can be found on the first line of output, and it is the first date displayed. For example, consider this introductory line of output:

  BLASTN 2.0MP-WashU [02-Apr-2002] [sol8-ultra-ILP32F64 2002-04-03T01:25:46]

The software release date is April 2, 2002, whereas the compilation or build date of the Solaris 8 binary was April 3rd at 1:25 AM.


Historical Notes

Historical notes and additional citation information for some earlier versions of NCBI and WU BLAST include:


References

Altschul, SF, and W Gish (1996). Local alignment statistics. ed. R. Doolittle. Methods in Enzymology 266:460-80.

Altschul, SF, and DJ Lipman (1990). Protein database searches for multiple alignments. Proc. Natl. Acad. Sci. USA 87:5509-13.

Altschul, SF, Gish, W, Miller, W, Myers, EW, and DJ Lipman (1990). Basic local alignment search tool. J. of Mol. Biol. 215:403-10.

Altschul, SF, Madden, TL, Schaffer, AA, Zhang, J, Zhang, Z, Miller, W, and DJ Lipman (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17):3389-402.

Claverie, JM, and DJ States (1993). Information enhancement methods for large scale sequence analysis. Computers in Chemistry 17:191-201.

Gish, W, and DJ States (1993). Identification of protein coding regions by database similarity search. Nature Genetics 3:266-72.

Hancock, JM, and JS Armstrong (1994). SIMPLE34: an improved and enhanced implementation for VAX and Sun computers of the SIMPLE algorithm for analysis of clustered repetitive motifs in nucleotide sequences. Comput. Appl. Biosci. 10:67-70.

Karlin, S, and SF Altschul (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87:2264-8.

Karlin, S, and SF Altschul (1993). Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. 90:5873-7.

Smith, TF, and MS Waterman (1981). Identification of common molecular subsequences. J. Mol. Biol. 147:195-7.

States, DJ, and W Gish (1994). Combined use of sequence similarity and codon bias for coding region identification. J. Comp. Biol. 1:39-50.

Wootton, JC, and S Federhen (1993). Statistics of local complexity in amino acid sequences and sequence databases. Computers in Chemistry 17:149-63.

Wootton, JC, and S Federhen (1996). Analysis of compositionally biased regions in sequence databases. ed. R. Doolittle. Methods in Enzymology 266:554-71.

Zhang, Z, Schaffer, AA, Miller, W, Madden, TL, Lipman, DJ, Koonin, EV, and SF Altschul (1998). Protein sequence similarity searches using patterns as seeds. Nucleic Acids Res. 26:3986-90.


Return to the WU BLAST Archives home page