You are on page 1of 35

Running BLAST through Perl

Introduction to BLAST

BLAST is a common sequence search tool that can be used for either nucleic acid or protein sequences. BLAST = Basic Local Alignment Search Tool The original publication was: Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410. Current reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. We are using a stand-alone version, meaning that the code and databases for BLAST reside on our computer. We obtained the code from the NCBI, and it resides at /home/share/blast.dir. There is also a widely used web-based version at the NCBI: http://www.ncbi.nih.gov/BLAST/ , which is very useful for searches with single sequences against the entire collection of known DNA sequences in GenBank, among other things. Also it has a very good tool for giving a nice visual comparison of two sequences. However, the web version is very difficult to use if you need to do multiple queries.

What BLAST does


Very useful reference: the NCBI BLAST course: http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html Also, a section of the NCBI Handbook concerning BLAST: http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.chapter.61 0

BLAST is a heuristic search method, meaning that it makes assumptions about the data based on experience. This implies that it is not guaranteed to find the best alignment in all possible circumstances. It sacrifices some accuracy for a great increase in speed. BLAST is based on the Smith-Waterman algorithm, which is slow but guaranteed to get the best possible alignment given certain input parameters. We have a Smith-Waterman implementation on biolinx called swat.

BLAST Algorithm
BLAST is a local search algorithm: it starts with small perfect matches and tries to extend them. First, the query sequence is broken down into a dictionary of words, all possible sequences of a certain size. For proteins, the default word size is 3 amino acids. For DNA, the default word size is 11 nucleotides. The query words are then matched with a similar dictionary for the database. BLAST requires 2 non-overlapping word hits within a certain distance of each other to proceed. (Originally each individual word hits was searched, but using 2 hits proved to be faster with only a small observed loss of accuracy). All regions with 2 hits are then extended using the Smith-Waterman algorithm to find the maximal extent of the alignment, starting at the center of the better of the 2 matched words and extending in both directions.

Substitution Matrices

When searching for imperfect matches, it is necesary to have penalties for base or amino acid mismatches. For nucleic acids:
the simplest approach is to assign the same penalty for any mismatch. a more sophisticated approach is to assign a different (larger) penalty for transversions than for transitions. by default, blastn uses a reward score of 1 for a match and a penalty of -3 for any mismatch.

For proteins the situation is more complicated, because many amino acid substitutions have very little effect on protein function, for example isoleucine and valine, or aspartic acid and glutamic acid. Protein substitution matrices have been developed by comparing the frequencies of different possible substitutions seen in homologous proteins across species. Two approaches have been used:
The PAM series (developed by Dayhoff) uses global alignments of proteins from closely related species. Results are then extrapolated to more distant relationships by matrix multiplication. The numbers describing the matrices reflect evolutionary distance: PAM30 is used for more closely related species and PAM 70 is for more distant relationships. The BLOSUM series (developed by Henikoff) use local, ungapped alignments between proteins from distantly related species. The matrices are directly calculated, not extrapolated. The numbers refer to the minimum percent identity allowed for a pair of sequences to be included in the matrix calculation. Thus, BLOSUM80 required 80% identity and only includes proteins that are more similar than those in the BLOSUM 45 matrix. BLOSUM matrices have been found to work better in practice, and blastp uses BLOSUM62 by default.

Gap Scores

When closely related sequences are aligned, small gaps are commonly observed in addition to base substitutions. The frequency of gaps is about 1/10 the frequency of substitutions. Gapped alignments mess up statistical calculations, but they must be dealt with. Since a gap can have any length, there are separate penalties for opening a gap and for extending a gap. By default, blastn uses a gap opening penalty of -5 and an extension penalty of -2; blastp has an opening penalty of -11 and an extension penalty of -1. Using separate opening and extension penalties is called an affine scoring system. For example, y = 3x is a linear function of x, while y = 3x + 5 is an affine function of x. Using an affine function for gaps is considerably slower than using a linear function (i.e. the gap opening and gap extensions are identical). Megablast, a related program useful for large-scale sequence comparisons, uses a linear gap function.

Smith-Waterman Algorithm
The S-W algorithm is an extension of the Needlemann-Wunsch algorithm, which produces global alignments. Both entire sequences are aligned with N-W, but local alignments are allowed in S-W. These algorithms use the technique called dynamic programming (which is not directly related to computer programming). Start with a 2-dimensional matrix with one sequence along the top and the other sequence down the left side. All possible pairs of nucleotides or amino acids are represented by the cells of the matrix. All possible alignments are represented by the paths through the matrix.
a diagonal step is an alignment between the query and the subject sequences at that position a vertical step is a gap in the query sequence a horizontal step is a gap in the subject sequence.

Have a match reward and penalties for mismatches, gap openings, and gap extensions. For our example, we will use:
match = +3 mismatch = -1 gap opening = -4 gap extension = -4 (to keep things simple)

Calculating Cell Scores


T A
Initialize the edge rows to scores of 0. The cell at row i and column j has a score S(i, j) Starting at top left cell, proceed row-by-row, calculating each cells score S(i, j). S(i, j) is the maximum of:
0 (i.e. set to 0 if the calculated score is less than 0) S(i-1, j-1) + match/mismatch score for cell (i, j) S(i, j-1) + match/mismatch score for cell (i, j) + gap penalty S(i-1, j) + match/mismatch score for cell (i, j) + gap penalty

T 5 7 G 2 ?
For the cell in question, the bases dont match, so it starts with a match/mismatch score of -1. There are 3 possible alignment paths to this cell: 1. diagonal (query/subject alignment). Score = 5 1 = 4. 2. vertical (query gap). Score = 7 4 1=2 3. horizontal (subject gap). Score = 2 4 1 = -3 (set to 0) Since 4 is the maximum, the cells value is set to 4.

Then, start at the highest score in the matrix and trace back the path leading through the highest previous scores to 0. Go left and up only, preferring the diagonal path if a choice needs to be made.

S-W Example: 1
--Initialize top and left side to 0s.

A 0 0

C 0

T 0

T 0

C 0

G 0

C 0

T 0

T
T C C A C G

0
0 0 0 0 0 0

S-W Example: 2
--top row matches = 3; mismatches = -1 (set to 0)

A 0 0

C 0

T 0

T 0

C 0

G 0

C 0

T 0

T
T C C A C G

0
0 0 0 0 0 0

S-W Example: 3
--2 Ts in a row = 6; T-C following gets 3 1 = 2

0
T T C C 0 0 0 0

0
0 0

0
0 0

0
3 3

0
3 6

0
0 2

0
0 0

0
0 0

0
3 3

A
C G

0
0 0

S-W Example: 4
--C-G mismatch gets 9 4 1 = 4 (subject gap), not 2 1 = 1 (aligned)

A 0 T T C 0 0 0 0 0 0 0

C 0 0 0 3

T 0 3 3 0

T 0 3 6 2

C 0 0 2 9

G 0 0 0 4

C 0 0 0 3

T 0 3 3 0

C
A C G

0
0 0 0

S-W Example: 5
--C-C match gets 9 4 + 3 = 8 (query gap), not 2 + 3 = 5 (aligned)

A 0 T T C 0 0 0 0 0 0 0

C 0 0 0 3

T 0 3 3 0

T 0 3 6 2

C 0 0 2 9

G 0 0 0 4

C 0 0 0 3

T 0 3 3 0

C
A C G

0
0 0 0

S-W Example: 6
--etc.

A 0 T T C C A C G 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0

C 0 0 0 3 3 0 6 1

T 0 3 3 0 2 2 1 5

T 0 3 6 2 0 1 1 0

C 0 0 2 9 8 3 4 0

G 0 0 0 4 8 7 2 7

C 0 0 0 3 7 7 10 2

T 0 3 3 0 2 6 6 9

S-W Example: 7
--starting at the highest score (10) trace back the alignment path

A 0 T T C C 0 0 0 0 0 0 0 0 0

C 0 0 0 3 3

T 0 3 3 0 2

T 0 3 6 2 0

C 0 0 2 9 8

G 0 0 0 4 8

C 0 0 0 3 7

T 0 3 3 0 2

A
C G

0
0 0

3
0 0

0
6 1

2
1 5

1
1 0

3
4 0

7
2 7

7
10 2

6
6 9

S-W Example: 8
Final alignment. Note that it doesnt cover the first two bases of the query (top row) sequence, and also doesnt cover the last mismatched pair (T-G). The statistical distribution of scores can be derived mathematically (to a close approximation). The alignment score can then be assigned a statistical expectation, the e-value, based on the lengths of the query and subject sequences as well as on the match, mismatch, and gap penalties. The evalue is the expected number of random sequences of the same length as the aligned sequence that would be found in the subject sequence by chance alone. e-values are related to the probability of finding a random sequence in the database by the Poisson distribution. However, for e-values less than 0.01, probabilities and e-values are virtually identical.

ACTT- CGCT

| |

TTCCACG

BLAST types
A BLAST search involves taking a single sequence (the query sequence) and comparing it to a database of other sequences (subject sequences), then reporting the best matches. In general, protein searches are more capable of detecting more distant and imperfect matches than DNA searches, because synonymous and similar amino acids are more conserved evolutionarily than are nucleotide sequences. A number of different BLAST programs are available. The main ones differ primarily by the type of query and subject sequences used:

blastn: query is DNA, subject is DNA blastp: query is protein, subject is protein blastx: query is nucleic acid that is translated by the program into protein sequences (all 6 reading frames); subject database is protein. Useful for imperfect query sequences such as ESTs tblastn: query is protein; database is DNA translated into protein sequences in all 6 reading frames. Useful for low quality databases. tblastx: query is DNA translated into protein, subject is nucleotide translated into protein. Both are translated into all 6 frames. It is very slow relative to the other BLAST types.

Several other types of BLAST search exist: see the NCBI site FAQs to learn more about them.

Query Sequence
The query sequence must be written in FASTA format. This consists of a single comment line that starts with a >, followed by one or more lines of sequence. Comments usually start with a unique identifier for the sequence, then some other descriptive information. To save trouble later, the unique identifier should contain letters, numbers, undescores, hyphens and periods only, and contain no spaces. The unique identifier ends at the first space. The sequence should contain only sequence characters: no spaces or numbers allowed. Either upper or lower case is acceptable, Some suggest that lines of 80 or fewer characters should be used. I prefer to put the entire sequence on a single line, which can be quite long. Nucleic acid sequences can be either DNA or RNA. In general, A, C, G, T and U are used, along with N for any base and a single (hyphen) for gaps of any length.
Some sequences also contain a variety of other letters, such as R for pyrimidine or M for amino-containing. These are not widely used and may well confuse others. However, the existence of these characters makes it necessary that you check unknown sequences for the presence of letters other than ACGTUN-.

Protein sequence use the standard 20 amino acid single letter codes. In addition, X means any amino acid, * is the translational stop codon at the end of the sequence, and there are a few other oddities. Note that FASTA format is not rigidly defined, and it is always worth checking sequence files for what exactly is being used as a format, and the information for any new program for what is an acceptable format.

FASTA format examples


>gi|282349|pir||A41961 chitinase (EC 3.2.1.14) D - Bacillus circulans LNQAVRFRPVITFALAFILIITWFAPRADAAAQWQAGTAYKQGDLVTYLNK DYECIQPHTALTGWEPSNVPALWKYVGEGTGGGTPTPDTTPPTVPAGLT SSLVTDTSVNLTWASTDNVGVTGYEVYRNGTLVANTSTTTAVVTGLTAGT TYVFTVKAKDAAGNLSAASTSLSVTTSTGSSNPGPSGSKWLIGYWHNFDN GSTNIKLRNVSTAYDVINVSFAEPISPGSGTLAFTPYNATVEEFKSDIAYLQ SQGKKVLISMGGANGRIELTDATKKRQQFEDSLKSIISTYGFNGLDIDLEGS SLSLNAGDTDFRSPTTPKIVNLINGVKALKSHFGANFVLTAAPETAYVQGG YLNYGGPWGAYLPVIHALRNDLTLLHVQHYNTGSMVGLDGRSYAQGTAD FHVAMAQMLLQGFNVGGSSGPFFSPLRPDQIAIGVPASQQAAGGGYTAP AELQKALNYLIKGVSYGGSYTLRQLRAMSVSRAL >U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1) AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACC TCCCATCCGTGTCTATTGTACCCTGTTGCTTCGGCGGGCCCGCCGCTT GTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTG ATTGAATGCAATCAGTTAAAACTTTCAACAATGGATCTCTTGGTTCCGG C

Subject Databases
BLAST uses a special database format to speed up the search operation. Several pre-packaged databases exist, most notably nr which is the nonredundant database consisting of all sequences in GenBank. See the BLAST program selection guide at http://www.ncbi.nlm.nih.gov/blast/producttable.shtml . We mostly make our own databases, which is done with the formatdb program. Useful information can be found in the file /home/share/blast.dir/README.formatdb. Databases are located either in the same directory you are running BLAST from, or in /home/share/blast.dir/data. To use the /home/share/blast.dir/data directory without having to specify the entire path every time, you need to create a file called .ncbirc in your home directory that contains these two lines:
[NCBI] Data=/home/share/blast.dir/data

Using formatdb

I am just going to cover the basic options: a number of other options can be found in the README.formatdb file. By default, format db produces 3 files with the same base name but different extensions. For example, ath.nhr, ath.nsq, and ath.nin. The base name is ath, and these are nucleic acid files because the first letter of the extensions is n (it would be p for protein files). The three necessary parameters for formatdb are:
-i input data file (containing one or more sequences in FASTA format) -n output file base name (if this parameter is not set, the input file name is used as base) -p type of file: T for protein, F for nucleic acid

The o option produces another set of files used for indexing. This of course uses more space. I havent found it very useful.

Basic syntax on the command line:


formatdb i my_input_genes.txt p F n my_genes

This takes a FASTA file containing nucleic acid sequences and creates the database files my_genes.nhr, my_genes.nsq, and my_genes.nin.

Using BLAST
The best source of information is the file /home/share/blast.dir/README.bls. The program we are running is called blastall; the various types of blast are invoked using options to blastall. The program is found in /home/share/blast.dir.

The essential options are:


-p program to run (blastn, blastp, blastx, tblastn, tblastx) -d subject database (created with formatdb) -i input sequence file (FASTA format) -o output file name

basic syntax, on the command line:


blastall p blastn i my_input_file d my_database o my_blast_output.txt

Some Other BLAST options


-e cutoff value The default cutoff is 10.0. No hits with a worse (higher) value than this are reported. This number needs to be in decimal, not exponential notation. I find setting -e .001 eliminates a large number of nuisance hits. -F filter query sequence. By default this option invokes the DUST filter that masks out simple sequence repeats (low complexity sequences) that will match many places in the genome that arent really related to the query sequence. To turn it off, use F T (i.e. set it to TRUE). -m 8 This gives a tabular output that gives summary information for each hit. Very useful for processing many sequences. -M protein substitution matrix. Default is BLOSUM62. Others available are BLOSUM45, BLOSUM80, PAM30 and PAM70. These matrices define the penalties for substituting one amino acid for another. Other options can be found be typing blastall on the command line

Tabular BLAST Output

The m 8 option gives a simple table as output. Often this information is all you need. The table is tab-separated, with each hit on a separate line. You can easily parse these data by splitting each line at tabs: my @data = split /\t/; Starting at 0, the columns are:
0: query name 1: subject name 2: percent identities 3: aligned length 4: number of mismatched positions 5: number of gap positions 6: query sequence start 7: query sequence end 8: subject sequence start 9: subject sequence end 10: e-value 11: bit score

More Tabular BLAST Output


The query and subject names come from the FASTA comment lines: everything up to the first space character. Aligned length can include gaps in both the query and the subject, so it is not necessarily the same length as either of these sequences. Percent identities looks at every position in the aligned sequences and counts the number that have the same base or amino acid. This count is then divided by the aligned length and then multiplied by 100. Mismatches are positions in the aligned sequences where both query and subject have a base or amino acid. Gap positions are not included. The number of gap positions is the number of positions in the aligned sequence with a base or amino acid in one sequence and a hyphen (gap) in the other. (NOT the number of gaps regardless of length). The query sequence start and end positions are always written with the start less than the end. However, the subject sequences can have the start position greater than the end, if the alignment is on the opposite DNA strand from the way the subject sequence was originally put into the database. The e-value is the expected number of random sequences that would have at least this good a match in a random database of the same size as the subject. Thus, an e-value of 1 means that 1 match would be expected to occur by chance alone using this database.
It is worth noting that a probability of 0.05 is often considered significant in biological experiments. This value would almost never be considered a significant hit in BLAST searches. Perhaps a rule-of-thumb worst significant score would be 10-6 (1e-6). Scores less than 1e-180 are given a value of 0.0.

The bit score is a normalized (corrected for the mismatch and gap penalties) version of the score derived from the Smith-Waterman search. It is related to the e-value through an exponential function that also takes into account the length of both the query and the subject database. That is, the e-value already takes into account query and subject lengths, while the bit score doesnt. I prefer to deal with e-values.

BLAST Output
If you dont specify m 8, you get a longer, detailed output file. The output starts with basic info, identifier for query (FASTA comment line up to the first space) and the database you listed on the command line with the d option. Then a summary of each database sequence that had a significant alignment(s). Two numbers appear: the bit score (an integer) and the e-value (a floating point number with a minimum of 0 and a maximum of whatever you set e to be, defaulting to 10.0). The summary lines add scores for all hits within each subject sequence. This is useful and meaningful if the subject database contains discrete sequences such as individual genes or peptides. However, if the subject database is genomic chromosomal sequences, the summary lines can be quite misleading. After the summary lines are detailed views of each hit.

More BLAST Output: Detailed View


At the start of the details for each subject sequence, there is a > in the first column, followed by the subjects FASTA comment line. The next line gives the subject sequences length. Note that there can be multiple hits on each subject, and each will be listed separately following these subject identifier and length lines. Then, some summary information about this hit. For example:
Score = 1594 bits (804), Expect = 0.0 Identities = 804/804 (100%) Strand = Plus / Plus

The Score is the raw bit score followed by the normalized bit score in parentheses. If you use bit scores to compare sequences, you want to use the normalized bit score. Expect is the e-value. Identities is the number of matching positions between the aligned sequences, followed by the percentage of identical bases.
Sometimes the Identities line will also have a Gaps = section, which gives the number of gap positions. For protein alignments, there is also a Positives = section, which gives the number of amino acid matches that are not identical but which count as similar in the scoring matrix (e.g. isolecine and valine). The order on this line is Identities, Positives, Gaps. However, Positives and/or Gaps may be absent.

Strand is the orientation of the query sequence, then the subject sequence. Query is always Plus, but the subject will be Minus if the alignment is on the opposite strand from the original sequence put into the subject database. This line appears in nucleic acid searches but not in protein searches (which only match

More BLAST Output: Detailed View


After the summary lines, the aligned sequence appears, in groups of 3 lines. The query sequence is on top, followed by a line indicating matches, followed by the subject line. For blastn, the match line just consists of vertical lines where the bases are identical and a blank space where they are not identical. For blastp, the match line gives the amino acid code for indentical amino acids, a + for similar (positives) amino acids, and a blank space otherwise. After all hits have been displayed, the end of the output file has some statistical information about the subject database and parameters of the blast search.

Running External Programs with Perl


The most commonly used Perl command for running external programs is system. This command executes the program specified by its arguments, then returns control to the next line in your Perl program.
There is also the exec command, which executes the program given in its arguments, but never returns control to the Perl command that issued the command. Occasionally useful, but not in the context of BLAST searches. You can also enclose the program name in backicks; the programs output to STDOUT is the return value of this: $output_string = ` blastall p blastn i my_input_file d my_database`;

system returns the signal number resulting from the process (program) it executed. If all went well, this is 0. Numbers other than 0 indicate some kind of error.
You should always test the results of interacting with the external world. Thus, reasonable syntax for using the system command: if ( system(whatever see below) != 0) { die error executing system command\n; }

The simplest way to use system is to simply enclose the command line you need in quotes:
system( blastall p blastn i my_input_file d my_database o my_blast_output.txt )

The above line invokes the bash shell to interpret the command line, converting each group of symbols separated by spaces into a separate argument. You can avoid invoking the shell at all (a somewhat more secure method), by separating out the individual space-delimited segments yourself:
system( blastall, p, blastn, i, my_input_file, d, my_database, o, my_blast_output.txt )

Parsing BLAST Output

Once you have run BLAST, its output is in a file. You need to open this file, then read it line-by-line with a while statement. Because you are reading line-by-line, most of the information you gather needs to be stored in global variables: variables declared with my outside the while loop. Each new line needs to be examined for the type of information it contains using regular expressions. It is important to be able to identify the start and end points of various sections, and respond appropriately. Most information will be gathered using regular expression memory parentheses. Variables need to be re-initialized at the end of each section (or the beginning of the next one if the end is not easily discerned). We will store the information in a hash whose keys are the subject identifiers. All hits for that subject will be stored in an array. Assuming that you want the information in the detailed analysis of each hit, note that there are 3 important sections:
1. each new subject sequence starts with a > in the first column, followed by the rest of its FASTA comment line. This can spill over multiple lines in the output. Immediately following this line(s) is the Length = line. 2. Within each subject sequence there are one or more hits. Each hit starts with a Score = line, followed by an Identities = line, followed (for blastn but not blastp) with a Strand = line. 3. Within each hit there are groups of 3 lines representing the aligned sequence, including the coordinates of the start and end position of each line. There can be several such groups, for long hits.

1. Parsing New Subject Sequence Information


Each new subject sequence starts with a line having the entire FASTA comment line on it (i.e. a > in the first column), followed by zero or more additional comment lines, and ends with a Length = line.
my ($subject_ident, $comment_line, $subject_length); my %hits; # to store all data from this file; while (<INFILE>) { # code for processing other sections if (/^>/) { # finds the > in the first column of the line $subject_length = 0; # reset to 0 $comment_line = $_; while (1) { # endless loop to get all of comment line my $next_line = <INFILE>; if ($next_line =~ /Length = (\d+)/ ) { $subject_length = $1; last; # break out of loop } else { # continuation of comment line chomp $next_line; $next_line =~ s/^\s+/ /; # remove leading $next_line =~ s/\s+$/ /; # and trailing multiple spaces $comment_line .= $next_line; } # end else } # end while(1) loop $comment_line =~ />(\S+)\s/; $subject_ident = $1; $hits{$subject_ident}{subject_length} = $subject_length; $hits{$subject_ident}{comment_line} = $comment_line; } # end subject sequence info section # additional processing of BLAST output file } # end while <INFILE> loop

The useful pieces of information here are the comment line itself, the unique identifier for this subject sequence (part of the comment line), and the length.
Because these sections have easily recognized start and end points, you can deal with them as a unit, inputting new lines separately from the main while loop.

2. Parsing Hit Information


Each hit starts with the same 3 lines (2 for blastp), Score, Identities, and Strand. If you detect the Score line, you can get the next two lines without using the overall while loop: my ($e_value, $identities, $gaps, $aligned_length, $orientation);
while (<INFILE>) { if (/Score = /) { my $score_line = $_; my $identities_line = <INFILE>; my $strand_line = <INFILE>; # now process the information in these 3 lines $score_line =~ /Score.*Expect = ([-.e\d]+)/; # note the regex here! $e_value = $1; $identities_line =~ /Identities = (\d+)\/(\d+)/; # note the escaped slash ($identities, $aligned_length) = ($1, $2); $gaps = 0; # set a default if ($identities_line =~ /Gaps = (\d+)/ ) { # Gaps are not always present $gaps = $1; } $orientation = 'Plus'; if ($strand_line =~ /Plus \/ (Plus|Minus)/) { # only for blastn $orientation = $1; } push @{ $hits{$subject_ident}{hit_data} }, [$e_value, $identities, $gaps, $aligned_length, $orientation]; } # end if /Score/ } # end while <INFILE>

3. Parsing the Aligned Sequences


This section is the hardest to deal with, because the end point is not clearly delimited. We solve this problem by noting that there are 3 ways the aligned sequence section can end: it can be followed by another hit (a Score = line), by a new subject (a line with > in the first column) or by the statistical information at the end of the file (identified by Lambda in the first column). So, near the beginning of the overall while loop we will insert a section that recognizes any of these patterns, and writes the aligned sequence information to the array of information from the previous hit. However, note that this should only be done if there is a previous hit, that is, not for the first hit in the first subject sequence. We can detect this because the $subject_ident variable will be undef before the first subject line is processed, and there will be no $hits{$subject_ident}{hit_data}.

my ($query_start, $query_end, $subject_start, $subject_end); while (<INFILE>) { if ($subject_ident && exists $hits{$subject_ident}{hit_data} && (/^>/ || /Score =/ || /^Lambda/ ) ) { # but not for the first hit of the first subject push @{ $hits{$subject_ident}[-1] }, $query_start, $query_end, $subject_start, $subject_end; last if (/^Lambda/); # end of hit processing $query_start = $query_end = $subject_start = $subject_end = 0; # reset } # process the file } # end while <INFILE>

3. Parsing Aligned Sequences: 2


The aligned sequences come in groups of 3 lines, Query, the match line, and Subject. It is easiest to process each group of 3 as a unit. We will gather the start and end positions for both query and subject. It is of course possible to collect more detailed information about the sequences if desired. There will be one or more aligned sequence units. Since the start and end points are set to 0 at the beginning of whatever section follows, the start positions are only modified if their variables are at 0.

while (<INFILE>) { # previous code if (/^Query/) { my $query_line = $_; my $match_line = <INFILE>; my $subject_line = <INFILE>; $query_line =~ /Query:\s*(\d+)\D+(\d+)\s*$/; $query_start = $1 unless $query_start; $query_end = $2; $subject_line =~ /Sbjct:\s*(\d+)\D+(\d+)\s*$/; # note wierd spelling $subject_start = $1 unless $subject_start; $subject_end = $2; } # end /^Query/ # code for other sections } # end while <INFILE>

Printing Out the Information


After the end of the while <INFILE> loop. Very simple format. This whole program could be a subroutine, with a reference to the %hits hash returned.

foreach my $subject_ident (sort keys %hits) { print "$subject_ident:\n"; print " $hits{$subject_ident}{comment_line}\n"; print " Length = $hits{$subject_ident}{subject_length}\n"; foreach my $hit ( @{$hits{$subject_ident}{hit_data}} ) { print "@$hit\n"; } print "\n"; }

You might also like