Comparison of BLAST Variants

COMPARISON OF VARIANTS OF BLAST (Basic Local Alignment Search Tool)
A Thesis
Submitted in partial fulfillment of the requirement for the award of degree of Master of Engineering In Software Engineering
Under the Supervision of Ms. Inderveer Chana Senior Lecturer Computer Science and Engineering Department
Batch 2003-2005 Submitted By
Harpreet Kaur (8033107) Computer Science & Engineering Department Thapar Institute of Engineering & Technology
(Deemed University), Patiala-147004 (India).
May 2005
ABSTRACT
Now a days, large quantities of gene sequences of related species of plants, animals and microorganisms show complex patterns of similarity to one another and many molecular biologists are convinced that an understanding of sequence evolution is the first step towards understanding the evolution itself. In fact this is one of the most fascinating aspects of the study of evolution. Thus the comparison of gene sequences or biological sequence analysis is one of the processes used to understand sequence evolution. Just as the ancient Greeks used comparative anatomy to understand the human body and linguists used the Rosetta stone to decipher Egyptian hieroglyphs, today we can use comparative sequence analysis to understand genomes. There is variety of different tools available to perform sequence analysis. Various DNA sequences alignment tools have been developed. Various software packages of automated tools have been developed that had improved the efficiency of much biological research. Fast, economical, flexible, and extensible computing power is making it increasingly attractive to scientists in many areas of research, including biology. More generally, the open source movement has greatly benefited biological research. The combination of data availability and free software is revolutionizing this field. BLAST is the efficient tool used for biological searches. There exists variants of Blast which are developed to overcome the limitations of Main BLAST Tool. I studied variants of BLAST (BLASTN, BLASTP, BLASTX, TBLASTN, TBLASTX,PSIBLAST). Each variant has advantages and disadvantages over one another. Different tools work according to the different parameters. These parameters add to the performance of the algorithm. I did analysis of these variants and compared these tools on the basis of their algorithms, parameters, and performance. Situation is depicted that in which condition, which variant is more advantageous and under which circumstances different versions should use. How they can be improved by eliminating their deficiencies and by adding new features.
DECLARATION
I hereby certify that the work which is being presented in the thesis entitled, Comparison of Variants of Blast (Basic Local Alignment Search
Tool) in partial fulfillment of the requirements for the award degree of Master
of Engineering in Software Engineering at Computer Science and Engineering Department of Thapar Institute of Engineering and Technology (Deemed University), Patiala, is an authentic record of my own work carried out under the supervision of Ms. Inderveer Chana.
The matter presented in this thesis has not been submitted by me for the award of any other degree of this or any other University.
Harpreet Kaur This is to certify that the above statement made by the candidate is correct and true to the best of my knowledge.
Ms. Inderveer Chana

Senior Lecturer Computer Science and Engineering Department Thapar Institute of Engineering and Technology PATIALA- 147004 Countersigned by
Mr. R.S Salaria

Head Computer Science and Engineering Department Thapar Institute of Engineering and Technology PATIALA- 147004
Dr. D. S. Bawa
Dean Of Academic Affairs Thapar Institute of Engineering and Technology PATIALA- 147004
ii

ACKNOWLEDGEMENT
I wish to express my deep gratitude to Ms. Inderveer Chana, Senior Lecturer, Computer Science and Engineering Department for providing her uncanny guidance and support throughout the Thesis work.

I am also thankful to Mr.R.S.Salaria, Head, Computer Science and Engineering Department and Mr. Rajesh Bhatia, P.G Coordinator, for their excellent guidance and encouragement right from the beginning of this course I would also like to thank all the staff members and my Co-students who were always there at the need of the hour and provided with all the help and facilities, which I required for the completion of the Thesis. I wish to express my indebtedness to my parents who have been a constant source of love and encouragement. Finally I would like to thank God for not letting me down at the time of crisis and showing me the silver lining in the dark clouds.
Harpreet Kaur

iii
TABLE OF CONTENTS
Abstract..........................................................................................................................i Declarationii Acknowledgement...iii List of Figures.vii List of Tablesix Organization of Thesis.x CHAPTER 1 DATA MINING .............................................................................. 1-10 1.1 DATA MINING.......................................................................................................1 1.2 WHY DATA MINING ............................................................................................1 1.3 STEPS OF KDD PROCESS ....................................................................................2 1.4 WHAT KIND OF DATA CAN BE MINED? .........................................................4 1.4.1 Relational Databases .....................................................................................4 1.4.2 Data Warehouses...........................................................................................4 1.4.3 Transactional Databases ................................................................................4 1.4.4 Multimedia Databases ...................................................................................5 1.4.5 Spatial Databases...........................................................................................5 1.4.6 World Wide Web ..........................................................................................5 1.4.7 Advanced DB and Information Repositories ................................................5 1.5 ARCHITECTURE FOR DATA MINING SYSTEM..............................................6 1.5.1 Database, Data Warehouse, or Other Information Repository......................6 1.5.2 Database or Data Warehouse Server.............................................................6 1.5.3 Knowledge Base............................................................................................7 1.5.4 Data Mining Engine ......................................................................................7 1.5.5 Pattern Evaluation Module............................................................................8 1.5.6 Graphical User Interface ...............................................................................8 1.6 DATA MINING APPLICATIONS .........................................................................8 1.7 THE SCOPE OF DATA MINING ..........................................................................9 CHAPTER 2 BIOINFORMATICS .................................................................... 11-24 2.1 WHY BIOINFORMATICS ...................................................................................11
iv
2.2 BIOINFORMATICS..............................................................................................11 2.3 AIMS OF BIOINFORMATICS.............................................................................12 2.4 STEPS OF KDD FOR BIOINFORMATICS.........................................................13 2.5 WHAT KIND OF DATA CAN BE MINED? .......................................................13 2.5.1 DNA ............................................................................................................13 2.5.2 RNA ............................................................................................................15 2.5.3 PROTEIN ....................................................................................................16 2.6 DATA MINING TECHNIQUES IN BIOINFORMATICS ..................................17 2.6.1 Clustering ....................................................................................................17 2.6.2 Classification...............................................................................................19 2.6.3 Association ..................................................................................................19 2.7 THE CENTRAL DOGMA.....................................................................................19 2.7.1 Transcription ...............................................................................................19 2.7.2 The Genetic Code........................................................................................20 2.8 NEED OF DATA MINING IN BIOINFORMATICS...........................................21 2.9 BIOINFORMATICS AND ITS SCOPE................................................................22 2.10 APPLICATIONS OF BIOINFORMATICS ........................................................23 CHAPTER 3 INTRODUCTION TO BLAST ................................................... 25-42 3.1 INTRODUCTION..................................................................................................25 3.2 DATABASES AVAILABLE FOR BLAST SEARCH INCLUDE.......................26 3.2.1 Protein Sequence Databases........................................................................26 3.2.2 Nucleotide Sequence Databases..................................................................27 3.3 BLAST ALGORITHM ..........................................................................................29 3.4 BLAST PARAMETERS........................................................................................32 3.5 FEATURES OF BLAST........................................................................................39 3.5.1 Heuristic ......................................................................................................39 3.5.2 Substitution Matrix......................................................................................40 3.5.3 Local Alignments ........................................................................................40 3.5.4 Ungapped Alignments.................................................................................40 3.5.5 Explicit Statistical Theory...........................................................................40 3.5.6 Rapid ...........................................................................................................41 3.5.7 Sequence Input ............................................................................................41 3.5.8 Results Format.............................................................................................41
3.5.9 BLAST Output ............................................................................................41 CHAPTER 4 VARIANTS OF BLAST............................................................... 43-61 4.1 BLAST VARIANTS ..............................................................................................43 4.2 PSI-BLAST ............................................................................................................45 4.3 BLASTN ................................................................................................................53 4.4 BLASTX ................................................................................................................55 4.5 BLASTP .................................................................................................................58 4.5.1 BLASTP PARAMETERS ..........................................................................59 4.6 TBLASTN..............................................................................................................60 4.7 TBLASTX..............................................................................................................61 4.7.1 Limitations of TBlastX................................................................................61 CHAPTER 5 COMPARISON OF VARIANTS OF BLAST ........................... 62-74 5.1 INTRODUCTION..................................................................................................62 5.1.1 Comparison On The Basis Of Parameters...................................................62 5.2 COMPARISON ON THE BASIS OF ALGORITHM...........................................66 5.2.1 The Two-Hit Algorithm Isn't Used In BLASTN, Because Word Hits
Are Generally Rare With Large Identical Words........................................66 5.2.2 Extension in BlastN is different from BlastP and other protein based programs......................................................................................................68 5.3 COMPARISON ON THE BASIS OF PERFORMANCE .....................................68 5.3.1 Comparison On The Basis of Varying Expect Values................................68 5.3.2 Comparison On The Basis of Word Size ....................................................70 5.3.3 Comparison on the Basis of Execution Time..............................................73 CHAPTER 6 CONCLUSION AND FUTURE SCOPE................................... 75-76 6.1 CONCLUSION ......................................................................................................75 6.2 FUTURE SCOPE...................................................................................................76 REFERENCES...........................................................................................................77 LIST OF PUBLICATIONS.......................................................................................80 GLOSSARY................................................................................................................81
vi
LIST OF FIGURES
Number Figure 1.1 The Process of Knowledge Discovery Figure 1.2 Figure 2.1 Figure 2.2 Figure 2.3 Figure 3.1 Figure 3.2 Figure 3.3 Figure 3.4 Figure 3.5 Figure 3.6 Figure 4.1 Figure 4.2 Figure 4.3 Architecture Of Typical Data Mining The KDD Process For Bioinformatics DNA Molecule Protien Moleceule Protein Database Nucleotide Database List of Words From Query Sequence Exact Matches of Words From Word List Maximal Segment Pairs Figure Shows The Word Size Option Blast Variants Blast Variants PSI-Blast Page 03 07 14 14 17 28 28 30 31 31 33 43 45 46 47 48 50 51 51 53 54 54 56 57 58 59 59 63 64
Figure 4.4 PSI-Blast Step1 Figure 4.5 Figure 4.6 Figure 4.7 Figure 4.8 Figure 4.9 PSI-Blast Step2 PSI-Blast Output PSI-Blast Output PSI Blast BlastN
Figure 4.10 Using BlastN for Comparison Figure 4.11 BlastN Results Figure 4.12 BlastX Figure 4.13 Using BlastX for Comparison Figure 4.14 BlastX Results Figure 4.15 BlastP Figure 4.16 Using BlastP for Comparison Figure 5.1 Figure 5.2 Conserved Domain Search For Blastn And Blastp Different Word size for BlastN and BlastP
vii
Figure 5.3 Empirically Estimated Probability That An HSP Is Missed By This Method, As a Function of Its Normalized Score Figure 5.4 Speeds of The One-Hit And Two-Hit Methods Figure 5.5 Comparison - Varying Expect Values Figure 5.6 Comparison - Varying Expect Values Figure 5.7 Varying Expect Values For Blastn Figure 5.8 Varying Expect Values Blastn 67 67 69 70 71 71 72 72
Figure 5.9 Varying Expect Values For Variants Figure 5.10 Varying Expect Values For Variants
Figure 5.11 Compares The Performance of BLAST Compiled With 32-Bit And 64-Bit Processor 73
viii
LIST OF TABLES
Number Table 2.1 Table 4.1 Table 5.1 Table 5.2 Table 5.3 Table 5.4 The 20-Amino Acids and their official codes Programs Available For Blast No of hits for varying expect values No of hits for varying expect values BlastN No of Hits For Varying Word Size Varying Execution Time Page 16 44 69 70 72 73
ix
ORGANIZATION OF THESIS
The Thesis entitled Comparison of Variants of BLAST (Basic Local Alignment Search Tool) is concerned with comparison of variants of BLAST. All tools are compared according to some defined criteria.
The First chapter briefly introduces Data Mining technology and the techniques which are used in data mining. Process of knowledge discovery for databases for is also discussed.
The Second chapter is related to Field of Bioinformatics, Need of Bioinformatics, kind of data on which bioinformatics is applied.
The Third chapter explains Biological tool BLAST which is used for sequence similarity, algorithm of BLAST, features of BLAST is explained.
Fourth chapter explores variants of BLAST (BlastN, BlastX, BlastP, TBlastN, TBlastX, PSI-Blast) the algorithm of all variants, parameters, and the performance criteria for each tool is explored.
In Fifth chapter comparison of variants of BLAST is performed on the basis of parameters, algorithms and performance. Deficiency of any parameters and improvement to that is also enlightened.
CHAPTER 1
1.1 DATA MINING
DATA MINING
Data Mining is extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases. It is the process of selecting, exploring, and modeling large amounts of data to uncover previously unknown patterns or information for a business advantage [4]. Data Mining can be viewed as an analytical process designed to explore data (usually large amounts of - typically business or market related - data) in search for consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. There are many terms carrying a similar or slightly different meaning to data mining, such as knowledge mining from databases, knowledge extraction, data/ pattern analysis, data archaeology, and data dredging. It is a young interdisciplinary field, drawing from areas such as database systems, data warehousing, statistics, machine learning, data visualization, information retrieval, and high-performance computing. Other contributing areas include neural networks, pattern recognition, spatial data analysis, image databases, signal processing, and many application fields, such as business, economics, and bioinformatics.
1.2 WHY DATA MINING

We are drowning in data, but starving for knowledge! Necessity is the Mother of Invention - Automated data collection tools and mature database technology led to tremendous amounts of data stored in databases, data warehouses and other information repositories. Every day the world creates 52,000 terabytes of data. Only 4% of the data is used for any purpose. So a thought came that if we could do something useful with this data, and with this thought the field of DATA MINING was born. Database technology began with the development of data collection and database creation mechanisms that, led to the development of effective mechanisms for data management including data storage and retrieval, and query and transaction processing. The large number of database systems offering query and transaction processing eventually and naturally led to the need for data analysis and understanding. Hence, data mining began its development out of this necessity.
1.3 STEPS OF KDD PROCESS

Knowledge discovery is defined as ``the non-trivial extraction of implicit, unknown, and potentially useful information from data''. The knowledge discovery process takes the raw results from data mining (the process of extracting trends or patterns from data) and carefully and accurately transforms them into useful and understandable information [6]. The overall process of finding and interpreting patterns from data involves the repeated application of the following steps: 1. Developing an understanding of
o o o
the application domain the relevant prior knowledge the goals of the end-user
2. Creating a target data set: selecting a data set, or focusing on a subset of variables, or data samples, on which discovery is to be performed. 3. Data cleaning and preprocessing.
o o o o
Removal of noise or outliers. Collecting necessary information to model or account for noise. Strategies for handling missing data fields. Accounting for time sequence information and known changes.
4. Data reduction and projection.

o
Finding useful features to represent the data depending on the goal of the task.
Using dimensionality reduction or transformation methods to reduce the effective number of variables under consideration or to find invariant representations for the data.
Figure 1. 1 The process of Knowledge Discovery [22] 5. Choosing the data mining task.
o
Deciding whether the goal of the KDD process is classification, regression, clustering, etc.
6. Choosing the data mining algorithm(s).

o o o
Selecting method(s) to be used for searching for patterns in the data. Deciding which models and parameters may be appropriate. Matching a particular data mining method with the overall criteria of the KDD process.
7. Data mining.
o
Searching for patterns of interest in a particular representational form or a set of such representations as classification rules or trees, regression, clustering, and so forth.
8. Interpreting mined patterns. 9. Consolidating discovered knowledge. The terms knowledge discovery and data mining are distinct. KDD refers to the overall process of discovering useful knowledge from data. It involves the evaluation and possibly interpretation of the patterns to make the decision of what qualifies as knowledge. It also includes the choice of encoding schemes, preprocessing, sampling, and projections of the data prior to the data mining step.
Data mining refers to the application of algorithms for extracting patterns from data without the additional steps of the KDD process.
1.4 WHAT KIND OF DATA CAN BE MINED?

Data mining is not specific to one type of media or data. Data mining should be applicable to any kind of information repository. However, algorithms and approaches may differ when applied to different types of data. Data mining is being put into use and studied for databases, including relational databases, object-relational databases and object-oriented databases, data warehouses, transactional databases, unstructured and semi-structured repositories such as the World Wide Web, advanced databases such as spatial databases, multimedia databases, time-series databases and textual databases, and even flat files.
1.4.1 Relational Databases

A relational database consists of a set of tables containing either values of entity attributes, or values of attributes from entity relationships. Tables have columns and rows, where columns represent attributes and rows represent tuples. A tuple in a relational table corresponds to either an object or a relationship between objects and is identified by a set of attribute values representing a unique key. The most commonly used query language for relational database is SQL, which allows retrieval and manipulation of the data stored in the tables, as well as the calculation of aggregate functions such as average, sum, min, max and count.
1.4.2 Data Warehouses

A data warehouse is a repository of information collected from multiple resources, stored under a unified schema and which usually reside at a single site. Data warehouse are constructed via a process of data cleaning, data transformation data integration, data loading and process data refreshing.
1.4.3 Transactional Databases

A transactional database consists of a file where each record represents a transaction. A transaction typically includes a unique transaction identity number and a list of items making up the transaction.
1.4.4 Multimedia Databases

Multimedia databases include video, images, audio and text media. They can be stored on extended object-relational or object-oriented databases, or simply on a file system. Multimedia is characterized by its high dimensionality, which makes data mining even more challenging. Data mining from multimedia repositories may require computer vision, computer graphics, image interpretation, and natural language processing methodologies.
1.4.5 Spatial Databases

Spatial databases are databases that, in addition to usual data, store geographical information like maps, and global or regional positioning. Such spatial databases present new challenges to data mining algorithms.
1.4.6 World Wide Web

The World Wide Web is the most heterogeneous and dynamic repository available. A very large number of authors and publishers are continuously contributing to its growth and metamorphosis, and a massive number of users are accessing its resources daily. Data in the World Wide Web is organized in inter-connected documents. These documents can be text, audio, video, raw data, and even applications. Conceptually, the World Wide Web is comprised of three major components: The content of the Web, which encompasses documents available; the structure of the Web, which covers the hyperlinks and the relationships between documents; and the usage of the web, describing how and when the resources are accessed.
1.4.7 Advanced DB and Information Repositories

Object-oriented databases Object Oriented databases are based on the object oriented programming paradigm, where each entity is considered as an object. Each object has associated with it a set of variables, a set of messages and set of methods. Objects that share a common set of properties can be grouped into an object class. Each object is an instance of its class. For example, employee can contain variables like name, address and birth date.
Object-relational databases
The object-relational model extends the basic relational data model by adding the power to handle complex data types, class hierarchies and object inheritance. These are becoming more popular in industry and applications. Spatial databases
Spatial databases include spatial related information. Such databases include geographical databases, VLSI chip design databases, and medical and satellite image databases. Temporal databases and Time Series databases
Temporal databases usually stores relational data that include time related attributes. Time Series database stores sequences of values that change with time, such as data collected regarding the stock exchange. Legacy databases
A legacy database is a group of heterogeneous databases that combines different kinds of data systems, such as relational or object oriented databases, spreadsheets or file systems.
1.5 ARCHITECTURE FOR DATA MINING SYSTEM

The architecture of typical data mining system has the following components [11]:
1.5.1 Database, Data Warehouse, or Other Information Repository

This is one or a set of database, data warehouse spreadsheet, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data.
1.5.2 Database or Data Warehouse Server

The database or data warehouse server is responsible for fetching the relevant data, based on the user data-mining request [14]. A data warehouse is a repository for long-term storage of data from multiple sources, organized so as to facilitate management decision making. The data are stored under a unified schema and are typically summarized. Data warehouse systems provide some
data analysis capabilities, collectively referred to as OLAP (On-Line Analytical Processing).
1.5.3 Knowledge Base

This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Knowledge such as users beliefs, which can be used to assess a patterns interestingness based on its unexpectedness, may be included. Other examples of domain knowledge are additional interestingness constraints or threshold and metadata.
Figure 1.2 Architecture of a typical data mining system
1.5.4 Data Mining Engine

This is essential to the data mining system and identically consists of set of functional modules for task such as characterization, association, classification, cluster analysis and evolution and deviation analysis.
1.5.5 Pattern Evaluation Module

This component typically employs interestingness measure and interacts with the data mining modules so as to focus the search towards interesting patterns. It may use interestingness thresholds to filter out discovered patterns.
1.5.6 Graphical User Interface

This modules communicates between users and data mining system, allowing the users to interacts with the system by specifying a data mining query or task, providing information to help focus on the search and performing exploratory data mining based on intermediate data mining results.
1.6 DATA MINING APPLICATIONS

The Google system uses a mathematical algorithm called PageRank to estimate the relative importance of individual web pages based on link patterns [19]. Financial institutions have reduced incidents of credit-card fraud through the application of neural networks, which feature circuits arranged in a brain-like configuration that can infer patterns from data. The medical sector is also taking advantage of data-mining: One application involves a collaboration between IBM and the Mayo Clinic to detect patterns in medical records, while another project uses natural-language processing to map out the "grammar" of amino acid sequences and match them to specific protein shapes and functions. Government organizations such as the Defense Department and the National Security Agency are using AI technology for several efforts related to national security, such as the Echelon telecom monitoring system. The Defense Advanced Research Projects Agency (DARPA) is a leading AI research investor, and the break throughs that come out of DARPA-funded projects are more often than not put to civilian rather than military use. Marketing: In marketing, the primary application is database marketing
systems, which analyze customer databases to identify different customer groups and forecast their behavior. Business Week (Berry 1994) estimated that
over half of all retailers are using or planning to use database marketing, and those who do use it have good results; for example,American Express reports a 10- to 15-percent increase in credit-card use. Another notable marketing application is market-basket analysis Investment: Numerous companies use data mining for investment, but most do not describe their systems. One exception is LBS Capital Management. Its system uses expert systems, neural nets, and genetic algorithms to manage portfolios totaling $600 million; since its start in 1993, the system has outperformed the broad stock market (Hall, Mani, and Barr 1996). Fraud detection: HNC Falcon and Nestor PRISM systems are used for monitoring credit card fraud, watching over millions of accounts. The FAIS system (Senator et al. 1995),from the U.S. Treasury Financial Crimes Enforcement Network, is used to identify financial transactions that might indicate money laundering activity. Telecommunications: The telecommunications alarm-sequence analyzer (TASA) was built in cooperation with a manufacturer of telecommunications equipment and three telephone networks (Mannila, Toivonen, and Verkamo 1995). The system uses a novel framework for locating frequently occurring alarm episodes from the alarm stream and presenting them as rules. Large sets of discovered rules can be explored with flexible information-retrieval tools supporting interactivity and iteration. In this way, TASA offers pruning, grouping, and ordering tools to refine the results of a basic brute-force search for rule.
1.7 THE SCOPE OF DATA MINING

Data mining derives its name from the similarities between searching for valuable business information in a large database for example, finding linked products in gigabytes of store scanner data and mining a mountain for a vein of valuable ore. Both processes require either sifting through an immense amount of material, or intelligently probing it to find exactly where the value resides. Given databases of
sufficient size and quality, data mining technology can generate new business opportunities by providing these capabilities [21].
Automated prediction of trends and behaviors. Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data quickly. A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events.
Automated discovery of previously unknown patterns. Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors.
Data mining techniques can yield the benefits of automation on existing software and hardware platforms, and can be implemented on new systems as existing platforms are upgraded and new products developed. When data mining tools are implemented on high performance parallel processing systems, they can analyze massive databases in minutes. Faster processing means that users can automatically experiment with more models to understand complex data. High speed makes it practical for users to analyze huge quantities of data. Larger databases, in turn, yield improved predictions.
10
CHAPTER 2
2.1 WHY BIOINFORMATICS
BIOINFORMATICS
The information for the set-up of living organisms is stored in the sequences of nucleotides in DNA. DNA serves two purposes: to provide the information during the life cycle of a cell and to pass it on to offspring. The discovery of genes and the genetic code triggered the hope to be able to read the information stored in our genes, and today we are able to do so: massive progress in sequencing technology has delivered entire genomes to the tips of our fingers. The era of genomics and proteomics has opened up the opportunity to go beyond the analysis of single genes and proteins, towards understanding the interactions between all components of genomes and proteomes. From trying to comprehend life by cutting it into smaller and smaller pieces, we are beginning to unveil in the same way it has been functioning since its beginning: as a whole. Computer scientists are important allies for biologists in the struggle to understand the information in DNAs. On one hand the massive amount of sequence data requires new tools -computers and programs- to generate, proof, store, and access these data. On the other hand, the deciphering of genomes necessitates the development of new hard- and software that allow to detect genes, determine relationships between them, study their expression, to be able to understand the basis of development and disease. Bioinformatics provides the tools to understand the information in biological data.
2.2 BIOINFORMATICS
Bioinformatics has evolved into a full-fledged multidisciplinary subject that integrates developments in Information and Computer Technology as applied to Biotechnology and Biological Sciences. Bioinformatics uses Computer software tools for database creation, data management, data warehousing, data mining and global communication networking. Bioinformatics is the recording, annotation, storage, analysis, and searching/retrieval of nucleic acid sequence (genes and RNAs), protein sequence and structural information [2]. This includes databases of the sequences and structural information as well methods to access, search, visualize and retrieve the information. Bioinformatics concern the creation and maintenance of databases of
11
biological information whereby researchers can both access existing information and submit new entries. Bioinformatics includes Sequence analysis used by geneticists, cell biologists, molecular biologists, Molecular modeling used by crystallographers, cell biologists, biochemists, Molecular phylogeny/evolution, Ecology and population studies ,Medical informatics .The most pressing tasks in bioinformatics involve the analysis of sequence information. Computational Biology is the name given to this process, and it involves the following:

Finding the genes in the DNA sequences of various organisms Developing methods to predict the structure and/or function of newly discovered proteins and structural RNA sequences.
Clustering protein sequences into families of related sequences and the development of protein models.
Aligning similar proteins and generating phylogenetic trees to examine evolutionary relationships.
2.3 AIMS OF BIOINFORMATICS

The aims of bioinformatics are basically three-fold. They are
Organization of data in such a way that it allows researchers to access existing information & to submit new entries as they are produced. While data-creation is an essential task, the information stored in these databases is useless unless analyzed. Thus the purpose of bioinformatics extends well beyond mere volume control.
To develop tools and resources that help in the analysis of data. For example, having sequenced a particular protein, it is with previously characterized sequences. This requires more than just a straightforward database search. As such, programs such as FATA and PSI-BLAST much consider what constitutes a biologically significant resemblance. Development of such resources extensive knowledge of computational theory, as well as a thorough understanding of biology.
12
Use of these tools to analyze the individual systems in detail, and frequently compared them with few that are related.
2.4 STEPS OF KDD FOR BIOINFORMATICS

The steps of KDD for bioinformatics involve the same steps as were performed during the KDD in simple databases. The only difference is the data on which the data mining is performed. Here the data is biomolecular data instead of simple databases [2]. It may involve DNA sequences, RNA sequences. KDD for bioinformatics is shown in figure 2.1.
2.5 WHAT KIND OF DATA CAN BE MINED?

KDD for Bioinformatics can be applied on biomolecular data. Biomolecular Data consists of the following types

DNA ( deoxyribonucleic acid) RNA ( ribonucleic acid) Protein sequences ( 2D & 3D structures)
2.5.1 DNA
In most living organisms (except for viruses), genetic information is stored in the molecule deoxyribonucleic acid, or DNA. DNA is made and resides in the nucleus of living cells. DNA gets its name from the sugar molecule contained in its backbone (deoxyribose), however it gets its significance from its unique structure There are four different nucleotide bases that occur in DNA: A - Adenine T- thymine C- cytosine G- guanine
13
Figure 2.1 The KDD for Bioinformatics The versatility of DNA comes from the fact that the molecule is actually doublestranded. The nucleotide bases of the DNA molecule form complementary pairs: the nucleotides hydrogen bond to another nucleotide base in a strand of DNA opposite to the original. This bonding is specific, and adenine always bonds to thymine (and vice versa) and guanine always bonds to cytosine (and vice versa). This bonding occurs across the molecule leading to a double-stranded system as shown in picture:
Figure 2.2 DNA Molecule
14
The fundamental chemical building block of deoxyribonucleic acid (DNA) is the nucleotide. A nucleotide consists of three parts: (1) a nitrogen-containing pyrimidine or purine base, (2) a deoxyribose sugar, and (3) a phosphate group that acts as a bridge between adjacent deoxyribose sugars. The double-stranded DNA molecule has the unique ability that it can make exact copies of itself, or self-replicate. When more DNA is required by an organism (such as during reproduction or cell growth) the hydrogen bonds between the nucleotide bases break and the two single strands of DNA separate. New complementary bases are brought in by the cell and paired up with each of the two separate strands thus forming two new identical, double-stranded DNA molecules.
2.5.2 RNA
RNA stands for Ribonucleic Acid. It is a long molecule but usually Single stranded, except when it folds back on itself. They differ chemically from DNA by containing ribose instead of deoxyribose & containing Uracil ( U) instead of Thymine (T). So the only important differences between RNA and DNA are that
S
RNA differs from DNA by one nucleotide. RNA comes as a single stranded
The four bases of RNA are A - adenine U- uracil C- cytosine G- guanine Some programs automatically handle the U-instead-of-T conversion and many do not even distinguish between the two classes o nucleic acids. Dont be surprised if a database entry displays RNA sequences with a T instead of U. In fact, RNA sequences are encoded in the DNA.
15
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1-Letter Code A R N D C Q E G H I L K M F P S T W Y V
3-Letter Code Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val
Name Alanine Arginine Asparagine Aspartic acid Cysteine Glutamine Glutamic acid Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine
Table 2.1: The 20- Amino Acids and their Official Codes
2.5.3 PROTEIN
Protein is a polymer constructed by Amino acids. The most popular representation model for biologists to describe a protein is to use the sequence. A protein sequence is made up of 20 amino acids, each represented by a letter. These amino acids along with their codes are shown in Table 2.1.
16
Figure 2.3 Protein Molecule
2.6 DATA MINING TECHNIQUES IN BIOINFORMATICS

There are many data mining techniques available which can be applied to biomolecular data. Clustering, Classification and Association, which are very useful in biomolecular data, are discussed below. These techniques are able to discover previously hidden pattern in biomolecular data [2].
2.6.1 Clustering
The search for protein structure motif begins with the knowledge that some protein with low sequence similarity folds into remarkably similar 3-D conformations. Even globally different structure may share similar or identical substructures. Protein motifs can be divided into four categories i. Sequences Motif: Linear strings of amino acids residues with a topological ordering. ii. Sequence structure motifs. iii. Structure Motifs: 3-D objects that correspond to a protein backbone. iv. Structure Sequence Motif: Structure motifs in which nodes of the graph are annotated with sequence information.
Predictability It is the degree to which a motif is representing one level or facet of protein structure or function may be predicted form knowledge of another. For the local structure
17
motifs designated as secondary structure, predictability is the ability to accurately predict secondary structure classes from amino acid sequence. Predictive utility It is the flip side of the predictability criterion for e.g. If one takes the view of secondary structure as an intermediate level encoding between primary structure and tertiary structure, then predictive utility ought to be some measure of the gain in accuracy in predicting tertiary structure with a particular encoding, as compared with prediction using other possible encoding. Another mode direct measure might be the degree to which a particular set of proposed motifs, corresponding to secondary structure classes, constrain the alpha and gamma angles of the included structure fragments. Intelligibility refers to the ease with which researchers and practitioners of protein science can understand a given structure motif and can incorporate its information into their own work. Many factors affect intelligibility for e.g. A discovered structured class that contains one-third traditional alpha helix, one-third traditional beta sheets and onethird coil is harder to explain than one that correlates almost perfectly with alpha helix. Naturalness It means the degree to which a motif captures some essential bio chemicals or evolutionary properties or some essential class structure in the space of protein sequence or structure fragments under consideration. Some clustering methods are infamous for finding ersatz clusters in uniformly distributed data. Other clustering methods produce results very dependent on their starting point. To avoid such results it is important to carefully choose appropriate representations and attributes for classification. Systematicity It is the degree to which a motif discovery method is derived from explicitly stated principles and the degree to which the methods can repeatedly be applied to diverge data and produce consistent results.
18
Ease Of Discovery It refers to the computational complexity and data complexity of the methods required to discover the motif.
2.6.2 Classification
To find knowledge pattern discovery is a fundamental operation. A Pattern in Biosequence can help scientist to analyze the property of a sequence or predict the function of a new entity. The pattern may also help to classify an unknown sequence or to assign the sequence to an existing family.
2.6.3 Association
Some qualities or some traits in any species don't come alone; they come associated with some other fundamentals differences. So sometime if one particular characteristic (pattern) in the sequence, that will also depend upon the confidence of a particular object (pattern) in that sequence for that particular association. Types of association i. Association can be for a pair or set of similarity in the same sequence. ii. Association can be for a pair or set of similarity in the two sequences. Association can be for a pair or set of similarity in the multiple sequences.
2.7 THE CENTRAL DOGMA

The expression of the genetic information stored in DNA involves the translation of a linear sequence of nucleotides into a co-linear sequence of amino acids in proteins. The flow is: DNA
: mRNA : Protein [2].

2.7.1 Transcription
A segment of DNA is first copied into a complementary strand of RNA. This process called transcription is catalyzed by the enzyme RNA polymerase. Near most of the genes there is a special pattern in the DNA called promotor, located upstream of the transcription start site, which informs the RNA polymerase where to begin the transcription. This is achieved with the assistance of transcriptional factors that recognize the promotor sequence and bind to it. Although ribonucleic acid (RNA) is a long chain of nucleic acids (as is DNA), it has very different properties. First, RNA is usually single stranded (denoted ssRNA). Second,
19
RNA has a ribose sugar, rather than deoxy-ribose. Third, RNA has the pyrimidine based Uracil (abbreviated U) instead of Thymine. Fourth, unlike DNA, which is located primarily in the nucleus, RNA can also be found in the cellular liquid outside the nucleus, which is called the cytoplasm. In Eukaryotic organisms, to produce a protein the entire length of the gene, including both its introns and its exons, is first transcribed into a very large RNA molecule - the primary transcript. At the end of the gene the transcription stops, and a few dozens of Adenine (A) nucleotides are added to the end of the RNA molecule for protection (poly-A tail ). 5 CAP lays an important part in the initializing of protein synthesis by the protecting the growing RNA transcript from degradation. Before this RNA molecule leaves the nucleus, a complex of RNA processing enzymes removes all the intron sequence, in a process called splicing, thereby producing a much shorter RNA molecule. Typical eukaryotic exons are of average length of 200bp, while the average length of introns is around 10000bp (these lengths can vary greatly between different introns and exons). In many cases, the pattern of the splicing can vary depending on the tissue in which the transcription occurs. For example, an intron that is cut from mRNAs of a certain gene transcribed in the liver may not be cut from the same mRNA when transcribed in the brain. This variation is called alternative splicing, and it contributes to the overall protein diversity in the organism. After this RNA processing step has been completed, the RNA molecule moves to the cytoplasm as a messenger RNA molecule (mRNA), in order to undergo translation.
2.7.2 The Genetic Code

The rules by which the nucleotide sequence of a gene is translated into the amino acid sequence of the corresponding protein, the so-called genetic code, were deciphered in the early 1960s. The sequence of nucleotides in the mRNA molecule, that acts as an intermediate was found to be read in serial order in groups of three. Each triplet of nucleotides, called a codon, species one amino acid (the basic unit of a protein, analogous to nucleotides in DNA). Since RNA is a linear polymer of four different nucleotides, there are 43 = 64 possible codon triplets (However, only 20 different amino acids are commonly found in proteins, so that most amino acids are specified by several codons. In addition, 3 codons (of the 64) specify the end of translation, and are called stop codons. The codon specifying beginning of translation is AUG, and is also the codon for the amino acid Methionine. The code has been highly conserved 20
during evolution: with a few minor exceptions, it is the same in organisms as diverse as bacteria, plants, and humans.
2.8 NEED OF DATA MINING IN BIOINFORMATICS

Data in biology are very diverse and abundant. They can be catalogued and classified, but often cannot be easily summarized or abstracted using a formula. With the increase in biological knowledge, computer-based databases have become essential for this task. Bioinformatics databases includes following types of databases Sequence databases Structural databases Motif databases Genome databases Proteome databases RNA expression Literature Populations Mutations Organisms Moreover the data of even a single microorganism is very large. Rickettsia conorii is the smallest bacteria whose complete gene sequence is known. This bacteria is 1.3 million bp long and this size is still on the small side of bacteria. Human genome sequences are several billion bp in length. So with the significant growth of the amount of biomolecular data, it becomes increasingly important to develop new techniques for extracting knowledge from the data. Data mining is a fundamental operation in such a domain. Every data in bioinformatics can be converted into DNA sequence. All the protein, RNA sequence can be converted into DNA sequences. So the data mining need to be applied on the DNA sequences and later the results can be converted for the other molecular data.
21
2.9 BIOINFORMATICS AND ITS SCOPE

Bioinformatics has evolved into a full-fledged scientific discipline over the last decade. The definition of Bioinformatics is not restricted to computational molecular biology and computational structural biology. It now encompasses fields such as comparative genomics, structural genomics, transcriptiomics, Proteomics,
cellunomics and metabolic pathway engineering. Developments in these fields have direct implications to healthcare, medicine, discovery of next generation drugs, development of agricultural products, renewable energy, environmental protection etc [23]. Bioinformatics integrates the advances in the areas of Computer Science, Information Science and Information Technology to solve complex problems in Life Sciences. The core data comprises of the genomes and proteomes of human and other organisms, 3-D structures and functions of proteins, microarray data, metabolic pathways, cell lines & hybridoma, biodiversity etc. The sudden growth in the quantitative data in Biology has rendered data capture, data warehousing and data mining as major issues for biotechnologists and biologist. Availability of enormous and other data has resulted in the realization of the inherent biocomplexity issues which call for innovative tools for biotechnologists and biologist. Availability of enormous and other data has resulted in the realization of the inherent biocomplexity issues which call for innovative tools for synthesis of knowledge. Information Technology, particularly the internet, is utilized to collect, distribute and access everincreasing data which are later analyzed with mathematics and statistics-based tools. Bioinformatics has a key role to play in the cutting edge Research & Development areas such as functional genomics, proteomics, protein engineering,
pharmacogenomics, discovery of new drugs and vaccines, molecular diagnostic kits, agro-biotechnology etc. This has attracted attention of several companies and entrepreneurs. As a result, a large number of Bioinformatics- based start-ups have been launched and the trend is likely to continue. This has necessitated the availability of a large number of formally trained individuals in Bioinformatics. A Bioinformaticians must acquire/possess expertise in the essential multi-displinary fields that comprise the core of this new science. Quality research and education in Bioinformatics are vital not only to meet the existing challenges but also to set and accomplish new goals in Life Science.
22
2.10 APPLICATIONS OF BIOINFORMATICS

Molecular medicine
The human genome will have profound effects on the fields of biomedical research and clinical medicine. Every disease has a genetic component. This may be inherited or a result of the body's response to an environmental stress which causes alterations in the genome (eg. cancers, heart disease, diabetes.). The completion of the human genome means that we can search for the genes directly associated with different diseases and begin to understand the molecular basis of these diseases more clearly [27]. This new knowledge of the molecular mechanisms of disease will enable better treatments, cures and even preventative tests to be developed.
Personalized medicine
Clinical medicine will become more personalised with the development of the field of pharmacogenomics. This is the study of how an individual's genetic inheritance affects the body's response to drugs. At present, some drugs fail to make it to the market because a small percentage of the clinical patient population show adverse affects to a drug due to sequence variants in their DNA. As a result, potentially lives saving drugs never make it to the marketplace. Today, doctors have to use trial and error to find the best drug to treat a particular patient as those with the same clinical symptoms can show a wide range of responses to the same treatment. In the future, doctors will be able to analyze a patient's genetic profile and prescribe the best available drug therapy and dosage from the beginning.
Preventative medicine
With the specific details of the genetic mechanisms of diseases being unraveled, the development of diagnostic tests to measure a persons susceptibility to different diseases may become a distinct reality. Preventative actions such as change of lifestyle or having treatment at the earliest possible stages when they are more likely to be successful, could result in huge advances in our struggle to conquer disease.
Gene Therapy
In the not too distant future, the potential for using genes themselves to treat disease may become a reality. Gene therapy is the approach used to treat, cure or even prevent disease by changing the expression of a persons genes. Currently, this field is in its 23
infantile stage with clinical trials for many different types of cancer and other diseases ongoing.
Drug development
At present all drugs on the market target only about 500 proteins. With an improved understanding of disease mechanisms and using computational tools to identify and validate new drug targets, more specific medicines that act on the cause, not merely the symptoms, of the disease can be developed. These highly specific drugs promise to have fewer side effects than many of today's medicines.
24
CHAPTER 3
3.1 INTRODUCTION
INTRODUCTION TO BLAST
The discovery of sequence homology to a known protein or family of proteins often provides the first clues about the function of a newly sequenced gene. As the DNA and amino acid sequence databases continue to grow in size they become increasingly useful in the analysis of newly sequenced genes and proteins because of the greater chance of finding such homologies. There are a number of software tools for searching sequence databases but all use some measure of similarity between sequences. To distinguish biologically significant relationships from chance similarities. Perhaps the best studied measures are those in conjunction with variations of the dynamic programming algorithm These methods assign scores to insertions, deletions and replacements, and compute an alignment of two sequences that corresponds to the least costly set of such mutations. Such an alignment may be thought of as minimizing the evolutionary distance or maximizing the similarity between the two sequences compared. In either case, the cost of this alignment is a measure of similarity; the algorithm guarantees it is optimal, based on the given scores. Because of their computational requirements, dynamic programming algorithms are impractical for searching large databases without the use of a supercomputer. Rapid heuristic algorithms that attempt to approximate the above methods have been developed, allowing large databases to be searched on commonly available computers. In many heuristic methods -the measure of -similarity is not explicitly defined as a minimal cost set of mutations, but instead is implicit in the algorithm itself. For example, the FASTP program first finds locally similar regions between two sequences based on identities but not gaps, and then rescores these regions using a measure of similarity between residues, such as a PAM which allows conservative replacements as well as identities to increment the similarity score. Despite their rather indirect approximation of minimal evolution measures, heuristic tools such as FASTP have been quite popular and have identified many distant but ' biologically significant relationships. BLAST (Basic Local Alignment Search Tool), which employs a measure based on well-defined mutation scores. It directly approximates the results that would be
25
obtained by a dynamic programming algorithm for optimizing this measure. The method will detect weak but biologically significant sequence similarities, and is more than an order of magnitude faster than existing heuristic algorithms.
BLAST Means:
B(Basic) - Despite the adjective BASIC in its name it is sophisticated software package that has become the single most important piece of software in the field of bioinformatics. LA (Local Alignment) - local alignment is one from two kinds of alignment that finds the best subsequence alignment. Necessity for this alignment is that functional (catalytic sites) are localized or relatively short regions. ST (Search Tool)- It has introduced a no of refinements to database searching that improved overall search speed & put database searching on a firm statistical foundation. It searches using some threshold value. BLAST (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available (DNA and protein) sequence databases regardless of whether the query is protein or DNA. The BLAST programs have been designed for speed, with a minimal sacrifice of sensitivity to distant sequence relationships. BLAST uses the concept of a "segment pair" which is a pair of subsequences of the same length that form an ungapped alignment. The algorithm first looks for short words that are present in both sequences and then extends these at either end to find the longest segments present in both. The statistical significance of these High-scoring Segment Pairs is evaluated to determine whether the matches are random or not. Thus, the scores assigned in a BLAST search have a well-defined statistical interpretation, making real matches easier to distinguish from random background.
3.2 DATABASES AVAILABLE FOR BLAST SEARCH

3.2.1 Protein Sequence Databases
We can choose a protein db for blastp or blastx. We can choose a nucleotide database for blastn, tblastn or tblastx
26
nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF month All new or revised GenBank CDS translation+PDB+SwissProt+PIR+PRF released in the last 30 days. swissprot Last major release of the SWISS-PROT protein sequence database (no updates) Drosophila genome Drosophila genome proteins provided by Celera and Berkeley Drosophila Genome Project (BDGP). (www.fruitfly.org) yeast Yeast (Saccharomyces cerevisiae) genomic CDS translations ecoli Escherichia coli genomic CDS translations pdb Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank (www.pdb.org) kabat Kabat's database of sequences of immunological interest
(http://immuno.bme.nwu.edu) alu Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences
3.2.2 Nucleotide Sequence Databases

We can choose a nucleotide database for blastn, tblastn or tblastx nr All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences). No longer "non-redundant". month All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days. Drosophila genome Drosophila genome provided by Celera and Berkeley Drosophila Genome Project) dbest Database of GenBank+EMBL+DDBJ sequences from EST Divisions dbsts Database of GenBank+EMBL+DDBJ sequences from STS Divisions gss Genome Survey Sequence, includes single-pass genomic data, exontrapped sequences, and Alu PCR sequences. yeast Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences E. coli Escherichia coli genomic nucleotide sequences pdb Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank
27
BLAST protein databases available at through blastp web interface
Figure 3.1 Protein Databases kabat Kabat's database of sequences of immunological interest vector Vector subset of GenBank(R), NCBI, in ftp://ncbi.nlm.nih.gov/blast/db/ mito Database of mitochondrial sequences alu Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available by anonymous FTP from ncbi.nlm.nih.gov (under the /pub/jmc/alu directory). Epd Eukaryotic Promotor Database BLAST nucleotide databases available at through blastn web interface
Figure 3.2 Nucleotide Databases
28
3.3 BLAST ALGORITHM

(1) In step 1, BLAST filters low complexity regions removes them from the query sequence. Low compositional complexity or short-periodicity repeats can yield extremely large numbers of statistically significant but biologically uninteresting results. The filtering and removal of these can be controlled with the -F flag of the stand-alone version of BLAST and with check boxes in the web version. Next, BLAST generates a list of all of short sequences, or words, that make up the query (Figure a). The default word lengths are 3 and 11, for amino-acid sequences and nucleotide sequences, respectively, and are adjustable using the -W flag in the standalone version. Then, BLAST uses a scoring matrix (BLOSUM62, by default, for amino acids) to determine all high-scoring matching words for each word in the query sequence. No gaps are allowed. The list of matches is reduced by taking only those that will score above a given threshold, called the neighborhood word-score threshold. There is a trade-off at this stage between speed and sensitivity: a higher threshold gives greater speed but increases the chance of missing relevant pairs [8].

For the query find the list of high scoring words of length w. For a given word length w (usually 3 for proteins) and a given score matrix: Create a list of all words (w-mers) that can can score >T when compared to wmers from the query.
LNKCKTPQGQRLVNQ P Q G 18 P E G 15 P R G 14 P K G 14 P N G 13 P M G 13 Below Threshold (T=13) P Q N 12 etc. Word Neighborhood Words
29
Query Sequence of length L
Maximum of L-w+1 words (typically w = 3 for proteins)
For each word from the query sequence find the list of words that will score at least T when scored using a pairscore matrix (e.g. PAM 250). For typical parameters there are around 50 words per residue of the query
Figure 3.3 List of Words From Query Sequence (2) In the second step, BLAST searches through the target sequence database for exact matches to the word list generated. Because BLAST has already pre-processed and indexed the databases for the occurrence of all words in each sequence in the database, this search is extremely fast. If a match is found, it is used to seed a possible alignment between the query and the database sequences.
Compare the word list to the database and identify exact matches.
Each neighborhood word gives all positions in the database where it is found (hit list).
P Q G 18 P E G 15 P R G 14 P K G 14 P N G 13 P D G 13 P M G 13 PMG Database
30
Database Sequences
Figure 3.4 Exact matches of words form word list (3) In the third step, the original BLAST method tried to extend the alignment from the matching words in both directions as long as the score continued to increase. For each word match, extend alignment in both directions to find alignments that score greater than score threshold S. The program tries to extend matching segments (seeds) out in both directions by adding pairs of residues. Residues will be added until the incremental score drops below a threshold. The resulting alignment was called a high-scoring pair, or HSP. Gapped BLAST uses a lower threshold for generating the list of high-scoring matching words; the algorithm uses short matched regions with no insertions or deletions between them and within a certain distance of each other as the starting points for longer ungapped alignments. These joined regions are then extended using the same method as in the original BLAST.
Figure 3.5 Maximal Segment Pairs (MSPs)
31
Next, BLAST determines whether each score found by one of the above methods is greater in value than a given cutoff score S, determined empirically by examining the range of scores given by comparing random sequences and then choosing a value that is significantly greater. The maximal scoring pairs, or MSPs, from the entire database are identified and listed. Finally, BLAST determines the statistical significance of each score, initially by calculating the probability that two random sequences, one the length of the query sequence and the other the length of the database (the sum of the lengths of all of the database sequences) with the same composition (nucleotide or amino acid) could produce the calculated score.
3.4 BLAST PARAMETERS

There are various parameters that play a vital role in the output produced by the BLAST. The proper value of these parameters can improve the speed and sensitivity of the BLAST. We have analyzed all the parameters to see which of them can be improved to improve the results of the BLAST [8]. The parameters of BLAST includes
W, word size
Word size is roughly the minimal length of an identical match an alignment must contain if it is to be found by the algorithm. It controls the number of word hits. The query sequence and every database sequence is split up into every possible "word" of a selected size. The default word size is 11 bp for DNA and 3 aa for Proteins (it must be >=7 for DNA). The task of finding HSPs begins with identifying short words of length W in the query sequence that either match or satisfy some positive-valued threshold score T when aligned with a word of the same length in a database sequence. These initial neighborhood word hits act as seeds for initiating searches to find longer HSPs containing them. The word hits are extended in both directions along each sequence for as far as the cumulative alignment score can be increased. Extension of the word hits in each direction are halted when: the cumulative alignment score falls off by the quantity X from its maximum achieved value; the cumulative score goes to zero or below, due to the accumulation of one or more negative-scoring residue alignments; or the end of either sequence is reached.
32
If we are interested in longer regions of homology we should increase the word size. Increasing the word size also speeds up the search, especially with larger query sequences (>5kb) and large databases. But the high values of W in conjunction with moderate values of T can lead to immense memory requirements. The probability of a hit decreases with increase in word size [15]. The smaller word sizes increase sensitivity and decreases speed. For protein searches the best word size is of four.
T, the threshold parameter.

T is referred to as the neighborhood word score threshold (Altschul et al., 1990). It is the minimum score that a word pair in the segment pair should have. Actually we can adjust the value of T to control the size of the neighborhood and therefore the number of word hits in the search space. The lower value of T increases the chance that a segment pair with a score of at lest S will contain a word pair with a score of at least T. Thus, a small value for T increases the number of hits. But this in turn increases the execution time of the algorithm because there will be more words generated by the query sequence and therefore more hits. On the other hand, higher values of T progressively reduce word hits and reduce the search space. So the proper value of T depends on the balance between speed and sensitivity. It also depends on the values in the scoring matrix.
Figure 3.6 Figure shows the word size option If the value for T is not chosen carefully, though -- i.e., if T is set just a little bit too low -- a combinatorial explosion in neighborhood words will soon lead to the
33
depletion of all available memory. Even if the neighborhood word list does fit in memory, however, its sheer size may produce an adverse effect on speed, due to the consequent loss of processor cache efficiency.
X, drop off
This value provides a cutoff threshold for the extension algorithm tree exploration. When the score of a given branch drops below the current best score minus the X dropoff, the exploration of this branch stops. This variable represents the recent alignment history [20]. Specifically, it represents how much the score is allowed to drop off since the last maximum. A very large value of X doesnt increase the score and requires more computation. It is generally a good idea to use a large value, which reduces the risk of premature termination and is better way to increase speed than with the seeding parameters. However, W,T and 2-hit are better for controlling speed than X. X not only depends on the substitution scores, but also gap initiation and extension costs. We general need to adjust this parameter in following two situations:
If we align sequences that are nearly identical and we want to prevent the extensions from straying into nonidentical sequences, we can set the various X values very low.
If we try to align very distant sequences and have already adjusted W, T and the scoring matrix to allow additional sensitivity, it makes sense to also increase the various X values.
, lambda
, is a matrix specific constant required to convert a raw score to normalized score. Raw score can be a misleading quantity because scaling factors are arbitrary. A normalized score, corresponding to the original lod score, is therefore a more useful measure. Lambda is approximately the inverse of the original scaling factor, but its value may be slightly different due to integer rounding errors. When calculating target frequencies from multiple alignments, the sum of all target frequencies naturally sums to 1. qij = 1 (1)
34
The score of two amino acids is the log-odds ratio of the observed and expected frequencies. The same equation is presented in Equation, but the lod score is replaced by the product of lambda and the raw score. Sij = loge (qij / pi pj ) Equation (1)rearranges Equation (2) to solve for pair-wise frequency. qij = pi pj e Sij (3) (2)
From Equation 3,we can see that a pair-wise frequency (q ij) is implied from individual amino acid frequencies (p i and p j )and a normalized score (S ij ).The key to solving for lambda is to provide the individual amino acid frequencies (pi and pj)and find a value for lambda where the sum of the implied target frequencies equals one. The formulation is given in Equation 4. qij = pi pj e Sij = 1 (4)
Normally, once lambda is estimated, it is used to calculate the Expect of every HSP in the BLAST report. Unfortunately, the residue frequencies of some proteins deviate widely from the residue frequencies used to construct the original scoring matrix. Recently, some versions of PSI-BLAST and BLASTP have therefore begun to use the query and subject sequence amino acid compositions to calculate a composition based lambda .These hit-specific lambdas have been shown to improve BLAST sensitivity, so this approach may see wider use in the near future. Lambda is also used in calculating the Expect by using the equation E = kmne-S . Here Lambda may be thought of as the expected increase in reliability of an alignment associated with a unit increase in alignment score. Reliability in this case is expressed in units of
information, such as bits or nats, with one nat being equivalent to 1/log(2) (roughly 1.44) bits.
k, Adjustment
A small adjustment (k) takes into account the fact that optimal local alignment scores for alignments that start at different places in the two sequences may be highly correlated. For example, a high-scoring alignment starting at residues 1,1 implies a pretty high alignment score for an alignment starting at residues 2,2 as well.
m, length of query
It seems to be the length of the query that we enter to be matched to the different databases. But actually in BLAST it is the effective length of the query. It may be 35
defined as the actual length minus the expected HSP length where expected HSP length is the length of an HSP that hat has an Expect of 1. The size of the search space is simply the product of the number of letters in the query (m) and the number of letters in the database (n). The relationship between the expected number of alignments (E) and the search space (mn)is linear. If the size of the search space is doubled, the expected number of alignments with a particular score also doubles.
n, length of the database

It seems to be the length of the database sequence with which the query is to be matched. But actually its is the effective length of the database. It may be defined as the sum of effective length of every sequence within it. The size of the search space is simply the product of the number of letters in the query (m) and the number of letters in the database (n). The relationship between the expected number of alignments (E) and the search space (mn) is linear. If the size of the search space is doubled, the expected number of alignments with a particular score also doubles. No effective length of the query or database can ever be less than 1/k. Setting an effective length to 1/k basically amounts to ignoring a short sequence for statistical purposes; in case when both m and n are less than 1/k, BLAST searches are illadvised.
H, Relative Entropy
The formal name for the average information per symbol is entropy. But what if all symbols arent equally probable? To compute the entropy, you need to weigh the information of each symbol by its probability of occurring. This formulation, known as Shannon s Entropy (named after Claude Shannon),is shown in Equation. H= - pi log2pi Entropy (H) is the negative sum over all the symbols (n )of the probability of a symbol (pi )multiplied by the log base 2 of the probability of a symbol (log 2 pi ). The relative entropy of a scoring matrix (H ) conveniently summarizes the general behavior of a scoring matrix. Its formulation is similar to the expected score but is calculated from normalized scores. It formulation is shown in following equation H = - qij Sij
36
H is the average number of bits (or nats) per position in an alignment and is always positive.
E, Expect
Expect is the number of alignments expected by chance during a sequence database search and can be represented using the Karlin-Altschul equation. E = kmne-S From the above equation we can see that E is a function of the size of the search space (m *n ),the normalized score (S ),and a minor constant (k ). The relationship between the expected number of alignments and the search space (mn) is linear. If the size of the search space is doubled, the expected number of alignments with a particular score also doubles. The relationship between the expected numbers of alignments and score is exponential. This means that small changes in score can lead to large differences in E. An E-value tells you how many alignments with a given score are expected by chance, that is, the E value is the probability that the associated match is due to randomness. The lower the E value, the more specific/significant is the match. Its relation with P value can represented as E= - In(1-P) E is the statistical significance threshold for reporting matches against database sequences; the default value is 10, such that 10 matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). In the BLAST output report the sequences are listed in order of increasing E (expect) value. The alignments are listed in order of most to least significant. If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported.
S, Score
In the late 60s and early 70s, Margaret Dayhoff pioneered quantitative techniques for measuring amino acid similarity. Using sequences that were available at the time, she constructed multiple alignments of related proteins and compared the frequencies of amino acid substitutions. As expected, there is quite a bit of variation in amino acid substitution frequency, and the patterns are generally what you d expect from the chemical properties.
37
Dayhoff represented the similarity between amino acids as a log 2 odds ratio, also known as a lod score .To derive the lod score of an amino acid, take the log 2 of the ratio of a pairing s observed frequency divided by the pairing s random expected frequency. If the observed and expected frequencies are equal, the lod score is zero. A positive score indicates that a pair of letters is common, while a negative score indicates an unlikely pairing. The general formula for any pair of amino acids is shown in following Equation. Sij = log(qij/pipj ) The score of two amino acids i and j, is sij, their individual probabilities are pi and pj , and their frequency of pairing is qij. The relationship between the expected number of alignments and score is exponential. This means that small changes in score can lead to large differences in E.
P-value
A P-value tells you how often you can expect to see such an alignment. P = 1- e -E For values of less than 0.001,the E-value and P-value are essentially identical. The aggregate pair-wise P-value for a sum score can be approximated using above stated equation. Thus, when sum statistics are being employed, BLAST not only uses a different score, it also uses a different formula to convert that score into a probability the standard Karlin-Altschul equation E= kmne -S isnt used to convert a sum score to an Expect. In the limit of infinite E, P approaches 1; and in the limit as E approaches 0, E and P approach equality. Due to inaccuracy in the statistical methods as they are applied in the BLAST programs, whenever E and P are less than about 0.05, the two values can be practically treated as being equal.
Number of sequences in database

The number of sequences in database also affects the speed and sensitivity of the BLAST algorithm. If the number of sequences is very less then the speed of the BLAST is enhanced as there are less word hits and less sequences to be compared with the query.
38
Percent identity
Percent identity is the percent of exact matches between your query sequence and the database sequence. The positive value is more relevant to protein alignments. This is the percent of exact + similar (based on properties) amino acid matches.
Number of Alignments
Restricts database sequences to the number specified for which high-scoring segment pairs (HSPs) are reported; the default limit is 100. If more database sequences than this happen to satisfy the statistical significance threshold for reporting only the matches ascribed the greatest statistical significance are reported.
Filter
Low-complexity regions, such as proline- or glycine-rich regions or acidic or basic regions, can yield tremendous numbers of spurious matches between sequences that have no other similarity between them. The statistics break down when such decidedly non-random sequences appear; furthermore, search times may be needlessly increased. To avoid spurious matching and make the statistics more robust, lowcomplexity regions can be filtered from the query sequence. Filtering eliminates statistically significant but biologically uninteresting reports from the blast output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. Filtering is only applied to the query sequence (or its translation products), not to database sequences. Filtering should not be expected to always yield an effect. Furthermore, in some cases, sequences are masked in their entirety, indicating that the statistical significance of any matches reported against the unfiltered query sequence should be suspect. Sometimes we need to mask the human repeats (LINE's and SINE's). It is especially useful for human sequences that may contain these repeats.
3.5 FEATURES OF BLAST

3.5.1 Heuristic
BLAST is not guaranteed to find the best alignment between your query and the database; it may miss matches. This is because it uses a strategy, which is expected to find most matches, but sacrifices complete sensitivity in order to gain speed.
39
However, in practice few biologically significant matches are missed by BLAST that can be found with other sequence search programs. BLAST searches the database in two phases. First it looks for short subsequences, which are likely to produce significant matches, and then it tries to extend these subsequences [8].
3.5.2 Substitution Matrix

A substitution matrix is used during all phases of protein searches (BLASTP, BLASTX, and TBLASTN). Both phases of the alignment process (scanning & extension) use a substitution matrix to score matches. This is in contrast to FASTA that uses a substitution matrix only for the extension phase. Substitution matrices greatly improve sensitivity. There are two main types of matrices PAM and BLOSUM; we can select the preferred matrix. PAM (Percent Accepted Mutation) matrices: predicted matrices, most sensitive for alignments of sequences with evolutionary related homologs. The greater the number in the matrix name, the greater the expected evolutionary (mutational) distance, i.e. PAM30 would be used for alignments expected to be more closely related in evolution than an alignment performed using the PAM250 matrix BLOSUM (Blocks Substitution Matrix): calculated matrices, most sensitive for local alignment of related sequences, ideal when trying to identify an unknown nucleotide sequence. BLOSUM62 is the default matrix set in the BLAST search tool.
3.5.3 Local Alignments

BLAST uses LOCAL ALIGNMENTS for matching sequnecs rather than GLOBAL ALIGNMENTS. BLAST tries to find patches of regional similarity, rather than trying to find the best alignment between your entire query and an entire database sequence.
3.5.4 Ungapped Alignments

Alignments generated with BLAST do not contain gaps. BLAST's speed and statistical model depend on this, but in theory it reduces sensitivity. However, BLAST will report multiple local alignments between your query and a database sequence.
3.5.5 Explicit Statistical Theory

BLAST is based on an explicit statistical theory developed by Samuel Karlin and Steven Altschul. The original theory was later extended to cover multiple weak matches between query and database entry: the repetitive nature of many biological 40
sequences (particularly naive translations of DNA/RNA) violates assumptions made in the Karlin & Altschul theory. While the P values provided by BLAST are a good rule-of-thumb for initial identification of promising matches, care should be taken to ensure that matches are not due simply to biased amino acid composition. The databases are contaminated with numerous artifacts. The intelligent use of filters can reduce problems from these sources. Remember that the statistical theory only covers the likelihood of finding a match by chance under particular assumptions; it does not guarantee biological importance.
3.5.6 Rapid
BLAST is extremely fast. It does not explore the entire search space between two sequences as it uses the three layers (seeding, extension, and evaluation) of rules to sequentially refine potential HSPs (high scoring pairs). This minimization of search space is the key to its speed but at the cost of a loss in sensitivity. You can either run the program locally or send queries to an E-mail server maintained by NCBI.
3.5.7 Sequence Input

The BLAST web pages accept input sequences in three formats; FASTA sequence format, NCBI Accession numbers, or GIs. The preferred query sequence format for the BLAST program is the FASTA format. Advanced BLAST tolerates both spaces and numbers and is case insensitive.
3.5.8 Results Format

Results returned in either text format (default) or HTML format (must supply an email address and select the HTML results format option). A Request ID number is given such that the results are obtained at a later time, if you want the results immediately, we can click on the "Format Results" button. Formatting items such as the results format option and the number of descriptions and alignments in the results output are needed only for formatting, these items may be specified from the BLAST query form or at the time you request your results. Most results are held for up to 24 hours; very-large result files are deleted after 30 minutes.
3.5.9 BLAST Output

All BLAST programs produce a similar output. This consists of program introduction,
41
a schematic distribution of alignments of the query sequence to those in the databases, a series of one line descriptions of the database sequences which have significantly aligned to the query sequence, the actual sequence alignments, and a list of statistics specific to the BLAST search method and version number is displayed at the top of the output. The output consists of:
A schematic distribution of the ordered alignments of the query sequence to those in the databases. Colored bars are distributed in a way to reflect the region of alignment onto the query sequence. The color legend represents the significance of the alignment scores. Holding the mouse over a given bar will display a description of that specific alignment sequence in the above window; clicking on a specific bar will cause the browser to jump down to that particular alignment.
Sequence alignments and their corresponding line descriptions are listed in order of lowest to highest E value where E value is the expect value is the probability that the associated match is due to randomness; the lower the E value, the more specific/significant the match.
Identifiers for the database sequences appear in the first column and are hyperlinked to the associated GenBank entry The Score for each alignment. The score (bits) is a sum value calculated for alignments using the scoring matrix; the higher the score value, the better the alignment
The percent identity (called "Identities" is given as a percent) is the percent of exact matches between your query sequence and the database sequence, this value also gives the number of nucleotide bases or amino acid residues that are matched in the database sequence versus the query sequence
Gap value is the percent of the alignment sequence that has been gapped in the particular alignment. Alignments are gapped unless specified by the user at the BLAST search submission page
A list of statistics specific to the particular BLAST search are displayed at the bottom of the output, they include the BLAST version number, the database and matrices used for the search.
42
CHAPTER 4
VARIANTS OF BLAST
The best way to identify an unknown sequence is to see if that sequence already exists in a public database. If the database sequence is a well-characterized sequence, then you may have access to a wealth of biological information. BLAST (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases (DNA or protein) regardless of whether the query is protein or DNA. These programs have been tailored specifically for the purpose of sequence similarity identification. Each BLAST program performs a different task. Different flavors of BLAST are covered in the following sections [7].
Figure 4.1 Blast Variants
4.1 BLAST VARIANTS

Programs Available For The Blast Search Include:
43
Program
Query sequences of type

DNA
Database Of Type
Comparison
Compares a nucleotide query sequence against a nucleotide sequence database Compares an amino acid query sequence against a protein sequence database Protein Compares a nucleotide query sequence translated in all reading frames against a protein sequence databases Compares protein query sequence against nucleotide sequence database translated in reading frames Compares the six-frame translations of a nucleotide query sequence against the six frame translations of a nucleotide sequence database
Application
BlastN
DNA
Find DNA sequences that match the query Compares an amino acid query sequence against a protein sequence database Find what protein the query sequence codes for
BlastP
Protein
Protein
BlastX
DNA
Protein
TBlastN
Protein
DNA
Find genes in unknown DNA sequences Discover gene structure (Find degree of homology between the coding region of the query sequence and known genes in the database)
TBlastX
DNA
DNA
Table 4.1 Programs Available For The Blast
Types of BLAST Programs:

blastp compares an amino acid query sequence against a protein sequence database blastn compares a nucleotide query sequence against a nucleotide sequence database blastx compares a nucleotide query sequence translated in all reading frames against a protein sequence database tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames.
44
Figure 4.2 Blast Variants tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Note that tblastx program cannot be used with the nr database on the BLAST Web page.
4.2 PSI-BLAST
PSI-Blast is the preferred method for searching a protein database with a protein sequence as the key. If used for only one round, it is identical to BlastP. Its algorithm is designed to conduct further iterations of the search and to extend the search to distantly related homologues. PSI stands for Position Specific Iterated. This search method makes use of a profile, which is a position-specific accounting of what amino acid residues are found in a family of aligned homologous proteins. PSI-Blast accepts a protein sequence as input and first conducts a normal BlastP search to identify homologues in the database. A profile is constructed from the spectrum of sequences found in the initially identified homologues. This profile is used as the search key to identify more distant relatives. The process is then iterated, each time refining the profile based on inclusion of the new members. Ideally, the process is expected to converge on a unique set of genes. In practice, the search may at some point begin to include proteins that are related by chance similarity. The user must use judgement to recognize when proteins of known and unrelated functions begin to appear in the list of finds [19].
45
Its an acronym for "Position Specific Iterated" BLAST. It is an iterative form of blastp in which a profile is created from the amino acid query and nth set of results (meeting the Psi-Expectation) and resubmitted. PSI - BLAST is a program based on the BLAST 2.0 algorithm that is designed to detect weak relationships between the query and members of the database not necessarily detectable by standard BLAST searches [19]. The added sensitivity of this program over regular BLAST comes from the use of a profile that is constructed (automatically) from a multiple alignment of the highest scoring hits in the initial BLAST search. The profile is generated by calculating position-specific scores for every position in the alignment. A highly conserved position will receive a high score and weakly conserved positions receive scores near zero. The profile is then used to perform additional BLAST searches (called iterations) and the results of each iteration used to refine the profile. PSI-BLAST is designed for more sensitive protein protein similarity searches. PSIBLAST is the most sensitive BLAST program, making it useful for finding very distantly related proteins. We should use PSI-BLAST when our standard protein-
Figure 4.3 PSI Blast protein BLAST search either failed to find significant hits, or returned hits with descriptions such as "hypothetical protein" or "similar to...". When we use PSIBLAST to search a database, it generates Position Specific Scoring Matrices, which can then be built into a database of patterns. Then we can just search one of these databases with a new sequence. One of the difficulties in doing this is curating the
46
database. In a regular sequence database, we just keep throwing in new sequences, whereas with one of these pattern databases, we have to periodically go back and redo the patterns and try to consolidate them and so forth. It takes a lot of effort to keep up to date Position-Specific Iterative PSI-BLAST analysis is useful both for identifying the distant members of a protein family, whose relationship is not recognizable by straight sequence comparison, and also for deducing the function of hypothetical proteins that are unannotated in the database. STEPS OF PSI-BLAST ALGORITHM: STEP 1: The data to be entered must be in one of the allowed formats for BLAST search. Once the query sequence is entered, the database to be searched must be selected from the appropriate pull down menus. Options include a number of different sequence databases that can be searched using blastp.
Figure 4.4 PSI Blast-Step1 The default database is nr, which is the collection of all unique sequences.It contains all non redundant Genbank CDS translations + PDB + SwissProt + PIR +PRF entries. STEP2: The E-value is the statistical significance threshold for reporting matches against database sequences. The default expect value for the initial BLAST search is 10. This
47
EXPECT threshold is fairly lenient allowing all possible related sequences to be reported. Thus, the initial (BLAST) E value is set at 1.0. It is appropriate to filter most queries for low complexity sequences because they give spuriously high scores that reflect compositional bias rather than significant positionby-position alignment. Thus we have selected to filter lo complexity region. The BLOSUM62 (gap existence cost = 11; per residue gap cost = 1; lambda ratio = 0.85) substitution matrix is used by default in BLAST 2.0. A variety of other matrices are also supported which include: PAM30, PAM70, BLOSUM80, BLOSUM62 and BLOSUM45. Adjustments to the matrix may be in order when a search for very distant relatives of the query is being performed. The BLOSUM matrix assigns a probability score for each position in an alignment that is based on the frequency with which that substitution is known to occur among related proteins.
Figure 4.5 PSI Blast-Step2 Then the word size needs to be set which is by default 3. There are other advance options possible, which can specify gap costs, word size, and other parameters not otherwise selectable on the query form that can be set. Here, we have not set any advanced option. STEP 3: Checking the NCBI-gi designation is facilitates the process of doing additional searches to investigate the significance of a given alignment whereas checking
48
graphical overview gives the graphical overview of the database sequences aligned to the query sequence. The score of each alignment is represented by bars of different colors. Multiple alignments on the same database sequence are connected by a striped line. Mousing over a hit sequence causes the definition and score to be shown in the window at the top. The default number of descriptions and alignments to be listed is 500. Although it may seem useful to change the default to something smaller to control the magnitude of the output, these variables affect the search in two important ways: First, if the total number of hits in which E is less than the threshold exceeds the number (x) of descriptions requested, only the top x most signficant would be listed; additional possibly significant alignments would not be shown, though these may embody important information. Second, the number of sequences used in generating the multiple alignment and the position specific matrix is specified by the larger of the two(descriptions, alignments) variables. If at any point in the iterative PSI-BLAST process, significant sequences are omitted from the profile, all subsequent output will be affected. By selecting a large number of descriptions (e.g. 250-500) it is possible to ensure that the E value and not the description limit will be the determining factor in generating the profile to be used for additional iterations. Reducing the output can then be accomplished, if desired, by limiting the number of alignments to be reported. A variety of different alignment formats are available. The choice of which to use is based on personal preference. Pairwise alignment gives a good view of the quality of an individual hit. However, a flat query-anchored alignment (with identities) is a format in which identities shared by numerous sequences can be easily spotted. There is second E value which is the threshold value for inclusion in the position specific matrix used for PSI-BLAST iterations. Here the PSI-BLAST E value is left at the default setting of 0.001. Both of the E values specified (one earlier) allow the user to see (and selectively, based on prior knowledge, include) all of the BLAST hits up to E=1; but to automatically include only those hits exceeding a relatively rigorous E value threshold of 0.001. There are some more options to set, which include layout, formatting options on page with result and autoformat. All these affect the report format but not the results produced. In the end we click on the search button to initiate the search. In seconds,
49
the query sequence has been compared to all of the entries in the specified database. Each comparison is scored and the top scores are listed in rank order. .PSI-BLAST Output Output of PSI-BLAST is shown both in graphical format and in detailed format. In detailed format the hits are divided into two categories. Those that are better than the E value threshold are listed first. Those with E values worse than threshold, but nonetheless have an E value better than 1 (selected on the query page) are listed further down the page.
Figure 4.6 PSI Blast-Output
PSI-BLAST In summary: Patterns of conservation such as PSSM (Position Specific Score Matrix) identified from the alignment of related sequences can aid the recognition of distant similarities. This power can be further enhanced through iteration of the search procedure. Position-Specific Iterated BLAST (PSI-BLAST) was developed for this goal, and furthermore, has advantages at speed, simplicity and automatic operation. PSIBLAST program runs as follows.
50
Figure 4.7 PSI Blast-Output (1) A standard BLAST search is performed against a database using a substitution matrix (e.g.BLOSUM62). (2) A PSSM (checkpoint) is constructed automatically from a multiple alignment of the hits of the initial BLAST search or last round iteration of homology searching. High conserved positions receive high scores and weakly conserved positions receive low scores. (3) The new PSSM replaces the initial matrix (e.g. BLOSUM62) or last round PSSM to perform a next BLAST search. (4) Steps 2 and 3 can be repeated and the new found sequences are included to build a new PSSM. (5) PSI-BLAST has converged if no new sequences are included.
Figure 4.8 PSI Blast
51
PSI-Blast The blastpgp program can do an iterative search in which sequences found in one round of searching are used to build a score model for the next round of searching. In this usage,the program is called Position-Specific Iterated BLAST, or PSI-BLAST. As explained in the accompanying paper, the BLAST algorithm is not tied to a specific score matrix. Traditionally, it has been implemented using an AxA substitution matrix where A is the alphabet size. PSI-BLAST instead uses a QxA matrix, where Q is the length of the query sequence; at each position the cost of a letter depends on the position w.r.t. the query and the letter in the subject sequence.The position-specific matrix for round i+1 is built from a constrained multiple alignment among the query and the sequences found with sufficiently low e-value in round i. The top part of the output for each round distinguishes the sequences into: sequences found previously and used in the score model, and sequences not used in the score model. The output currently includes lots of diagnostics requested by users at NCBI. To skip quickly from the output of one round to the next, search for the string "producing", which is part of the header for each round and likely does not appear elsewhere in the output. PSI-BLAST "converges" and stops if all sequences found at round i+1 below the evalue threshold were already in the model at the beginning of the round [21]. There are several blastpgp parameters specifically for PSI-BLAST: -j is the maximum number of rounds (default 1; i.e., regular BLAST) -e is the e-value threshold for including sequences in the score matrix model (default 0.01) -c is the "constant" used in the pseudocount formula specified in the paper (default 10) The -C and -R flags provide a "checkpointing" facility whereby a score model can be stored and later reused. -C stores the query and frequency count ratio matrix in a file -R restarts from a file stored previously. When using -R, it is required that the query specified on the command line match exactly the query in the restart file.Users who also develop their own sequence analysis software may wish to develop their own scoring systems. For this purpose the code in posit.c that writes out the checkpoint can be easily adapated to write out
52
scoring systems derived by other algorithms in such a way that PSI-BLAST can read the files in later. The checkpoint structure is general in the sense that it can handle any positionspecific matrix that fits in the Karlin-Altschul statistical framework for BLAST scoring.
4.3 BLASTN
Standard nucleotide BLAST compares a nucleotide query sequence against a nucleotide sequence database. It is better at finding sequences similar, but not identical, to your query. The BLAST nucleotide algorithm finds similar sequences by generating an indexed table or dictionary of short subsequences called words for both the query and the database. The program can then rapidly find initial exact matches to the query words by simply looking up a particular word in the database dictionary. These initial matches serve as starting points for longer alignments that are generated in several steps, ending with a final gapped alignment [8]. One of the important parameters governing the sensitivity of BLAST searches is the length of the initial words (word size). The most important reason that blastn is more sensitive than MEGABLAST is that it uses a shorter default word size. Because of this, blastn is better than MEGABLAST at finding alignments to related nucleotide sequences from other organisms since the initial exact match can be shorter. The word
Figure 4.9 BLASTN
53
size is adjustable in blastn and can be reduced from the default value of 11 to a minimum of 7 to increase sensitivity. This word size can also be increased to increase the search speed and limit the number of database hits. Nucleotide-nucleotide searches are not the recommended way to find homologous protein coding regions in other organisms. It is better to perform searches at the protein level, either with translations of the nucleotide sequences or by direct protein-protein BLAST. This is because of the degeneracy of the genetic code, the greater information available in amino acid sequence, and the more sophisticated algorithm in protein-protein BLAST.
Figure 4.10 Using Blastn For Comparison
Figure 4.11 Blastn Results
54
4.4 BLASTX
Sequence similarity between a translated nucleotide sequence and a known biological protein can provide strong evidence for the presence of a homologous coding region, and such similarities can often be identified even between distantly related genes. The computer program BLASTX performed conceptual translation of a nucleotide query sequence followed by a protein database search in one programmatic step. The BLAST search algorithm combined with Karlin-Altschul statistics yields a predictable selectivity that has been parameterized. BLASTX is appropriate for use in moderate and large scale sequencing projects at the earliest opportunity, when the data are most prone to containing errors [9]. Most primary sequence data is obtained as nucleic acid, while much of the biological interest lies in the encoded protein. Inference of likely protein coding regions is often based on statistical features, such as codon usage and the locations of putative splice site signals but significant false positive rates are common. In contrast, similarity between a conceptually translated nucleotide sequence and a known protein sequence may be highly significant statistically, which suggested a more discriminating approach to inferring coding potential. BLASTX is used to probe a nucleotide
sequence directly for the presence of protein coding regions by identifying segments that encode significant similarity to members of a protein sequence database. The BLASTX program has been successfully employed to identify likely protein coding sequences in thousands of partial cDNA sequences from human brain tissue. BLASTX allowed protein-protein comparisons to be considered when only uncharacterized nucleotide query sequence was available. The program conceptually translated query sequences in all six reading frames (three on each strand) and compared each of these full-length translation products with a comprehensive protein sequence database in a single pass. The BLAST algorithm approximates a well defined measure of local sequence similarity based on a matrix of similarity or substitution scores for all possible pairs of residues. The algorithm identifies ungapped, aligned pairs of sequence segments with locally maximum scores which meet or exceed a parameterized cutoff score S, These segments are referred to as high-scoring segment pairs (HSPs), and the highest scoring segment pair derivable from any two
55
Figure 4.12 Blastx sequences is their maximal-scoring segment pair, or MSP. A program, BLASTX, based on this rapid, probabilistic algorithm, was used to find statistically significant HSPs between a translated nucleotide query sequence and a target protein sequence database. When an HSP was found, the analysis of Karlin and Altschul was used to estimate the significance of its score. No prior knowledge of the reading frame or direction was assumed by BLASTX; all possible reading frames in both orientations of the query sequence were translated into protein sequence using the standard genetic code. The PAM (point accepted mutation) amino acid substitution model was typically used for scoring similarity between peptide sequences. By default, BLASTX used a PAM120 matrix. The expected number of alignments scoring S or greater in a comparison between two random sequences of lengths m and n is
E=mnKe-s
Where K and S DUHSDUDPHWHUVGHSHQGHQWRQWKHDPLQRDFLGFRPSRVLWLRQVRIWKH sequences. For values less than about 0.1,E is often an acceptable approximation to P the probability of occurrence of one or more matches scoring S or greater. In a true coding region, one reading frame may have a predicted amino acid composition typical for biologically occurring proteins, while the other reading frames exhibit anomalous Compositions. For this reason, BLASTX calculated separate K and S YDOXHVIRUHDFK reading frame.
56
Figure 4.13 Using Blastx For Comparison The BLAST algorithm operates in two successive stages, neighborhood word generation followed by the actual search, with an implicit trade-off in speed versus sensitivity imparted in the first stage. A list of neighborhood words of length W is generated from consecutive, overlapping words of length W in the query sequence, using a specified scoring matrix. The neighborhood list contains all words which satisfy a threshold scoring parameter, T, when aligned with words in the query sequence. Raising T decreases the size of the neighborhood and, consequently, increases the search speed in the algorithms second stage, but at the expense of decreased sensitivity. In BLASTX, the neighborhood word list was built from the conceptual translations of all six reading frames on both strands of the query sequence [24]. During the second stage of the BLAST algorithm, the neighborhood words from the first stage are searched for in the database or target sequence; the presence of a neighborhood word match indicates the possible location of an HSP. Individual neighborhood word matches (or word hits) are extended in both directions along the matrix diagonal until the ends are reached or the cumulative alignment score falls from its maximum achieved value by a parameterized quantity X.
57
Figure 4.14 Blastx Results
4.5 BLASTP
The BLASTP program is a search tool for databases of protein sequences that is widely used by biologists as a first step in investigating new genome sequences. BLASTP finds high-scoring local alignments without gaps between a query sequence q and sequence s in the database. The score of an alignment is the sum of the scores of individual alignments between amino acids that make up the protein. These individual scores come from a scoring matrix modeling the rate of evolutionary mutation. BLASTP is the most widely used program for determining alignments of protein sequences against databases such as Genbank. BLASTP is a three-step algorithm that succeeds in only scanning the database for exact matches [14]. The BLASTP algorithm works in three steps: 1. Neighborhood Construction. A set of words of length W, called the neighborhood N, is computed. Each word scores at least T with some word of equivalent length in the query sequence Q. 2. Hit Detection. Each subject SB in the database DB is scanned for (exact) matches to a word in N. 3. Hit Extension. The match, or hit H, is extended into a potentially higher scoring alignment
58
Figure 4.15 Blastp .The first step is to create a neighbourhood for each (short) segment of length $ of the query sequence. The neighbourhood consists of all sequences of $ amino acids that match the query segment with a high-score. An automaton is built to recognize the union of all neighbourhoods. The second step is to scan the database for exact matches to any neighbour. These matches are called hits. The third step attempts to extend a hit into a high-scoring pair of segment with approximate matches to the left and right of the hit. As each pair of aligned residues is included into the alignment, the score of the aligned pair is looked-up in a score matrix and added to a running sum. Extension of a hit continues until the falloff value, X, is reached.
Figure 4.16 Using Blastp For Comparison
4.5.1 BlastP Parameters

1.[ DATABASE ] Valid database name
59
Default : nr 2. [EXPECT] The statistically significant expectation value. If the statistical significance ascribed to a match is greater than the E value, the match will not be reported. Lower E values are more stringent, leading to a fewer chance matches being reported. Default : 10.0 3. [ENTREZ_QUERY] Entrez query to limit Blast search

Value
: Entrez query format
Default : Empty
4. [FILTER] Sequence filter identifier

5.
L for Low Complexity R for Human Repeats m for Mask for Lookup
[GAP_OPEN_COSTS] Gap open costs

Value
: integer values
Default : 5 for nuc-nuc, 11 for proteins, non-affine for megablast
6. [GAP_EXTEND_COSTS] Gap extend costs

Value
: space separated float values
Default : 2 for nuc-nuc, 1 for proteins, non-affine for megablast
7. [MATRIX_NAME] A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix", which assigns a score for aligning any possible pair of residues.

Value
: Valid matrix name
Default : BLOSUM62
4.6 TBLASTN
It compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands). The "Protein query Translated db [tblastn]" search is useful for finding protein homologs in unannotated nucleotide data. A tblastn search allows you to compare a protein sequence to the sixframe translations of a nucleotide database. It can be a very productive way of finding homologous protein coding regions in unannotated nucleotide sequences such as expressed sequence tags (ESTs) and draft genome records (HTG), located in BLAST databases est and htgs, respectively. ESTs are short, single-read cDNA sequences.
60
These comprise the largest pool of sequence data for many organisms and contain portions of transcripts from many uncharacterized genes. Since ESTs have no annotated coding sequences, there are no corresponding protein translations in the BLAST protein databases. Hence a tblastn search is the only way to search for these potential coding regions at the protein level. The HTG sequences, draft sequences from various genome projects or large genomic clones, are another large source of unannotated coding regions [8]. Like all translating searches, the tblastn search is especially suited to working with error prone data like ESTs and draft genomic sequences from HTG because it combines BLAST statistics for hits to multiple reading frames and thus is robust to frame shifts introduced by sequencing error.
4.7 TBLASTX
Tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. The tblastx program cannot be used with the nr database on the BLAST Web page because it is computationally intensive. The "Nucleotide query - Translated db [tblastx]" is useful for identifying novel genes in error prone query sequence. Tblastx takes a nucleotide query sequence, translates it in all six frames, and compares those translations to the database sequences dynamically translated in all six frames. This effectively performs a more sensitive blastp search without doing the manual translation. Tblastx gets around the potential frame-shift and ambiguities that may prevent certain open reading frames from being detected. This is very useful in identifying potential proteins encoded by single pass read ESTs. In addition, it would be a good tool for identifying novel genes [8].
4.7.1 Limitations Of Tblastx

1. TblastX is computationally insensitive. 2. Until recently there were not many completely sequenced genomes 3. When we got a match, rarely find a description for what was found.
61
CHAPTER 5
COMPARISON OF VARIANTS OF BLAST
5.1 INTRODUCTION
Blast is a successful tool to compare biological sequences. Now a days Large amount of biological data is available, but Standalone Blast is not sufficient to handle all types of queries related to sequence similarities, so different variants (BlastX, BlastP, BlastN, TBlastN, TBlastX, PSI-Blast) have been developed. Each variant has limitations and advantages. Every tool is made to handle with different purposes. So the user should have knowledge in which situation to use which tool. Comparison is needed between these variants different to know thoroughly about these tools [29]. Comparison Between The Variants of Blast on The Basis of:
Parameters Algorithm Performance.
5.1.1 Comparison On The Basis Of Parameters

All variants of BLAST run on same algorithm followed by Main Blast Program. There are some differences occur between these variants, due to which the functionality differs. All the parameters are same for all variants, which are used for MAIN BLAST program. But still there are some parameters which can be present in some variants, or the absence of which can make other tools to advantageous one over the other. 5.1.1.1 Conserved Domain Search Is Not Applied To Blastn, It Is Applicable To Blastp. Proteins often contain several domains, each with a distinct function (membrane binding, signal peptide, etc.) .As species evolve; the functional parts of important proteins remain relatively constant over time, and may even be copied and adapted for use by other proteins. Such domains have evolved as modules that are combined in various arrangements to produce proteins of unique function. Conserved domains are structural modules that have been reused frequently during the process of evolution.
62
NCBIs new Conserved Domain Search (CD-Search) service can be used to identify conserved domains in a protein sequence.
Figure 5.1 Conserved Domain For BlastN and BlastP Influence of absence of CDD Search: Conserved Domain Search is applicable only to proteins. Because it is based on PSSMs (Position Specific Score Matrices) which is applied only on proteins. By applying PSSMs, specific functional areas with in proteins can be searched. The searched functional domains are used in future for further research. Because PSSM is not applied on nucleotides so if there are specific functional areas exist in nucleotides, no search option is available for that. Conserved domain will not work for nucleotide as -it is based on PSSM which does not apply to nucleotide. 5.1.1.2 The Default Word Size Is 11 Characters For Blastn. The Default Word Size Is 3 For BLASTP, due To Which BLASTP Searches Run Slower Than BLASTN. Word size (seed) strongly affects the database searching. Speed of the algorithm is inversely proportional to the word size. By decreasing the word size the sensitivity increases but speed of the search program decreases. Word size for BlastP is very small as compared to BlastN. Word size (seed) in case of BlastP is of 3-residues.It is seen for BlastP, during the second step of algorithm, large no of hits are found in the database. This is because of the small size of the seed. So more time is spent on the search. But in case of BlastN, seed is of 11-nucleotides.It is difficult to find more number of exact matches for such large seed size. Results are displayed in lesser time as compared to BlastP and less number of hits are found. But sensitivity decreases in BlastN.
63
Figure 5.2 Different Word Sizes For BlastN and BlastP 5.1.1.3 Blastn Is Very Different From Other Protein-Based Algorithms. Blastn Seeds Are Identical Words. T Is Never Used In Blastn. A word hit is simply two identical sequences. T is the threshold parameter for sequences. T is only used where any match related to given sequence is not found. This parameter is used to increase the length of the word seed. Neighborhood of a given word seed is found. Neighborhood of a word contains the word itself and other words whose score is at least as big as T when comparing with the scoring matrix. By adjusting T, it is possible to control the size of neighborhood and therefore word hits in the search space[30].But T is not used in BlastN, because BlastN always find identical matches. Therefore no need of neighborhood is there. Influence of absence of T: T is not used in BlastN. There is big limitation of this to BlastN algorithm. If identical seeds are not found in BlastN, there will be no match. Because when no match is found with respect to the given seed, the search is stopped there. No extension of the seed will be performed and no match will be found. Improvement: T should be used in BlastN. By using T more word hits can be found. When the other words are aligned with the previously word seed, Neighborhood of word is created and extension is applied on that. On applying the extension in both the directions, the words are included in the extension whose score does not lies below threshold value T. And similar sequence is found whose value does not lie below the drop-off score X. Therefore no need to stop the search here. More sequence matches can be found. There will be less chances of missing alignments.
64
5.1.1.4 Unlike Nucleotide BLAST, There Is No Comparable MEGABLAST For Protein Searches. MegaBlast is optimized for aligning sequences that differ slightly as a result of sequencing or other similar "errors". MegaBlast is also able to efficiently handle much longer DNA sequences than the blastn program of traditional BLAST algorithm. When larger word size is used (see explanation below), it is up to 10 times faster than more common sequence similarity programs. Mega BLAST is also able to efficiently handle much longer DNA sequences than the blastn program of traditional BLAST algorithm. Influence of absence of Mega Blast: MegaBlast is an improvement to existing BlastN algorithm, but for proteins there is no such program exists. No batch queries can be run in case of protein sequence searching. Longer sequence searches cant be applied so efficiently. To improve the speed of the protein searches by speed, and to handle long sequence searches MegaBlast like program should be developed for proteins, Which can run large protein sequence and batch sequences at a time. 5.1.1.5 Genetic Code Option Is Only Used With Blastx, Genetic Code Option Is Disabled With Tblastn The genetic code is the relationship between the sequence of the bases in the DNA and the sequence of amino acids in proteins. Both DNA and proteins are linear polymers thus it seems logical to suppose that the sequence of bases in DNA codes for the sequences of amino acids in proteins. However, there are 20 amino acids found in proteins and only 4 different bases found in DNA so the coding ratio cannot be 1 to 1 nor can it be 2 bases to 1 amino acid, which would only give 16 different combinations. At least 3 bases in combination as a triplet are required to code for each amino acid and this would give 4 to power 3 = 64 possible combinations of triplet bases or codons. We now know that the genetic code is based on these triplet codons.
Different species may use different genetic codes to encode for the same amino acid. You have to specify appropriate genetic codes (translation table) for your query sequence based on the organism and sources
65
BlastX mainly translate the given nucleotide sequence into protein and then compare it with the protein database. These genetic codes are used to translate those nucleotide into protein. Without these codes translation is not possible. Different codes are available for different species. Mainly the Standard Genetic codes are used.
5.2 COMPARISON ON THE BASIS OF ALGORITHM

All variants of BLAST run on same algorithm followed by Main Blast Program. But there exist some difference in the working of these due to which the performance of all varies by the other. The different features in the algorithm make it possible to use different tool for purpose. On the basis of different functionality different algorithms can be optimized to improve the performance.
5.2.1 The Two-Hit Algorithm Isn't Used In BLASTN, Because Word Hits Are Generally Rare With Large Identical Words.
The two-hit algorithm isn't used in original version. BLASTN the statistical alignments which are found using main BLAST algorithm are based on threshold value T and drop-off score X. The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words. BLAST first scans the database for words (typically of length three for proteins) that score at least T when aligned with some word within the query sequence. Any aligned word pair satisfying this condition is called a hit. The second step of the algorithm checks whether each hit lies within an alignment with score sufficient to be reported. This is done by extending a hit in both directions, until the running alignments score has dropped more than X below the maximum score yet attained. This extension step is computationally quite costly; with the T and X parameters necessary to attain reasonable sensitivity to weak alignments, the extension step typically accounts for >90% of Blasts execution time. It is therefore desirable to reduce the number of extensions performed. Refined algorithm is based upon the observation that an HSP of interest is much longer than a single word pair, and may therefore entail multiple hits on the same diagonal and within a relatively short distance of one another. Specifically, we choose a window length A, and invoke an extension only when two non-overlapping hits are found within distance A of one another on the same diagonal. Any hit that overlaps the most recent one is ignored. The two-hit method will detect an HSP if it contains two no overlapping length-W 66
words of score at least T. To analyze the relative speeds of the one-hit and two-hit methods, using the parameters studied above. Two-hit method generates on average ~3.2 times as many hits, but only ~0.14 times as many hit extensions. Influence of absence of two-hit algorithm: Two-hit algorithm is not used for BlastN, because the word size for BlastN is large (11 nucleotide). Word hits are the identical words. It is rare and difficult to find word hits with large word size. It is easy to find identical matches for one or two nucleotide in a given database.
3URE DE LO L W\ RI PL VVL QJ DQ +63 URED L O LW \R IP QJDQ+
% %
1RUPD O L]HG +636FRUH RUPDOL ]HG+ 36F
Figure 5.3 shows the empirically estimated probability that an HSP is missed by this method, as a function of its normalized score But it is very rare that we find exactly same nucleotide sequence with the seed of 11 bp. Therefore two-hit algorithm is not used.
Figure 5.4 Speeds of the one-hit and two-hit methods
67
Improvement: If two-hit algorithm will be applied to blastn, The sensitivity of BlastN will increased and more accurate sequence similarity will be obtained. This can be done by decreasing the word size of BlastN. Because with large words size it is difficult to find the same matches regularly at two positions. But with short word size it is easy to find the exact matches at more than one position.
5.2.2 Extension in BlastN is different from BlastP and other protein based programs.
Extension for BlastN is different from Blastp. This is because of the Proteins and Nucleotides. Different Scoring matrices are used for scoring of neighborhood during extension. Different scoring matrices yields separate drop-off(X) score for BlastN and BlastP. But in BlastN there are 11-nucleotides for which the whole score has to be evaluated. It will take more time to calculate as compared to BlastP because the word size for BlastP is small as compared to BlastN.
5.3 COMPARISON ON THE BASIS OF PERFORMANCE

Every tool is efficient in different conditions and to different input queries. Performance of variants is measured on the basis of following criteria. Performance of various variants of Blast is measured on the basis of: Expect Value Word Size Time
5.3.1 Comparison On The Basis of Varying Expect Values

A BlastN was performed using the mRNA sequence of PRDX1 against the nonredundant database. To observe the effect of the "expect value" parameter, values of 10, 0.1, and 1e-30 were used, keeping the wordsize (11) and the filter (low complexity) constant. Table 5.1 shows the results: The results from expect=10 returned 163 hits, expect=0.1 returned 157 hits, and expect=1e-30 returned only 65 hits. The expect value is the measure of how many times the sequence could hit another by chance. By decreasing this value, the blast becomes more stringent and less results are returned.
68
Expect value (e)
BlastN
BlastP
BlastX
TBlastN
TBlastX
PSI-Blast
10 0.1 1e-30
163 157 65
100 100 80
100 100 58
100 101 75
101 100 98
501 501 480
Table 5.1 No of hits for varying expect values In the same manner, the protein sequence of PRDX1 was blasted against the nonredundant protein databases, BlastP, BlastX, TblastX, TBlastN and PSI-Blast. Again, the expect value was varied while keeping the word size (3) constant. The results from the expect values of 10 and 0.1 both returned almost 100 hits and in PSIBLAST it gives 501 hits, meaning that a decrease in stringency by 100x yields no difference. However, when an expect value of 1e-30 was used, only 58 hits were returned. The protein sequences in the database aligned so well with the PRDX1 protein sequence that only very low expect values altered the output.
BlastN
BlastP
Blastx
TBlastN
Tblastx
PSIBlast
Figure 5.5 Comparison - Varying Expect Values
69
3HUI R U PDQ F HR QW K H% DV L VR I([S 3HU HR QW H% VR I([SHF W9DO X H H W9DO

1R R I +L W V R I+
%ODVW1 %ODVW3 %ODVW; 7EODVW1 7EODVW; 3VL%ODVW

([ SH FW 9D O XH SHF W9
Figure 5.6 Comparison - Varying Expect Values By lowering the value by just 100th does not make much difference in number of hits in BlastP, BlastX, TBlastX, BlastP. But variation comes when the expect value is reduced by a large factor. But as it can be seen from the graph , irrespective of the same input parameters given to all the variants, PSI-BLAST and BLASTN gives the maximum output.
5.3.2 Comparison On The Basis of Word Size

Similar to the above experiment, a BlastN was performed using PRDX1 mRNA. This time, the expect value was held constant at 10 while the word size was changed (7, 11, 15). Also, other variables such as the nr database and the low complexity filter were similarly used. The following results were observed.
Word Size (w) 7 11 15
BlastN 163 163 139
. Table 5.2 No of hits for varying expect values BlastN The results showed that both a wordsize of 7 and 11 returned 163 hits while a wordsize of 15 returned only 139 hits. Wordsize is a measure of how many items,
70
3HUI R U PDQF HR I% DOVW 1RQW K H% DL VRQ 3HU QFHR I% OVW1RQW H% 9DU\ L Q JZ R U GVL]H 9DU JZ GVL] H

1R R I +L W V R I+
6HULHV
:RUG 6L ]H RUG6

Figure 5.7 Varying Expect Values for BlastN
nucleotides in this case, are taken and compared to the database. In a wordsize of 11, a group of 11 sequential nucleotides are compared with the database. The larger the wordsize, the more stringent the analysis. That is why a wordsize of 15 returned less results
3HU I RU PDQF HRQWK HE DV L VR I:RUG6L] RUP QFHRQW HE VR I:RUG6L]H H
165 160 155 No. of Hits 150 145 140 135 130 125 7 11 Word Size 15 BlastN
Figure 5.8 Varying Expect Values BlastN Wordsize can also be varied in a BlastP, BlastX, TblastX, TBlastN and PSI-Blast. In the next comparison, PRDX1 protein was blasted against the protein database using a constant expect value (1e-70), database (nr), and filter (low complexity). Wordsize was varied between 2 and 3.
71
Word size ( w) 2 3
BlastP 58 58
BlastX 100 100
TblastX 100 57
TblastN 115 115
PSI 501 501
. Table 5.3 No of Hits For Varying Word Size

Perform ance on the basis of varying w ord size
600
500
400
no of h its
300
word size=2 word size=3
200
100
0 w ord size
BlastP
BlastX
TbalstX
TbalstN
PSI
Figure 5.9 Varying Expect Values for variants
3HUI R UPDQ F HRQWK H% DV LVR I UIR HRQW H% VR :RU G6L] H RUG6L]H

1R R I +L W V R I+
%ODVW3 %ODVW; 7EODVW; 7EDOVW1 36,

: R U G 6L ]H G6
Figure 5.10 Varying Expect Values for variants
72
Varying word size does not affect the performance of BlastP, BlastN, TBlastN, TBlastP and PSI-Blast. But it only affects the performance of TBlastX. Performance of TBlastX declines with the increase of word size.
5.3.3 Comparison on the Basis of Execution Time

All the variants were executed on 32-bit and 64-bit processors and their performance was compared in terms of seconds and number of processors, which is shown below.
TEST blastX blastX blastN blastN tblastX tblastX
NUMBER OF CPUs 1 2 1 2 1 2
32-BIT TIME (in seconds) 1516 751 297 153 4999 2761
64-BIT TIME (in seconds) 1085 550 252 132 3545 1940
. Table 5.4 Varying Execution Time From the graph shown on next page, it is clear that TblastN takes less time to Execute than the other variants. TblastX is slowest amongst all whether it is executed on 32-bit processor or 64-bit processor. The performance of BlastX lies between both. The observations are represented in the graph as shown below:
6000 5000 4000 3000 2000 1000 0
Single CPU 32bit Dual CPU 32bit Single CPU 64bit Dual CPU 64bit
Figure 5.11 Compares the performance of BLAST compiled with 32-bit and 64-bit processor
73
Summary: Variants of Blast (BlastN, BlastP, BlastX, TBlastN, and TblastX, PSI-Blast) run on different parameters, different algorithms, and each tool have different performance criteria. The performances differ on the basis of parameters like Word Size, Expect Value, and Databases Available. By selecting different values, the efficiency of each tool can be improved. In this chapter the performance is being checked on the basis of execution time, and varying parameters and algorithm comparison. From the performance, we can make decision that in which situation, which tool is to be used.
74
CHAPTER 6
CONCLUSION AND FUTURE SCOPE
6.1 CONCLUSION
In the plethora of tools available for data mining in bioinformatics, Blast was chosen due to its unmatched speed, sensitivity and accuracy. Though, the performances of BLAST was best, but still due to different conditions variants of BLAST are available. There are various parameters that are having contextual relation with areas other than the algorithm design and computer science: the analysis of parameters was limited form the point of view of computer engineer. That is why the improvements in some of the parameters are suggested. Firstly to improve the speed of BlastP word size should increased as in BlastP. Word size strongly affects the database searching. Speed of the algorithm is inversely proportional to the word size. By decreasing the word size the sensitivity increases but speed of the search program decreases. Word size for BlastP is very small as compared to BlastN. In case of BlastP word size is of 3-residues. It is seen for BlastP, during the second step of algorithm, large no of hits are found in the database. This is because of the small size of the seed. So, more time is spent on the search. But in case of BlastN, seed is of 11-nucleotides. It is difficult to find more number of exact matches for such large seed size. Results are displayed in lesser time as compared to BlastP and less number of hits are found. But sensitivity decreases in BlastN. Secondly improvement for BlastP, BlastX, and TBlastX, PSI-Blast is: For Nucleotide BLAST, there is one MegaBlast available. There should also be Comparable MEGABLAST for Protein Searches. MegaBlast is optimized for aligning sequences that differ slightly as a result of sequencing or other similar "errors". MegaBlast efficiently handles much longer DNA sequences than the BlastN program of traditional BLAST algorithm. When larger word size is used, it is up to 10 times faster than more common sequence similarity programs. Lastly there are some advantages and disadvantages in each of the variants. Regular exploration and improvements are need for better efficiency of these tools. Some
75
features are available only in nucleotide based tools which are absent in protein based versions. By continuously evaluating the performances and exploring the features of each tool, improvements are being done in this area.
6.2 FUTURE SCOPE

Over the past decade many biological tools have been developed, but still improvements are needed in these tools, to improve the speed and accuracy. Research for improvements of existing tools is carrying on.
Examinations of the problems arising from the use of biological tools. How the execution of code affects the performance of the tool. What modifications can be done in source code.
By doing modifications to the existing parameters and source code, speed will increase and the field of bioinformatics will emerge with and more dynamic scope. Measurement and Analysis is the key to Development and Improvement So with continuous evaluations of existing versions of biological tools, further improvements will be possible.
76
REFERENCES
[1] By blast-help group, NCBI User Service, BLAST Program Selection Guide, NCBI, NLM, NIH, 8600 Rockville Pike, Bethesda, MD 20894 [2] Dan E. Krane, Michel L. Raymer Fundamentals concepts of
Bioinformatics,Pearson Education, 2003. [3] Dr. Joanne Fox, Sequence Similarity Searching: Understanding and Using Web Based BLAST, Wednesday January 26th, 2005 Rm 220 FNS Building, UBC [4] Discovery: An Overview. In U.M. Fayad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1-35.AAAI/MIT Press, 1996. [5] Gat and Tal Kohen , Algorithms for Molecular Biology, Lecture 4: January 1, 1999 [6] G.Piatetsky-Shapiro, U. Fayad, and P. Smith Data mining to Knowledge Discovery: An Overview. In U.M. Fayad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1-35.AAAI/MIT Press, 1996 [7] Ian Korf, Serial BLAST Searching, The Wellcome Trust Sanger Institute [8] Ian Krof, Mark Yandell, and Joseph Bedell BLAST , Shroff Publishers & Distributors Pvt. Ltd. [9] Jason, Bruce, Dennis, Pattern Discovery in Biomolecular Data, Oxford University Press. New York 1999 [10] Jean Michel Claverie and Cedric Notredame Bioinformatics A Beginners Guide, Wiley Publishing, Inc. 2003. [11] Jiawei Han, Micheline Kamber and Simon Fraser University Data Mining Concepts and Techniques Morgan Kaufmann Publishers, USA 2001. [12] Jaak Vilo, Pattern Discovery from Biosequences , University Of Helsinki Finland,2003
77
[13] Nick Camp, Haruna Cofer, and Roberto Gomperts, High-Throughput BLAST, September 1998 [14] Osmar R. Zaane ,Principles of Knowledge Discovery in Databases, 1999 [15] Paracel Algorithms, The Biologists Guide to Paracels Similarity Search Algorithms, October 2, 2001 [16] Sandra Barth, Sequence similarity searches, Session 4 ,2002.Jason [17] Shawn Delaney, Greg Butler, Clement Lam, Larry Thiel Department of Computer Science, Concordia University, Three Improvements to the BLASTP Search of Genome Databases, 1455 de Maisonneuve Blvd. West, Montreal, Quebec, Canada, H3G 1M8 [18] Sir William Dunn, Introduction to Database Searching, Oxford, July 12, 2001 [19] Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schffer1, Jinghui Zhang, Webb Miller2 and David J. Lipman Gapped BLAST and PSIBLAST: a new generation of protein database search programs, 33893402 Nucleic Acids Research, 1997, Vol. 25, No. 17 [20] Stephen F. Altschul', Warren Gish', Webb Miller2 Eugene W. Myers3 and David J. Lipmanl Basic Local Alignment Search Tool J.Mol.Biol (1990) 215,403410
[21] Fengkai Zhang, The Use of Vector Seeds to Improve PSI-BLAST

Sensitivity, School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, 2004 [22] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, From Data Mining to Knowledge Discovery in Databases. Articles. [23] Warren Gish and David J. States Identification of Protein Coding Regions by Database Similarity Search, Articles.
INTERNET RELATED LINKS

[24] http://www.eas.asu.edu/~mining03/chap1/lesson_2.html
78
[25] http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/ palace/datamining.htm [26] http://e-comm.webopedia.com/TERM/D/data_mining.html [27] http://biotech.icmb.utexas.edu/pages/bioinfo.html [28] http://services.bioasp.nl/blast/cgi-bin/blast.cgi?program=blastx [29] http://www.ncbi.nlm.nih.gov/blast [30] www.biotech.ufl.edu/WorkshopsCourses/ bioinfoWorkshops/bioinfoTools/BLAST
79
LIST OF PUBLICATIONS
1. Ms. Inderveer Chana, Harpreet Kaur, Navjot Kaur, Issues Of Software Engineering and Knowledge Engineering In Bioinformatics in National Conference of Bioinformatics Computing, held at T.I.E.T, Patiala, on 18th March 19th March. 2. Mrs. Rinkle Aggarwal, Navjot Kaur, Harpreet Kaur, Algorithmic and NonAlgorithmic Issues In Database Search Of Sequence databases in National Conference of Bioinformatics Computing, held at T.I.E.T, Patiala, on 18th March 19th March.
80
GLOSSARY
Algorithm: a fixed procedure embodied in a computer program. The Basic Local Alignment Search Tool or BLAST is a sequence comparison algorithm that NCBI uses to search sequence databases for optimal local alignments with a query sequence. FASTA is another type of algorithm used for database similarity searching. Alignment: The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology.
Codon: The sequence of nucleotides, coded in triplets (codons) along the mRNA, that determines the sequence of amino acids in protein synthesis. A gene's DNA sequence can be used to predict the mRNA sequence, and the genetic code can in turn be used to predict the amino acid sequence.
EST expressed sequence tag: A short strand of DNA that is a part of a cDNA molecule and can act as identifier of a gene. Used in locating and mapping genes.
Exons: DNA segments of a gene that encode the amino acid sequence of a protein.
Gap: A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment.
Global Alignment: The alignment of two nucleic acid or protein sequences over their entire length
Homology: Similarity attributed to descent from a common ancestor.
81
HSP: High-scoring segment pair. Local alignments with no gaps that achieve one of the top alignment scores in a given search.
Identity: The extent to which two (nucleotide or amino acid) sequences are invariant.
Introns: Noncoding DNA sequences that interrupt the sequences containing instructions for making a protein (exons). Introns are not represented in messenger RNA; only the exons are translated into protein. The function of introns is still being.
Local Alignment: The alignment of some portion of two nucleic acid or protein sequences Sensitivity: It is the ability to detect true positives i.e. correct matches. The most sensitive search finds all true matches, but might have lots of false positives i.e. erroneous matches detected. Sensitivity can be defined as the probability of finding the matches such that the query and the matched database sequences have at least x% similarity.
Similarity: The extent to which nucleotide or protein sequences are related. The extent of similarity between two sequences can be based on percent sequence identity and/or conservation. In BLAST similarity refers to a positive matrix score.
Specificity: Ability to reject false positives. The most specific search will return only true matches, but might have lots of false negatives i.e. missed correct matches.
82

Comparison of BLAST Variants

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Comparison of BLAST Variants

Uploaded by

Copyright:

Available Formats

COMPARISON OF VARIANTS OF BLAST (Basic Local Alignment Search Tool)

Ms. Inderveer Chana

Mr. R.S Salaria

1.2 WHY DATA MINING

1.3 STEPS OF KDD PROCESS

4. Data reduction and projection.

6. Choosing the data mining algorithm(s).

1.4 WHAT KIND OF DATA CAN BE MINED?

1.4.1 Relational Databases

1.4.2 Data Warehouses

1.4.3 Transactional Databases

1.4.4 Multimedia Databases

1.4.5 Spatial Databases

1.4.6 World Wide Web

1.4.7 Advanced DB and Information Repositories

1.5 ARCHITECTURE FOR DATA MINING SYSTEM

1.5.1 Database, Data Warehouse, or Other Information Repository

1.5.2 Database or Data Warehouse Server

data analysis capabilities, collectively referred to as OLAP (On-Line Analytical Processing).

1.5.3 Knowledge Base

Figure 1.2 Architecture of a typical data mining system

1.5.4 Data Mining Engine

1.5.5 Pattern Evaluation Module

1.5.6 Graphical User Interface

1.6 DATA MINING APPLICATIONS

1.7 THE SCOPE OF DATA MINING

2.3 AIMS OF BIOINFORMATICS

2.4 STEPS OF KDD FOR BIOINFORMATICS

2.5 WHAT KIND OF DATA CAN BE MINED?

Figure 2.2 DNA Molecule

Figure 2.3 Protein Molecule

2.6 DATA MINING TECHNIQUES IN BIOINFORMATICS

2.7 THE CENTRAL DOGMA

: mRNA : Protein [2].

2.7.2 The Genetic Code

2.8 NEED OF DATA MINING IN BIOINFORMATICS

2.9 BIOINFORMATICS AND ITS SCOPE

2.10 APPLICATIONS OF BIOINFORMATICS

3.2 DATABASES AVAILABLE FOR BLAST SEARCH

3.2.2 Nucleotide Sequence Databases

BLAST protein databases available at through blastp web interface

Figure 3.2 Nucleotide Databases

3.3 BLAST ALGORITHM

LNKCKTPQGQRLVNQ P Q G 18 P E G 15 P R G 14 P K G 14 P N G 13 P M G 13 Below Threshold (T=13) P Q N 12 etc. Word Neighborhood Words

Query Sequence of length L

Maximum of L-w+1 words (typically w = 3 for proteins)

Figure 3.5 Maximal Segment Pairs (MSPs)

3.4 BLAST PARAMETERS

T, the threshold parameter.

n, length of the database

Number of sequences in database

3.5 FEATURES OF BLAST

3.5.2 Substitution Matrix

3.5.3 Local Alignments

3.5.4 Ungapped Alignments

3.5.5 Explicit Statistical Theory

3.5.7 Sequence Input

3.5.8 Results Format

3.5.9 BLAST Output

Figure 4.1 Blast Variants

4.1 BLAST VARIANTS

Query sequences of type

%ODVW1 %ODVW3 %ODVW; 7EODVW1 7EODVW; 3VL%ODVW