You are on page 1of 9

International Journal of Creative Mathematical Sciences & Technology (IJCMST) 1(1): 80-88, 2012

ISSN (P): 2319 7811, ISSN (O): 2319 782X

The Role of Machine Learning tools on Biological Systems


J. K. Meher1, P. Mishra2, M.K.Raval3 and G.N.Dash4
1,2

Vikash College of Engineering for Women, Bargarh, Odisha, India 3 Gangadhar Meher College, Sambalpur, Odisha, India 4 Sambalpur University, Burla, Odisha, India

Abstract: Recent development in high throughput new generation biological sequencing


technologies and the resulting rapid quantitative growth in the macromolecular sequence, structure, gene expression measurements, have uncovered a transformation of biology from a wet lab into a computational task. Biological data are flooding at an enormous rate and causing the current databases to expand at an exponential rate. Hence biological research has become a data driven discipline due to advanced research. As a result, bioinformatics has emerged as an important discipline in the post genome era that can explore the information hidden in the bimolecular database. An efficient and inexpensive approach is required to solve problems in molecular biology which is a grand challenge for bioinformatics. Analysis and understanding of these data provides a natural application field for machine learning algorithms. Machine learning is a sub-set of artificial intelligence and deals with techniques to allow computers to learn that may help to achieve this goal. The aim of this survey paper is to introduce machine learning techniques in the context of their possible application in biological systems. Keywords: Machine learning, Bioinformatics, Genomics, clustering, support vector machines, genetic algorithms, artificial neural networks, hidden Markov models.

INTRODUCTION
Bioinformatics is the application of information technology to the area of molecular biology. It is an integrated multidisciplinary field. Large scale genome sequencing efforts have resulted in the availability of hundreds of complete genome sequences [1]. More importantly, the Gen Bank repository of nucleic acid sequences is doubling in size every 15 months [2]. Similarly, structural genomics efforts have led to a corresponding increase in the number of macromolecular structures [3]. At present, there are over a thousand databases of interest to biologists. The gene notation techniques have made possible system-wide measurements of biological variables. Consequently, discoveries in biological sciences are increasingly enabled by machine learning. Some representative applications of machine learning in computational and systems biology include: classifying protein sequences and structures into structural classes; predicting the functions of a protein from its amino acid sequence; identifying the protein-coding genes from genomic DNA sequences; identifying functionally important sites such as protein-protein, protein-DNA, protein-RNA binding sites from the amino acid sequence; identifying functional modules and genetic networks from gene expression data.

80 Corresponding Author: J. K. Meher, Vikash College of Engineering for Women,


Bargarh, Odisha, India

International Journal of Creative Mathematical Sciences & Technology (IJCMST) 1(1): 80-88, 2012

ISSN (P): 2319 7811, ISSN (O): 2319 782X

Figure 1: The growth of the Protein Data Bank

Figure 2: The growth of Gen Bank.

Biological data are flooding in at an enormous rate and causing the current databases to expand at an exponential rate. Figures 1 and 2 illustrate the growth of the Protein Data Bank (PDB) and the Gen Bank databases. Both figures exhibit a common growth curve an exponential growth rate [4]. This phenomenon is caused by the new and efficient experimental techniques in analyzing genomes and proteomes sequences. Handling and analyzing the accumulated data in the databases will become the major challenge for bioinformatics community [5]. With the availability of the draft human genome sequence, the next challenge for bioinformatics researchers is to learn, discover and predict useful knowledge from these databases. The human genome was predicted to contain about 30,000 40,000 protein-coding genes [6]. A lot of useful and important information about human development, physiology, medicine and evolution is hidden in the data. As a result, an intelligent approach is needed to discover the information from this data as well as to cope with the rapid rate of data deposition. These applications collectively span the entire spectrum of machine learning problems including supervised learning, unsupervised learning and system identification. For example, protein function prediction can be formulated as a supervised learning problem i.e. given a dataset of protein sequences with experimentally determined function labels, induce a classifier that correctly labels a novel protein sequence. The problem of identifying functional modules from gene expression data can be formulated as an unsupervised learning problem i.e., given expression measurements of a set of genes under different conditions and a distance metric for measuring the similarity or distance between expression profiles of a pair of genes, identify clusters of genes that are co-expressed [7]. The problem of constructing gene networks from gene expression data can be formulated as a system identification problem i.e. given expression measurements of a set of genes under different conditions and available background knowledge or assumptions, construct a model that explains the observed gene expression measurements and predicts the effects of experimental perturbations 81 Corresponding Author: J. K. Meher, Vikash College of Engineering for Women,
Bargarh, Odisha, India

International Journal of Creative Mathematical Sciences & Technology (IJCMST) 1(1): 80-88, 2012

ISSN (P): 2319 7811, ISSN (O): 2319 782X

MACHINE LEARNING
Machine learning is a sub-set of artificial intelligence and deals with techniques to allow computers to learn. Machine learning is the capacity of a computer to learn from experience, i.e. to modify its processing on the basis of newly acquired information. It is a process which causes systems to improve with experience [8].Learning is the essence of intelligence. If a system could learn and gain from experiences and improve its performance automatically, it would be an advanced tool for solving complex problems such as in the biological systems. The problem of inducing general functions from specific training examples is the central idea of learning. The learning agent is given a set of training and test examples of a category. The agent learns from the training examples and defines the hypothesis for them. The agent must search through the hypothesis space and locates the best hypothesis when given the test sets. There are three main categories of learning such as supervised learning in which both the inputs and the outputs of a component can be observed; reinforcement learning where the learning agent is given an evaluation of its action but not told the correct action and unsupervised learning where the learning agent has no information about what the correct outputs are [9]. The different machine learning tools used in the analysis of biological systems are discussed briefly. 1. Artificial Neural Networks (ANNs) An Artificial Neural Network is a mathematical model inspired by biological neural networks. A neural network consists of an interconnected group of artificial neurons, and it processes information using a connectionist approach to computation. In most cases a neural network is an adaptive system that changes its structure during a learning phase. Neural networks are used to model complex relationships between inputs and outputs or to find patterns in data [10]. The idea of developing artificial neural networks (ANNs) was inspired by the biological neural networks that are found in the human brain. Since the introduction of simplified neurons, research has been carried out to investigate the role of the neurons in the human brain. Neurons are a single unit in the brain that can transfer the information to the other neurons in the complex nerve networks. Using the brain as the model, computer scientists try to design and develop new platform and network that can performs computational tasks as the neurons do. The earliest research into ANNs was used to understand and model the information processing in the brain. After several promising developments in the ANN research, this AI technique has been applied to solve other real-world problems.

Figure 3. A simple artificial neural network 82 Corresponding Author: J. K. Meher, Vikash College of Engineering for Women,
Bargarh, Odisha, India

International Journal of Creative Mathematical Sciences & Technology (IJCMST) 1(1): 80-88, 2012

ISSN (P): 2319 7811, ISSN (O): 2319 782X

ANNs are built from multi-layer of nodes linking each other. Generally, there are three layers in the network, the input layer, the output layer and a hidden layer in between hem (Figure 3). There are different types of ANNs architectures such as the feed forward architecture, the recurrent architecture and the layered architecture [11]. The differences between these architectures depend on the arrangement of the internal nodes in the networks layers .The ANN is one of the most widely used machine learning approach in bioinformatics and it is also the earliest technique applied to the field of biological analysis [12]. Neural networks have been applied to problems such as disease classification and identification of biomarkers. The advantages of ANNs are capable to learn and solve many real-world problems. 2. Genetic Algorithms (GAs) Genetic algorithm is a search heuristic that mimics the process of natural evolution. This heuristic is routinely used to generate useful solutions to optimization and search problems. Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, mutation, selection, and crossover. The genetic algorithm (GA) technique is the learning agent based on biological evolution theory. The main idea of GAs is to maintain a population of data structures that represent the candidate solutions to the problem, where they evolve through competition by controlling the variation to improve the performance of the learning system [13]. The population undergoes recombination and mutation processes to adapt the new environment. The ultimate goal of the candidate solution is to become the best solution in the environment [14]. GAs is loosely based on ideas from population genetics. First, consider the environment as the problem of interest that we would like to solve. A population of individuals encoded as bit strings is created randomly from this environment. There are variations among this population of individuals, and therefore, some of the individuals are fitter than. Evolution occurs in this population during every iteration. The selection criteria for the new candidate are based on the difference of the survival abilities of each individual. 3. Hidden Markov Models (HMMs) Generally, Hidden Markov Models (HMMs) derive from the first-order Markov chain [15], that concentrates only on the sequence state. HMMs have been widely applied in speech recognition research since the early 1970s. Haussler et. al.A [16] introduced this model to the bioinformatics community in the early 1990s. Since then, HMMs have become popular in sequence modeling, multiple alignment, and protein structure prediction and profiling. An HMM is a statistical model that predicts the result based on the probabilities of the model states. The states in the model are associated with meaningful biological properties [17]. Profile HMMs are the most popular among the bioinformatics research. 4. Clustering Clustering is a discovery approach that organizes and identifies the data and groups them into classes. A derivative clustering algorithm can also used to predict and explain the complex data. 83 Corresponding Author: J. K. Meher, Vikash College of Engineering for Women,
Bargarh, Odisha, India

International Journal of Creative Mathematical Sciences & Technology (IJCMST) 1(1): 80-88, 2012

ISSN (P): 2319 7811, ISSN (O): 2319 782X

Hierarchical clustering and k-clustering are the two major styles of clustering algorithms. In hierarchical clustering, the agent clusters the input data into groups in a hierarchical way. For k clustering, the agent assigns every input objects to exactly one group according to the nature of the data sets has analyzed recent clustering algorithms [18]. There are two approaches to clustering observed data. The first approach is based on the physical or chemical theories of the data. The second approach is based on the computational and statistical analysis of the data. Given a set of data, the clustering agent will then cluster the data into smaller groups following either the first or the second approach. 5. Support Vector Machines (SVMs) The Support Vector Machine (SVM) is a learning method developed by Vapnikand co-workers [19] based on the Structural Risk Minimization principle from statistical learning theory and the VC-dimension theory. They can be applied to regression, classification, and density estimation problems. SVM is another ``black-box "algorithm (like artificial neural networks) that is trained on a training data set. SVM is claimed to outperform most other algorithms [20]. The main idea of an SVM is to separate classes with a surface that maximizes the margins between them. It is a powerful classification learning approach which applies the following concept: non-linear input vectors are mapped through a very high dimension feature space where the linear decision of the input vectors is computed in this feature space. By dividing the highdimensional space into different boundaries or subspaces, SVM maximizes the classification according to the generalized boundary. 6. Decision Trees The decision tree was developed by Quinlan. It is also known as classification tree or regression tree [21]. The decision tree is a simple inductive learning system using approximating discretevalued functions to estimate and classify the examples. It is one of the widely used machine learning method because of its simplicity and practical approach. Given a set of instances, the learning agent uses the divide-and-conquer strategy to construct the tree. The sets of instances are accompanied by a set of properties. The tree will returns a yes or no decision when the sets of instances are tested on it. The decision tree is a supervised learning technique. Predictions about the probability of a particular case belongs to a particular class are made from the trees.

MACHINE LEARNING APPROACHES IN BIOINFORMATICS RESEARCH


The idea of machine learning is to design the machines to learn like a human, learn from experience and discover information from the available data. This approach is suitable for application to bioinformatics because the subjects of investigation are highly complex biological systems. Machine learning techniques are popular in the bioinformatics community because they are task-oriented. Humans can understand the theoretical background and the rules generated from these techniques. Many problems in biological systems cannot be defined well except by using examples. Humans can specify the input/output pairs, but the relationship between the inputs and outputs are unknown. In this case, machine learning approaches are desirable because they can automatically adjust their internal structure to generate approximate results for the given problems. 84 Corresponding Author: J. K. Meher, Vikash College of Engineering for Women,
Bargarh, Odisha, India

International Journal of Creative Mathematical Sciences & Technology (IJCMST) 1(1): 80-88, 2012

ISSN (P): 2319 7811, ISSN (O): 2319 782X

Another advantage of machine learning approaches is that they can easily be adapted to a new environment. This is important in molecular biology research because new data are generated every day and probably the newly generated data will update the initial concept or learning hypotheses. A variety of machine learning techniques can handle most of the problems in bioinformatics. Logic programming, rule explanation, finite-state machines, functions system and problem solving systems are among these machine learning techniques. These techniques operating individually or in combination can tackle the various challenges in bioinformatics. Generally, the basic concept of applying machine learning in bioinformatics research is to discover meaningful knowledge from the existing biological databases and presented in a meaningful and understandable pattern. The tasks of machine learning application in bioinformatics can be divided following categories. i. Classification -Predicting an item class ii. Forecasting- Predicting a parameter value iii. Clustering -Finding groups of items iv. Description- Describing a group v. Deviation Detection- Finding changes vi. Link Analysis- Finding relationships & associations vii. Visualization -Presenting data visually to facilitate human discovery The machine learning algorithms are applied in many fields of bioinformatics such as gene prediction, gene classifications, protein structure prediction, helix kink prediction [22], determining functionally important sites in proteins etc. Prediction of Protein function: Protein determines the functional aspect of living organism. Proteins are the principal catalytic agents, structural elements, signal transmitters, transporters and molecular machines in cells. Understanding protein function is critical to understanding diseases and ultimately in designing new drugs. The primary source of information about protein function has come from biochemical, structural, or genetic experiments on individual proteins. However, with the rapid increase in number of genome sequences, and the corresponding growth in the number of protein sequences, the numbers of experimentally determined structures and functional annotations has significantly lagged the number of protein sequences. With the availability of datasets of protein sequences with experimentally determined functions, there is increasing use of sequence or structural homology based transfer of annotation from already annotated sequences to new protein sequences. However, the effectiveness of such homologybased methods drops dramatically when the sequence similarity between the target sequence and the reference sequence falls below 30%. In many instances, the function of a protein is determined by conserved local sequence motifs. However, approaches that assign function to a protein based on the presence of a single motif (the so-called characteristic motif) fail to take advantage of multiple sequence motifs that are correlated with critical structural features (e.g., binding pockets) that play a critical role in protein function. Against this background, machine learning methods offer an attractive approach to training classifiers to assign putative functions to protein sequences. Machine learning methods have been 85 Corresponding Author: J. K. Meher, Vikash College of Engineering for Women,
Bargarh, Odisha, India

International Journal of Creative Mathematical Sciences & Technology (IJCMST) 1(1): 80-88, 2012

ISSN (P): 2319 7811, ISSN (O): 2319 782X

applied, with varying degrees of success, to the problem of protein function prediction. Several studies have demonstrated that machine learning methods, used in conjunction with traditional sequence or structural homology based techniques and sequence motif-based methods outperform the latter in terms of accuracy of function prediction (based on cross-validation experiments). However, the efficacy of alternative approaches in genome-wide prediction of functions of protein-coding sequences from newly sequenced genomes remains to be established. There is also significant room for improving current methods for protein function prediction. Detection of functionally important sites in proteins: There are hotspots in the sequence of DNA. This helps in Protein-protein, protein-DNA, and protein-RNA interactions that play a pivotal role in protein function. Reliable identification of such interaction sites from protein sequences has broad applications ranging from rational drug design to the analysis of metabolic and signal transduction networks. Experimental detection of interaction sites must come from determination of the structure of protein-protein, protein-DNA and protein-RNA complexes. However, experimental determination of such complexes lags far behind the number of known protein sequences. Hence, there is a need for development of reliable computational methods for identifying functionally important sites from a protein sequence (and when available, its structure, but not the complex). This problem can be formulated as a sequence (or structure)labeling problem. Several groups have developed and applied, with varying degrees of success, machine learning methods for identification of functionally important sites in proteins [23,24]. Analysis of gene and protein networks: Understanding how the parts of biological systems(e.g., genes, proteins, metabolites) work together to form dynamic functional units, e.g., how genetic interaction sand environmental factors orchestrate development, aging, and response to disease, is one of the major foci of the rapidly emerging field of systems biology [25]. Some of the key challenges include the following: uncovering the biophysical basis and essential macromolecular sequence and structural features of macromolecular interactions; comprehending how temporal and spatial clusters of genes, proteins, and signaling agents correspond to genetic, developmental and regulatory networks; discovering topological and other characteristics of these networks; and explaining the emergence of systems-level properties of networks from the interactions among their parts. Machine learning methods have been developed and applied, with varying degrees of success, in learning predictive models including Boolean networks and Bayesian networks from gene expression data. However, there is significant room for improving the accuracy and robustness of such algorithms by taking advantage of multiple types of data and by using active learning

CONCLUSION
Machine learning tools are essential to assist biologists to analyze the genomes and proteomic sequences, interpret the patterns, classify them, detect useful information in the databases, gene prediction and model the molecular structures. The attempt to solve the molecular biology problems till now has created the new exciting field of bioinformatics. Machine learning is an automatic and intelligent learning technique. It has been widely used to solve many biological problems. The applications of these approaches in bioinformatics are becoming popular and continue to develop. This is because machine learning techniques are efficient and inexpensive in solving bioinformatics problems. 86 Corresponding Author: J. K. Meher, Vikash College of Engineering for Women,
Bargarh, Odisha, India

International Journal of Creative Mathematical Sciences & Technology (IJCMST) 1(1): 80-88, 2012

ISSN (P): 2319 7811, ISSN (O): 2319 782X

REFERENCES
[1]. [2]. [3]. Backofen, R. and Gilbert,D. 2001. Bioinformatics and constraints. Constraints. 6: 141-156. D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, and D.L. Wheeler. Genbank. Nucleic Acids Research,35(Database issue):D21D25, 2007. H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. Theprotein data bank. Nucleic Acids Research, 28:235242, 2000. F. J. Bruggeman and H.V. Westerhoff. The nature of systems biology. Trends in Microbiology, 15:4550, 2007. Baxevanis, A.D. 2001. The molecular biology database collection: an updated compilation of biological database resources. Nuclei Acids Research. 29: 1-10. Bult, C.J., White, O., Olsen, G.J., Zhou, L., Fleischmann, R.D., Sutton, G.G., Blake, J.A., FitzGerald, L.M.,Clayton, R.A., Gocayne, J.D., Kerlavage, A.R., Dougherty, B.A., Tomb, J.F., Adams, M.D., Reich, C.I.,Overbeek, R., Kirkness, E.F., Weinstock, K.G., Merrick, J.M., Glodek, A., Scott, J.L., Geoghagen, N.S.and Venter, J.C. 1996. Complete genome sequence of the methanogenicarchaeon, Methanococcusjannaschii. Science. 273: 10581073. Burge,C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78-94. Mitchell, T.M. 1997. Machine Learning. McGraw-Hill International, Singapore. P. Baldi and S. Brunak. Bioinformatics: the Machine Learning Approach. MIT Press, 2001. P. Baldi and S. Brunak. Bioinformatics: the Machine Learning Approach. MIT Press, 2001. Mitchell, T.M. 1997. Machine Learning. McGraw-Hill International, Singapore. Stormo, G., Schneider, T., Gold, L. & Ehrenfeucht, A. 1982. Use of the perceptron algorithm to distinguish translational initiation in E.coli. Nuclei Acids Research 10: 29973011. De Jong, K.A. 1990. Genetic-algorithm-based learning. In: Machine Learning: An Artificial Intelligence Approach (Vol. 3). Y. Kodratoff and R. Michalski eds. San Mateo, CA: Morgan Kaufmann. Goldberg, D.E. 1989. Genetic algorithms in search, optimization, and machine learning. Reading, MA: Addison-Wesley. Cox, D.R. and Miller, H.D. 1965. The Theory of Stochastic Processes, Chapman and Hall. Krogh, A., Mian, I.S. and Haussler, D. 1994. A hidden Markov model that finds genes in E. coli DNA. Nuclei Acids Res. 22: 4768-4778. Durbin, R., Eddy, S.R., Krogh, A. and Mitchison, G.J. 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK. Fasulo, D. 1999. An Analysis of Recent Work on Clustering Algorithms. Technical Report: 01-03-02, Department of Computer Science and Engineering, University of Washington. Vapnik, V. 1995. The Nature of Statistical Learning Theory. Springer-Verlag, New York. Burges, C. 1998. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery. 2: 121-167.

[4]. [5]. [6].

[7]. [8]. [9]. [10]. [11]. [12].

[13].

[14]. [15]. [16]. [17].

[18]. [19]. [20].

87 Corresponding Author: J. K. Meher, Vikash College of Engineering for Women,


Bargarh, Odisha, India

International Journal of Creative Mathematical Sciences & Technology (IJCMST) 1(1): 80-88, 2012

ISSN (P): 2319 7811, ISSN (O): 2319 782X

[21]. Breiman, L., Friedan, J., Olshen, R. & Stone, C. 1984. Classification and regression trees. Wadsworth, Belmont. [22]. J. K. Meher, N. Mishra, P. K. Mohapatra, M. K. Raval, P. K. Meher and G. N. Dash. Signal Processing Approach for Prediction Kink in Transmembrane -Helices, Springer CCIS, ISBN 978-3-642-20572-9 (AIM-2011), pp. 170-177, April-2011, [23]. M. Terribilini, J.-H. Lee, C. Yan, R. L. Jernigan, V. Honavar, and D Dobbs. Predicting RNA-binding sites from amino acid sequence. RNA Journal, 12:1450-1462, 2006. [24]. C. Yan, M. Terribilini, F. Wu, R.L. Jernigan, D. Dobbs, and V. Honavar. Identifying amino acid residues involved in protein-DNA interactions from sequence. BMC Bioinformatics, doi:10.1186/1471-2105-7-262, 2006. [25]. F. J. Bruggeman and H.V. Westerhoff. The nature of systems biology. Trends in Microbiology, 15:4550, 2007.

88 Corresponding Author: J. K. Meher, Vikash College of Engineering for Women,


Bargarh, Odisha, India

You might also like