Professional Documents
Culture Documents
AbstractAdvancement in the sequencing technology has finest benchmark to manage, evaluate, store and analyze
resulted in the generation of large amount of bioinformatics data large datasets produced in many experiments of sequencing,
that need to be analyzed in short time. Traditional techniques
proteomics and genomics.
cannot cope with the speed and size of data generation. New
platforms, tools and techniques need to be explored for data A large amount of data has been produced in engineering,
analysis that can meet the time and space challenge. Big data biological and computational fields. More recently, big data
tools and techniques address the issues of size, scalability and
has been introduced to analyze, manage and store the large
performance. In this paper, we present the potential of machine
learning techniques implemented on Hadoop and Spark for datasets such as google and yahoo data. In situations where
analyzing bioinformatics data. First, we summarize the recent data cannot be controlled by ordinary techniques big data
work in this area and then discuss the future research directions can helps.
and opportunities.
Big data handles basically five Vs i.e. volume, variety,
Keywords-Big Data, Machine Learning, Hadoop, velocity, veracity and potential value. Big data techniques
MapReduce, Spark, Bioinformatics, Microarray, Gene, DNA, contain data mining (association rule learning),
RNA, Proteomics
crowdsourcing, big cloud for computers, linear and multiple
I.INTRODUCTION regression analysis and investigation of social networks. For
parallel processing of large datasets, Hadoop platform has
Bioinformatics domain includes DNA, RNA, protein
been designed that includes modules HDFS (Hadoop
and genomics data in the heterogeneous network such as
Distributed File System), MapReduce, Pig, Hive, HBase,
gene-gene interaction, protein-protein interaction and gene-
yarn and Spark. Hadoop is an open source JAVA written
disease interaction. During the last few years, this data has
platform that gives distributed and parallel processing and
enormously increased in volume and need to be processed in
storage capability for large datasets through many clusters.
a well-organized manner to reduce the execution time and
HDFS provides distributed storage with the help of nodes of
space requirements. Usually microarray data analysis for
clusters using one Name node (master) and one or multiple
gene selection and classification require closely related genes
Data nodes (slaves) for all other Hadoop modules [1].
in the proper way and also DNA sequencing data is very
MapReduce is a framework for parallel processing of large
important for analysis. Many tools have designed for
data sets in reliable and fault-tolerant way using one Job
bioinformatics data analysis such as blast, EMBOSS, bioperl,
Tracker (master) and one Task Tracker (slave) in the key and
babel and Modeller. All these tools work on small datasets
value pair with both Map and Reduce tasks [1]. Hive is data
and do not illustrate any performance on large datasets.
warehouse framework that provides HQL (Hive Query
However, there is no designed appropriate benchmark that
Language) like SQL (Structured Query Language) interface
can fit to all of these problems. So, there is need to design
for ad-hoc queries and summarization in batch processing
environment [1]. Pig is scripting language also for batch
processing and HBase provides random read write access
This paper was submitted for review on 23 October 2016. and real time processing of big data [1]. Hadoop is the
M. Usman Ali is with the Department of Computer Science, COMSATS
Institute of Information Technology, Sahiwal, 57000 Pakistan (e-mail:
superlative platform for large biological data processing of
usman.sani1439@ciitsahiwal.edu.pk) DNA, RNA, protein and genomics and also plays an
Shahzad Ahmad is with the Department of Computer Science, COMSATS important role in microarray data analysis and read
Institute of Information Technology, Sahiwal, 57000 Pakistan (e-mail: mapping.
shahzadahmad@ciitsahiwal.edu.pk)
Javed Ferzund is currently working in the Department of Computer Science,
COMSATS Institute of Information Technology, Sahiwal, 57000 Pakistan
(e-mail: jferzund@ciitsahiwal.edu.pk)
668 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 10, October 2016
Many machine learning techniques and algorithms are test is used to find relevant genes resulting in best classification
used for classification and analysis of datasets. Machine accuracy with fewer genes by using three bioinformatics
learning classification and regression algorithms include datasets from NCBI. Lee et al. [4] have proposed novel gene
decision tree, nave bays, logistic regression, SVM (Support selection approach for microarray, in which GADP (Genetic
Vector Machine), gradient boosted tree, random forest, Algorithm with Dynamic Parameter setting) is used with the
generalized linear model, linear regression. Clustering 2-test for gene selection and also SVM (support vector
algorithms are K-means, Fuzzy K-means, power iteration, machine) is used for efficiency verification of genes Resulting
spectral clustering and CluStream. Machine learning also in best classification accuracy with fewer genes by using six
consist of association rule mining and deep learning. All of bioinformatics datasets from NCBI. Kumar et al. [5] have
these machine learning algorithms and techniques are used in proposed Fuzzy kNN algorithm classification, providing
traditional computational processes. Machine learning also, 100% accuracy to select the genes with t-test and classify the
has great importance in the field of big data. Almost these genes using kNN by using leukemia and breast cancer datasets.
techniques are used in Hadoop big data framework such as in Table 1: Overview of Machine Learning techniques used in Hadoop Big Data
the MapReduce and Spark. MapReduce uses Mahout Library
(written in Java) and Spark uses Mlib library (written in Java,
Python, and Scala) for implementation of machine learning MapReduce Spark
techniques. Some of machine learning techniques used in Machine Learning
(Mahout (Mlib
Hadoop big data are mentioned in the Table 1.
(Techniques and Algorithms) library) library)
The objectives of this study are:
To explore the Machine Learning techniques used in NB (Nave Bayes Bayesian Algorithm) Yes Yes
Bioinformatics
To analyze the capabilities o f Machine GBT (Gradient Boosted Tree
Learning techniques implemented on Big Data Yes Yes
Ensemble Algorithm)
Platform
To present the future research opportunities for using
Machine Learning techniques along with Big Data Streaming K-means Clustering Yes Yes
platform to analyze the bioinformatics data
To explore the Performance comparison of existing SVM (Support Vector Machine) Yes Yes
Algorithms
K-means Clustering Yes Yes
The rest of the paper is structured as follows: Section II
describe the related work. Section III represent the work in
Adaptive Model Rules No No
Bioinformatics along with Machine Learning. Section IV
represent the use of ML tools, algorithms and techniques in
the field of Bioinformatics using Big Data Hadoop. Section GLM (Generalized Linear Model) No Yes
V describe the discussion related to our work. Section VI
concludes the paper with opportunities for further research
kNN
work in Bioinformatics along with Big Data Hadoop. (Instance based Algorithm) Yes Yes
669 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 10, October 2016
670 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 10, October 2016
Table 2: Overview of Machine Learning techniques used in Bioinformatics using Big Data Hadoop
kNN
Yes Yes Yes
(Instance based Algorithm)
Random Forest
Yes Yes Yes
(Ensemble Algorithm)
671 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 10, October 2016
A lot of Machine Learning techniques are used today Scalability of SVM is low as compared to kNN
mentioned in Table 1. Some are used in Bioinformatics; Algorithm when it is implemented in MapReduce framework.
some are in Big Data Hadoop and some are in Bioinformatics By implementing kNN (k Nearest Neighbor) Algorithm in
with Big Data Hadoop. Nave Bayes is used in training and Hadoop MapReduce framework for Bioinformatics
testing of data, making models in statistics and having major (Microarray) data, better Scalability has been achieved as
focus on classification of datasets. Gradient Boosting solves compared to SVM and GADP (Genetic Algorithm with
the problems of regression and classify the datasets with Dynamic Programming). Similarly, Accuracy of kNN is less
Decision Tree model. Random Forest behaves as GBT than SVM. Implementation of kNN in MapReduce provides
(Gradient Boosted Tree) but performance is best with GBT. best Performance by decreasing communication cost than
SVM (Support Vector Machine) combined with GA (Genetic implementation of kNN in traditional tools.
Algorithm) provides more accuracy for large datasets. Fuzzy
kNN algorithm performs better than kNN to classify large By implementing LR (Logistic Regression) Technique in
datasets. Association Rule Mining give the relationship Spark framework for Bioinformatics (Microarray) data, better
between different variables. Accuracy can be obtained than Hadoop MapReduce
framework. Implementation of LR in Spark provides lower
There are many Machine Learning Techniques and Accuracy and Scalability as compared to NB. By
Algorithms that are implemented in Bioinformatics using implementing Random Forest Algorithm in Hadoop
Big Data Hadoop framework. Some of these are Nave Bayes MapReduce framework for Bioinformatics (Microarray) data,
(Bayesian Algorithm), SVM (Support Vector Machine), kNN Accuracy can be maintained in the presence of missing data.
(Instance based Algorithm), Logistic Regression, Random Scalability of Random Forest is better than other algorithms. It
Forest (Ensemble Algorithm) and Linear Regression that are provides best Performance by using many-many model.
mentioned in Table 2. Nave Bayes and Logistic Regression Implementation of Linear Regression in Hadoop MapReduce
used for microarray classification in Apache Spark for large framework for Bioinformatics (Microarray) data also provides
datasets acquire best scalability. kNNs are implemented in better Accuracy, Scalability and Performance. Linear
Hadoop MapReduce platform for microarray feature Regression is most widely used ML Technique that runs fast.
classification for big datasets for better performance. SVM Table 3 explains Performance Comparison of existing
(Support Vector Machine) also provides best results for gene algorithms.
selection/prediction for microarray data analysis in Hadoop.
Better result appeared with Random Forest in Bioinformatics
V.DISCUSSION
using Big Data Hadoop.
In the recent past, With the passage of time, use of Machine
Some Machine Learning Techniques are not implemented Learning tools and techniques along with are used
in Bioinformatics using Big Data Hadoop such as Gradient bioinformatics using Big Data Hadoop have gained popularity
Boosted Tree (Ensemble Algorithm), Streaming K-means in the bioinformatics domain. SVM (Support Vector Machine)
clustering, K-means clustering, Adaptive Model Rules, GLM and kNN are implemented in Bioinformatics using Big Data
(Generalized Linear Model), Deep Learning, K-median Hadoop for genome datasets. We can use SVM for protein
clustering, Association Rule Mining (Apriori Algorithm) and datasets such as to find protein-protein interaction networks.
Decision Tree that are mentioned in Table 2. By using these Logistic Regression and Nave Bayes techniques are also
ML techniques and algorithms in Bioinformatics using Big used in Hadoop for gene classification for microarray data.
Data Hadoop, we can achieve superlative Performance, These techniques give better performance when we use them
Accuracy, Scalability, Reliability, Speedup and Efficiency for proteomics. Linear Regression also can be used for protein
by reducing the need for time and storage. datasets. By using the state of the art machine learning
techniques implemented on Hadoop and Spark for DNA, RNA
A. Performance Comparison of Existing Algorithms and protein datasets, we can expect best Performance,
Accuracy, Speedup and Scalability. A lot of research potential
By implementing NB (Nave Bayes) Technique in Spark exists in this area.
framework for Bioinformatics (Microarray) data, better
Accuracy and Performance has been obtained as compared to The ML techniques and algorithms that are used in
Hadoop MapReduce framework and conventional Techniques Bioinformatics using Hadoop can also be used in DNA and
[17]. Nave Bayes provides better Accuracy and Scalability RNA Sequence Alignment to achieve best performance for
than Logistic Regression in Spark framework. Apache Spark some analysis. These techniques and algorithms can also be
offers best Performance by reducing training time when kNN used in GATP pipeline for best read mapping Scalability
Algorithm is implemented on Spark. By implementing SVM and Performance. With the help of using these tools,
(Support Vector Machine) Algorithm in Hadoop MapReduce techniques and algorithms in the domain of Bioinformatics
framework for Bioinformatics (Microarray) data, better along with Big Data Hadoop demonstrate extreme potential
Accuracy has been achieved than kNN. SVM implemented on in our work.
MapReduce gives better Performance by decreasing training
time than using SVM on traditional tools [17].
672 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 10, October 2016
673 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 10, October 2016
674 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 10, October 2016
675 https://sites.google.com/site/ijcsis/
ISSN 1947-5500