Harnessing The Potential of Machine Learning For Bioinformatics Using Big Data Tools

International Journal of Computer Science and Information Security (IJCSIS),
Vol. 14, No. 10, October 2016
Harnessing the Potential of Machine Learning for Bioinformatics using

Big Data Tools
M. Usman Ali Shahzad Ahmad Javed Ferzund

Department of Computer Science Department of Computer Science Department of Computer Science
COMSATS Institute of Information COMSATS Institute of Information COMSATS Institute of Information
Technology Technology Technology
Sahiwal, Pakistan Sahiwal, Pakistan Sahiwal, Pakistan
usman.sani1439@ciitsahiwal.edu.pk shahzadahmad@ciitsahiwal.edu.pk jferzund@ciitsahiwal.edu.pk
AbstractAdvancement in the sequencing technology has finest benchmark to manage, evaluate, store and analyze
resulted in the generation of large amount of bioinformatics data large datasets produced in many experiments of sequencing,
that need to be analyzed in short time. Traditional techniques
proteomics and genomics.
cannot cope with the speed and size of data generation. New
platforms, tools and techniques need to be explored for data A large amount of data has been produced in engineering,
analysis that can meet the time and space challenge. Big data biological and computational fields. More recently, big data
tools and techniques address the issues of size, scalability and
has been introduced to analyze, manage and store the large
performance. In this paper, we present the potential of machine
learning techniques implemented on Hadoop and Spark for datasets such as google and yahoo data. In situations where
analyzing bioinformatics data. First, we summarize the recent data cannot be controlled by ordinary techniques big data
work in this area and then discuss the future research directions can helps.
and opportunities.
Big data handles basically five Vs i.e. volume, variety,
Keywords-Big Data, Machine Learning, Hadoop, velocity, veracity and potential value. Big data techniques
MapReduce, Spark, Bioinformatics, Microarray, Gene, DNA, contain data mining (association rule learning),
RNA, Proteomics
crowdsourcing, big cloud for computers, linear and multiple
I.INTRODUCTION regression analysis and investigation of social networks. For
parallel processing of large datasets, Hadoop platform has
Bioinformatics domain includes DNA, RNA, protein
been designed that includes modules HDFS (Hadoop
and genomics data in the heterogeneous network such as
Distributed File System), MapReduce, Pig, Hive, HBase,
gene-gene interaction, protein-protein interaction and gene-
yarn and Spark. Hadoop is an open source JAVA written
disease interaction. During the last few years, this data has
platform that gives distributed and parallel processing and
enormously increased in volume and need to be processed in
storage capability for large datasets through many clusters.
a well-organized manner to reduce the execution time and
HDFS provides distributed storage with the help of nodes of
space requirements. Usually microarray data analysis for
clusters using one Name node (master) and one or multiple
gene selection and classification require closely related genes
Data nodes (slaves) for all other Hadoop modules [1].
in the proper way and also DNA sequencing data is very
MapReduce is a framework for parallel processing of large
important for analysis. Many tools have designed for
data sets in reliable and fault-tolerant way using one Job
bioinformatics data analysis such as blast, EMBOSS, bioperl,
Tracker (master) and one Task Tracker (slave) in the key and
babel and Modeller. All these tools work on small datasets
value pair with both Map and Reduce tasks [1]. Hive is data
and do not illustrate any performance on large datasets.
warehouse framework that provides HQL (Hive Query
However, there is no designed appropriate benchmark that
Language) like SQL (Structured Query Language) interface
can fit to all of these problems. So, there is need to design
for ad-hoc queries and summarization in batch processing
environment [1]. Pig is scripting language also for batch
processing and HBase provides random read write access
This paper was submitted for review on 23 October 2016. and real time processing of big data [1]. Hadoop is the
M. Usman Ali is with the Department of Computer Science, COMSATS
Institute of Information Technology, Sahiwal, 57000 Pakistan (e-mail:
superlative platform for large biological data processing of
usman.sani1439@ciitsahiwal.edu.pk) DNA, RNA, protein and genomics and also plays an
Shahzad Ahmad is with the Department of Computer Science, COMSATS important role in microarray data analysis and read
Institute of Information Technology, Sahiwal, 57000 Pakistan (e-mail: mapping.
shahzadahmad@ciitsahiwal.edu.pk)
Javed Ferzund is currently working in the Department of Computer Science,
COMSATS Institute of Information Technology, Sahiwal, 57000 Pakistan
(e-mail: jferzund@ciitsahiwal.edu.pk)
668 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Many machine learning techniques and algorithms are test is used to find relevant genes resulting in best classification
used for classification and analysis of datasets. Machine accuracy with fewer genes by using three bioinformatics
learning classification and regression algorithms include datasets from NCBI. Lee et al. [4] have proposed novel gene
decision tree, nave bays, logistic regression, SVM (Support selection approach for microarray, in which GADP (Genetic
Vector Machine), gradient boosted tree, random forest, Algorithm with Dynamic Parameter setting) is used with the
generalized linear model, linear regression. Clustering 2-test for gene selection and also SVM (support vector
algorithms are K-means, Fuzzy K-means, power iteration, machine) is used for efficiency verification of genes Resulting
spectral clustering and CluStream. Machine learning also in best classification accuracy with fewer genes by using six
consist of association rule mining and deep learning. All of bioinformatics datasets from NCBI. Kumar et al. [5] have
these machine learning algorithms and techniques are used in proposed Fuzzy kNN algorithm classification, providing
traditional computational processes. Machine learning also, 100% accuracy to select the genes with t-test and classify the
has great importance in the field of big data. Almost these genes using kNN by using leukemia and breast cancer datasets.
techniques are used in Hadoop big data framework such as in Table 1: Overview of Machine Learning techniques used in Hadoop Big Data
the MapReduce and Spark. MapReduce uses Mahout Library
(written in Java) and Spark uses Mlib library (written in Java,
Python, and Scala) for implementation of machine learning MapReduce Spark
techniques. Some of machine learning techniques used in Machine Learning
(Mahout (Mlib
Hadoop big data are mentioned in the Table 1.
(Techniques and Algorithms) library) library)
The objectives of this study are:
To explore the Machine Learning techniques used in NB (Nave Bayes Bayesian Algorithm) Yes Yes
Bioinformatics
To analyze the capabilities o f Machine GBT (Gradient Boosted Tree
Learning techniques implemented on Big Data Yes Yes
Ensemble Algorithm)
Platform
To present the future research opportunities for using
Machine Learning techniques along with Big Data Streaming K-means Clustering Yes Yes
platform to analyze the bioinformatics data
To explore the Performance comparison of existing SVM (Support Vector Machine) Yes Yes
Algorithms
K-means Clustering Yes Yes
The rest of the paper is structured as follows: Section II
describe the related work. Section III represent the work in
Adaptive Model Rules No No
Bioinformatics along with Machine Learning. Section IV
represent the use of ML tools, algorithms and techniques in
the field of Bioinformatics using Big Data Hadoop. Section GLM (Generalized Linear Model) No Yes
V describe the discussion related to our work. Section VI
concludes the paper with opportunities for further research
kNN
work in Bioinformatics along with Big Data Hadoop. (Instance based Algorithm) Yes Yes
II.RELATED WORK LR (Logistic Regression) Yes Yes
A. Machine Learning in Bioinformatics Deep Learning Yes Yes

In the past, machine learning techniques have been used
in bioinformatics domain for microarray data analysis. Random Forest
Hernandez et al. [2] proposed computational approach for (Ensemble Algorithm) Yes Yes
selection and classification of genes, in which they classified
the genes with SVM (Support Vector Machine) classifier
k-Median Clustering Yes Yes
with the help of genetic algorithm by using leukemia, colon
cancer and lymphoma datasets from NCBI resulting higher Association Rule
accuracy. Leu et al. [3] have developed analysis of Mining
(Apriori Algorithm) Yes Yes
microarray data with sampling, in which genes are classified
into three groups based on expression level. After removing
the unnecessary groups, subsets are made by using sampling. Decision Tree Yes Yes
Then irrelevant subsets are removed with the help of
classification accuracy determined by kNN algorithm and 2- Linear Regression Yes Yes
ISSN 1947-5500
B. Machine Learning in Big Data Hadoop Platform III. BIOINFORMATICS

Machine learning techniques are being used in big data Bioinformatics consists of multiple heterogeneous
Hadoop platform since last in the few years. Ye et al. [6] have networks of DNA, RNA, protein and genome and their
developed Stochastic GBDT (Gradient Boosted Decision interaction in multiple ways. Machine learning techniques
Trees) learning algorithm for machine learning. It presents perform a significant role in the field of Bioinformatics in
two methods for improvement of training time individual areas like gene prediction, microarray data analysis, sequence
trees that produces exact stochastic GBDT models that are alignment, pattern identification, protein-protein interaction
implemented on MapReduce and then implemented on prediction and SNP (Single Nucleotide Precision). Multiple
Hadoop with MPI (Message Passing Interface). Dai et al. [7] Machine Learning techniques and algorithms are used in
have developed MapReduce based application of Decision gene selection and classification for microarray data analysis.
Tree C 4.5 that reduces the time and memory requirements For large datasets, Some of Machine Learning techniques
and results in efficient and scalable method for large are used in Bioinformatics using Big Data Hadoop
datasets. Venkataraman et al. [8] have developed framework. However, many techniques are still not being
Generalized Linear Model in SparkR with the help of R used in Bioinformatics with Big Data Hadoop. By using
tool for large datasets. R is statistical tool that provides R these techniques, we can expect better performance,
package in Hadoop framework such as SparkR for Spark and scalability, efficiency, accuracy and speedup. Basically, this
RHIPE for MapReduce. RHIPE is used to calculate the is the main focus.
correlation matrix on gene expression data [9]. OANCEA et
al. [10] have performed LR (Linear Regression) using After gene selection and classification for microarray
statistical tool R and Hadoop framework by using open data analysis using Machine Learning techniques with
source Rhadoop library for large datasets and least squares Hadoop platform, we can efficiently perform Katz (link
solution for the linear regression problem have been formulation method), CATAPULT (positive unlabeled
conveyed in terms of map-reduce framework. method) and IMC (Inductive Matrix Calculation) for gene-
disease association to find how much a gene is associated
with specific disease [14]. Also, can well perform sequence
C. Machine Learning in Bioinformatics using Big Data
alignment and GATK pipeline after microarray data analysis
Hadoop Platform
using ML (Machine Learning) with Hadoop [15]. However,
Traditionally, there are many methods of gene selection ML with Hadoop also show an imperative role in
for microarray data analysis such as GADP, SVM and Bioinformatics and engineering fields in terms of different
supervised clustering. All of these methods are better to scenario.
some extent but not good for scalability. Recently, machine
learning techniques are used in Bioinformatics domain using Machine Learning techniques used in Hadoop context
Big Data framework. A.K.M. Tauhidul Islam et al [11] are presented in Table 1. These almost can be used in
have developed Microarray data analysis with the help of Bioinformatics domain with Big Data Hadoop for designing
Hadoop MapReduce platform, in which map task find out best resultant benchmark for large datasets. By using these
BW (Between-groups to Within-groups sum of square) ratio techniques, we can address the problems of time, space,
value for every gene. BW measures find out degree of scalability and speedup.
variance between gene expression values. After finding
potential gene subset, it uses kNN classifier algorithm on IV.MACHINE LEARNING TOOLS AND TECHNIQUES
gene list for accuracy. By running multiple parallel kNN There are many tools of Machine Learning such as
algorithms known as MRkNN for finding top-k genes by Hadoop Mahout Library (uses MapReduce), Caffe, Mlib
using 4 real and 3 synthetic bioinformatics datasets, better Library (uses Apache Spark), WEKA, Neon, Torch and
results are obtained in terms of accuracy and scalability. ConvNetJS. Most of Machine Learning tasks are
Kumar et al. [12] have also developed a method in which implemented via Hadoop libraries such as Mahout and Mlib.
statistical test ANOVA (Analysis of Variance) is used for Mahout offers performance and scalable features for analysis
gene selection and kNN algorithm is used to classify the of large datasets using Machine Learning in the perspective
features resulting in better speedup and scalability by using of clustering and classification. Mlib is a Spark Machine
NCBI datasets. Ray et al. [13] have proposed Spark Learning library which includes many algorithms such as
framework for microarray data analysis in such a way that Nave Bays and Decision Tree etc. it gives best scalability,
feature selected with sf-ANOVA and genes are classified speedup and performance parameters as compared to
with machine learning techniques such as Logistic MapReduce. Caffe provides Deep Learning platform for
Regression and Nave Bays resulting in best accuracy, Machine Learning tasks in terms of Speed and used in
scalability and speedup as compared to all traditional making different models for analysis in the perspective of
methods. learning tasks. WEKA is Data Mining and Machine learning
tool. Torch is a platform for GPU and provides flexibility and
speedup.
ISSN 1947-5500
Table 2: Overview of Machine Learning techniques used in Bioinformatics using Big Data Hadoop
Bioinformatics using Big

MapReduce Spark
Machine Learning (Techniques and Algorithms) Data
(Mahout library) (Mlib library)
(Bioinformatics+Hadoop)
NB (Nave Bayes Bayesian Algorithm) Yes Yes Yes
GBT (Gradient Boosted Tree Ensemble Algorithm) Yes Yes No
Streaming K-means Clustering Yes Yes No
SVM (Support Vector Machine) Yes Yes Yes
K-means Clustering Yes Yes No
Adaptive Model Rules No No No
GLM (Generalized Linear Model) No Yes No
kNN
Yes Yes Yes
(Instance based Algorithm)
LR (Logistic Regression) Yes Yes Yes
Deep Learning Yes Yes No
Random Forest
Yes Yes Yes
(Ensemble Algorithm)
k-Median Clustering Yes Yes No
Association Rule Mining (Apriori Algorithm) Yes Yes No
Decision Tree Yes Yes No
Linear Regression Yes Yes Yes
ISSN 1947-5500
A lot of Machine Learning techniques are used today Scalability of SVM is low as compared to kNN
mentioned in Table 1. Some are used in Bioinformatics; Algorithm when it is implemented in MapReduce framework.
some are in Big Data Hadoop and some are in Bioinformatics By implementing kNN (k Nearest Neighbor) Algorithm in
with Big Data Hadoop. Nave Bayes is used in training and Hadoop MapReduce framework for Bioinformatics
testing of data, making models in statistics and having major (Microarray) data, better Scalability has been achieved as
focus on classification of datasets. Gradient Boosting solves compared to SVM and GADP (Genetic Algorithm with
the problems of regression and classify the datasets with Dynamic Programming). Similarly, Accuracy of kNN is less
Decision Tree model. Random Forest behaves as GBT than SVM. Implementation of kNN in MapReduce provides
(Gradient Boosted Tree) but performance is best with GBT. best Performance by decreasing communication cost than
SVM (Support Vector Machine) combined with GA (Genetic implementation of kNN in traditional tools.
Algorithm) provides more accuracy for large datasets. Fuzzy
kNN algorithm performs better than kNN to classify large By implementing LR (Logistic Regression) Technique in
datasets. Association Rule Mining give the relationship Spark framework for Bioinformatics (Microarray) data, better
between different variables. Accuracy can be obtained than Hadoop MapReduce
framework. Implementation of LR in Spark provides lower
There are many Machine Learning Techniques and Accuracy and Scalability as compared to NB. By
Algorithms that are implemented in Bioinformatics using implementing Random Forest Algorithm in Hadoop
Big Data Hadoop framework. Some of these are Nave Bayes MapReduce framework for Bioinformatics (Microarray) data,
(Bayesian Algorithm), SVM (Support Vector Machine), kNN Accuracy can be maintained in the presence of missing data.
(Instance based Algorithm), Logistic Regression, Random Scalability of Random Forest is better than other algorithms. It
Forest (Ensemble Algorithm) and Linear Regression that are provides best Performance by using many-many model.
mentioned in Table 2. Nave Bayes and Logistic Regression Implementation of Linear Regression in Hadoop MapReduce
used for microarray classification in Apache Spark for large framework for Bioinformatics (Microarray) data also provides
datasets acquire best scalability. kNNs are implemented in better Accuracy, Scalability and Performance. Linear
Hadoop MapReduce platform for microarray feature Regression is most widely used ML Technique that runs fast.
classification for big datasets for better performance. SVM Table 3 explains Performance Comparison of existing
(Support Vector Machine) also provides best results for gene algorithms.
selection/prediction for microarray data analysis in Hadoop.
Better result appeared with Random Forest in Bioinformatics
V.DISCUSSION
using Big Data Hadoop.
In the recent past, With the passage of time, use of Machine
Some Machine Learning Techniques are not implemented Learning tools and techniques along with are used
in Bioinformatics using Big Data Hadoop such as Gradient bioinformatics using Big Data Hadoop have gained popularity
Boosted Tree (Ensemble Algorithm), Streaming K-means in the bioinformatics domain. SVM (Support Vector Machine)
clustering, K-means clustering, Adaptive Model Rules, GLM and kNN are implemented in Bioinformatics using Big Data
(Generalized Linear Model), Deep Learning, K-median Hadoop for genome datasets. We can use SVM for protein
clustering, Association Rule Mining (Apriori Algorithm) and datasets such as to find protein-protein interaction networks.
Decision Tree that are mentioned in Table 2. By using these Logistic Regression and Nave Bayes techniques are also
ML techniques and algorithms in Bioinformatics using Big used in Hadoop for gene classification for microarray data.
Data Hadoop, we can achieve superlative Performance, These techniques give better performance when we use them
Accuracy, Scalability, Reliability, Speedup and Efficiency for proteomics. Linear Regression also can be used for protein
by reducing the need for time and storage. datasets. By using the state of the art machine learning
techniques implemented on Hadoop and Spark for DNA, RNA
A. Performance Comparison of Existing Algorithms and protein datasets, we can expect best Performance,
Accuracy, Speedup and Scalability. A lot of research potential
By implementing NB (Nave Bayes) Technique in Spark exists in this area.
framework for Bioinformatics (Microarray) data, better
Accuracy and Performance has been obtained as compared to The ML techniques and algorithms that are used in
Hadoop MapReduce framework and conventional Techniques Bioinformatics using Hadoop can also be used in DNA and
[17]. Nave Bayes provides better Accuracy and Scalability RNA Sequence Alignment to achieve best performance for
than Logistic Regression in Spark framework. Apache Spark some analysis. These techniques and algorithms can also be
offers best Performance by reducing training time when kNN used in GATP pipeline for best read mapping Scalability
Algorithm is implemented on Spark. By implementing SVM and Performance. With the help of using these tools,
(Support Vector Machine) Algorithm in Hadoop MapReduce techniques and algorithms in the domain of Bioinformatics
framework for Bioinformatics (Microarray) data, better along with Big Data Hadoop demonstrate extreme potential
Accuracy has been achieved than kNN. SVM implemented on in our work.
MapReduce gives better Performance by decreasing training
time than using SVM on traditional tools [17].
ISSN 1947-5500
Table 3: Performance Comparison of Existing Algorithms
Machine Learning Implementation

Performance and Comparison
(Techniques and Dataset Platform Accuracy Scalability
with Traditional Algorithms
Algorithms) (Hadoop)
Using Spark, 100 times faster

Using Spark, Better
NB (Nave Bayes Same as like than MapReduce.
than MapReduce and
Bayesian Bioinformatics Apache Spark MapReduce. Training time decreases and
conventional.
Algorithm) Better than LR best Performance than
Better than LR
traditional NB
Training time decreases, best

SVM (Support Performance and reduce
Bioinformatics MapReduce Better than kNN Less than kNN
Vector Machine) computational complexity than
traditional SVM
kNN Better than SVM, Better Performance and

(Instance based Bioinformatics MapReduce BPNN and GADP decreases communication cost
Less than SVM
Algorithm) Algorithms than traditional kNN
Using Spark, 100 times faster

Using Spark, Better
Same as like than MapReduce.
LR (Logistic than MapReduce and
Bioinformatics Apache Spark MapReduce. Execution time decreases and
Regression) conventional.
Less than NB best Performance than
Less than NB
traditional LR
Random Forest Maintains Accuracy Scalability better

(Ensemble Bioinformatics MapReduce when there is than others Best using many-many model
Algorithm) missing data Algorithms
Work fine on high-

Best than traditional Most widely used ML
Linear Regression Bioinformatics MapReduce dimensional,
Techniques Technique that runs fast
sparse dataset
ISSN 1947-5500
ML tools used with Big Data Hadoop have produced

surprising results in Bioinformatics. For gene classification, ACKNOWLEDGMENT
fuzzy kNN algorithm in MapReduce platform gives better The authors would like to express thanks to Abbas
results for scalability and accuracy for large datasets than by Rehman and Atif Sarwar Department of Computer Science,
using kNN. We can implement Deep Learning and Decision COMSATS Institute of Information Technology Sahiwal,
Tree techniques in Apache Spark for microarray data analysis Pakistan for their visionary suggestions and beneficial
to get best Accuracy. contribution to support this work.
We can reduce the time and space constraints to solve
Bioinformatics problems by implementing all of these
techniques on Hadoop and Spark. REFERENCES
Scalability of NB (Nave Bayes) Technique can be
improve by implementing in Apache Spark framework for [1] Taylor, Ronald C, "An overview of the
large Bioinformatics dataset. There is a great need to improve Hadoop/MapReduce/HBase framework and its current
Scalability of SVM (Support Vector Machine) Algorithm that applications in bioinformatics," in Bioinformatics Open
is implemented in MapReduce. We can improve the Source Conference (BOSC), Boston, MA, USA, 2010.
Accuracy, Scalability and Performance of SVM Algorithm by
implementing in Spark for large Bioinformatics dataset. [2] Jose Crispin Hernandez Hernandez , Jin-Kao Hao, B eatrice
There is a great need to improve Accuracy of kNN (k Nearest Duval;, "A Genetic Embedded Approach for Gene Selection
Neighbor) Algorithm that is implemented in MapReduce. We and Classification of Microarray Data," in EvoBio, Verlag
can improve the Accuracy, Scalability and Performance of Berlin Heidelberg, 2007.
kNN Algorithm by implementing in Spark for large
Bioinformatics dataset. Accuracy and Scalability of LR [3] Yungho Leu, Chien-Pang Lee, and Hui-Yi Tsai, "A Gene
(Logistic Regression) Technique can be improved by Selection Method for Microarray Data Based on Sampling,"
implementing in Apache Spark framework because LR in ICCCI, Verlag Berlin Heidelberg, 2010.
Accuracy and Scalability have less than NB in Spark.
Accuracy, Scalability and Performance of Linear Regression [4] Chien-Pang Lee , Yungho Leu, "A novel hybrid feature
and Random Forest Algorithm can be improved by selection method for microarray data analysis," Applied Soft
implementing that Algorithms in Apache Spark framework. Computing, vol. 11, no. 1, pp. 208-213, January 2011.
The Performance comparison is given in Table 3.
[5] Mukesh Kumar, Santanu Ku. Rath, "Microarray Data
Classification using Fuzzy K-Nearest Neighbor," 2014.
VI.CONCLUSION
In this paper, we describe the many Machine Learning [6] Jerry Ye, Jyh-Herng Chow, Jiang Chen, Zhaohui Zheng,
Tools, Techniques and Algorithms for Bioinformatics using "Stochastic Gradient Boosted Distributed Decision Trees,"
Big Data Hadoop for large datasets expected to preeminent in CIKM, Hong Kong, China, 2009.
results by reducing the need for time and space limits. We
distinguish the implementation of ML techniques and [7] Wei Dai, Wei Ji, "A MapReduce Implementation of C4.5
algorithms that are used in Bioinformatics domain using Big Decision Tree Algorithm," International Journal of
Data Hadoop mentioned in Table 2. With the usage of these Database Theory and Application, vol. 7, pp. 49-60, 2014.
ML techniques such as Deep Learning, Streaming K-means
clustering, Decision Tree, Adaptive Model Rules, GLM [8] Venkataraman et al., "SparkR: Scaling R Programs with
(Generalized Linear Model), Gradient Boosted Tree Spark," in SIGMOD, San Francisco, CA, USA, 2016.
(Ensemble Algorithm), K-median clustering, Association
[9] Wang et al., "Optimising parallel R correlation matrix
Rule Mining (Apriori Algorithm) and K-means clustering,
calculations on gene expression data using MapReduce," in
we conclude that these gives best Accuracy, Performance,
BMC Bioinformatics, 2014.
Scalability and Speedup directed to big ML potential for
Bioinformatics using Big Data.
[10] OANCEA, Bogdan, "LINEAR REGRESSION WITH R
Adaptive model rules are not used in Hadoop platform. AND HADOOP," Challenges of the Knowledge Society, pp.
Generalized linear model are not used in MapReduce. We 1007-1012, 2016.
can implement Adaptive model rules in Hadoop MapReduce
and Apache Spark to attain superlative results. Also, we can [11] A.K.M. Tauhidul Islam, Byeong-Soo Jeong, A.T.M. Golam
Bari, Chae-Gyun Lim, Seok-Hee Jeon, "MapReduce based
use GLM in MapReduce platform for best Performance
parallel gene selection method," Applied Intelligence, vol.
analysis.
42, no. 2, pp. 147-156, 2015.
ISSN 1947-5500
[12] Mukesh Kumar, Nitish Kumar Rath, Amitav Swain, Santanu

Kumar Rath, "Feature Selection and Classification of Shahzad Ahmad is a Research Associate at
Microarray Data using MapReduce based ANOVA and K- Department of Computer Science,
Nearest Neighbor," in IMCIP, 2015. COMSATS Institute of Information
Technology Sahiwal, Pakistan. He received
[13] Ransingh Biswajit Ray, Mukesh Kumar, Santanu Kumar BS (CS) degree from COMSATS Institute
Rath, "Fast Computing of Microarray Data Using Resilient of Information Technology Sahiwal,
Distributed Dataset of Apache Spark," in Recent Advances Pakistan in 2015. Currently, he is a scholar
in Information and Communication Technology, vol. 463, of MS (CS) session 2015-2017 in
Springer International Publishing, 2016, pp. 171-182. COMSATS Institute of Information Technology Sahiwal,
Pakistan. His main research interests include Big Data
[14] Nagarajan Natarajan, Inderjit S. Dhillon, "Inductive Matrix Analytics and Machine Learning. Particularly, he is interested
Completion for Predicting Gene-Disease Association," in applications of Big Data in the Bioinformatics field.
Oxford, vol. 30, no. 12, 2014. Currently, he is working with the Big Data Analytics Research
Group at COMSATS Institute Sahiwal.
[15] Hamid Mushtaq, Zaid Al-Ars, "Cluster-Based Apache Spark
Implementation of the GATK DNA Analysis Pipeline," in
BIBM, Washington, USA, 2015.
Dr. Javed Ferzund is an associate
professor at Department of Computer
[16] Sara Landset, Taghi M. Khoshgoftaar, Aaron N. Richter,
Tawfiq Hasanin, "A survey of open source tools for machine
Science, COMSATS Institute of
learning with big data in the Hadoop ecosystem," Journal of
Information Technology, Sahiwal, where
Big Data, 2015. he served as Head of Department from
2013-2015. He received PhD degree
[17] S. R. Pakize and A. Gandomi, "Comparative Study of from Graz University of Technology,
Classification Algorithms Based on MapReduce Model," Austria in 2009. His main research
International Journal of Innovative Research in Advanced interests include Big Data Analytics,
Engineering (IJIRAE), vol. 1, no. 7, pp. 251-254, August Internet of Things and Machine Learning. Particularly, he is
2014. interested in applications of IoT and Big Data in the Agro-
Informatics and Bioinformatics fields. Currently, he is leading
the Big Data Analytics Research Group at COMSATS Institute
Sahiwal.
M. Usman Ali is a Lab Engineer at

Department of Computer Science,
COMSATS Institute of Information
Technology Sahiwal, Pakistan. He
received BS (IT) degree from Govt.
College University Faisalabad, Pakistan in
2013. Currently, he is a scholar of MS (CS)
session 2015-2017 in COMSATS Institute
of Information Technology Sahiwal, Pakistan. His main
research interests include Big Data Analytics and Machine
Learning. Particularly, he is interested in applications of Big
Data in the Bioinformatics field. Currently, he is working with
the Big Data Analytics Research Group at COMSATS
Institute Sahiwal.
ISSN 1947-5500

Harnessing The Potential of Machine Learning For Bioinformatics Using Big Data Tools

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Harnessing The Potential of Machine Learning For Bioinformatics Using Big Data Tools

Uploaded by

Copyright:

Available Formats

International Journal of Computer Science and Information Security (IJCSIS),

Vol. 14, No. 10, October 2016

Harnessing the Potential of Machine Learning for Bioinformatics using

M. Usman Ali Shahzad Ahmad Javed Ferzund

II.RELATED WORK LR (Logistic Regression) Yes Yes

A. Machine Learning in Bioinformatics Deep Learning Yes Yes

B. Machine Learning in Big Data Hadoop Platform III. BIOINFORMATICS

Bioinformatics using Big

NB (Nave Bayes Bayesian Algorithm) Yes Yes Yes

GBT (Gradient Boosted Tree Ensemble Algorithm) Yes Yes No

Streaming K-means Clustering Yes Yes No

SVM (Support Vector Machine) Yes Yes Yes

K-means Clustering Yes Yes No

Adaptive Model Rules No No No

GLM (Generalized Linear Model) No Yes No

LR (Logistic Regression) Yes Yes Yes

Deep Learning Yes Yes No

k-Median Clustering Yes Yes No

Association Rule Mining (Apriori Algorithm) Yes Yes No

Decision Tree Yes Yes No

Linear Regression Yes Yes Yes

Table 3: Performance Comparison of Existing Algorithms

Machine Learning Implementation

Using Spark, 100 times faster

Training time decreases, best

kNN Better than SVM, Better Performance and

Using Spark, 100 times faster

Random Forest Maintains Accuracy Scalability better

Work fine on high-

ML tools used with Big Data Hadoop have produced

[12] Mukesh Kumar, Nitish Kumar Rath, Amitav Swain, Santanu

M. Usman Ali is a Lab Engineer at

You might also like