You are on page 1of 53

jwoo Woo

HiPI C
CSULA
Big Data Use Cases in Science and
Advanced Data Intensive Computing


KISTI
Dae-Jon, Korea
June 20
th
2014

Jongwook Woo (PhD)
High-Performance Information Computing Center (HiPIC)
Cloudera Academic Partner and Grants Awardee of Amazon AWS
Computer Information Systems Department
California State University, Los Angeles
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Contents

Emerging Big Data Technology
Big Data Use Cases
Hadoop 2.0
Training in Big Data
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Me
:
:
(: ), California State University Los Angeles
Capital City of Entertainment
:
2002 : Computer Information Systems Dept, College of
Business and Economics
www.calstatela.edu/faculty/jwoo5
1998
J2EE eBusiness applications
FAST, Lucene/Solr, Sphinx ,
Warner Bros (Matrix online game), E!, citysearch.com, ARM
2009

High Performance I nformation Computing Center
Jongwook Woo
CSULA
Experience in Big Data
Grants
Received MicroSoft Windows Azure Educator Grant (Oct 2013
- July 2014)
Received Amazon AWS in Education Research Grant (July
2012 - July 2014)
Received Amazon AWS in Education Coursework Grants (July
2012 - July 2013, Jan 2011 - Dec 2011
Partnership
Received Academic Education Partnership with Cloudera since
June 2012
Linked with Hortonworks since May 2013
Positive to provide partnership

High Performance I nformation Computing Center
Jongwook Woo
CSULA
Experience in Big Data
Certificate
Certified Cloudera
Certified Cloudera Hadoop Developer / Administrator
Certificate of Achievement in the Big Data University Training
Course, Hadoop Fundamentals I, July 8 2012
Certificate of 10gen Training Course, M101: MongoDB
Development, (Dec 24 2012)
Blog and Github for Hadoop and its ecosystems
http://dal-cloudcomputing.blogspot.com/
Hadoop, AWS, Cloudera
https://github.com/hipic
Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming,
RHadoop
https://github.com/dalgual


High Performance I nformation Computing Center
Jongwook Woo
CSULA
Experience in Big Data
Several publications regarding Hadoop and NoSQL
Deeksha Lakshmi,

Iksuk

Kim, Jongwook Woo, Analysis of
MovieLens Data Set using Hive, in Journal of Science and
Technology, Dec 2013, Vol3 no12, pp1194-1198, ARPN
Scalable, Incremental Learning with MapReduce Parallelization for
Cell Detection in High-Resolution 3D Microscopy Data. Chul Sung,
Jongwook Woo, Matthew Goodman, Todd Huffman, and Yoonsuck
Choe. in Proceedings of the International Joint Conference on Neural
Networks, 2013
Apriori-Map/Reduce Algorithm, Jongwook Woo, PDPTA 2012, Las
Vegas (July 16-19, 2012)
Market Basket Analysis Algorithm with no-SQL DB HBase and
Hadoop,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho
Kim, EDB 2012, Incheon, Aug. 25-27, 2011
Market Basket Analysis Algorithm with Map/Reduce of Cloud
Computing, Jongwook Woo and Yuhang Xu, PDPTA 2011, Las
Vegas (July 18-21, 2011)
Collaboration with Universities and companies
USC, Texas A&M, Cloudera, Amazon, MicroSoft
High Performance I nformation Computing Center
Jongwook Woo
CSULA
What is Big Data, Map/Reduce, Hadoop, NoSQL DB on
Cloud Computing
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Data
Google
We dont have a better algorithm
than others but we have more data
than others
High Performance I nformation Computing Center
Jongwook Woo
CSULA
New Data Trend
Sparsity
Unstructured
Schema free data with sparse attributes
Semantic or social relations
No relational property
nor complex join queries
Log data
Immutable
No need to update and delete data
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (10
12
), Peta-byte (10
15
)
Because of web
Sensor Data, Bioinformatics, Social Computing,
smart phone, online game
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
GFS
On inexpensive commodity computers
How to compute Big Data
MapReduce
Parallel Computing with multiple non-expensive
computers
Own super computers

High Performance I nformation Computing Center
Jongwook Woo
CSULA
Hadoop 1.0
Hadoop
Doug Cutting

Lucene, Nutch, Avro,


Chief Architect at Cloudera
MapReduce
HDFS
Restricted Parallel Programming
Not for iterative algorithms
Not for graph


High Performance I nformation Computing Center
Jongwook Woo
CSULA
Emerging Big Data Technology
Giraph
Spark and Shark
Use Cases
Use Cases experienced


High Performance I nformation Computing Center
Jongwook Woo
CSULA
Spark and Shark
High Speed In-Memory Analytics over
Hadoop and Hive data
http://www.slideshare.net/Hadoop_Summit/s
park-and-shark
Fast Data Sharing
Iterative Graph Algorithms
Data Mining (Classification/Clustering)
Interactive Query
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Giraph
BSP
Facebook
http://www.slideshare.net/aladagemre/a-talk-
on-apache-giraph



High Performance I nformation Computing Center
Jongwook Woo
CSULA
Josh Wills (Cloudera)
I have found that many kinds of
scientists such as astronomers,
geneticists, and geophysicists are
working with very large data sets in order
to build models that do not involve
statistics or machine learning, and that
these scientists encounter data
challenges that would be familiar to data
scientists at Facebook, Twitter, and
LinkedIn.
Data science is a set of techniques used
by many scientists to solve problems
across a wide array of scientific fields.
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Use Cases experienced
Log Analysis
Log files from IPS and IDS
1.5GB per day for each systems
Extracting unusual cases using Hadoop, Solr,
Flume on Cloudera
Customer Behavior Analysis
Market Basket Analysis Algorithm
Machine Learning for Image Processing
with Texas A&M
Hadoop Streaming API
Movie Data Analysis
Hive, Impala





jwoo Woo
HiPI C
CSULA
Scalable, Incremental Learning
with MapReduce Parallelization for
Cell Detection in High-Resolution 3D Microscopy
Data (IJCNN 2013)
Chul Sung, Yoonsuck
Choe
Brain Networks Laboratory
Computer Science and
Engineering
TAMU
Jongwook Woo
Computer Information
Systems
CALSTATE-LA
Matthew Goodman,
Todd Huffman
3SCAN
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Motivation
Analysis of neuronal distribution in the brain
plays an important role in the diagnosis of
disorders of the brain.
E.g., Purkinje cell reduction in autism [3]
A. Normal cerebellum
B. Reduction of neurons in the Purkinje cell layer
Normal
human
brain
Autistic
human
brain
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Approach
Use a machine learning approach to
detect neurons.
Learn a binary classifier:
:
3
{0,1}
Input: local volume data
Output: cell center (1) or off-center (0)
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Requirement:
Effective
Incremental Learning
Several properties are desired:
Low computational cost
Non-iterative
No accumulation of data points
No retraining
Yet, sufficient accuracy
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Proposed Algorithm
Principal Components Analysis
(PCA)-based supervised learning
No need of retraining
Highly scalable due to only the
eigenvector matrices being stored
Highly parallelizable due to its
incremental nature
We keep the eigenvectors as new training
samples are made available and
additionally use them in the testing
process.
High Performance I nformation Computing Center
Jongwook Woo
CSULA
MapReduce Parallelization
A highly effective and popular framework for big
data analytics
Parallel data processing tasks
Map phase - tasks are divided and results are emitted
Reduce phase - the emitted results are sorted and
consolidated
Apache Hadoop
Open source project of the Apache Foundation
Storage: Hadoop Distributed File System (HDFS)
Processing: Map/Reduce (Fault Tolerant Distributed
Processing)
Slide from Dr. Jongwook Woos SWRC 2013 Presentation
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Hadoop Streaming
Hadoop MapReduce for Non-Java codes: Python,
Ruby
Requirement
Running Hadoop
Needs Hadoop Streaming API
hadoop-streaming.jar
Needs to build Mapper and Reducer codes
Simple conversion from sequential codes
STDIN > mapper > reducer > STDOUT
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Hadoop Streaming
MapReduce Python execution
http://wiki.apache.org/hadoop/HadoopStreaming
Sysntax
$HADOOP_HOME/bin/hadoop jar
$HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar
[options] Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
Example
$ bin/hadoop jar contrib/streaming/hadoop-streaming.jar \
-file /home/jwoo/mapper.py -mapper /home/jwoo/mapper.py \
-file /home/jwoo/reducer.py -reducer /home/jwoo/reducer.py \
-input /user/jwoo/shakespeare/* -output /user/jwoo/shakespeare-
output

High Performance I nformation Computing Center
Jongwook Woo
CSULA
Training
PCA is run separately on these class-
specific subsets, resulting in class-
specific eigenvector matrices.
Class 1:
+

Training Set
Class 2:


Eigenvectors 1:
+

PCA
Eigenvectors 2:


PCA
XY XZ YZ
An input
vector
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Eigenvectors 1:
+

Novel input x
Eigenvectors 2:


Eigenvectors 1:
+

Eigenvectors 2:


*
Reconst.
+

Class 1

Yes
x
+
<x-

?
No
Class 2

Reconst.


*
*
*
Projection
+

Projection


Testing
Each data vector x is projected using the two class-specific
PCA eigenvector matrices
The class associated with the more accurate reconstruction
determines the label for the new data vector
xz
xy
yz
xz
xy
yz
?
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Reconstruction Examples
Reconstruction of cell center and off-center data using
matching vs. non-matching eigenvector matrices
Reconstruction accurate only with matching eigenvector matrix
Proximity: Cell center proximity value (e.g., 1.0 is cell center
and 0.1 off-center)
High Performance I nformation Computing Center
Jongwook Woo
CSULA
MapReduce Parallelization
Our algorithm is highly
parallelizable.
To exploit this property, we
developed a MapReduce-based
implementation of the algorithm.
High Performance I nformation Computing Center
Jongwook Woo
CSULA
MapReduce Parallelization (Training)
Parallel PCA computations of the class-specific subsets
from the training sets, generating two eigenvector matrices
per training set
Set


Set
+

Set


Eigenvectors
Input files
Eigen
Decomposition
Map phase Output files
Read
worker
worker
worker
Class
1
Class
2
Training Set
1
Class
1
Class
2
Training Set
k
Set
+

High Performance I nformation Computing Center
Jongwook Woo
CSULA
MapReduce Parallelization (Testing)
- Map
1. We need to prepare all data vectors from
all voxels in the data volume whether a data
vector is in the cell-center class.
Eigenvectors
Input
files
Read
Projection
& Reconst.
Map
phase
Reconst.
Errors
Intermediate
files
Averaging
Reconst. Errors
Reduce
phase
Output
files
Novel
input
split 1
split
m
Read
worker
worker
worker
Averages of
Reconst. Errors
err
avg.
err
avg.
Set


Set

+

Set


Set

+


+


worker
worker
worker
High Performance I nformation Computing Center
Jongwook Woo
CSULA
300

250

200

150

100

50

0

A

B C D

Cluster Configuration

A: Single Node
B: One Master, One Slave
C: One Master, Five Slaves
D: One Master, Ten Slaves
Results: MapReduce Performance
Performance
comparison during
testing
35 map tasks and
10 reduce tasks
per job (except for
A case)
Performance was
greatly improved
(nearly 10 times)
Not much gain
during training
A
v
e
r
a
g
e

Each node computing is quad-core 2xIntel
Xeon X5570 CPU and 23.00 GB memory.
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Conclusion
Developed a novel scalable incremental
learning algorithm for fast quantitative
analysis of massive, growing, sparsely
labeled data.
Our algorithm showed high accuracy
(AUC of 0.9614).
10 times speed up using MapReduce.
Expected to be broadly applicable to
the analysis of high-throughput medical
imaging data.
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Use Cases in Science
Seismology
HEP





High Performance I nformation Computing Center
Jongwook Woo
CSULA

Reflection Seismology ()
Marine Seismic Survey ( )
Sears (Retail)
Gravity (Online Publishing,
Personalized Content)


High Performance I nformation Computing Center
Jongwook Woo
CSULA
Reflection Seismology ()

A set of techniques for solving a classic inverse problem:
given a collection of seismograms ( ) and associated
metadata,
generate an image of the subsurface of the Earth that generated
those seismograms.
Big Data
A Modern seismic survey
tens of thousands of shots and multiple terabytes of trace data.

To locate oil and natural gas deposits.
To identify the location of the Chicxulub Crater
that has been linked to the extinction of the dinosaur.
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Marine Seismic Survey
( )
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Common Depth Point (CDP) Gather
( )
Common Depth Point (CDP)
CDP
By comparing the time it
took for the seismic
waves to trace from the
different source and
receiver locations and
experimenting with
different velocity models
for the waves moving
thorough the rock,
we can estimate the depth
of the common surface
point that the waves
reflected off of.
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Reflection Seismology and Hadoop
By aggregating a large
number of these estimates,
construct a complete image of
the surface.
As we increase the density ()
and the number of traces,
create higher quality images
that improve our understanding
of the subsurface geology
()
A 3D seismic image of
Japans southeastern
margin
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Reflection Seismology and Hadoop
(Legacy Seismic Data Processing
)
Geophysicists ( )
Use the first Cray supercomputers
as well as the massively parallel
Connection Machine.
Parallel Computing
must file a request to move the data into
active storage
then consume precious cluster resources
in order to process the data.

High Performance I nformation Computing Center
Jongwook Woo
CSULA
Reflection Seismology and Hadoop
(Legacy Seismic Data Processing)
open-source software tools in
Seismic data processing
The Seismic Unix project
from the Colorado School of Mines
SEPlib
from Stanfrod University
SeisSpace
commercial toolkit for seismic data
processing.
Built on top of an open source foundation,
the JavaSeis project.
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Emerge of Apache Hadoop for Seismology
Seismic Hadoop by Cloudera
Data Intensive Computing
store and process seismic data in a Hadoop cluster.
Enabled to export many of the most I/O intensive steps in the seismic data processing into the
Hadoop cluster
Combines Seismic Unix with Crunch,
the Java library for creating MapReduce Pipelines.
Seismic Unix
extensive use of Unix pipes in order to construct complex data processing tasks from
a set of simple procedures
sufilter f=
10,20,30,40 | suchw key1=gx,cdp key2=offset,gx key3=sx,sx b=1,1 c=1,1 d=1,2 | susort
cdp gx

A pipeline in Seismic Unix
first applies a filter to the trace data is built,
then some meta data associated with each trace are edited,
and the traces by the metadata just edited are finally sorted
High Performance I nformation Computing Center
Jongwook Woo
CSULA
What is HEP?
High Energy Physics
Definition:
Involves colliding highly energetic, common particles together
in order to create small, exotic, and incredibly short-lived
particles.
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Large Hadron Collider
Collides protons together at an energy of 7 TeV per
particle.
protons travel around the rings and are collided inside particle
detectors.
Collisions occur every 25 nanoseconds.
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Compact Muon Solenoid
Big Data
Collisions at a rate of 40MHz
Each collision has about 1MB worth of data.
40MHz x 1MB = 320 Tera bps
(unmanageable amount)
Complex custom compute system (called trigger)
will cut down the entire collision rate to about
300Hz, which means that significant data are
statistically determined.
High Performance I nformation Computing Center
Jongwook Woo
CSULA
From Raw Data to Significant
Raw Sensor Data
Reconstructed Data
Analysis-oriented Data
Physicist-specific
N-tuples
1MB
110KB
1KB
T
i
e
r

1

-

A
t

C
E
R
N

T
i
e
r

2

-

B
i
g

D
a
t
a

High Performance I nformation Computing Center
Jongwook Woo
CSULA
Characteristics of Tier 2
Need Hadoop
Large amount of data (400 TB)
Large data rate (in the range of 10Gbps) to analyze
Need for reliability, but not archival storage
Proper use of resources
Need for interoperability
High Performance I nformation Computing Center
Jongwook Woo
CSULA
HDFS Structure
HDFS
Mounted with FUSE
Worker Nodes
SRM
Generic Web-
Services Interface
Globus GridFTP
Standard Grid Protocol
for WAN Transfers
FUSE
allows physicists C++ applications to access HDFS
without modification.
Two grid components
allow interoperation with non-Hadoop sites.
High Performance I nformation Computing Center
Jongwook Woo
CSULA
MapReduce 1.0 Cons and Future
Bad for
Fast response time
Large amount of shared data
Fine-grained synch needed
CPU-intensive not data-intensive
Continuous input stream
Hadoop 2.0: YARN
product
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Hadoop 2.0: YARN
Data processing applications and services
Online Serving HOYA (HBase on YARN)
Real-time event processing Storm, S4, other commercial
platforms
Tez Generic framework to run a complex DAG
MPI: OpenMPI, MPICH2
Master-Worker
Machine Learning: Spark
Graph processing: Giraph
Enabled by allowing the use of paradigm-specific application
master
[http://www.slideshare.net/hortonworks/apache-hadoop-yarn-
enabling-nex]
High Performance I nformation Computing Center
Jongwook Woo
CSULA
Training in Big Data
Learn by yourself?
Miss many important topics
Cloudera
With hands-on exercises
Cloudera


/

High Performance I nformation Computing Center
Jongwook Woo
CSULA
Conclusion
Era of Big Data
Need to store and compute Big Data
Many solutions but Hadoop
Hadoop is supercomputer that you
can own
Hadoop 2.0
Training is important

High Performance I nformation Computing Center
Jongwook Woo
CSULA
Question?

You might also like