Professional Documents
Culture Documents
Introduction
Big-data
Four parameters:
Velocity: Streaming data and large volume data movement.
Volume: Scale from terabytes to zettabytes.
Variety: Manage the complexity of multiple relational and
non-relational data types and schemas.
Voracity: Produced data has to be consumed fast before it
becomes meaningless.
Government By collecting and analyzing data across agencies, location and employee groups, the
government can reduce redundancies, identify productivity gaps, pinpoint consumer issues and find
procurement savings across agencies.
Healthcare Big data in healthcare could be used help improve hospital operations, provide better
tracking of procedural outcomes, help accelerate research and grow knowledge bases more rapidly.
According to a 2011 Cleveland Clinic study, leveraging big data is estimated to drive $300B in annual
value through an 8% systematic cost savings.
Processing Granularity
Data size: small
Pipelined Instruction level
Concurrent Thread level
Service Object level
Singlecore
Multi-core
Cluster
Grid of clusters
Examples
Web logs;
RFID;
sensor networks;
social networks;
social data (due to the social data revolution),
Internet text and documents;
Internet search indexing;
call detail records;
astronomy,
atmospheric science,
genomics,
biogeochemical,
biological, and
other complex and/or interdisciplinary scientific research;
military surveillance;
medical records;
photography archives;
video archives; and
large-scale e-commerce.
Not so easy
Moving data from storage cluster to computation cluster is not feasible
In large clusters
Failure is expected, rather than exceptional.
In large clusters, computers fail every day
Data is corrupted or lost
Computations are disrupted
The number of nodes in a cluster may not be constant.
Nodes can be heterogeneous.
Very expensive to build reliability into each application
A programmer worries about errors, data motion, communication
Traditional debugging and performance tools dont apply
Need a common infrastructure and standard set of tools to handle this complexity
Efficient, scalable, fault-tolerant and easy to use
Seeking is the process of moving the disks head to a particular place on the
disk to read or write data.
It characterizes the latency of a disk operation, whereas the transfer rate
corresponds to a disks bandwidth.
If the data access pattern is dominated by seeks, it will take longer to read
or write large portions of the dataset than streaming through it, which
operates at the transfer rate.
Hadoop distributions
Apache Hadoop
Apache Hadoop- based Services
for Windows Azure
Clouderas Distribution Including
Apache Hadoop (CDH)
Hortonworks Data Platform
IBM InfoSphere BigInsights
Platform Symphony MapReduce
MapR Hadoop Distribution
EMC Greenplum MR (using MapRs
M5 Distribution)
Zettaset Data Platform
SGI Hadoop Clusters (uses
Cloudera distribution)
Grand Logic JobServer
OceanSync Hadoop Management
Software
Oracle Big Data Appliance (uses
Cloudera distribution)
Hadoop features
Distributed Framework for processing and storing data generally on commodity
hardware.
Completely Open Source.
Written in Java
Runs on Linux, Mac OS/X, Windows, and Solaris.
Client apps can be written in various languages.
Scalable: store and process petabytes, scale by adding Hardware
Primarily used for batch data processing, not real-time / user facing applications
Components of Hadoop
HDFS (Hadoop Distributed File System)
Modeled on GFS
Reliable, High Bandwidth file system that can
store TB' and PB's data.
Map-Reduce
Using Map/Reduce metaphor from Lisp language
A distributed processing framework paradigm that
process the data stored onto HDFS in key-value .
Input
Map
Reduce
Output
Client 1
Client 2
Map
Input
data
Reduce
Map
Reduce
Output
data
Processing Framework
Map
DFS
HDFS
Very Large Distributed File System
10K nodes, 100 million files, 10 PB
Linearly scalable
Supports Large files (in GBs or TBs)
Economical
Uses Commodity Hardware
Nodes fail every day. Failure is expected, rather than exceptional.
The number of nodes in a cluster is not constant.
HDFS Goals
Highly fault-tolerant
runs on commodity HW, which can fail frequently
Large files
Typical file is gigabytes to terabytes in size
Support for tens of millions of files
Simple coherency
Write-once-read-many access model
Data Organization
Fault Tolerance
High Throughput:
HDFS Components
NameNode
DataNode
HDFS Architecture
Client
I/O
3
1
4
5
Map-Reduce
MapReduce programs are executed in two main phases, called
mapping and
reducing .
In the mapping phase, MapReduce takes the input data and feeds each
data element to the mapper.
In the reducing phase, the reducer processes all the outputs from the
mapper and arrives at a final result.
The mapper is meant to filter and transform the input into something
That the reducer can aggregate over.
MapReduce uses lists and (key/value) pairs as its main data primitives.
Map-Reduce
Map-Reduce Program
Based on two functions: Map and Reduce
Every Map/Reduce program must specify a Mapper and optionally a Reducer
Operate on key and value pairs
uniq -c
Reduce
cat /var/log/auth.log* | grep session opened | cut -d -f10 | sort | uniq -c >
~/userlist
Map function: Takes a key/value pair and generates a set of intermediate key/value pairs
map(k1,
v1) -> list(k2, v2)
Reduce function: Takes intermediate values and associates them with the same intermediate key
reduce(k2, list(v2)) -> list (k3, v3)
Map-Reduce on Hadoop
Machine -x
Machine -2
HDFS
Input
Splits
Record Reader
(Kay, Value)
pairs
combiner
Partitionar
HDFS
files
File 1
Split 1
Map 1
Split 2
Map 2
Split 3
Map 3
Combiner 1
Partition 1
Reducer 1
File 1
File 2
Partition 2
Reducer 2
Output
File 3
File 3
.
Input .
.
.
.
.
Mapper
File N-2
Partition P-1
Reducer
Reducer R-1
File N-1
File N
Machine - M
File 2
.
.
.
File O-2
File O-1
Split M-2
Map M-2
Split M-1
Map M-1
Split M
Map M
Combiner C
Partition P
Reducer R
File O
Hadoop Eco-system
Hadoop Common: The common utilities that support the other Hadoop subprojects.
Hadoop Distributed File System (HDFS): A distributed file system that provides highthroughput access to application data.
Hadoop MapReduce: A software framework for distributed processing of large data sets
on compute clusters.
Other Hadoop-related projects at Apache include:
Avro: A data serialization system.
Cassandra: A scalable multi-master database with no single points of failure.
Chukwa: A data collection system for managing large distributed systems.
HBase: A scalable, distributed database that supports structured data storage for large tables.
Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout: A Scalable machine learning and data mining library.
Pig: A high-level data-flow language and execution framework for parallel computation.
ZooKeeper: A high-performance coordination service for distributed applications.
Exercise task
You have timeseries data (timestamp, ID, value) collected from 10,000
sensors in every millisecond. Your central system stores this data, and
allow more than 500 people to concurrently access this data and execute
queries on them. While last one month data is accessed more frequently,
some analytics algorithm built model using historical data as well.
Task:
Provide an architecture of such system to meet following goals
Fast
Available
Fair
Or, provide analytics algorithm and data-structure design considerations (e.g. kmeans clustering, or regression) on this data set of worth 3 months.
End of session
Day 1: Introduction