You are on page 1of 30

Training (Day 1)

Introduction

Big-data

Four parameters:
Velocity: Streaming data and large volume data movement.
Volume: Scale from terabytes to zettabytes.
Variety: Manage the complexity of multiple relational and
non-relational data types and schemas.
Voracity: Produced data has to be consumed fast before it
becomes meaningless.

Not just internet companies


Big Data Shouldnt Be a Silo
Must be an integrated part of enterprise information architecture

Data >> Information >> Business Value


Retail By combining data feeds on inventory, transactional records, social media and online trends,
retailers can make real-time decisions around product and inventory mix adjustments, marketing and
promotions, pricing or product quality issues.
Financial Services By combining data across various groups and services like financial markets, money
manager and lending, financial services companies can gain a comprehensive view of their individual
customers and markets.

Government By collecting and analyzing data across agencies, location and employee groups, the
government can reduce redundancies, identify productivity gaps, pinpoint consumer issues and find
procurement savings across agencies.
Healthcare Big data in healthcare could be used help improve hospital operations, provide better
tracking of procedural outcomes, help accelerate research and grow knowledge bases more rapidly.
According to a 2011 Cleveland Clinic study, leveraging big data is estimated to drive $300B in annual
value through an 8% systematic cost savings.

Processing Granularity
Data size: small
Pipelined Instruction level
Concurrent Thread level
Service Object level

Singlecore
Multi-core
Cluster

Indexed File level

Grid of clusters

Mega Block level

Embarrassingly parallel processing

Virtual System Level


Data size: large

MapReduce, distributed file system


Cloud computing

Reference: Bina Ramamurthy 2011

Single-core, single processor


Single-core, multi-processor
Multi-core, single processor
Multi-core, multi-processor
Cluster of processors (single or multi-core) with shared
memory
Cluster of processors with distributed memory

Broadly, the approach in HPC is to distribute the work across a


cluster of machines, which access a shared file-system, hosted by a
SAN.

How to Process BigData?


Need to process large datasets (>100TB)
Just reading 100TB of data can be overwhelming
Takes ~11 days to read on a standard computer
Takes a day across a 10Gbit link (very high end storage solution)
On a single node (@50MB/s) 23days
On a 1000 node cluster 33min

Examples
Web logs;
RFID;
sensor networks;
social networks;
social data (due to the social data revolution),
Internet text and documents;
Internet search indexing;
call detail records;
astronomy,
atmospheric science,
genomics,
biogeochemical,
biological, and
other complex and/or interdisciplinary scientific research;
military surveillance;
medical records;
photography archives;
video archives; and
large-scale e-commerce.

Not so easy
Moving data from storage cluster to computation cluster is not feasible
In large clusters
Failure is expected, rather than exceptional.
In large clusters, computers fail every day
Data is corrupted or lost
Computations are disrupted
The number of nodes in a cluster may not be constant.
Nodes can be heterogeneous.
Very expensive to build reliability into each application
A programmer worries about errors, data motion, communication
Traditional debugging and performance tools dont apply
Need a common infrastructure and standard set of tools to handle this complexity
Efficient, scalable, fault-tolerant and easy to use

Why is Hadoop and MapReduce needed?


The answer to this questions comes from another trend in disk drives:
seek time is improving more slowly than transfer rate.

Seeking is the process of moving the disks head to a particular place on the
disk to read or write data.
It characterizes the latency of a disk operation, whereas the transfer rate
corresponds to a disks bandwidth.
If the data access pattern is dominated by seeks, it will take longer to read
or write large portions of the dataset than streaming through it, which
operates at the transfer rate.

Why is Hadoop and MapReduce needed?


On the other hand, for updating a small proportion of records in a database,
a traditional B-Tree (the data structure used in relational databases, which is
limited by the rate it can perform seeks) works well.
For updating the majority of a database, a B-Tree is less efficient than
MapReduce, which uses Sort/Merge to rebuild the database.

MapReduce can be seen as a complement to an RDBMS.


MapReduce is a good fit for problems that need to analyze the whole
dataset, in a batch fashion, particularly for ad hoc analysis.

Why is Hadoop and MapReduce needed?

Hadoop distributions
Apache Hadoop
Apache Hadoop- based Services
for Windows Azure
Clouderas Distribution Including
Apache Hadoop (CDH)
Hortonworks Data Platform
IBM InfoSphere BigInsights
Platform Symphony MapReduce
MapR Hadoop Distribution
EMC Greenplum MR (using MapRs
M5 Distribution)
Zettaset Data Platform
SGI Hadoop Clusters (uses
Cloudera distribution)
Grand Logic JobServer
OceanSync Hadoop Management
Software
Oracle Big Data Appliance (uses
Cloudera distribution)

Whats up with the names?


When naming software projects, Doug Cutting
seems to have been inspired by his family.
Lucene is his wifes middle name, and her
maternal grandmothers first name.
His son, as a toddler, used Nutch as the allpurpose word for meal and later named a
yellow stuffed elephant Hadoop.
Doug said he was looking for a name that
wasnt already a web domain and wasnt
trademarked, so I tried various words that
were in my life but not used by anybody else.
Kids are pretty good at making up words.

Hadoop features
Distributed Framework for processing and storing data generally on commodity
hardware.
Completely Open Source.

Written in Java
Runs on Linux, Mac OS/X, Windows, and Solaris.
Client apps can be written in various languages.
Scalable: store and process petabytes, scale by adding Hardware

Economical: 1000s of commodity machines


Efficient: run tasks where data is located
Reliable: data is replicated, failed tasks are rerun

Primarily used for batch data processing, not real-time / user facing applications

Components of Hadoop
HDFS (Hadoop Distributed File System)

Modeled on GFS
Reliable, High Bandwidth file system that can
store TB' and PB's data.

Map-Reduce
Using Map/Reduce metaphor from Lisp language
A distributed processing framework paradigm that
process the data stored onto HDFS in key-value .

Input

Map

Shuffle & Sort

Reduce

Output

Client 1

Client 2

Map
Input
data

Reduce
Map
Reduce

Output
data

Processing Framework

Map

DFS

HDFS
Very Large Distributed File System
10K nodes, 100 million files, 10 PB
Linearly scalable
Supports Large files (in GBs or TBs)

Economical
Uses Commodity Hardware
Nodes fail every day. Failure is expected, rather than exceptional.
The number of nodes in a cluster is not constant.

Optimized for Batch Processing

HDFS Goals
Highly fault-tolerant
runs on commodity HW, which can fail frequently

High throughput of data access


Streaming access to data

Large files
Typical file is gigabytes to terabytes in size
Support for tens of millions of files

Simple coherency
Write-once-read-many access model

HDFS: Files and Blocks

Data Organization

Data is organized into files and directories


Files are divided into uniform sized large blocks
Typically 128MB
Blocks are distributed across cluster nodes

Fault Tolerance

Blocks are replicated (default 3) to handle


hardware failure
Replication based on Rack-Awareness for performance and fault
tolerance

Keeps checksums of data for corruption


detection and recovery

Client reads both checksum and data from DataNode. If checksum


fails, it tries other replicas

HDFS: Files and Blocks

High Throughput:

Client talks to both NameNode and DataNodes


Data is not sent through the NameNode.
Throughput of file system scales nearly linearly with the
number of nodes.

HDFS exposes block placement so that computation can


be migrated to data

HDFS Components
NameNode

Manages the file namespace operation like opening, creating,


renaming etc.
File name to list blocks + location mapping
File metadata
Authorization and authentication
Collect block reports from DataNodes on block locations
Replicate missing blocks
Keeps ALL namespace in memory plus checkpoints & journal

DataNode

Handles block storage on multiple volumes and data integrity.


Clients access the blocks directly from data nodes for read and write
Data nodes periodically send block reports to NameNode
Block creation, deletion and replication upon instruction from the
NameNode.

HDFS Architecture

Namenode (the master)


name:/users/joeYahoo/myFile - blocks:{1,3}
name:/users/bobYahoo/someData.gzip - blocks:{2,4,5}
Metadata

Client

Datanodes (the slaves)

I/O

3
1

4
5

Hadoop DFS Interface


Simple commands
hdfs dfs -ls, -du, -rm, -rmr
Uploading files
hdfs dfs copyFromLocal foo mydata/foo
Downloading files
hdfs dfs - moveToLocal mydata/foo foo
hdfs dfs -cat mydata/foo
Admin
hdfs dfsadmin report

Map Reduce - Introduction

Parallel Job processing framework


Written in java
Close integration with HDFS
Provides :
Auto partitioning of job into sub tasks
Auto retry on failures
Linear Scalability
Locality of task execution
Plugin based framework for extensibility

Map-Reduce
MapReduce programs are executed in two main phases, called
mapping and
reducing .

In the mapping phase, MapReduce takes the input data and feeds each
data element to the mapper.
In the reducing phase, the reducer processes all the outputs from the
mapper and arrives at a final result.
The mapper is meant to filter and transform the input into something
That the reducer can aggregate over.
MapReduce uses lists and (key/value) pairs as its main data primitives.

Map-Reduce

Map-Reduce Program
Based on two functions: Map and Reduce
Every Map/Reduce program must specify a Mapper and optionally a Reducer
Operate on key and value pairs

Map-Reduce works like a Unix pipeline:


cat input | grep |
sort
|
Input | Map
| Shuffle & Sort |

uniq -c
Reduce

| cat > output


| Output

cat /var/log/auth.log* | grep session opened | cut -d -f10 | sort | uniq -c >
~/userlist

Map function: Takes a key/value pair and generates a set of intermediate key/value pairs
map(k1,
v1) -> list(k2, v2)
Reduce function: Takes intermediate values and associates them with the same intermediate key
reduce(k2, list(v2)) -> list (k3, v3)

Map-Reduce on Hadoop

Hadoop and its elements


Machine -1

Machine -x

Machine -2

HDFS
Input

Splits

Record Reader

(Kay, Value)
pairs

combiner

Partitionar

HDFS

files

File 1

Split 1

Map 1

Split 2

Map 2

Split 3

Map 3

Combiner 1

Partition 1

Reducer 1

File 1

File 2

Partition 2

Reducer 2

Output

File 3

File 3

.
Input .
.

.
.
.

Mapper

File N-2

Partition P-1

Reducer

Reducer R-1

File N-1

File N

Machine - M

File 2

.
.
.
File O-2

File O-1
Split M-2

Map M-2

Split M-1

Map M-1

Split M

Map M

Combiner C

Partition P

Reducer R

File O

Hadoop Eco-system
Hadoop Common: The common utilities that support the other Hadoop subprojects.
Hadoop Distributed File System (HDFS): A distributed file system that provides highthroughput access to application data.
Hadoop MapReduce: A software framework for distributed processing of large data sets
on compute clusters.
Other Hadoop-related projects at Apache include:
Avro: A data serialization system.
Cassandra: A scalable multi-master database with no single points of failure.
Chukwa: A data collection system for managing large distributed systems.
HBase: A scalable, distributed database that supports structured data storage for large tables.
Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout: A Scalable machine learning and data mining library.
Pig: A high-level data-flow language and execution framework for parallel computation.
ZooKeeper: A high-performance coordination service for distributed applications.

Exercise task
You have timeseries data (timestamp, ID, value) collected from 10,000
sensors in every millisecond. Your central system stores this data, and
allow more than 500 people to concurrently access this data and execute
queries on them. While last one month data is accessed more frequently,
some analytics algorithm built model using historical data as well.
Task:
Provide an architecture of such system to meet following goals
Fast
Available
Fair

Or, provide analytics algorithm and data-structure design considerations (e.g. kmeans clustering, or regression) on this data set of worth 3 months.

Group / individual presentation

End of session
Day 1: Introduction

You might also like