01 Introduction 141206230951 Conversion Gate01 PDF

Training (Day 1)
Introduction
Big-data
Four parameters:
Velocity: Streaming data and large volume data movement.
Volume: Scale from terabytes to zettabytes.
Variety: Manage the complexity of multiple relational and
non-relational data types and schemas.
Voracity: Produced data has to be consumed fast before it
becomes meaningless.
Not just internet companies

Big Data Shouldnt Be a Silo
Must be an integrated part of enterprise information architecture
Data >> Information >> Business Value

Retail By combining data feeds on inventory, transactional records, social media and online trends,
retailers can make real-time decisions around product and inventory mix adjustments, marketing and
promotions, pricing or product quality issues.
Financial Services By combining data across various groups and services like financial markets, money
manager and lending, financial services companies can gain a comprehensive view of their individual
customers and markets.
Government By collecting and analyzing data across agencies, location and employee groups, the
government can reduce redundancies, identify productivity gaps, pinpoint consumer issues and find
procurement savings across agencies.
Healthcare Big data in healthcare could be used help improve hospital operations, provide better
tracking of procedural outcomes, help accelerate research and grow knowledge bases more rapidly.
According to a 2011 Cleveland Clinic study, leveraging big data is estimated to drive $300B in annual
value through an 8% systematic cost savings.
Processing Granularity
Data size: small
Pipelined Instruction level
Concurrent Thread level
Service Object level
Singlecore
Multi-core
Cluster
Indexed File level
Grid of clusters
Mega Block level
Embarrassingly parallel processing
Virtual System Level

Data size: large
MapReduce, distributed file system

Cloud computing
Reference: Bina Ramamurthy 2011
Single-core, single processor

Single-core, multi-processor
Multi-core, single processor
Multi-core, multi-processor
Cluster of processors (single or multi-core) with shared
memory
Cluster of processors with distributed memory
Broadly, the approach in HPC is to distribute the work across a

cluster of machines, which access a shared file-system, hosted by a
SAN.
How to Process BigData?

Need to process large datasets (>100TB)
Just reading 100TB of data can be overwhelming
Takes ~11 days to read on a standard computer
Takes a day across a 10Gbit link (very high end storage solution)
On a single node (@50MB/s) 23days
On a 1000 node cluster 33min
Examples
Web logs;
RFID;
sensor networks;
social networks;
social data (due to the social data revolution),
Internet text and documents;
Internet search indexing;
call detail records;
astronomy,
atmospheric science,
genomics,
biogeochemical,
biological, and
other complex and/or interdisciplinary scientific research;
military surveillance;
medical records;
photography archives;
video archives; and
large-scale e-commerce.
Not so easy
Moving data from storage cluster to computation cluster is not feasible
In large clusters
Failure is expected, rather than exceptional.
In large clusters, computers fail every day
Data is corrupted or lost
Computations are disrupted
The number of nodes in a cluster may not be constant.
Nodes can be heterogeneous.
Very expensive to build reliability into each application
A programmer worries about errors, data motion, communication
Traditional debugging and performance tools dont apply
Need a common infrastructure and standard set of tools to handle this complexity
Efficient, scalable, fault-tolerant and easy to use
Why is Hadoop and MapReduce needed?

The answer to this questions comes from another trend in disk drives:
seek time is improving more slowly than transfer rate.
Seeking is the process of moving the disks head to a particular place on the
disk to read or write data.
It characterizes the latency of a disk operation, whereas the transfer rate
corresponds to a disks bandwidth.
If the data access pattern is dominated by seeks, it will take longer to read
or write large portions of the dataset than streaming through it, which
operates at the transfer rate.

On the other hand, for updating a small proportion of records in a database,
a traditional B-Tree (the data structure used in relational databases, which is
limited by the rate it can perform seeks) works well.
For updating the majority of a database, a B-Tree is less efficient than
MapReduce, which uses Sort/Merge to rebuild the database.
MapReduce can be seen as a complement to an RDBMS.

MapReduce is a good fit for problems that need to analyze the whole
dataset, in a batch fashion, particularly for ad hoc analysis.
Hadoop distributions
Apache Hadoop
Apache Hadoop- based Services
for Windows Azure
Clouderas Distribution Including
Apache Hadoop (CDH)
Hortonworks Data Platform
IBM InfoSphere BigInsights
Platform Symphony MapReduce
MapR Hadoop Distribution
EMC Greenplum MR (using MapRs
M5 Distribution)
Zettaset Data Platform
SGI Hadoop Clusters (uses
Cloudera distribution)
Grand Logic JobServer
OceanSync Hadoop Management
Software
Oracle Big Data Appliance (uses
Cloudera distribution)
Whats up with the names?

When naming software projects, Doug Cutting
seems to have been inspired by his family.
Lucene is his wifes middle name, and her
maternal grandmothers first name.
His son, as a toddler, used Nutch as the allpurpose word for meal and later named a
yellow stuffed elephant Hadoop.
Doug said he was looking for a name that
wasnt already a web domain and wasnt
trademarked, so I tried various words that
were in my life but not used by anybody else.
Kids are pretty good at making up words.
Hadoop features
Distributed Framework for processing and storing data generally on commodity
hardware.
Completely Open Source.
Written in Java
Runs on Linux, Mac OS/X, Windows, and Solaris.
Client apps can be written in various languages.
Scalable: store and process petabytes, scale by adding Hardware
Economical: 1000s of commodity machines

Efficient: run tasks where data is located
Reliable: data is replicated, failed tasks are rerun
Primarily used for batch data processing, not real-time / user facing applications
Components of Hadoop
HDFS (Hadoop Distributed File System)
Modeled on GFS
Reliable, High Bandwidth file system that can
store TB' and PB's data.
Map-Reduce
Using Map/Reduce metaphor from Lisp language
A distributed processing framework paradigm that
process the data stored onto HDFS in key-value .
Input
Map
Shuffle & Sort
Reduce
Output
Client 1
Client 2
Map
Input
data
Reduce
Map
Reduce
Output
data
Processing Framework
Map
DFS
HDFS
Very Large Distributed File System
10K nodes, 100 million files, 10 PB
Linearly scalable
Supports Large files (in GBs or TBs)
Economical
Uses Commodity Hardware
Nodes fail every day. Failure is expected, rather than exceptional.
The number of nodes in a cluster is not constant.
Optimized for Batch Processing
HDFS Goals
Highly fault-tolerant
runs on commodity HW, which can fail frequently
High throughput of data access

Streaming access to data
Large files
Typical file is gigabytes to terabytes in size
Support for tens of millions of files
Simple coherency
Write-once-read-many access model
HDFS: Files and Blocks
Data Organization
Data is organized into files and directories

Files are divided into uniform sized large blocks
Typically 128MB
Blocks are distributed across cluster nodes
Fault Tolerance
Blocks are replicated (default 3) to handle

hardware failure
Replication based on Rack-Awareness for performance and fault
tolerance
Keeps checksums of data for corruption

detection and recovery
Client reads both checksum and data from DataNode. If checksum

fails, it tries other replicas
HDFS: Files and Blocks
High Throughput:
Client talks to both NameNode and DataNodes

Data is not sent through the NameNode.
Throughput of file system scales nearly linearly with the
number of nodes.
HDFS exposes block placement so that computation can

be migrated to data
HDFS Components
NameNode
Manages the file namespace operation like opening, creating,

renaming etc.
File name to list blocks + location mapping
File metadata
Authorization and authentication
Collect block reports from DataNodes on block locations
Replicate missing blocks
Keeps ALL namespace in memory plus checkpoints & journal
DataNode
Handles block storage on multiple volumes and data integrity.

Clients access the blocks directly from data nodes for read and write
Data nodes periodically send block reports to NameNode
Block creation, deletion and replication upon instruction from the
NameNode.
HDFS Architecture
Namenode (the master)

name:/users/joeYahoo/myFile - blocks:{1,3}
name:/users/bobYahoo/someData.gzip - blocks:{2,4,5}
Metadata
Client
Datanodes (the slaves)
I/O
3
1
4
5
Hadoop DFS Interface

Simple commands
hdfs dfs -ls, -du, -rm, -rmr
Uploading files
hdfs dfs copyFromLocal foo mydata/foo
Downloading files
hdfs dfs - moveToLocal mydata/foo foo
hdfs dfs -cat mydata/foo
Admin
hdfs dfsadmin report
Map Reduce - Introduction
Parallel Job processing framework

Written in java
Close integration with HDFS
Provides :
Auto partitioning of job into sub tasks
Auto retry on failures
Linear Scalability
Locality of task execution
Plugin based framework for extensibility
Map-Reduce
MapReduce programs are executed in two main phases, called
mapping and
reducing .
In the mapping phase, MapReduce takes the input data and feeds each
data element to the mapper.
In the reducing phase, the reducer processes all the outputs from the
mapper and arrives at a final result.
The mapper is meant to filter and transform the input into something
That the reducer can aggregate over.
MapReduce uses lists and (key/value) pairs as its main data primitives.
Map-Reduce
Map-Reduce Program
Based on two functions: Map and Reduce
Every Map/Reduce program must specify a Mapper and optionally a Reducer
Operate on key and value pairs
Map-Reduce works like a Unix pipeline:

cat input | grep |
sort
|
Input | Map
| Shuffle & Sort |
uniq -c
Reduce
| cat > output

| Output
cat /var/log/auth.log* | grep session opened | cut -d -f10 | sort | uniq -c >
~/userlist
Map function: Takes a key/value pair and generates a set of intermediate key/value pairs
map(k1,
v1) -> list(k2, v2)
Reduce function: Takes intermediate values and associates them with the same intermediate key
reduce(k2, list(v2)) -> list (k3, v3)
Map-Reduce on Hadoop
Hadoop and its elements

Machine -1
Machine -x
Machine -2
HDFS
Input
Splits
Record Reader
(Kay, Value)
pairs
combiner
Partitionar
HDFS
files
File 1
Split 1
Map 1
Split 2
Map 2
Split 3
Map 3
Combiner 1
Partition 1
Reducer 1
File 1
File 2
Partition 2
Reducer 2
Output
File 3
File 3
.
Input .
.
.
.
.
Mapper
File N-2
Partition P-1
Reducer
Reducer R-1
File N-1
File N
Machine - M
File 2
.
.
.
File O-2
File O-1
Split M-2
Map M-2
Split M-1
Map M-1
Split M
Map M
Combiner C
Partition P
Reducer R
File O
Hadoop Eco-system
Hadoop Common: The common utilities that support the other Hadoop subprojects.
Hadoop Distributed File System (HDFS): A distributed file system that provides highthroughput access to application data.
Hadoop MapReduce: A software framework for distributed processing of large data sets
on compute clusters.
Other Hadoop-related projects at Apache include:
Avro: A data serialization system.
Cassandra: A scalable multi-master database with no single points of failure.
Chukwa: A data collection system for managing large distributed systems.
HBase: A scalable, distributed database that supports structured data storage for large tables.
Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout: A Scalable machine learning and data mining library.
Pig: A high-level data-flow language and execution framework for parallel computation.
ZooKeeper: A high-performance coordination service for distributed applications.
Exercise task
You have timeseries data (timestamp, ID, value) collected from 10,000
sensors in every millisecond. Your central system stores this data, and
allow more than 500 people to concurrently access this data and execute
queries on them. While last one month data is accessed more frequently,
some analytics algorithm built model using historical data as well.
Task:
Provide an architecture of such system to meet following goals
Fast
Available
Fair
Or, provide analytics algorithm and data-structure design considerations (e.g. kmeans clustering, or regression) on this data set of worth 3 months.
Group / individual presentation
End of session
Day 1: Introduction

01 Introduction 141206230951 Conversion Gate01 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

01 Introduction 141206230951 Conversion Gate01 PDF

Uploaded by

Copyright:

Available Formats

Training (Day 1)

Not just internet companies

Data >> Information >> Business Value

Indexed File level

Mega Block level

Embarrassingly parallel processing

Virtual System Level

MapReduce, distributed file system

Reference: Bina Ramamurthy 2011

Single-core, single processor

Broadly, the approach in HPC is to distribute the work across a

How to Process BigData?

Why is Hadoop and MapReduce needed?

Why is Hadoop and MapReduce needed?

MapReduce can be seen as a complement to an RDBMS.

Why is Hadoop and MapReduce needed?

Whats up with the names?

Economical: 1000s of commodity machines

Shuffle & Sort

Optimized for Batch Processing

High throughput of data access

HDFS: Files and Blocks

Data is organized into files and directories

Blocks are replicated (default 3) to handle

Keeps checksums of data for corruption

Client reads both checksum and data from DataNode. If checksum

HDFS: Files and Blocks

Client talks to both NameNode and DataNodes

HDFS exposes block placement so that computation can

Manages the file namespace operation like opening, creating,

Handles block storage on multiple volumes and data integrity.

Namenode (the master)

Datanodes (the slaves)

Hadoop DFS Interface

Map Reduce - Introduction

Parallel Job processing framework

Map-Reduce works like a Unix pipeline:

| cat > output

Hadoop and its elements

Group / individual presentation

You might also like