Hadoop PPT

SCHEDULE
MODULE 1. Introduction
MODULE
to Big Data and Hadoop
2. HDFS Internals, Hadoop configuration, and Data loading
MODULE 3.
Introduction to Map reduce
MODULE
4. Advanced Map reduce concepts
MODULE
5. Introduction to pig
MODULE 6. Advanced
pig and Introduction to Hive
MODULE 7. Advanced
Hive
MODULE 8.
Extending Hive and Introduction to HBase
MODULE 9. Advanced
MODULE 10. Project
HBase and oozie
set-up Discussion
MODULE 1. Introduction to Big Data and Hadoop
Big data is a popular term used to describe the exponential growth and availability of data, both structured and
unstructured
More data may lead to more accurate analyses. More accurate analyses may lead to more confident decision
making. And better decisions can mean greater operational efficiencies, cost reductions and reduced risk.
The hopeful vision is that organizations will be able to take data from any source, harness relevant data and
analyze it to find answers that enable 1) cost reductions, 2) time reductions, 3) new product development and
optimized offerings, and 4) smarter business decision making.
Big data analytics is often associated with cloud computing because the analysis of large data sets in real-time
requires a platform like Hadoop to store large data sets across a distributed cluster and Map Reduce to
coordinate, combine and process data from multiple sources.
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a
distributed computing environment.
Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of
terabytes. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to
continue operating uninterrupted in case of a node failure.
This approach lowers the risk of system failure, even if a significant number of nodes become inoperative.
Companies that need to process large and varied data sets frequently look to Apache Hadoop as a potential
tool, because it offers the ability to process, store and manage huge amounts of both structured and
unstructured data.
The open source Hadoop framework is built on top of a distributed file system and a cluster architecture that
enable it to transfer data rapidly and continue operating even if one or more compute nodes fail.
But Hadoop isn't a cure-all system for big data application needs as a whole. And while big-name Internet
companies like Yahoo, Facebook, Twitter, eBay and Google are prominent users of the technology, Hadoop
projects are new undertakings for many other types of organizations.
Big data is come from Genomics and Astronomy.
WHAT IS
Huge amount of data (TB or PB)
Big in volume
It is a moving data
Velocity : Real time capture and real time analytics
Volume : petabytes per day/week
Variety : unstructured data, web logs, audio, video, image, structured data
Data cannot be affordable in single physical machine
Distributed storage in multiple systems
File system is going to be distributed one
Bid data in Industry

1. Financial services uses 22%
2. Technology uses 16%
3. Telecommunications uses 14%
4. Retail uses 9%
5. Government uses 7%
Big Data is a collection of large datasets that cannot be processed using traditional computing techniques. It is not a
single technique or a tool, rather it involves many areas of business and technology.
The data in it will be of three types.
Structured data : Relational data.
Semi Structured data : XML data.
Unstructured data : Word, PDF, Text, Media Logs.
Benefits of Big Data
Using the information in the social media like preferences and product perception of their consumers, product
companies and retail organizations are planning their production.
Using the data regarding the previous medical history of patients, hospitals are providing better and quick
service.
Big Data Technologies
Big data technologies are important in providing more accurate analysis, which may lead to more concrete
decision-making resulting in greater operational efficiencies, cost reductions, and reduced risks for the business.
There are various technologies in the market from different vendors including Amazon, IBM, Microsoft, etc., to
handle big data.
CHALLENGES ASSOCIATED WITH Big Data

1.
Storage
2.
Capture
3.
Sharing
4.
Visualization
5.
Curation
Storage : Some vendors are using increased memory and powerful parallel processing to crunch large
volumes of data extremely quickly. Another method is putting data in-memory but using a grid computing
approach, where many machines are used to solve a problem. Both approaches allow organizations to explore
huge data volumes and gain business insights in near-real time.
Capture : Even if you can capture and analyze data quickly and put it in the proper context for the audience
that will be consuming the information, the value of data for decision-making purposes will play an important
role, if the data is not accurate or timely. This is a challenge with any data analysis, but when considering the
volumes of information involved in big data projects, it becomes even more pronounced.
Sharing :
REAL TIME PROCESSING
Real time data processing involves a continual input, process and output of data
Data processing time is typically much smaller (in fraction of seconds)
Examples
Complex event processing (CEP) platform, which combines data fro multiple sources to detect patterns and
attempts to identify either opportunities or threats
Operational intelligence (OI) platform which use real time data processing and (CEP) to gain insight into
operations by running query analysis against live feeds and event data
OI is near real time analytics over operational data and provides variability over many data sources. The goal
is to obtain near real time insights using continuous analytics to allow the organization to take immediate
action
BATCH PROCESSING
Executing a series of non-interactive jobs all at one time.
Batch jobs can be stored up during working hours and then executed during the evening or whenever the
computer is idle.
Batch processing is an efficient and preferred way for processing high volume of data
Data processing programs are run over a group of transactions and collected over a business agreed time period
Data is collected, entered, processed and then the batch results are produced for every batch window. Batch
processing requires separate programs for input, process and output
Examples
An example of batch processing is the way that credit card companies process billing. The customer does not
receive a bill for each separate credit card purchase but one monthly bill for all of that months purchases. The
bill is created through batch processing, where all of the data are collected and held until the bill is processed
as a batch at the end of the billing cycle.
Financial reporting and
Forecasting
BIG DATA BATCH PROCESSING

Operation data
Extract
BI
Transform
Social data
Load
Historic data
Service data
Big data analysis
HOW HADOOP COME INTO EXISTANCE
Hadoop and Big Data, they became synonymous. But they are two different things. Hadoop is a parallel
programming model that is implemented on a bunch of low-cost clustered processors, and it's intended to
support data-intensive distributed applications. That's what Hadoop is all about.
Due to the advent of new technologies, devices, and communication means like social networking sites, the
amount of data produced by mankind is growing rapidly every year
An enterprise will have a computer to store and process big data. For storage purpose, the programmers will
take the help of their choice of database vendors such as Oracle, IBM, etc. In this approach, the user interacts
with the application, which in turn handles the part of data storage and analysis.
This approach works fine with those applications that process less voluminous data that can be accommodated
by standard database servers, or up to the limit of the processor that is processing the data. But when it comes to
dealing with huge amounts of scalable data, it is a hectic task to process such data through a single database
bottleneck
Google solved this problem using an algorithm called Map Reduce. This algorithm divides the task into small
parts and assigns them to many computers, and collects the results from them which when integrated, form the
result dataset. Using the solution provided by Google, Doug Cutting and his team developed an Open Source
Project called HADOOP.
WHAT IS
Hadoop is an Apache open source framework written in java that allows distributed processing of large
datasets across clusters of computers using simple programming models.
The Hadoop framework application works in an environment that provides distributed storage and
computation across clusters of computers.
Hadoop is designed to scale up from single server to thousands of machines, each offering local computation
and storage.
Hadoop supports any type of data
Hadoop is best in solving Big data Batch processing
Hadoop = Storage + Computer grid
As mentioned above Hadoop is a place to store and known to be a distributed file system
Concepts come under Hadoop is HDFS , Map reduce , Pig , Hive , Sqoop , Flume , HBase
Core components of Hadoop
HDFS
Map reduce
HADOOP KEY CHARACTERISTICS

Reliable
Scalable
Characteristics
Flexible
Economical
HADOOP KEY DIFFERENTIATORS

Robust
Accessible
Differentiating
Factors
Scalable
Simple
Hadoop is a system for large scale data processing. It has two main components:
HDFS
Distributed across nodes

Natively redundant
NameNode tracks locations
MapReduce
Splits a task across processors
near the data & assembles results
Self-Healing, High Bandwidth
Clustered storage
JobTracker manages the TaskTrackers
MODULE 2 . Hadoop Distributed File system
The most common file system used by Hadoop is the Hadoop Distributed File System (HDFS).
It is designed to run on large clusters (thousands of computers) of small computer machines in a reliable,
fault-tolerant manner.
HDFS uses a master/slave architecture where master consists of a single Name Node that manages the file
system metadata and one or more slave Data Nodes that store the actual data.
A file in an HDFS namespace is split into several blocks and those blocks are stored in a set of Data Nodes.
The Name Node determines the mapping of blocks to the Data Nodes.
The Data Nodes takes care of read and write operation with the file system. They also take care of block
creation, deletion and replication based on instruction given by Name Node.
HDFS provides a shell like any other file system and a list of commands are available to interact with the file
system.
File system is a Storage side component.
HDFS COMPONENTS
1.
Name node : http://localhost:50070/dfshealth.jsp
Storage side acts as master of the system , HDFS has only one Name node
It maintains, manages and administers the data blocks present on the Data nodes
Default Data block size 64 MB (It can be changed)
The Name Node determines the mapping of blocks to the Data Nodes.
Read/write operation wanted to be perform fast, seek time should be less
Increasing the block size tends to decrease seek time and increase in streaming time finally the R/W operations
will be fast
Name node will keep track of overall file directory
If at all failure of the Name node occurs back up of name node is must done
So, here secondary name node will act as back up
We have to make secondary name node
Secondary Name node :
Secondary name node will get data from name node for every one hour
If it fails at middle of hour we cant trace back the data

Metadata
Name node
Secondary
Name node
secure
It will take meta data for every hour & keep
HDFS ARCHITECTURE
Client
Meta data operations
Read
Name node
Data nodes
Block ops
Replication
Data nodes
Write
Rack 1
Rack 2
Client
RACK AWARENESS
Rack 1
Rack 2
File 1
File 3
File 2
File 2
File 3
File 3
File 2
File 1
File 1
Rack 3
HDFS FILE WRITE OPERATION

1. Open
HDFS
User
2. Create
Distributed File
System
4. Write data
3. Shows location
Name node
5. ack packet
4
Data node
4
Data node
Data node
HDFS FILE READ OPERATION

1. Open
HDFS
User
4. Read
5. Close
2. Get Block location

Distributed File
System
Name node
FS Data
Input Stream
3. Read
Data node
3. Read
3. Read
Data node
Data node
MAP REDUCE FRAMEWORK

Mapper
Reducer
It takes processing to the data
The Reducer is responsible for processing one
It allows processing of data in parallel
or more values which share a common key
Map
Input key
value pair
Persistent
data
Map
Map
Reduce
Transient
data
Reduce
Reduce
Persistent
data
Output key
value pair

Hadoop PPT

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop PPT

Uploaded by

Copyright:

Available Formats

SCHEDULE

to Big Data and Hadoop

2. HDFS Internals, Hadoop configuration, and Data loading

Introduction to Map reduce

4. Advanced Map reduce concepts

pig and Introduction to Hive

Extending Hive and Introduction to HBase

HBase and oozie

MODULE 1. Introduction to Big Data and Hadoop

Big data is come from Genomics and Astronomy.

Huge amount of data (TB or PB)

Velocity : Real time capture and real time analytics

Volume : petabytes per day/week

Data cannot be affordable in single physical machine

Distributed storage in multiple systems

File system is going to be distributed one

Bid data in Industry

Structured data : Relational data.

Semi Structured data : XML data.

Unstructured data : Word, PDF, Text, Media Logs.

Benefits of Big Data

Big Data Technologies

CHALLENGES ASSOCIATED WITH Big Data

REAL TIME PROCESSING

Data processing time is typically much smaller (in fraction of seconds)

Executing a series of non-interactive jobs all at one time.

Financial reporting and

BIG DATA BATCH PROCESSING

Big data analysis

HOW HADOOP COME INTO EXISTANCE

Hadoop supports any type of data

Hadoop is best in solving Big data Batch processing

Hadoop = Storage + Computer grid

Core components of Hadoop

HADOOP KEY CHARACTERISTICS

HADOOP KEY DIFFERENTIATORS

Distributed across nodes

Splits a task across processors

near the data & assembles results

Self-Healing, High Bandwidth

JobTracker manages the TaskTrackers

MODULE 2 . Hadoop Distributed File system

File system is a Storage side component.

Name node : http://localhost:50070/dfshealth.jsp

Default Data block size 64 MB (It can be changed)

Read/write operation wanted to be perform fast, seek time should be less

Name node will keep track of overall file directory

So, here secondary name node will act as back up

We have to make secondary name node

Secondary Name node :

If it fails at middle of hour we cant trace back the data

It will take meta data for every hour & keep

Meta data operations

HDFS FILE WRITE OPERATION

HDFS FILE READ OPERATION

2. Get Block location

MAP REDUCE FRAMEWORK

It takes processing to the data

The Reducer is responsible for processing one

It allows processing of data in parallel

or more values which share a common key

You might also like