You are on page 1of 25

SCHEDULE

MODULE 1. Introduction
MODULE

to Big Data and Hadoop

2. HDFS Internals, Hadoop configuration, and Data loading

MODULE 3.

Introduction to Map reduce

MODULE

4. Advanced Map reduce concepts

MODULE

5. Introduction to pig

MODULE 6. Advanced

pig and Introduction to Hive

MODULE 7. Advanced

Hive

MODULE 8.

Extending Hive and Introduction to HBase

MODULE 9. Advanced
MODULE 10. Project

HBase and oozie

set-up Discussion

MODULE 1. Introduction to Big Data and Hadoop

Big data is a popular term used to describe the exponential growth and availability of data, both structured and
unstructured

More data may lead to more accurate analyses. More accurate analyses may lead to more confident decision
making. And better decisions can mean greater operational efficiencies, cost reductions and reduced risk.

The hopeful vision is that organizations will be able to take data from any source, harness relevant data and
analyze it to find answers that enable 1) cost reductions, 2) time reductions, 3) new product development and
optimized offerings, and 4) smarter business decision making.

Big data analytics is often associated with cloud computing because the analysis of large data sets in real-time
requires a platform like Hadoop to store large data sets across a distributed cluster and Map Reduce to
coordinate, combine and process data from multiple sources.

Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a
distributed computing environment.

Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of
terabytes. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to
continue operating uninterrupted in case of a node failure.

This approach lowers the risk of system failure, even if a significant number of nodes become inoperative.

Companies that need to process large and varied data sets frequently look to Apache Hadoop as a potential
tool, because it offers the ability to process, store and manage huge amounts of both structured and
unstructured data.

The open source Hadoop framework is built on top of a distributed file system and a cluster architecture that
enable it to transfer data rapidly and continue operating even if one or more compute nodes fail.

But Hadoop isn't a cure-all system for big data application needs as a whole. And while big-name Internet
companies like Yahoo, Facebook, Twitter, eBay and Google are prominent users of the technology, Hadoop
projects are new undertakings for many other types of organizations.

Big data is come from Genomics and Astronomy.

WHAT IS

Huge amount of data (TB or PB)

Big in volume

It is a moving data

Velocity : Real time capture and real time analytics

Volume : petabytes per day/week

Variety : unstructured data, web logs, audio, video, image, structured data

Data cannot be affordable in single physical machine

Distributed storage in multiple systems

File system is going to be distributed one

Bid data in Industry


1. Financial services uses 22%
2. Technology uses 16%
3. Telecommunications uses 14%
4. Retail uses 9%
5. Government uses 7%

Big Data is a collection of large datasets that cannot be processed using traditional computing techniques. It is not a
single technique or a tool, rather it involves many areas of business and technology.
The data in it will be of three types.

Structured data : Relational data.

Semi Structured data : XML data.

Unstructured data : Word, PDF, Text, Media Logs.

Benefits of Big Data

Using the information in the social media like preferences and product perception of their consumers, product
companies and retail organizations are planning their production.

Using the data regarding the previous medical history of patients, hospitals are providing better and quick
service.

Big Data Technologies

Big data technologies are important in providing more accurate analysis, which may lead to more concrete
decision-making resulting in greater operational efficiencies, cost reductions, and reduced risks for the business.

There are various technologies in the market from different vendors including Amazon, IBM, Microsoft, etc., to
handle big data.

CHALLENGES ASSOCIATED WITH Big Data


1.

Storage

2.

Capture

3.

Sharing

4.

Visualization

5.

Curation

Storage : Some vendors are using increased memory and powerful parallel processing to crunch large
volumes of data extremely quickly. Another method is putting data in-memory but using a grid computing
approach, where many machines are used to solve a problem. Both approaches allow organizations to explore
huge data volumes and gain business insights in near-real time.

Capture : Even if you can capture and analyze data quickly and put it in the proper context for the audience
that will be consuming the information, the value of data for decision-making purposes will play an important
role, if the data is not accurate or timely. This is a challenge with any data analysis, but when considering the
volumes of information involved in big data projects, it becomes even more pronounced.

Sharing :

REAL TIME PROCESSING

Real time data processing involves a continual input, process and output of data

Data processing time is typically much smaller (in fraction of seconds)

Examples

Complex event processing (CEP) platform, which combines data fro multiple sources to detect patterns and
attempts to identify either opportunities or threats

Operational intelligence (OI) platform which use real time data processing and (CEP) to gain insight into
operations by running query analysis against live feeds and event data

OI is near real time analytics over operational data and provides variability over many data sources. The goal
is to obtain near real time insights using continuous analytics to allow the organization to take immediate
action

BATCH PROCESSING

Executing a series of non-interactive jobs all at one time.

Batch jobs can be stored up during working hours and then executed during the evening or whenever the
computer is idle.

Batch processing is an efficient and preferred way for processing high volume of data

Data processing programs are run over a group of transactions and collected over a business agreed time period

Data is collected, entered, processed and then the batch results are produced for every batch window. Batch
processing requires separate programs for input, process and output

Examples

An example of batch processing is the way that credit card companies process billing. The customer does not
receive a bill for each separate credit card purchase but one monthly bill for all of that months purchases. The
bill is created through batch processing, where all of the data are collected and held until the bill is processed
as a batch at the end of the billing cycle.

Financial reporting and

Forecasting

BIG DATA BATCH PROCESSING


Operation data

Extract

BI

Transform

Social data

Load

Historic data

Service data

Big data analysis

HOW HADOOP COME INTO EXISTANCE

Hadoop and Big Data, they became synonymous. But they are two different things. Hadoop is a parallel
programming model that is implemented on a bunch of low-cost clustered processors, and it's intended to
support data-intensive distributed applications. That's what Hadoop is all about.

Due to the advent of new technologies, devices, and communication means like social networking sites, the
amount of data produced by mankind is growing rapidly every year

An enterprise will have a computer to store and process big data. For storage purpose, the programmers will
take the help of their choice of database vendors such as Oracle, IBM, etc. In this approach, the user interacts
with the application, which in turn handles the part of data storage and analysis.

This approach works fine with those applications that process less voluminous data that can be accommodated
by standard database servers, or up to the limit of the processor that is processing the data. But when it comes to
dealing with huge amounts of scalable data, it is a hectic task to process such data through a single database
bottleneck

Google solved this problem using an algorithm called Map Reduce. This algorithm divides the task into small
parts and assigns them to many computers, and collects the results from them which when integrated, form the
result dataset. Using the solution provided by Google, Doug Cutting and his team developed an Open Source
Project called HADOOP.

WHAT IS

Hadoop is an Apache open source framework written in java that allows distributed processing of large
datasets across clusters of computers using simple programming models.

The Hadoop framework application works in an environment that provides distributed storage and
computation across clusters of computers.

Hadoop is designed to scale up from single server to thousands of machines, each offering local computation
and storage.

Hadoop supports any type of data

Hadoop is best in solving Big data Batch processing

Hadoop = Storage + Computer grid

As mentioned above Hadoop is a place to store and known to be a distributed file system

Concepts come under Hadoop is HDFS , Map reduce , Pig , Hive , Sqoop , Flume , HBase

Core components of Hadoop

HDFS

Map reduce

HADOOP KEY CHARACTERISTICS


Reliable

Scalable

Characteristics

Flexible

Economical

HADOOP KEY DIFFERENTIATORS


Robust

Accessible

Differentiating
Factors

Scalable

Simple

Hadoop is a system for large scale data processing. It has two main components:

HDFS

Distributed across nodes


Natively redundant
NameNode tracks locations

MapReduce

Splits a task across processors

near the data & assembles results

Self-Healing, High Bandwidth

Clustered storage

JobTracker manages the TaskTrackers

MODULE 2 . Hadoop Distributed File system

The most common file system used by Hadoop is the Hadoop Distributed File System (HDFS).

It is designed to run on large clusters (thousands of computers) of small computer machines in a reliable,
fault-tolerant manner.

HDFS uses a master/slave architecture where master consists of a single Name Node that manages the file
system metadata and one or more slave Data Nodes that store the actual data.

A file in an HDFS namespace is split into several blocks and those blocks are stored in a set of Data Nodes.
The Name Node determines the mapping of blocks to the Data Nodes.

The Data Nodes takes care of read and write operation with the file system. They also take care of block
creation, deletion and replication based on instruction given by Name Node.

HDFS provides a shell like any other file system and a list of commands are available to interact with the file
system.

File system is a Storage side component.

HDFS COMPONENTS
1.

Name node : http://localhost:50070/dfshealth.jsp

Storage side acts as master of the system , HDFS has only one Name node

It maintains, manages and administers the data blocks present on the Data nodes

Default Data block size 64 MB (It can be changed)

The Name Node determines the mapping of blocks to the Data Nodes.

Read/write operation wanted to be perform fast, seek time should be less

Increasing the block size tends to decrease seek time and increase in streaming time finally the R/W operations
will be fast

Name node will keep track of overall file directory

If at all failure of the Name node occurs back up of name node is must done

So, here secondary name node will act as back up

We have to make secondary name node

Secondary Name node :

Secondary name node will get data from name node for every one hour

If it fails at middle of hour we cant trace back the data


Metadata

Name node

Secondary
Name node

secure

It will take meta data for every hour & keep

HDFS ARCHITECTURE
Client

Meta data operations

Read

Name node

Data nodes

Block ops
Replication

Data nodes

Write
Rack 1

Rack 2
Client

RACK AWARENESS

Rack 1

Rack 2

File 1

File 3

File 2

File 2

File 3

File 3

File 2

File 1
File 1

Rack 3

HDFS FILE WRITE OPERATION


1. Open
HDFS
User

2. Create
Distributed File
System

4. Write data

3. Shows location

Name node

5. ack packet
4

Data node

4
Data node

Data node

HDFS FILE READ OPERATION


1. Open
HDFS
User

4. Read
5. Close

2. Get Block location


Distributed File
System

Name node

FS Data
Input Stream

3. Read

Data node

3. Read

3. Read

Data node

Data node

MAP REDUCE FRAMEWORK


Mapper

Reducer

It takes processing to the data

The Reducer is responsible for processing one

It allows processing of data in parallel

or more values which share a common key

Map
Input key
value pair

Persistent
data

Map

Map

Reduce
Transient
data

Reduce

Reduce

Persistent
data

Output key
value pair

You might also like