You are on page 1of 33

BIG DATA ANALYTICS USING APACHE HADOOP

SEMINAR REPORT
Submitted in partial fulfilment of
the requirements for the award of Bachelor of Technology Degree
in Computer Science and Engineering
of the University of Kerala

Submitted by
ABIN BABY
Roll No : 1
Seventh Semester
B.Tech Computer Science and Engineering
DEPARTMENT

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


COLLEGE OF ENGINEERING
TRIVANDRUM
2014
i

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


COLLEGE OF ENGINEERING
TRIVANDRUM

CERTIFICATE

This is to certify that this seminar report entitled BIG DATA ANALYTICS USING
APACHE HADOOP is a bonafide record of the work done by Abin Baby, under our
guidance towards partial fulfilment of the requirements for the award of the Degree of
Bachelor of Technology in Computer Science and Engineering of the University of
Kerala during the year 2011-2015.

Dr. Abdul Nizar A

Mrs. Sabitha S

Mrs. Rani Koshi

Professor
Dept. of CSE

Assoc. Professor
Dept. of CSE
(Guide)

Assoc. Professor
Dept. of CSE
(Guide)

(Head of the Department)

ii

ACKNOWLEDGEMENTS
I would like to express my sincere gratitude and heartful indebtedness to my
guide Dr. Abdul Nizar , Head of Department, Department of Computer Science and
Engineering for her valuable guidance and encouragement in pursuing this seminar.

I am also very much thankful to, Mrs. Sabitha S, Associate Professor,


Department of Computer Science and Engineering for their help and support.

I also extend my hearty gratitude to Seminar Co-ordinator, Mrs. Rani Koshi,


Associate Professor, Department of CSE, College of Engineering Trivandrum for
providing necessary facilities and their sincere co-operation. My sincere thanks is
extended to all the teachers of the department of CSE and to all my friends for their help
and support.

Above all, I thank God for the immense grace and blessings at all stages of the project.

Abin Baby

iii

ABSTRACT
The paradigm of processing huge datasets has been shifted from centralized
architecture to distributed architecture. As the enterprises faced issues of gathering
large chunks of data they found that the data cannot be processed using any of the
existing centralized architecture solutions. Apart from time constraints, the enterprises
faced issues of efficiency, performance and elevated infrastructure cost with the data
processing in the centralized environment.

With the help of distributed architecture these large organizations were able to
overcome the problems of extracting relevant information from a huge data dump. One
of the best open source tools used in the market to harness the distributed architecture
in order to solve the data processing problems is Apache Hadoop. Using Apache
Hadoops various components such as data clusters, map-reduce algorithms and
distributed processing, we will resolve various location-based complex data problems
and provide the relevant information back into the system, thereby increasing the user
experience.

iv

TABLE OF CONTENTS
1. Introduction

2. Big Data

2.1 Attributes of Big data


2.2 Big data Applications

10
12

3. Big Data Analytics

14

3.1 Challenges of Big Data Analytics


3.2 Objective of Big Data Analytics

16
17

4. Apache Hadoop

19

5. Hadoop Distributed File System

22

5.1 Architecture

23

6. Hadoop Map-Reduce

25

6.1 Architecture
6.2 Map Reduce Paradigm
6.3 Comparing RDBMS & Map Reduce

25
27
29

7. Apache Hadoop Ecosystem

31

8. Conclusion

32

9. Bibliography

33

CHAPTER 1

INTRODUCTION
Amount of data generated every day is expanding in drastic manner.Big data is a
popular term used to describe the data which is in zetta bytes. Government , companies
many organisations try to acquire and store data about their citizens and customers in
order to know them better and predict the customer behaviour . social networking
websites generate new data every second and handling such a data is one of the major
challenges companies are facing. Data which is stored in data warehouses is causing
disruption because it is in a raw format ,proper analysis and processing is to be done in
order to produce usable information out of it. Big Data has to deal with large and
complex datasets that can be structured , semi structured,or unstructured and will
typically not fit into memory to be processed. They have to be processed in place, which
means that computation has to be done where the data resides for processing.
Big data challenges include analysis, capture, curation, search, sharing, storage,
transfer, visualization, and privacy violations. The trend to larger data sets is due to the
additional information derivable from analysis of a single large set of related data, as
compared to separate smaller sets with the same total amount of data, allowing
correlations to be found to "spot business trends, prevent diseases, combat crime and so
on.
Big Data usually includes datasets with sizes. It is not possible for such systems
to process this amount of data within the time frame mandated by the business. Big
Data volumes are a constantly moving target, as of 2012 ranging from a few dozen
terabytes to many petabytes of data in a single dataset. Faced with this seemingly
insurmountable challenge, entirely new platforms are called Big Data platforms.

New tools are being used to handle such a large amount of data in short
time.ApacheHadoop is java based programming framework which is used for
processing large data sets in distributed computer environment. Hadoop is used in
1

system where multiple nodes are present which can process terabytes of data.hadoop
uses its own file system HDFS which facilitates fast transfer of data which can sustain
node failure and avoid system failure as whole. Hadoop uses MapReduce algorithm
which breaks down the big data into smaller chunks and performs the operations on it.
Hadoop framework is used by many big companies like Google, yahoo, IBM for
applications such as search engine, advertising and information gathering and
processing.
Various technologies will come in hand-in-hand to accomplish this task such as
Spring Hadoop Data Framework for the basic foundations and running of the MapReduce jobs, Apache Maven for distributed building of the code, REST Web services for
the communication, and lastly Apache Hadoop for distributed processing of the huge
dataset.

CHAPTER 2
BIG DATA
Every day, we create 2.5 quintillion bytes of data so much that 90% of the data
in the world today has been created in the last two years alone. This data comes from
everywhere: sensors used to gather climate information, posts to social media sites,
digital pictures and videos, purchase transaction records, and cell phone GPS signals to
name a few. This data is big data.
The data lying in the servers of the company was just data until yesterday
sorted and filed. Suddenly, the slang Big Data got popular and now the data in a
company is Big Data. The term covers each and every piece of data an organization has
stored till now. It includes data stored in clouds and even the URLs that you have been
bookmarked.A company might not have digitized all the data. They may not have
structured all the data already. But then, all the digital, papers, structured and nonstructured data with the company is now Big Data.
Big data is an all-encompassing term for any collection of data sets so large and
complex that it becomes difficult to process using traditional data processing
applications. It refers to the large amounts, at least terabytes, of poly-structured data
that flows continuously through and around organizations, including video, text, sensor
logs, and transactional records. The business benefits of analyzing this data can be
significant. According to a recent study by the MIT Sloan School of Management,
organizations that use analytics are twice as likely to be top performers in their industry
as those that dont.
Big data burst upon the scene in the first decade of the 21st century, and the first
organizations to embrace it were online and startup firms. In a nutshell, Big Data is your
data. It's the information owned by your company, obtained and processed through new
techniques to produce value in the best way possible.

Companies have sought for decades to make the best use of information to
improve their business capabilities. However, it's the structure (or lack thereof) and
size of Big Data that makes it so unique. Big Data is also special because it represents
both significant information - which can open new doors and the way this information
is analyzed to help open those doors. The analysis goes hand-in-hand with the
information, so in this sense "Big Data" represents a noun "the data" - and a verb
"combing the data to find value." The days of keeping company data in Microsoft Office
documents on carefully organized file shares are behind us, much like the bygone era of
sailing across the ocean in tiny ships. That 50 gigabyte file share in 2002 looks quite tiny
compared to a modern-day 50 terabyte marketing database containing customer
preferences and habits.
The world's technological per-capita capacity to store information has roughly
doubled every 40 months since the 1980s as of 2012, every day 2.5 exabytes (2.51018)
of data were created. The challenge for large enterprises is determining who should
own big data initiatives that straddle the entire organization.
Some of the popular organizations that hold Big Data are as follows:
Facebook: It has 40 PB of data and captures 100 TB/day
Yahoo!: It has 60 PB of data
Twitter: It captures 8 TB/day
EBay: It has 40 PB of data and captures 50 TB/day
How much data is considered as Big Data differs from company to
company.Though true that one company's Big Data is another's small, there is
something common: doesn't fit in memory, nor disk, has rapid influx of data that needs
to be processed and would benefit from distributed software stacks. For some
companies, 10 TB of data would be considered Big Data and for others 1 PB would be
Big Data. So only you can determine whether the data is really Big Data. It is sufficient to
say that it would start in the low terabyte range.

ATTRIBUTES OF BIG DATA


As far back as 2001, industry analyst Doug Laney (currently with Gartner)
articulated the now mainstream definition of big data as the three Vs of big data:
volume, velocity and variety.

Volume. The quantity of data that is generated is very important in this context.It is the
size of the data which determines the value and potential of the data under
consideration and whether it can actually be considered as Big Data or not.The name
Big Data itself contains a term which is related to size and hence the
characteristic.Many factors contribute to the increase in data volume. Transactionbased data stored through the years. Unstructured data streaming in from social media.
Increasing amounts of sensor and machine-to-machine data being collected. In the past,
excessive data volume was a storage issue. But with decreasing storage costs, other
issues emerge, including how to determine relevance within large data volumes and
how to use analytics to create value from relevant data.
It is estimated that 2.5 Quintillion data is generated every day.
40 Zettabytes of data will be created by 2020,an increase of 30 times from 2005
6 billion people around the world are using mobile phones

Velocity. The term velocity in the context refers to the speed of generation of data or
how fast the data is generated and processed to meet the demands and the challenges
which lie ahead in the path of growth and development.Data is streaming in at
unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and
smart metering are driving the need to deal with torrents of data in near-real time.
Reacting quickly enough to deal with data velocity is a challenge for most organizations.

20 Hours of video being uploaded every minute

2.9 million emails sent every second

NewYork stock exchange captures 1TB of information in every trading session

Variety. The next aspect of Big Data is its variety.This means that the category to which
Big Data belongs to is also a very essential fact that needs to be known by the data
5

analysts.This helps the people, who are closely analyzing the data and are associated
with it, to effectively use the data to their advantage and thus upholding the importance
of the Big Data.Data today comes in all types of formats. Structured, numeric data in
traditional databases. Information created from line-of-business applications.
Unstructured text documents, email, video, audio, stock ticker data and financial
transactions. Managing, merging and governing different varieties of data is something
many organizations still grapple with.

Global size of data in health care is about 150 exabytes as of 2011

30 Billion pieces of content are shared on facebook every month

4 billion hours of video are watched on youtube each month

Veracity. In addition to the increasing velocities and varieties of data, data flows can be
highly inconsistent with periodic peaks. Is something trending in social media? Daily,
seasonal and event-triggered peak data loads can be challenging to manage. Even more
so with unstructured data involved. This is a factor which can be a problem for those
who analyse the data.

Poor data quality causes US economy around $3.1 trillion a year

Complexity - Data management can become a very complex process,especially when


large volumes of data come from multiple sources.These data need to be
linked,connected and correlated in order to be able to grasp the information that is
supposed to be conveyed by these data.This situation,is therefore,termed as the
complexity of Big Data.

Volatility : Big data volatility refers to how long is data valid and how long should it be
stored. In this world of real time data you need to determine at what point is data no
longer relevant to the current analysis.

BIG DATA APPLICATIONS

ADVERTISING : Big data analytics help companies like google and other advertising
companies to identify the behavior of a person and to target the ads accordingly.Big
data analytics help in more personal and targeted ads.

ONLINE MARKETING : Big data analytics is used by online retailers like


amazon,ebay,flipkart etc to identify their potential customers giving them offers , for
varying the price of products according to the trends etc

HEALTH CARE : The average amount of data per hospital will increase from 167TB to
665TB in 2015.With Big Data medical professionals can improve patient care and
reduce cost by extracting relevant clinical information.

CUSTOMER SERVICE : Service representatives can use data to gain a more holistic
view of their customers , understanding their likes amd dislikes in real time

INSURANCE : An insurance or citizen service provider can apply advanced analytics on


data to detect fraud quickly

UNDERSTANDING AND OPTIMIZING BUSINESS PERFORMANCE : Big data is also


increasingly used to optimize business processes. Retailers are able to optimize their
stock based on predictions generated from social media data, web search trends and
weather forecasts. One particular business process that is seeing a lot of big data
analytics is supply chain or delivery route optimization.

FINANCIAL TRADING : High-Frequency Trading (HFT) is an area where big data finds a
lot of use today. Here, big data algorithms are used to make trading decisions. Today,
the majority of equity trading now takes place via data algorithms that increasingly take
into account signals from social media networks and news websites to make, buy and
sell decisions in split seconds.

IMPROVING AND OPTIMIZING CITY MANAGEMENT : Big data is used to improve


many aspects of our cities and countries. For example, it allows cities to optimize traffic
flows based on real time traffic information as well as social media and weather data. A
number of cities are currently piloting big data analytics with the aim of turning
themselves into Smart Cities, where the transport infrastructure and utility processes
are all joined up.

IMPROVING SECURITY AND LAW ENFORCEMENT : Big data is applied heavily in


improving security and enabling law enforcement. I am sure you are aware of the
revelations that the National Security Agency (NSA) in the U.S. uses big data analytics to
foil terrorist plots (and maybe spy on us). Others use big data techniques to detect and
prevent cyber attacks.

OPTIMIZING MACHINE AND DEVICE PERFORMENCE : Big data analytics help


machines and devices become smarter and more autonomous. For example, big data
tools are used to operate Googles self-driving car. The Toyota Prius is fitted with
cameras, GPS as well as powerful computers and sensors to safely drive on the road
without the intervention of human beings. Big data tools are also used to optimize
energy grids using data from smart meters.

IMPROVING SCIENCE AND RESEARCH : Science and research is currently being


transformed by the new possibilities big data brings. Take, for example, CERN, the Swiss
nuclear physics lab with its Large Hadron Collider, the worlds largest and most
powerful particle accelerator. Experiments to unlock the secrets of our universe how
it started and works - generate huge amounts of data. The CERN data center has 65,000
processors to analyze its 30 petabytes of data. However, it uses the computing powers
of thousands of computers distributed across 150 data centers worldwide to analyze
the data.

IMPROVING SPORTS PERFORMANCE : Most elite sports have now embraced big data
analytics. We have the IBM SlamTracker tool for tennis tournaments; we use video
analytics that track the performance of every player in a football or baseball game, and
sensor technology in sports equipment such as basket balls or golf clubs allows us to get
feedback (via smart phones and cloud servers) on our game and how to improve it.
8

CHAPTER 3
BIG DATA ANALYTICS
Big data is difficult to work with using most relational database management
systems and desktop statistics and visualization packages, requiring instead "massively
parallel software running on tens, hundreds, or even thousands of servers"

Rapidly ingesting, storing, and processing big data requires a cost effective
infrastructure that can scale with the amount of data and the scope of analysis. Most
organizations

with

traditional

data

platformstypically

relational

database

management systems (RDBMS) coupled to enterprise data warehouses (EDW) using


ETL toolsfind that their legacy infrastructure is either technically incapable or
financially impractical for storing and analyzing big data. A traditional ETL process
extracts data from multiple sources, then cleanses, formats, and loads it into a data
warehouse for analysis. When the source data sets are large, fast, and unstructured,
traditional ETL can become the bottleneck, because it is too complex to develop, too
expensive to operate, and takes too long to execute.

By most accounts, 80 percent of the development effort in a big data project goes into
data integration and only 20 percent goes toward data analysis. Furthermore, a
9

traditional EDW platform can cost upwards of USD 60K per terabyte. Analyzing one
petabytethe amount of data Google processes in 1 hourwould cost USD 60M.
Clearly more of the same is not a big data strategy that any CIO can afford.So we
require more efficient analytics for Big Data.

Big Analytics delivers competitive advantage in two ways compared to the


traditional analytical model. First, Big Analytics describes the efficient use of a simple
model applied to volumes of data that would be too large for the traditional analytical
environment. Research suggests that a simple algorithm with a large volume of data is
more accurate than a sophisticated algorithm with little data. The algorithm is not the
competitive advantage; the ability to apply it to huge amounts of datawithout
compromising performancegenerates the competitive edge.
Second, Big Analytics refers to the sophistication of the model itself. Increasingly,
analysis algorithms are provided directly by database management system (DBMS)
vendors. To pull away from the pack, companies must go well beyond what is provided
and innovate by using newer, more sophisticated statistical analysis.
To analyze such a large volume of data, big data analytics is typically performed
using specialized software tools and applications for predictive analytics, data mining,
text mining, forecasting and data optimization. Collectively these processes are separate
but highly integrated functions of high-performance analytics. Using big data tools and
software enables an organization to process extremely large volumes of data that a
business has collected to determine which data is relevant and can be analyzed to drive
better business decisions in the future.

10

CHALLENGES OF BIG DATA ANALYTICS


For most organizations, big data analysis is a challenge. Consider the sheer
volume

of

data

and

the

many

different

formats

of

the

data

(both structured and unstructured data) collected across the entire organization and
the many different ways different types of data can be combined, contrasted and
analyzed to find patterns and other useful information.
1.Meeting the need for speed : In todays hypercompetitive business environment,
companies not only have to find and analyze the relevant data they need, they must find
it quickly. Visualization helps organizations perform analyses and make decisions much
more rapidly, but the challenge is going through the sheer volumes of data and
accessing the level of detail needed, all at a high speed. One possible solution is
hardware. Some vendors are using increased memory and powerful parallel processing
to crunch large volumes of data extremely quickly. Another method is putting data inmemory but using a grid computing approach, where many machines are used to solve
a problem.
2.Understanding the data : It takes a lot of understanding to get data in the right
shape so that you can use visualization as part of data analysis. For example, if the data
comes from social media content, you need to know who the user is in a general sense
such as a customer using a particular set of products and understand what it is youre
trying to visualize out of the data. One solution to this challenge is to have the proper
domain expertise in place. Make sure the people analyzing the data have a deep
understanding of where the data comes from, what audience will be consuming the data
and how that audience will interpret the information.
3.Addressing data quality : Even if you can find and analyze data quickly and put it in
the proper context for the audience that will be consuming the information, the value of
data for decision-making purposes will be jeopardized if the data is not accurate or

11

timely. This is a challenge with any data analysis, but when considering the volumes of
information involved in big data projects, it becomes even more pronounced. Again, data
visualization will only prove to be a valuable tool if the data quality is assured. To address
this issue, companies need to have a data governance or information management process
in place to ensure the data is clean.
4.Displaying meaningful results : Plotting points on a graph for analysis becomes
difficult when dealing with extremely large amounts of information or a variety of
categories of information. For example, imagine you have 10 billion rows of retail SKU
data that youre trying to compare. The user trying to view 10 billion plots on the screen
will have a hard time seeing so many data points. One way to resolve this is to cluster
data into a higher-level view where smaller groups of data become visible. By grouping
the data together, or binning, you can more effectively visualize the data.
5. Dealing with outliers : The graphical representations of data made possible by
visualization can communicate trends and outliers much faster than tables containing
numbers and text. Users can easily spot issues that need attention simply by glancing at
a chart. Outliers typically represent about 1 to 5 percent of data, but when youre
working with massive amounts of data, viewing 1 to 5 percent of the data is rather
difficult. How do you represent those points without getting into plotting issues?
Possible solutions are to remove the outliers from the data (and therefore from the
chart) or to create a separate chart for the outliers.
The availability of new in-memory technology and high-performance analytics
that use data visualization is providing a better way to analyze data more quickly than
ever. Visual analytics enables organizations to take raw data and present it in a
meaningful way that generates the most value. Nevertheless, when used with big data,
visualization is bound to lead to some challenges. If youre prepared to deal with these
hurdles, the opportunity for success with a data visualization strategy is much greater.

12

OBJECTIVES OF BIG DATA ANALYTICS


1. Cost Reduction from Big Data Technologies : Some organizations pursuing big data
believe strongly that MIPS and terabyte storage for structured data are now most
cheaply delivered through big data technologies like Hadoop clusters. Organizations
that were focused on cost reduction made the decision to adopt big data tools primarily
within the IT organization on largely technical and economic criteria.
2. Time Reduction from Big Data : The second common objective of big data
technologies and solutions is time reduction.Many companies make use of big data
analytics to generate real-time decisions to save a lot of time. Another key objective
involving time reduction is to be able to interact with the customer in real time, using
analytics and data derived from the customer experience.
3. Developing New Big Data-Based Offerings : One of the most ambitious things an
organization can do with big data is to employ it in developing new product and service
offerings based on data. Many of the companies that employ this approach are online
firms, which have an obvious need to employ data-based products and services.
4. Supporting Internal Business Decisions : The primary purpose behind traditional,
small data analytics was to support internal business decisions. What offers should be
presented to a customer? Which customers are most likely to stop being customers
soon? How much inventory should be held in the warehouse? How should we price our
products?
These types of decisions employ big data when there are new, less structured data
sources that can be applied to the decision. For example, any data that can shed light on
customer satisfaction is helpful, and much data from customer interactions is
unstructured

13

CHAPTER 4
APACHE HADOOP

Doug Cutting, and Mike Carafella helped create Apache Hadoop in 2005 out
of necessity as data from the web exploded, and grew far beyond the ability of
traditional systems to handle it. Hadoop was initially inspired by papers published by
Google outlining its approach to handling an avalanche of data, and has since become
the de facto standard for storing, processing and analyzing hundreds of terabytes, and
even petabytes of data.
Apache Hadoop is an open source distributed software platform for storing and
processing data. Written in Java, it runs on a cluster of industry-standard servers
configured with direct-attached storage. Using Hadoop, you can store petabytes of data
reliably on tens of thousands of servers while scaling performance cost-effectively by
merely adding inexpensive nodes to the cluster.
Apache Hadoop is 100% open source, and pioneered a fundamentally new way
of storing and processing data. Instead of relying on expensive, proprietary hardware
and different systems to store and process data, Hadoop enables distributed parallel
processing of huge amounts of data across inexpensive, industry-standard servers that
both store and process the data, and can scale without limits. With Hadoop, no data is
too big. And in todays hyper-connected world where more and more data is being
created every day, Hadoops breakthrough advantages mean that businesses and
organizations can now find value in data that was recently considered useless.
Reveal Insight From All Types of Data,From All Types of Systems
Hadoop can handle all types of data from disparate systems: structured, unstructured,
log files, pictures, audio files, communications records, email just about anything you
can think of, regardless of its native format. Even when different types of data have been
stored in unrelated systems, one can dump it all into a Hadoop cluster with no prior
need for a schema. In other words, we dont need to know how we intend to query the

14

data before storing it; Hadoop lets one to decide later and over time can reveal
questions that never even thought to ask.
By making all data useable, not just whats in a databases, Hadoop lets one to see
relationships that were hidden before and reveal answers that have always been just
out of reach. One can start making more decisions based on hard data instead of
hunches and look at complete data sets, not just samples.

Redefine the Economics of Data: Keep Everything, Forever, Online


In addition, Hadoops cost advantages over legacy systems redefine the
economics of data. Legacy systems, while fine for certain workloads, simply were not
engineered with the needs of Big Data in mind and are far too expensive to be used for
general purpose with today's largest data sets.
One of the cost advantages of Hadoop is that because it relies in an internally
redundant data structure and is deployed on industry standard servers rather than
expensive specialized data storage systems, you can afford to store data not previously
viable. And we all know that once data is on tape, its essentially the same as if it had
been deleted - accessible only in extreme circumstances.
Hadoop consists of the Hadoop Common package, which provides filesystem and
OS level abstractions, a MapReduce engine (either MapReduce/MR1 or YARN/MR2) and
theHadoop Distributed File System (HDFS). The Hadoop Common package contains the
necessary Java Archive (JAR) files and scripts needed to start Hadoop.
For effective scheduling of work, every Hadoop-compatible file system should
provide location awareness: the name of the rack (more precisely, of the network
switch) where a worker node is. Hadoop applications can use this information to run
work on the node where the data is, and, failing that, on the same rack/switch, reducing
backbone traffic. HDFS uses this method when replicating data to try to keep different
copies of the data on different racks. The goal is to reduce the impact of a rack power
outage or switch failure, so that even if these events occur, the data may still be
readable.

15

HDFS : Self-healing, high-bandwidth, clustered storage.


MapReduce : Distributed, fault-tolerant resource management,coupled with scalable
data processing.
YARN : YARN is the architectural center of Hadoop that allows multiple data processing
engines such as interactive SQL, real-time streaming, data science and batch processing
to handle data stored in a single platform, unlocking an entirely new approach to
analytics. It is the foundation of the new generation of Hadoop and is enabling
organizations everywhere to realize a modern data architecture.
YARN is the prerequisite for Enterprise Hadoop, providing resource
management and a central platform to deliver consistent operations, security, and data
governance tools across Hadoop clusters.
YARN also extends the power of Hadoop to incumbent and new technologies
found within the data center so that they to can take advantage of cost effective, linearscale storage and processing. It provides ISVs and developers a consistent framework
for writing data access applications that run IN Hadoop.

16

CHAPTER 5
HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
HDFS is Hadoop's own rack-aware filesystem, which is a UNIX-based data
storage layer of Hadoop. HDFS is derived from concepts of Google filesystem. An
important characteristic of Hadoop is the partitioning of data and computation across
many (thousands of) hosts, and the execution of application computations in parallel,
close to their data. On HDFS, data files are replicated as sequences of blocks in the
cluster. A Hadoop cluster scales computation capacity, storage capacity, and I/O
bandwidth by simply adding commodity servers. HDFS can be accessed from
applications in many different ways. Natively, HDFS provides a Java API for applications
to use.
The Hadoop clusters at Yahoo! span 40,000 servers and store 40 petabytes of
application data, with the largest Hadoop cluster being 4,000 servers. Also, one hundred
other organizations worldwide are known to use Hadoop.
HDFS was designed to be a scalable, fault-tolerant, distributed storage system
that works closely with MapReduce. HDFS will just work under a variety of physical
and systemic circumstances. By distributing storage and computation across many
servers, the combined storage resource can grow with demand while remaining
economical at every size.
HDFS supports parallel reading and writing and is optimized for streaming
reading and writing.The bandwidth scales linearly with the number of nodes.HDFS
provides a block redundancy factor which is normally 3 , ie every block will be
replicated 3 times in various nodes .This helps to get higher fault tolerance.
These specific features ensure that the Hadoop clusters are highly functional and highly
available:

17

Rack awareness allows consideration of a nodes physical location, when


allocating storage and scheduling tasks

Minimal data motion. MapReduce moves compute processes to the data on


HDFS and not the other way around. Processing tasks can occur on the physical
node where the data resides. This significantly reduces the network I/O patterns
and keeps most of the I/O on the local disk or within the same rack and provides
very high aggregate read/write bandwidth.

Utilities diagnose the health of the files system and can rebalance the data on
different nodes

Rollback allows system operators to bring back the previous version of HDFS
after an upgrade, in case of human or system errors

Standby NameNode provides redundancy and supports high availability

Highly operable. Hadoop handles different types of cluster that might otherwise
require operator intervention. This design allows a single operator to maintain a
cluster of 1000s of nodes.

ARCHITECTURE
HDFS stores large files by dividing them into blocks (usually 64 or 128 MB) and
replicating the blocks on three or more servers.Data is organized in to files and
directories.
HDFS has a master/slave architecture. An HDFS cluster consists of a single
NameNode, a master server that manages the file system namespace and regulates
access to files by clients.In addition, there are a number of DataNodes, usually one per
node in the cluster, which manage storage attached to the nodes that they run on. HDFS
exposes a file system namespace and allows user data to be stored in files. Internally, a
file is split into one or more blocks and these blocks are stored in a set of DataNodes.
NameNode : They executes file system namespace operations like opening, closing, and
renaming files and directories. It also determines the mapping of blocks to DataNodes.
The Namenode actively monitors the number of replicas of a block. When a replica of a
block is lost due to a DataNode failure or disk failure, the NameNode creates another

18

replica of the block. The NameNode maintains the namespace tree and the mapping of
blocks to DataNodes, holding the entire namespace image in RAM.

DataNodes : are responsible for serving read and write requests from the file systems
clients. The DataNodes also perform block creation, deletion, and replication upon
instruction from the NameNode.

The NameNode and DataNode are pieces of software designed to run on commodity
machines. These machines typically run a GNU/Linux operating system (OS). HDFS is
built using the Java language; any machine that supports Java can run the NameNode or
the DataNode software. Usage of the highly portable Java language means that HDFS can
be deployed on a wide range of machines. A typical deployment has a dedicated
machine that runs only the NameNode software. Each of the other machines in the
cluster runs one instance of the DataNode software. HDFS provide automatic block
replication if any nodes fail. A failed disk or node need not to be repaired immediately
unlike RAID systems.Typically repair is done periodically for a collection of failures
which makes it more efficient.
19

CHAPTER 6
HADOOP MAP REDUCE

Central to the scalability of Apache Hadoop is the distributed processing


framework known as MapReduce. MapReduce helps programmers solve data-parallel
problems for which the data set can be sub-divided into small parts and processed
independently. MapReduce is an important advance because it allows ordinary
developers, not just those skilled in high-performance computing, to use parallel
programming constructs without worrying about the complex details of intra-cluster
communication, task monitoring, and failure handling. MapReduce simplifies all that.
MapReduce is a programming model for processing large datasets distributed on
a large cluster. MapReduce is the heart of Hadoop. Its programming paradigm allows
performing massive data processing across thousands of servers configured with
Hadoop clusters. This is derived from Google MapReduce. Hadoop MapReduce is a
software framework for writing applications easily, which process large amounts of
data (multiterabyte datasets) in parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner.
ARCHITECTURE
MapReduce is also implemented over master-slave architectures. Classic
MapReduce contains job submission, job initialization, task assignment, task execution,
progress and status update, and job completion-related activities, which are mainly
managed by the JobTracker node and executed by TaskTracker. Client application
submits a job to the JobTracker. Then input is divided across the cluster. The JobTracker
then calculates the number of map and reducer to be processed. It commands the
20

TaskTracker to start executing the job. Now, the TaskTracker copies the resources to a
local machine and launches JVM to map and reduce program over the data. Along with
this, the TaskTracker periodically sends update to the JobTracker, which can be
considered as the heartbeat that helps to update JobID, job status, and usage of
resources.

JobTracker : This is the master node of the MapReduce system, which


manages the jobs and resources in the cluster (TaskTrackers). The JobTracker
tries to schedule each map as close to the actual data being processed on
the TaskTracker, which is running on the same DataNode as the underlying
block.
TaskTracker : These are the slaves that are deployed on each machine. They
are responsible for running the map and reducing tasks as instructed by the
JobTracker.

21

MAP REDUCE PARADIGM

Hadoop MapReduce is a software framework for writing applications easily, which


process large amounts of data (multiterabyte datasets) in parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. This
MapReduce paradigm is divided into two phases, Map and Reduce that mainly deal with
key and value pairs of data. The Map and Reduce task run sequentially in a cluster; the
output of the Map phase becomes the input for the Reduce phase. These phases are
explained as follows:
Map phase: Once divided, datasets are assigned to the task tracker to
perform the Map phase. The data functional operation will be performed
over the data, emitting the mapped key and value pairs as the output of the
Map phase.
Reduce phase: The master node then collects the answers to all the
subproblems and combines them in some way to form the output; the answer
to the problem it was originally trying to solve.

22

The five common steps of parallel computing are as follows:


1. Preparing the Map() input: This will take the input data row wise and emit
key value pairs per rows, or we can explicitly change as per the requirement.
Map input: list (k1, v1)
2. Run the user-provided Map() code
Map output: list (k2, v2)
3. Shuffle the Map output to the Reduce processors. Also, shuffle the similar
keys (grouping them) and input them to the same reducer.
4. Run the user-provided Reduce() code: This phase will run the custom
reducer code designed by developer to run on shuffled data and emit key
and value.
Reduce input: (k2, list(v2))
Reduce output: (k3, v3)
5. Produce the final output: Finally, the master node collects all reducer output
and combines and writes them in a text file.
EXAMPLE

23

COMPARING RDBMS AND MAP-REDUCE

Traditional RDBMS

Map Reduce

Gigabyte(Terabytes)

Pentabytes(Exabytes)

Relational Tables

Key / Value Pairs

Interactive and Batch

Batch

Updates

Read and Write many times

Write once,Read many

Structure

Static schema

Dynamic schema

Integrity

High(ACID)

Low

Scaling

Non Linear

Linear

Processing

Online Transactions

Offline Batch Processing

Querying

Declarative Queries

Functional Programming

Data Size
Data Format
Access

24

CHAPTER 7
APACHE HADOOP ECOSYSTEM

Apache Hadoop ecosystem consist of many other implementations apart from hadoop
hdfs and hadoop map reduce.They include :
1.Apache PIG (Scripting platform) : Pig is a platform for analyzing large data sets that
consists of a high-level language for expressing data analysis programs, coupled with
infrastructure for evaluating these programs. The salient property of Pig programs is
that their structure is amenable to substantial parallelization, which in turns enables
them to handle very large data sets. At the present time,Pig's infrastructure layer
consists of a compiler that produces sequences of Map-Reduce programs. Pig's language
layer currently consists of a textual language called Pig Latin, which is easy to use,
optimized, and extensible.
2. Apache Hive : Hive is a data warehouse system for Hadoop that facilitates easy data
summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop
compatible file systems. It provides a mechanism to project structure onto this data and
query the data using a SQL-like language called HiveQL. Hive also allows traditional
map/reduce programmers to plug in their custom mappers and reducers when it is
inconvenient or inefficient to express this logic in HiveQL.
3. Apache HBase : HBase (Hadoop DataBase) is a distributed, column oriented
database. HBase uses HDFS for the underlying storage. It supports both batch style
computations using MapReduce and point queries (random reads).
The main components of HBase are as described below:

HBase Master is responsible for negotiating load balancing across all Region
Servers and maintain the state of the cluster. It is not part of the actual data
storage or retrieval path.

RegionServer is deployed on each machine and hosts data and processes I/O
requests.
25

4Apache Mahout : Mahout is a scalable machine learning library that implements many
different

approaches

to

machine

learning.

The

project

currently

contains

implementations of algorithms for classification, clustering, frequent item set mining,


genetic programming and collaborative filtering. Mahout is scalable along three
dimensions: It scales to reasonably large data sets by leveraging algorithm properties or
implementing versions based on Apache Hadoop.
5. Apache HCatalog : HCatalog is a metadata abstraction layer for referencing data
without using the underlying filenames or formats. It insulates users and scripts from
how and where the data is physically stored.WebHCat provides a REST-like web API for
HCatalog and related Hadoop components. Application developers make HTTP requests
to access the Hadoop MapReduce, Pig, Hive, and HCatalog DDL from within the
applications. Data and code used by WebHCat is maintained in HDFS. HCatalog DDL
commands are executed directly when requested. MapReduce, Pig, and Hive jobs are
placed in queue by WebHCat and can be monitored for progress or stopped as required.
Developers also specify a location in HDFS into which WebHCat should place Pig, Hive,
and MapReduce results.
6. Apache ZooKeeper : ZooKeeper is a centralized service for maintaining
configuration information, naming, providing distributed synchronization, and
providing group services which are very useful for a variety of distributed systems.
HBase is not operational without ZooKeeper.
7. Apache Flume : Flume is a top level project at the Apache Software Foundation.
While it can function as a general purpose event queue manager, in the context of
Hadoop it is most often used as a log aggregator, collecting log data from many diverse
sources and moving them to a centralized data store.

26

CONCLUSION
Big data is considered as the next big thing in the world of information
technology. The use of Big Data is becoming a crucial way for leading companies to
outperform their peers. Big Data will help to create new growth opportunities and
entirely new categories of companies, such as those that aggregate and analyse industry
data. Sophisticated analytics of big data can substantially improve decision-making,
minimise risks, and unearth valuable insights that would otherwise remain hidden.

Big data analytics is the process of examining big data to uncover hidden
patterns, unknown correlations and other useful information that can be used to make
better decisions. With big data analytics, data scientists and others can analyze huge
volumes of data that conventional analytics and business intelligence solutions can't
touch.

Apache Hadoop is a popular open source big data analytics tool. Using Hadoop,
we can store petabytes of data reliably on tens of thousands of servers while scaling
performance cost-effectively by merely adding inexpensive nodes to the cluster.
Instead of relying on expensive, proprietary hardware and different systems to store
and process data, Hadoop enables distributed parallel processing of huge amounts of
data across inexpensive, industry-standard servers that both store and process the data,
and can scale without limits.

27

Bibliography

[1]

Big data analysis using Apache Hadoop Nandimath, J. ; Banerjee, E. ; Patil,


A. ; Kakade, P. ; Vaidya, S. Information Reuse and Integration (IRI), 2013 IEEE
14th International Conference on

[2]

Extending Map-Reduce for Efficient Predicate-BasedSampling


Grover, R. ; Carey, M.J. Data Engineering (ICDE), 2012 IEEE 28th International
Conference on

[3]

Hadoop: Addressing challenges of Big Data Singh, K. ; Kaur, R.


Advance Computing Conference (IACC), 2014 IEEE International

[4]

Big Data analytics frameworks Chandarana, P. ; Vijayalakshmi, M.


Circuits, Systems, Communication and Information Technology Applications
(CSCITA), 2014 International Conference on

[5]

MapReduce: Simplified Data Processing on Large


Clusters.Available at http://labs.google.com/papers/mapreduceosdi04.
pdf

[6]

Pro Hadoop- build scalable distributed applications in the


cloud by Jason Venner

[7]

Hadoop the definitive guide by Tom White publisher: O


REILLY

28

You might also like