You are on page 1of 5

What is BIGDATA ?

'Big Data' is also a data but with a huge size. 'Big Data' is a term used to describe
collection of data that is huge in size and yet growing exponentially with time. In
short, such a data is so large and complex that none of the traditional data management
tools are able to store it or process it efficiently.

Example of bigdata :

1. The New York Stock Exchange generates about one terabyte of new trade data
per day.
2. Statistic shows that 500+terabytes of new data gets ingested into the databases of
social media site Facebook, every day. This data is mainly generated in terms of photo
and video uploads, message exchanges, putting comments etc.
3. Single Jet engine can generate 10+terabytes of data in 30 minutes of a flight time.
With many thousand flights per day, generation of data reaches up to many Petabytes.

Data sets grow rapidly - in part because they are increasingly gathered by cheap and
numerous information-sensing Internet of things devices such as mobile devices, aerial
(remote sensing), software logs, cameras, microphones, radio-frequency
identification (RFID) readers and wireless sensor networks.
Based on an IDC report prediction, the global data volume will grow exponentially from
4.4 zettabytes to 44 zettabytes between 2013 and 2020. By 2025, IDC predicts there
will be 163 zettabytes of data.

1021 bytes equals to 1 zettabyte or one billion terabytes forms a zettabyte.

Characteristics Of 'Big Data' :


Volume

The amount of data matters. With big data, you’ll have to process high volumes of low-
density, unstructured data. This can be data of unknown value, such as Twitter data
feeds, clickstreams on a webpage or a mobile app, or sensor-enabled equipment. For
some organizations, this might be tens of terabytes of data. For others, it may be
hundreds of petabytes.

Velocity

Velocity is the fast rate at which data is received and (perhaps) acted on. Normally, the
highest velocity of data streams directly into memory versus being written to disk. Some
internet-enabled smart products operate in real time or near real time and will require
real-time evaluation and action.
Variety

Variety refers to the many types of data that are available. Traditional data types were
structured and fit neatly in a relational database. With the rise of big data, data comes in
new unstructured data types. Unstructured and semistructured data types, such as text,
audio, and video require additional preprocessing to derive meaning and support
metadata.

Two more Vs have emerged over the past few years: value and Variability

Variability – This refers to the inconsistency which can be shown by the data at times,
thus hampering the process of being able to handle and manage the data effectively.

Value - Data has intrinsic value. But it’s of no use until that value is discovered. Equally
important: How truthful is your data—and how much can you rely on it?

The exponential growth of data first presented challenges to cutting-edge


businesses such as Google, Yahoo, Amazon, and Microsoft. Existing
tools were becoming inadequate to process such large data sets. Google was the first
to publicize MapReduce—a system they had used to scale their data processing needs.
This system aroused a lot of interest because many other businesses were facing
similar scaling challenges, and it wasn’t feasible for everyone to reinvent their own
proprietary tool. Doug Cutting saw an opportunity and led the charge to develop an
open source version of this MapReduce system called Hadoop .

History of Apache Hadoop :

Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used
text search library. Hadoop has its origins in Apache Nutch, an open source web search
engine, itself a part of the Lucene project. Nutch was started in 2002, and a working
crawler and search system quickly emerged.
However, its creators realized that their architecture wouldn’t scale to the billions of
pages on the Web. In 2004, Nutch’s developers set about writing an open source
implementation, the Nutch Distributed Filesystem (NDFS). In 2004, Google published
the paper that introduced MapReduce to the world. Early in 2005, the Nutch developers
had a working MapReduce implementation in Nutch, and by the middle of that year all
the major Nutch algorithms had been ported to run using MapReduce and NDFS. In
January 2006, Yahoo! hired Doug Cutting, and a month later we decided to abandon our
prototype and adopt Hadoop. In January 2008, Hadoop was made its own top-level
project at Apache, confirming its success and its diverse, active community. By this time,
Hadoop was being used by many other companies besides Yahoo!, such as Last.fm,
Facebook, and the New York Times.
Type of 'Big Data' :
Big data' could be found in three type:

1. Structure
2.UnStructure
3.SamiStructure

1. Structure :
Any data that can be stored, accessed and processed in the form of fixed format is
termed as a 'structured' data.
example :

An 'Student' table in a database is an example of Structured Data

2.UnStructure :
Any data with unknown form or the structure is classified as unstructured data.
Typical example of unstructured data is images, videos, facebook message etc.

3.SamiStructure :
Semi-structured data can contain both the forms of data.
In this type of data has some perticular format but not with schema .
example : xml data.

RDBMS VS HADOOP :
aditional RDBMS MapReduce
Traditional RDBMS Traditional MapReduce
RDBMS
Data size Gigabytes Petabytes
Access Interactive and batch Batch
Updates Read and write many times read many times
Write once
Transactions ACID None
Structure Schema-on-write Schema-on-read
Integrity High Low
Scaling Nonlinear Linear

SCALE-OUT INSTEAD OF SCALE-UP


Scaling commercial relational databases is expensive. Their design is more
friendly
to scaling up. To run a bigger database you need to buy a bigger machine. In
fact,
it’s not unusual to see server vendors market their expensive high-end
machines as
“database-class servers.” Unfortunately, at some point there won’t be a big
enough
machine available for the larger data sets. More importantly, the high-end
machines
are not cost effective for many applications. For example, a machine with
four times
the power of a standard PC costs a lot more than putting four such PCs in a
cluster.
Hadoop is designed to be a scale-out architecture operating on a cluster of
commodity
PC machines. Adding more resources means adding more machines to the
Hadoop cluster. Hadoop clusters with ten to hundreds of machines is standard. In
fact, other than for development purposes, there’s no reason to run Hadoop on a
single server.
KEY/VALUE PAIRS INSTEAD OF RELATIONAL TABLES :
A fundamental tenet of relational databases is that data resides in tables having
relational
structure defined by a schema . Although the relational model has great formal
properties, many modern applications deal with data types that don’t fit well into this
model. Text documents, images, and XML files are popular examples. Also, large data
sets are often unstructured or semistructured. Hadoop uses key/value pairs as its basic
data unit, which is flexible enough to work with the less-structured data types. In
Hadoop, data can originate in any form, but it eventually transforms into (key/value)
pairs for the processing functions to work on.
FUNCTIONAL PROGRAMMING (MAPREDUCE) INSTEAD OF DECLARATIVE
QUERIES (SQL)
SQL is fundamentally a high-level declarative language. You query data by stating the
result you want and let the database engine figure out how to derive it. Under
MapReduce you specify the actual steps in processing the data, which is more
analogous to an execution plan for a SQL engine . Under SQL you have query
statements; under MapReduce you have scripts and codes. MapReduce allows you to
process data in a more general
fashion than SQL queries. For example, you can build complex statistical models from
your data or reformat your image data. SQL is not well designed for such tasks.
On the other hand, when working with data that do fit well into relational structures,
some people may find MapReduce less natural to use. Those who are accustomed to
the SQL paradigm may find it challenging to think in the MapReduce way. I hope the
exercises and the examples in this book will help make MapReduce programming
more intuitive. But note that many extensions are available to allow one to take
advantage of the scalability of Hadoop while programming in more familiar paradigms.
In fact, some enable you to write queries in a SQL-like language, and your query is
automatically compiled into MapReduce code for execution. We’ll cover some of these
tools in chapters 10 and 11.

OFFLINE BATCH PROCESSING INSTEAD OF ONLINE TRANSACTIONS


Hadoop is designed for offline processing and analysis of large-scale data. It doesn’t
work for random reading and writing of a few records, which is the type of load for
online transaction processing. In fact, as of this writing (and in the foreseeable future),
Hadoop is best used as a write-once , read-many-times type of data store. In this aspect
it’s similar to data warehouses in the SQL world.

You might also like