You are on page 1of 4

Big Data Analysis

ASSIGNMENT #1

Presented to
Meritorious. Professor .Dr.Aqil Burni
Head of Actuarial Sciences
Institute of Business ManagementQ#1
Big Data
Adnan Alam Khan Std_18090

Page 1

What is Big Data?

Big Data Analysis


Definition:
Big data usually includes data sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage, and process
data within a tolerable elapsed time.[wikipedia]
Big Data is the new buzzword sweeping the worlds of IT and analytics. The
number of new data sources and the amount of data generated by existing
and new sources is growing at an incredible rate. Some recent statistics
illustrate this explosion:
*Facebook generates 130 terabytes in data each day, just in user logs. An
additional 200-400 terabytes are generated by users posting pictures.
*Google processes 25 petabytes (a petabyte is about 1,000 terabytes) each
day.
*The large Hadron Collider, the world's largest and highest-energy particle
accelerator, built by the European Organization for Nuclear Research,
generates one petabyte of data each second.
The total amount of data created worldwide in 2011 was about one
zetabyte (1,000,000 petabytes). This is projected to grow by 50%-60% each
year into the future. With all of this data, the demand for skilled statistical
problem solvers has never been greater.
Characteristics of Big data.
Big data can be described by the following characteristics:
Volume The quantity of data that is generated is very important in this
context. It is the size of the data which determines the value and potential
of the data under consideration and whether it can actually be considered
as Big Data or not. The name Big Data itself contains a term which is
related to size and hence the characteristic.
Variety - The next aspect of Big Data is its variety. This means that the
category to which Big Data belongs to is also a very essential fact that
needs to be known by the data analysts. This helps the people, who are
closely analyzing the data and are associated with it, to effectively use the
data to their advantage and thus upholding the importance of the Big Data.
Velocity - The term velocity in the context refers to the speed of
generation of data or how fast the data is generated and processed to meet
the demands and the challenges which lie ahead in the path of growth and
development.
Variability - This is a factor which can be a problem for those who analyze
the data. This refers to the inconsistency which can be shown by the data at
times, thus hampering the process of being able to handle and manage the
data effectively.
Veracity - The quality of the data being captured can vary greatly.
Accuracy of analysis depends on the veracity of the source data.
Complexity - Data management can become a very complex process,
especially when large volumes of data come from multiple sources. These
data need to be linked, connected and correlated in order to be able to
grasp the information that is supposed to be conveyed by these data. This
situation, is therefore, termed as the complexity of Big Data.
Technical definition of Big data.
Despite all of the above, we still need a good definition of Big Data. Two
such definitions come to mind.
The Three Vs: Volume, Velocity, and Variety. Here, Big Data come in massive
quantities, may come at you fast, and may also come in various forms (e.g.
structured and unstructured). To this, we add a fourth V: Variability. Big Data

Adnan Alam Khan Std_18090

Page 2

Big Data Analysis


is highly variable, covering the full range of experiences in the human
condition and the physical world.
The second definition may have more appeal to some: Big Data are data
that are expensive to manage and hard to extract value from. This
definition recognizes that the many forms and varieties of Big Data may be
difficult to collect, manage, process, aggregate, or summarize.
The Berkeley AMP lab suggests three pillars to deal with Big Data.
New algorithms are needed to deal with Big Data. They recognize that many
existing statistical and data mining algorithms will not scale, nor will some
existing software handle Big Data (e.g. R is well-known to have issues with
memory). Furthermore, as data becomes bigger, using sampling for
prediction or projection may miss important facts and phenomena.
Diagrammatic explanation

Q Why we are doing?


Q What is information insights Doing with Big Data?
Information insights is meeting the challenge of Big Data by organizing our
efforts around algorithms, machines, and people.
Algorithms: We have taken the most state-of-the-art statistical method,
namely hierarchical Bayesian statistical models, and we have parallelized
their implementation. This has resulted in blazing speed and the ability to
do sophisticated analysis on terabytes of data. We continue to innovate by
adding new statistical models that can be used in a variety of applications,
including advanced econometrics, variable selection, hidden Markov
models, Bayesian data fusion, tree-structured models, Bayesian CART, and
Random Forests.

Adnan Alam Khan Std_18090

Page 3

Big Data Analysis


Machines: We have purchased our own in-house High Performance
Computation Cluster (cloud), which has been tuned for high speed, complex
mathematical and matrix calculations. We are able to get over 80% of
theoretical performance from this HPCC, compared to the 50% or so
obtained when not so tuned. We currently have over 100 computation cores
and 1/2 terabyte of RAM. Expansion is expected to add an additional 128
cores and 4 terabytes of RAM.
People: We have invested considerable time in finding and training super
smart people. The company's atmosphere and ethos combine to foster
collaboration and communication, and innovation.

Adnan Alam Khan Std_18090

Page 4

You might also like