Big Data Notes

Big Data Notes:
Big Data refers to the data that because of its size, speed or format, i.e volume,
velocity and variety cannot be easily stored, manipulated or analysed with traditional
methods such as spreadsheets, relational database or common statistical software.
What big data is not?
It is not a regular data and it is not something that an experienced data analyst is
ready to deal with . To put in another way, big data does not fit well in the familiar
analytic paradigm. It would not fit into the rows and columns of excel files, cant be
analysed with conventional multiple regression and it would not fit on the normal
desktop computer.
One way of describing big data is by three Vs: Volume, Velocity and Variety.
In a simplest definition of big data, it is a data that is too big to work on your
computer. However it is a relative definition, whats big for one system in one time is
relatively smaller over a period of time.
Volume:
Moores Law: The Physical capacity and performance of computers will double every
2 years. For example the maximum number of rows that one can have in a single
spreadsheet is changed overtime. Previously it was 65,000 now its over a million
which seems like a lot but if we are logging internet activity that can occur in
hundreds and thousands of times per second, we will reach a million rows very
quickly. On the other hand, if we are looking at photos and videos, the data space it
consumes is very high.
Velocity: This is when data is coming in at very fast. In conventional scientific
methods, it could take months to gather data and hence it could take years to publish
such research. It is not only time consuming to gather, but the data is generally
static. For example Twitter is processing more than 6000 tweets per second globally,
meaning more than 500 million tweets per day and more than 200 billion tweets
every year. One can see the details live at www.internetivedtats.com/twitter-statistics/
Even a simple temperature sensor hooked up to a micro processor through a serial
connection and sending just one bit of data at a time can overwhelm a computer if
left running for long period of time. This type of constant influx of streaming data
poses special challenges for analysis because the data itself is a moving target. If
one is accustomed to working with static data set in a program such as SPSS or R,
the complexity of the streaming data can be very daunting.
Variety: Its not just the rows and the column of the nicely formatted data sets in a
spreadsheet for instance, instead we can have many data sheets in many different
formats. We can have unstructured text like books and blog posts, comments on
news articles and tweets. One researcher has estimated that 80% of the enterprise
data is unstructured data and can also include photos and videos, audio. Similarly
data sets that include things like network graph data or if we are dealing with data
sets which are called NoSQL or Not Only SQL Data so you may have graphs of
social connections, hierarchical structures and connections and documents, any data
that do not fit well into the rows or column of a conventional relational database or
spreadsheets, then we can have serious analytical challenges. In fact a recent study
by Forrester Research shows that variety is the biggest factor that is leading
companies to go for solutions that involve Big Data.
Now the final question is to understand that do we really need to have all the three
elements to describe a data set to be a big data set? It may be true that all three
could describe a big data set but any one can be too much to use the standard
approach. In fact Big Data really means that we cannot use the standard approach.
As a result Big data can result in number of special challenges.
In addition, there could be other Vs that can help to understand Big Data:
1.Veracity: It implies that does the data in your system has enough information at a
micro-level for you to make accurate conclusions about larger groups of data sets.
2. Validity: Is the data clean and well managed abiding to certain standards?
3. Value: It means is it worth to justify the ROI (Return on Investment).
4. Variability: The data can change over time and place and there are lot of
uncontrolled factors that may introduce noise into data unless you specifically
measure them.
5. Venue: Where the data is located and how does it effect formatting.
6. Vocabulary: Refers to the meta data when the data is combined from multiple data
sources.
7. Vagueness: Is implies that do you really understand the goals and clarity of the
purpose to use the Big data.
How Big Data is Used?
Big data for consumers: Most of the time when people are talking about Big data,
they are talking in terms of the commercial setting, about how businesses can use
big data in advertising and marketing. But one important place where big data is
specially used is for consumers. While the data and algorithm involve incredibly
sophisticated processes, its nearly invisible. The results are so clean they give just a
little piece of information but exactly what we need.
Common applications of big data from consumer standpoint:
1. If one has an Apple iPhone or an iPad its what Siri can do. So for instance if one
asks Whats the weather like? and Siri exactly knows what do we exactly mean and
where we are and at what time are we talking about. They can do things like look for
restaurants.
2. Similarly Amazon.com makes recommendations for books. For instance if one
looks to buy a specific book category, it gives recommendations of list of other
books, generated by Amazons recommendation engine which are relevant and
might be interested to the buyer.
3. Google Now: What Google now does is to make recommendations before we ask
for them, especially when it is linked up yo calendar or locations sensing on the
phone it knows where we are and where we need to be and it can tell things like
traffic or weather before we even ask for it. This is based on the enormous amount of
information, about the kind of information people search for and provides it in the
pre-emptive manner.
Big data for Business:
Big data is revolutionising the business in the way people do E-commerce.
1. Google Ad Searches: Whenever we search something on Google search engine,
we get the information but we also get relevant ads to which users are most likely to
respond to.
2. Predictive Marketing: This is something big data decides who the audience would
be before they get their. It helps in predicting major life events like graduating or
getting married or getting new jobs or any series of events that are associated with
whole series of commercial transactions. Companies can look at consumer
behaviour to accomplish this. They may also look at how often we log into their
website, what credit card we use, how often we look at particular items before
moving on to something else. They can look at whether we have applied for an
account in their organisation. They can use demographic information. This can
include things like our age, our marital status, the home address, how far we live
from the store, the estimated salary if we have moved recently, what websites we
visit. Companies may also end up purchasing these data such as about the ethnicity,
job history, magazine subscriptions, where attended college, or whether you have
talked about something online. There is an enormous about of information and is
potentially available.
3. Fraud Detection: Online retailers lose about 3.5 billion USD every year to
fraudsters and hence it is a big issue. Companies can use a number of things to
lessen the fraud especially through online transaction. They can look at point of sale
(POS) as to how you are making the purchase, they can look at the geo location and
ip address, what computer are you using to access the website, they can look at the
login time, or biometrics. For instance the way people move the mouse, or the time
they take between pressing keys on the computer are distinctive measurements of
people, when you hold you cell phones, people of different heights hold the phone at
different angles as measured by the accelerometer in the phone and all these can
help to determine whether the purchases are made by people claiming to be who
they are. These patterns of details that they have in the extraordinarily large data
sets let
Big data for Research:
Let us take a look at a few examples where Big Data has influenced scientific
progress:
1. Google Flu trends, where they were able to find that search patterns for Flu
related words were actually able to identify outbreaks of the flu in the United States
much faster than the research by the centre for disease control could do. Wikipedia
searches can identify them with greater accuracy.
2. Google Books Project: For last few years, Google is scanning the books that were
published in the last few years and have scanned more than 30 million books and
make them available to people in the digital format. For instance, over the last 200
years of the preface of the word Math, Arithmetic and Algebra, Arithmetic shows
strong spike in 20s and 30s, where as the word Math has increased over the last
50-60 years.
Ten Ways Big Data is different than small data:
1. Goals: Small data has a specific data whereas Big Data may have a goal when it
starts and evolve over a period of time.
2. Location: Small data is usually in one computer and at a single computer, whereas
Big Data can be in multiple servers, and computers and geographic locations.
3. Data Structure: Small data is structured and it has rows and columns. Whereas
Big data can be unstructured formats and files, across multiple disciplines.
4. Data Preparation: Data is usually prepared by the end user for their own
purposes, whereas with Big data, the data is often prepared by one group of people,
analysed by second group of people and then used by the third group of people.
They have different purposes, used in different disciplines.
5. Longevity: Small data is kept for a specific period of time, because they have the
clear ending point, but with with big data, since it involves a lot of cost gets into other
project and can be used for a very long period of time and an uncertain amount of
time.
6. Measurements: Small Data is typically measured by a single protocol using set
units and is usually done at the same time, but with big data, the measurements can
be with multiple units due to geographically widespread population set and requires
a fair amount of conversions for a uniform standard.
7. Reproducibility: Small data set can be reproduced in the entire data, but big data
is difficult to reproduce because they have multiple inputs of data sources.
8. Stakes: On small data if things are wrong, cost is not significant, whereas in big
data cost huge.
9. Introspection: It means data describes itself: In small data the data itself is
organised and hence it is clear to understand, where as big data is difficult to identify.
10. Analysis: Its easier to analyse small data with simple procedure, whereas with
big data may require extraction, revealing and other steps to deal with a small part of
the data to eventually aggregate to make a meaning out of the analysis.
Sources for big data:
Can be generated from multiple sources. Human beings create a lot of data. Let us
take a look at some of these data:
1. Intentional data: This could be either photos, videos, audios, text on social
network, clicking Like, web searches, webpages bookmarked, Emails / text
messages, cell phone calls, online purchases etc.

2. Metadata: It is data about data, or a second order human generated data. Meta
data can sometimes be larger than the actual piece of data. Because it is computer
generated, it is easier to read and perform computations. For example in
photograph, it is called Exif (Exchangeable Image File Format) Metadata. Another
example could be email metadata, containing time, whom the email is sent, what
was the context and so on. Example: immersion.media.mit.edu
Machine Generated Data: Examples could be cell phones connecting to towers,
satellite radio, RFID readings, readings from medical devices or web crawlers.
M2M (Machine to machine) data can be generated with IoT (Internet of things).
Examples could be smart sensors, smart home to turn on lights, smart grids or smart
cities where each machine is connected with others using chips. Applications could
be monitoring productions lines, smart meters for utility systems, thermostats,
lightbulbs or environmental monitoring.
Cloud computing: Examples are
1. IaaS (Infrastructure/ hardware as a service): CPU, RAM or hardware are in the
cloud.
2. PaaS (Platform as a Service): Developer creates applications using this platform.
Ex: Microsoft Azure.
3. DaaS (Data as a service): It is online as a service, provides access to data, similar
to data marts.
Hadoop:
1. It is not a single product
2. A collection of applications
3. A framework on a platform
HDFS: Hadoop Distributed File System): Helps to store files across many
computers.
Components of Hadoop:
1. MapReduce: Map splits a task into pieces and Reduce combines the output.
2. MapReduce has been replaced by YARN (Yet Another Resource Negotiator).
YARN can do batch processing, proper streaming of data and also can create
graphs of social media activities.
3. Pig: Writes MapReduce programs, uses the Pig Latin programming language.
4. Hive: Summarises queries, analyses data and uses the HiveQL Language
5. HBase: It is NoSQL Data base.
6. Storm: Allows processing of streaming data.
7. Spark: Allows in Memory processing of data in computer RAM.
8. Giraph: Allows analysing the graph
Hadoop can be installed on any computer or can be put in the cloud platform, used
by companies like Yahoo, Linkedin, Facebook, Google among others. It is an Open

Source project from Apache and is free, modifiable by anyone.
ETL : The term is derived from Data Warehouse and means Extract transform and
Load.
1. Extract: The process of pulling data from storage such as a database.
2. Transform: The process of putting all data into a common format.
3. Load: The process pf loading data into software for analysis.

Big Data Notes

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Notes

Uploaded by

Copyright:

Available Formats

Big Data Notes:

messages, cell phone calls, online purchases etc.

by companies like Yahoo, Linkedin, Facebook, Google among others. It is an Open

You might also like