You are on page 1of 39

Acknowledgements

I am going to start my dissertation in respect of God, Who is gracious


and Merciful.
Successfully completing any task gives us satisfaction as well as internal strength for future
problems but the person alone has never existed. He is truly accompanied by few people. They
use to give the person support as well as suggestion to successfully complete his work. So I am
pleasured in thanking all such people who motivates me and provides there kind support at all
stages of my research work.
Firstly, I would like to honor my institute Ideal Institute of Technology Ghaziabad.
Here I have been provided with a workplace to learn recent techniques and conceptual
background to strengthen my programming and professional skills.
I am very much grateful to Dr. S.K. Chaudhrai , (Director) and Dr. Yaduvir Singh
(Associate Professor & Head of Department of Computer Science & Engineering) Ideal
Institute of Technology, Ghaziabad, for his helpful attitude and encouragement in time to
time to excel in our studies.
I would like to express my sincere heartfelt gratitude to my honourable, esteemed
supervisor Ms. Anjali Goel (Assistant professor), Department of Computer Science and
Engineering for her kind and valuable guidance for the successful completion of the
presentation work. I am glad to work under his supervision.
Furthermore, I am thankful to, all faculty members for motivating me and to the Staffs of
computer labs in the department for providing excellent valuable facility as well as issuing me
a computer system of good configuration and providing regular maintenance.
I would like to extend special thanks to all my batch mates for their love, encouragement
and constant support.
Last but not least I would like to thank my parents for supporting me to complete my
presentation report in all ways.

1302810019

Abstract

Big Data may well be the Next Big Thing in the IT world. Big data burst upon
the scene in the first decade of the 21st century. The first organizations to
embrace it were online and startup firms. Firms like Google, eBay, LinkedIn, and
Facebook were built around big data from the beginning. Like many new
information technologies, big data can bring about dramatic cost reductions,
substantial improvements in the time required to perform a computing task, or
new product and service offerings. Big Data is similar to small data, but bigger
in size but having data bigger it requires different approaches: Techniques, tools
and architecture an aim to solve new problems or old problems in a better way
Big Data generates value from the storage and processing of very large quantities
of digital information that cannot be analyzed with traditional computing
techniques.

Table of Contents

Certificate
Acknowledgement ... 1
Abstract ...2
Table of Contents...3
Lists of Figures..5
Chapter 1 : Introduction

Chapter 2 : 3d doctor basics

Chapter 3 : steps to create 3d rendering from 2d image slices

10
10

3.1
11
3.2
Chapter 4 : 3d formats ,handling and reslicing
4.1

12

Image formats that 3D-DOCTOR support and which

12

can be used
3D formats that 3D-DOCTOR support

14

Limit on image size

15

4.2
4.3
4.3.1

Import raw image files from the

15

Visible Human Project


Deconvolution

16

4.3.2
16
4.3.3
Chapter 5 : 3d surface rendering

17

How to adjust the scale (X, Y, Z) of my 3D rendering

17

5.1
Creating 3D surface model from images

18

5.1.1
5.1.2

Split or Cut objects for 3D rendering

Chapter 6 : able software upgrades 3d doctor

19

6.1.1 ABLE released a new version of 3D-DOCTOR


Chapter 7 : measurement done by 3d doctor
The 3D volume of 3D surface model

20
20

7.1

3D Model Examples

3D Volume Calculation,
Measurements and Quantitative
Analysis
Chapter 8 : 3d measurements

22

Object Measurements:

22

8.1

Display image slices together with 23


8.2

3D models

Chapter 9 : advance 3d image processing

THE BASICS

24
25

9.1

Volume Rendering:

25

Interactive Segmentation
Auto Segmentation

25

9.1.1

9.1.2

3D Mesh Modeling from CT, MRI 26


9.1.3

and other Images

26

9.2
27
9.3
28
9.4
28
9.5
29
9.6
30
9.6.1
30
9.6.2
30
9.6.3
30
9.7
31
9.8
Chapter 10 : advance 3d image processing
10.1 Virtual doctor trains patients in

32
32

3D

10.1.

Personal details taken

32

1
10.1.2

32

Chapter 11 : 3d image fusion

33

Chapter 12 : platforms 3d doctor runs

34

The ideal hardware set up to run


12.1

3D DOCTOR

34

12.1.

34

12.1.2

34

1
12.

35

2
Chapter 13 : Conclusion

36

Chapter 14 : References

37

List of Figures

Figure No.

Title

Page No.

3.1
4.1
4.3

3 Vs Characteristics..........
Brainstorming Big Data Architecture......
Processing Big Data...

10
14
16

5.1

Standard table size for Big Data..

17

6.1

Traditional data v/s Big Data.......

19

7.1

Big Data Sources...

20

7.2

Big Data analytics....

21

CHAPTER 1
INTRODUCTION TO BIG DATA
Big data is a broad term for data sets so large or complex that
traditional data processing applications are inadequate. Challenges
include analysis, capture, data curation, search, sharing, storage,
transfer, visualization, and information privacy. The term often refers
simply to the use of predictive analytics or other certain advanced
methods to extract value from data, and seldom to a particular size of
data set. Accuracy in big data may lead to more confident decision
making. And better decisions can mean greater operational efficiency,
cost reductions and reduced risk.
Analysis of data sets can find new correlations, to "spot business
trends, prevent diseases, combat crime and so on." Scientists,
practitioners of media and advertising and governments alike regularly
meet difficulties with large data sets in areas including Internet search,
finance and business informatics. Scientists encounter limitations in eScience work, including meteorology, genomics, connectomics,
complex physics simulations, and biological and environmental
research.
Data sets grow in size in part because they are increasingly being
gathered by cheap and numerous information-sensing mobile devices,
aerial (remote sensing), software logs, cameras, microphones, radiofrequency identification (RFID) readers, and wireless sensor networks.
7

The world's technological per-capita capacity to store information has


roughly doubled every 40 months since the 1980s; as of 2012, every
day 2.5 exabytes (2.51018) of data were created; The challenge for
large enterprises is determining who should own big data initiatives
that straddle the entire organization.
Work with big data is necessarily uncommon; most analysis is of "PC
size" data, on a desktop PC or notebook that can handle the available
data set.
Relational database management systems and desktop statistics and
visualization packages often have difficulty handling big data. The
work instead requires "massively parallel software running on tens,
hundreds, or even thousands of servers". What is considered "big data"
varies depending on the capabilities of the users and their tools, and
expanding capabilities make Big Data a moving target. Thus, what is
considered to be "Big" in one year will become ordinary in later years.
"For some organizations, facing hundreds of gigabytes of data for the
first time may trigger a need to reconsider data management options.
For others, it may take tens or hundreds of terabytes before data size
becomes a significant considerations.

CHAPTER 2
WHAT IS BIG DATA?
Big data usually includes data sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage, and process
data within a tolerable elapsed time. Big data "size" is a constantly
moving target, as of 2012 ranging from a few dozen terabytes to many
petabytes of data. Big data is a set of techniques and technologies that
require new forms of integration to uncover large hidden values from
large datasets that are diverse, complex, and of a massive scale.
In a 2001 research report and related lectures, META Group (now
Gartner) analyst Doug Laney defined data growth challenges and
opportunities as being three-dimensional, i.e. increasing volume
(amount of data), velocity (speed of data in and out), and variety (range
of data types and sources). Gartner, and now much of the industry,
continue to use this "3Vs" model for describing big data. In 2012,
Gartner updated its definition as follows: "Big data is high volume,
high velocity, and/or high variety information assets that require new
forms of processing to enable enhanced decision making, insight
discovery and process optimization." Additionally, a new V "Veracity"
is added by some organizations to describe it.
If Gartners definition (the 3Vs) is still widely used, the growing
maturity of the concept fosters a more sound difference between big
data and Business Intelligence, regarding data and their use:
Business Intelligence uses descriptive statistics with data with
high information density to measure things, detect trends etc.;

Big data uses inductive statistics and concepts from nonlinear


system identification to infer laws (regressions, nonlinear
relationships, and causal effects) from large sets of data with low
information density to reveal relationships, dependencies and
perform predictions of outcomes and behaviors.
A more recent, consensual definition states that "Big Data represents the Information assets characterized by such a
High Volume, Velocity and Variety to require specific Technology and
Analytical Methods for its transformation into Value".

10

CHAPTER 3
CHARACTERSTICS OF BIG DATA

3.1 Big data can be described by the following characteristics:

3 VS
Characteristi
cs

VOLUME

VELOCITY

VARIETY

Figure : 3.1

Volume The quantity of data that is generated is very important


in this context. It is the size of the data which determines the
value and potential of the data under consideration and whether it
can actually be considered Big Data or not. The name Big Data
itself contains a term which is related to size and hence the
characteristic.

Variety - The next aspect of Big Data is its variety. This means
that the category to which Big Data belongs to is also a very
essential fact that needs to be known by the data analysts. This
11

helps the people, who are closely analyzing the data and are
associated with it, to effectively use the data to their advantage
and thus upholding the importance of the Big Data.

Velocity - The term velocity in the context refers to the speed of


generation of data or how fast the data is generated and processed
to meet the demands and the challenges which lie ahead in the
path of growth and development.

Variability - This is a factor which can be a problem for those


who analyses the data. This refers to the inconsistency which can
be shown by the data at times, thus hampering the process of
being able to handle and manage the data effectively.

Veracity - The quality of the data being captured can vary greatly.
Accuracy of analysis depends on the veracity of the source data.

Complexity - Data management can become a very complex


process, especially when large volumes of data come from
multiple sources. These data need to be linked, connected and
correlated in order to be able to grasp the information that is
supposed to be conveyed by these data. This situation, is
therefore, termed as the complexity of Big Data.

3.2 Factory work and Cyber-physical systems may have a 6C system:


1. Connection (sensor and networks),
2. Cloud (computing and data on demand),
3. Cyber (model and memory),
4. content/context (meaning and correlation),
5. Community (sharing and collaboration), and
6. Customization (personalization and value).

12

In this scenario and in order to provide useful insight to the factory


management and gain correct content, data has to be processed with
advanced tools (analytics and algorithms) to generate meaningful
information. Considering the presence of visible and invisible issues in
an industrial factory, the information generation algorithm has to be
capable of detecting and addressing invisible issues such as machine
degradation, component wear, etc. in the factory floor.

13

CHAPTER 4
ARCHITECTURE
4.1 In 2000, Seisint Inc. developed C++ based distributed file sharing
framework for data storage and querying. Structured, semi-structured
and/or unstructured data is stored and distributed across multiple
servers. Querying of data is done by modified C++ called ECL which
uses apply scheme on read method to create structure of stored data
during time of query. In 2004 LexisNexis acquired Seisint Inc. and
2008 acquired ChoicePoint, Inc. and their high speed parallel
processing platform. The two platforms were merged into HPCC
Systems and in 2011 was open sourced under Apache v2.0 License.
Currently HPCC and Quantcast File Systemare the only publicly
available platforms capable of analyzing multiple exabytes of data.
In 2004, Google published a paper on a process called MapReduce
that used such an architecture. The MapReduce framework provides
a parallel processing model and associated implementation to process
huge amounts of data. With MapReduce, queries are split and
distributed across parallel nodes and processed in parallel (the Map
step). The results are then gathered and delivered (the Reduce step).
The framework was very successful, so others wanted to replicate the
algorithm. Therefore, an implementation of the MapReduce
framework was adopted by an Apache open source project named
Hadoop.
MIKE2.0 is an open approach to information management that
acknowledges the need for revisions due to big data implications in
an article titled "Big Data Solution Offering".

14

The methodology addresses handling big data in terms of useful


permutations of data sources, complexity in interrelationships, and
difficulty in deleting (or modifying) individual records.
Recent studies show that the use of a multiple layer architecture is an
option for dealing with big data. The Distributed Parallel architecture
distributes data across multiple processing units and parallel
processing units provide data much faster, by improving processing
speeds. This type of architecture inserts data into a parallel DBMS,
which implements the use of MapReduce and Hadoop frameworks.
This type of framework looks to make the processing power
transparent to the end user by using a front end application server.
Big Data Analytics for Manufacturing Applications can be based on a
5C architecture (connection, conversion, cyber, cognition, and
configuration).

Figure: 4.1

15

4.2 Big Data Lake


With the changing face of business and IT sector, capturing and
storage of data has emerged into a sophisticated system. The big data
lake allows an organization to shift its focus from centralized control to
a shared model to respond to the changing dynamics of information
management. This enables quick segregation of data into the data lake
thereby reducing the overhead time.

4.3 Storing , Selecting and Processing of Big Data


4.3.1 STORING
Analyzing your data characteristics
Selecting data sources for analysis
Eliminating redundant data
Establishing the role of NoSQL
Overview of Big Data stores
Data

models:

key

value,

graph,

document,

column-family
Hadoop Distributed File System
HBase
Hive

4.3.2 SELECTING BIG DATA STORES


Choosing the correct data stores based on your data
16

characteristics
Moving code to data
Implementing polyglot data store solutions
Aligning business goals to the appropriate data store.
4.3.3 PROCESSING
Integrating disparate data stores
Mapping data to the programming framework
Connecting and extracting data from storage
Transforming data for processing
Subdividing data in preparation for Hadoop MapReduce
Employing Hadoop MapReduce
Creating the components of Hadoop MapReduce jobs
Distributing data processing across server farms
Executing Hadoop MapReduce jobs
Monitoring the progress of job flows

17

Figure: 4.3

CHAPTER 5
WHY BIG DATA?
5.1 Standard table for Big Data size

18

Figure: 5.1

Growth of Big Data is needed


Increase of storage capacities
Increase of processing power
Availability of data(different data types)
Every day we create 2.5 quintillion bytes of data; 90% of the
data in the world today has been created in the last two years
alone
FB generates 10TB daily
Twitter generates 7TB of data daily.
19

IBM claims 90% of todays stored data was generated in just the
last two years.

CHAPTER 6
HOW BIG DATA IS DIFFERENT?
1) Automatically generated by a machine (e.g. Sensor embedded in an
20

engine)
2) Typically an entirely new source of data (e.g. Use of the internet)
3) Not designed to be friendly (e.g. Text streams)
4) May not have much values
Need to focus on the important part

Figure: 6.1

CHAPTER 7
BIG DATA SOURCES

Users
21

Large and
Growing files

Applicati
on

(Big Data files)

Sensors

Systems

Figure: 7.1

7.1 Data generation points Examples:


Mobiles
Microphones
Readers/Scanners
Science facilities
Programs/Software
Social Media
Cameras

7.2 Big Data Analytics:


Examining large amount of data
Appropriate information
Identification of hidden patterns, unknown correlations
22

Competitive advantage
Better business decisions: strategic and operational
Effective marketing, customer satisfaction, increased revenue

Figure: 7.2

CHAPTER 8
TOOLS USED IN BIG DATA
8.1 TOOLS

23

Big data requires exceptional technologies to efficiently process


large quantities of data within tolerable elapsed times. A 2011
McKinsey report suggests suitable technologies include A/B testing,
crowd sourcing, data fusion and integration, genetic algorithms,
machine learning, natural language processing, signal processing,
simulation, time series analysis and visualization. Multidimensional
big data can also be represented as tensors, which can be more
efficiently handled by tensor-based computation, such as multilinear
subspace learning. Additional technologies being applied to big data
include massively parallel-processing (MPP) databases, searchbased applications, data mining, distributed file systems, distributed
databases, cloud based infrastructure (applications, storage and
computing resources) and the Internet.
Some but not all MPP relational databases have the ability to store
and manage petabytes of data. Implicit is the ability to load, monitor,
back up, and optimize the use of the large data tables in the RDBMS.
DARPAs Topological Data Analysis program seeks the fundamental
structure of massive data sets and in 2008 the technology went public
with the launch of a company called Ayasdi.
The practitioners of big data analytics processes are generally hostile
to slower shared storage, preferring direct-attached storage (DAS) in
its various forms from solid state drive (SSD) to high capacity SATA
disk buried inside parallel processing nodes. The perception of
shared storage architecturesStorage area network (SAN) and
Network-attached storage (NAS) is that they are relatively slow,
complex, and expensive. These qualities are not consistent with big
data analytics systems that thrive on system performance, commodity
infrastructure, and low cost.
Real or near-real time information delivery is one of the defining
characteristics of big data analytics. Latency is therefore avoided
whenever and wherever possible. Data in memory is gooddata on
spinning disk at the other end of a FC SAN connection is not. The

24

cost of a SAN at the scale needed for analytics applications is very


much higher than other storage techniques.
There are advantages as well as disadvantages to shared storage in
big data analytics, but big data analytics practitioners as of 2011 did
not favour it.
8.2 Examples:
Where processing is hosted?
Distributed Servers / Cloud (e.g. Amazon EC2)
Where data is stored?
Distributed Storage (e.g. Amazon S3)
What is the programming model?
Distributed Processing (e.g. MapReduce)
How data is stored & indexed?
High-performance schema-free databases (e.g. MongoDB)
What operations are performed on data?
Analytic / Semantic Processing

CHAPTER 9
APPLICATIONS
25

Big data has increased the demand of information management


specialists in that Software AG, Oracle Corporation, IBM, Microsoft,
SAP, EMC, HP and Dell have spent more than $15 billion on software
firms specializing in data management and analytics. In 2010, this
industry was worth more than $100 billion and was growing at almost
10 percent a year: about twice as fast as the software business as a
whole.
Developed economies make increasing use of data-intensive
technologies. There are 4.6 billion mobile-phone subscriptions
worldwide and between 1 billion and 2 billion people accessing the
internet Between 1990 and 2005, more than 1 billion people worldwide
entered the middle class which means more and more people who gain
money will become more literate which in turn leads to information
growth. The world's effective capacity to exchange information through
telecommunication networks was 281 petabytes in 1986, 471 petabytes
in 1993, 2.2 exabytes in 2000, 65 exabytes in 2007and it is predicted
that the amount of traffic flowing over the internet will reach 667
exabytes annually by 2014. It is estimated that one third of the globally
stored information is in the form of alphanumeric text and still image
data, which is the format most useful for most big data applications.
This also shows the potential of yet unused data (i.e. in the form of
video and audio content).
While many vendors offer off-the-shelf solutions for Big Data, experts
recommend the development of in-house solutions custom-tailored to
solve the company's problem at hand if the company has sufficient
technical capabilities.

9.1 Government

26

The use and adoption of Big Data within governmental processes is


beneficial and allows efficiencies in terms of cost, productivity, and
innovation. That said, this process does not come without its flaws.
Data analysis often requires multiple parts of government (central and
local) to work in collaboration and create new and innovative processes
to deliver the desired outcome. Below are the thought leading examples
within the Governmental Big Data space.
9.1.1 United States of America:
In 2012, the Obama administration announced the Big Data
Research and Development Initiative, to explore how big data
could be used to address important problems faced by the
government. The initiative is composed of 84 different big data
programs spread across six departments.
Big data analysis played a large role in Barack Obama's
successful 2012 re-election campaign.
The United States Federal Government owns six of the ten most
powerful supercomputers in the world.
The Utah Data Center is a data center currently being constructed by
the United States National Security Agency. When finished, the facility
will be able to handle a large amount of information collected by the
NSA over the Internet. The exact amount of storage space is unknown,
but more recent sources claim it will be on the order of a few exabytes.
9.1.2 India
Big data analysis was, in parts, responsible for the BJP and its
allies to win a highly successful Indian General Election 2014.
The Indian Government utilises numerous techniques to ascertain
how the Indian electorate is responding to government action, as
well as ideas for policy augmentation

9.1.3 United Kingdom


Examples of uses of big data in public services:
27

Data on prescription drugs: by connecting origin, location and the


time of each prescription, a research unit was able to exemplify
the considerable delay between the release of any given drug, and
a UK-wide adaptation of the National Institute for Health and
Care Excellence guidelines. This suggests that new/most up-todate drugs take some time to filter through to the general patient.
Joining up data: a local authority blended data about services,
such as road gritting rotas, with services for people at risk, such
as 'meals on wheels'. The connection of data allowed the local
authority to avoid any weather related delay.
9.2 International development
Research on the effective usage of information and communication
technologies for development (also known as ICT4D) suggests that
big data technology can make important contributions but also
present unique challenges to International development.
Advancements in big data analysis offer cost-effective opportunities
to improve decision-making in critical development areas such as
health care, employment, economic productivity, crime, security,
and natural disaster and resource management. However,
longstanding challenges for developing regions such as inadequate
technological infrastructure and economic and human resource
scarcity exacerbate existing concerns with big data such as privacy,
imperfect methodology, and interoperability issues.
9.3 Manufacturing
Based on TCS 2013 Global Trend Study, improvements in supply
planning and product quality provide the greatest benefit of big
data for manufacturing. Big data provides an infrastructure for
transparency in manufacturing industry, which is the ability to
unravel uncertainties such as inconsistent component performance
and availability. Predictive manufacturing as an applicable
approach toward near-zero downtime and transparency requires
28

vast amount of data and advanced prediction tools for a systematic


process of data into useful information. A conceptual framework
of predictive manufacturing begins with data acquisition where
different type of sensory data is available to acquire such as
acoustics, vibration, pressure, current, voltage and controller data.
Vast amount of sensory data in addition to historical data construct
the big data in manufacturing. The generated big data
acts as the input into predictive tools and preventive
strategies such as Prognostics and Health
Management (PHM).
Cyber-Physical Models
Current PHM implementations mostly utilize data during the
actual usage while analytical algorithms can perform more
accurately when more information throughout the machines
lifecycle, such as system configuration, physical knowledge and
working principles, are included. There is a need to systematically
integrate, manage and analyze machinery or process data during
different stages of machine life cycle to handle data/information
more efficiently and further achieve better transparency of
machine health condition for manufacturing industry.
With such motivation a cyber-physical (coupled) model scheme
has been developed. Please see http://www.imscenter.net/cyberphysical-platform The coupled model is a digital twin of the real
machine that operates in the cloud platform and simulates the
health condition with an integrated knowledge from both data
driven analytical algorithms as well as other available physical
knowledge. It can also be described as a 5S systematic approach
consisting of Sensing, Storage, Synchronization, Synthesis and
Service. The coupled model first constructs a digital image from
the early design stage. System information and physical
knowledge are logged during product design, based on which a
simulation model is built as a reference for future analysis. Initial
parameters may be statistically generalized and they can be tuned
using data from testing or the manufacturing process using
parameter estimation. After which, the simulation model can be
29

considered as a mirrored image of the real machine, which is able


to continuously record and track machine condition during the
later utilization stage. Finally, with ubiquitous connectivity
offered by cloud computing technology, the coupled model also
provides better accessibility of machine condition for factory
managers in cases where physical access to actual equipment or
machine data is limited.
9.4 Media
Internet of Things (IoT)
To understand how the media utilizes Big Data, it is first
necessary to provide some context into the mechanism used for
media process. It has been suggested by Nick Couldry and Joseph
Turow that practitioners in Media and Advertising approach big
data as many actionable points of information about millions of
individuals. The industry appears to be moving away from the
traditional approach of using specific media environments such as
newspapers, magazines, or television shows and instead tap into
consumers with technologies that reach targeted people at optimal
times in optimal locations. The ultimate aim is to serve, or
convey, a message or content that is (statistically speaking) in line
with the consumers mindset. For example, publishing
environments
are
increasingly
tailoring
messages
(advertisements) and content (articles) to appeal to consumers
that have been exclusively gleaned through various data-mining
activities.
Targeting of consumers (for advertising by marketers)
Data-capture
Big Data and the IoT work in conjunction. From a media
perspective, data is the key derivative of device inters connectivity
and allows accurate targeting. The Internet of Things, with the
help of big data, therefore transforms the media industry,
companies and even governments, opening up a new era of
30

economic growth and competitiveness. The intersections of


people, data and intelligent algorithms have far-reaching impacts
on media efficiency. The wealth of data generated allows an
elaborate layer on the present targeting mechanisms of the
industry.
9.5 Technology
eBay.com uses two data warehouses at 7.5 petabytes and 40PB
as well as a 40PB Hadoop cluster for search, consumer
recommendations, and merchandising. Inside eBays 90PB data
warehouse

Amazon.com handles millions of back-end operations every


day, as well as queries from more than half a million third-party
sellers. The core technology that keeps Amazon running is
Linux-based and as of 2005 they had the worlds three largest
Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7
TB.

Facebook handles 50 billion photos from its user base.

As of August 2012, Google was handling roughly 100 billion


searches per month.

9.6 Private sector


9.6.1 Retail
Walmart handles more than 1 million customer transactions every
hour, which are imported into databases estimated to contain more
than 2.5 petabytes (2560 terabytes) of data the equivalent of 167
times the information contained in all the books in the US Library
of Congress.
9.6.2 Retail Banking

31

FICO Card Detection System protects accounts world-wide.


The volume of business data worldwide, across all companies,
doubles every 1.2 years, according to estimates.
9.6.3 Real Estate
Windermere Real Estate uses anonymous GPS signals from
nearly 100 million drivers to help new home buyers determine
their typical drive times to and from work throughout various
times of the day.
9.7 Science
The Large Hadron Collider experiments represent about 150 million
sensors delivering data 40 million times per second. There are nearly
600 million collisions per second. After filtering and refraining from
recording more than 99.99995% of these streams, there are 100
collisions of interest per second.
As a result, only working with less than 0.001% of the sensor
stream data, the data flow from all four LHC experiments
represents 25 petabytes annual rate before replication (as of
2012). This becomes nearly 200 petabytes after replication.
If all sensor data were to be recorded in LHC, the data flow
would be extremely hard to work with. The data flow would
exceed 150 million petabytes annual rate, or nearly 500
exabytes per day, before replication. To put the number in
perspective, this is equivalent to 500 quintillion (51020) bytes
per day, almost 200 times more than all the other sources
combined in the world.
The Square Kilometre Array is a telescope which consists of
millions of antennas and is expected to be operational by 2024.
Collectively, these antennas are expected to gather 14 exabytes and
store one petabyte per day. It is considered to be one of the most
ambitious scientific projects ever undertaken.
32

9.8 Science and Research


When the Sloan Digital Sky Survey (SDSS) began collecting
astronomical data in 2000, it amassed more in its first few
weeks than all data collected in the history of astronomy.
Continuing at a rate of about 200 GB per night, SDSS has
amassed more than 140 terabytes of information. When the
Large Synoptic Survey Telescope, successor to SDSS, comes
online in 2016 it is anticipated to acquire that amount of data
every five days.
Decoding the human genome originally took 10 years to
process, now it can be achieved in less than a day: the DNA
sequencers have divided the sequencing cost by 10,000 in the
last ten years, which is 100 times cheaper than the reduction in
cost predicted by Moore's Law.
The NASA Center for Climate Simulation (NCCS) stores 32
petabytes of climate observations and simulations on the
Discover supercomputing cluster.

CHAPTER 10
33

RISKS OF BIG DATA


Will be so overwhelmed
-Need the right people and solve the right problems
Costs escalate too fast
-Isnt necessary to capture 100%
-Many sources of big data is privacy
-self-regulation
-Legal regulation
10.1 Leading Technology Vendors
10.1.1 Example Vendors
IBM Netezza
EMC Greenplum
Oracle Exadata
10.1.2 Commonality
MPP architectures
Commodity Hardware
RDBMS based
Full SQL compliance

34

CHAPTER 11
BENEFITS OF BIG DATA
Real-time big data isnt just a process for storing petabytes or
exabytes of data in a data warehouse, its about the ability to make
better decisions and take meaningful actions at the right time.
Fast forward to the present and technologies like Hadoop give you
the scale and flexibility to store data before you know how you are
going to process it.
Technologies such as MapReduce, Hive and Impala enable you to
run queries without changing the data structures underneath.
Our newest research finds that organizations are using big data to
target customer-centric outcomes, tap into internal data and build a
better information ecosystem.
Big Data is already an important part of the $64 billion database
and data analytics market.

It offers commercial opportunities of a comparable scale to


enterprise software in the late 1980s.

And the Internet boom of the 1990s, and the social media
explosion of today.

35

CHAPTER 12
BIG DATA IMPACTS & FUTURE

12.1 BIG DATA IMPACT


Big data is a troublesome force presenting opportunities with
challenges to IT organizations.
By 2015 4.4 million IT jobs in Big Data ; 1.9 million is in US
itself

India will require a minimum of 1 lakh data scientists in the next


couple of years in addition to data analysts and data managers to
support the Big Data space.

12.1.1 Potential Value of Big Data:


$300 billion potential annual value to US health care.
$600 billion potential annual consumer surplus from using
personal location data.
60% potential in retailers operating margins.
12.1.2

India Big Data:

Gaining attraction
Huge market opportunities for IT services (82.9% of revenues)
and analytics firms (17.1 % ).
Current market size is $200 million. By 2015 $1 billion.
The opportunity for Indian service providers lies in offering
services around Big Data implementation and analytics for
global multinationals.

36

12.2 FUTURE OF BIG DATA


$15 billion on software firms only specializing in data
management and analytics.
This industry on its own is worth more than $100 billion and
growing at almost 10% a year which is roughly twice as fast as
the software business as a whole.
In February 2012, the open source analyst firm Wikibon released
the first market forecast for Big Data , listing $5.1B revenue in
2012 with growth to $53.4B in 2017
The McKinsey Global Institute estimates that data volume is
growing 40% per year, and will grow 44x between 2009 and
2020.

37

CHAPTER 13
CONCLUSION
The availability of Big Data, low-cost commodity hardware, and
new information management and analytic software have produced a
unique moment in the history of data analysis. The convergence of
these trends means that we have the capabilities required to analyze
astonishing data sets quickly and cost-effectively for the first time in
history. These capabilities are neither theoretical nor trivial. They
represent a genuine leap forward and a clear opportunity to realize
enormous gains in terms of efficiency, productivity, revenue, and
profitability.
The Age of Big Data is here, and these are truly
revolutionary times if both business and technology
professionals continue to work together and deliver on
the promise.

38

CHAPTER 14
REFERENCES

www.google.com
www.wikipedia.com
www.studymafia.org

39

You might also like