Professional Documents
Culture Documents
Your Organization
Copyright 2014 EMC Corporation. All rights reserved.
Why Hadoop?
Fast and Cheap Way For Exploiting Massive Amounts of New Data Sources
Internet of Things
Mobile Sensors
Dark Data
Copyright 2014 EMC Corporation. All rights reserved.
Smart Grids
Social Media
Oil Exploration
Video Surveillance
Medical
Imaging
2
Why Hadoop?
Save Money Or Make Money
Improve
Company
Performance
Increase Revenue
Increase Demand
Increase Spend
Efficiency
Increase
Customer
Acquisition
Ad
Optimization
Hyper
Targeting
Campaign
Optimization
Purchase Funnel
Analysis
Increase Customer
Engagement
Customer
Segmentation
Churn Prevention
Customer Lifetime
Value
Ad
Effectiveness
Analytics
Manage Demand
Increase Basket
Size
Demand Analysis
Price Optimization
Affinity Analytics
Next Best Offer
Cross-Sell / Upsell
Reduce Costs
Increase Reach
Digital Marketing
Social Media
Click Fraud
Improve
Customer Loyalty
Social Graph /
Influencers
Transaction
Anomaly
Detection
Production Cost /
Efficiency
Supply / Demand
Forecasting
General and
Administrative
Workforce
Analytics
Employee Churn
IT / Security
Analytics
Loyalty Program
Analytics
Customer
Satisfaction
Customer Care
Analytics
Market Mix
Modeling
Coupon
Redemption
Hadoop Overview
Hadoop
is an open-source framework from Apache that allows
for parallel batch processing of very large data sets
MapReduce
is the Hadoop process that divides the workload so
multiple devices can process it
HDFS
is the file system for the data. It provides data
protection and locality with multiple mirrors (usually 3
times)
Multiple, siloed
clusters to manage
Redundant common
data in separate
clusters
Peak compute and I/O
resource is limited to
number of nodes in
each independent
cluster
Production
Test
Production
Log files
Experimentation
Dept B: Ad targeting
Test
Experimentation
Social data
6
Production
Ad targeting
Production
Experimentation
Test
Experimentation
Test
Experimentation
Production
recommendation engine
Test/Dev
Production
Ad Targeting
GUI simplifies
management tasks
Apache
nam e
node
nam e
node
nam e
node
data node
NameNode
Data
HDFS
nam e
node
2
3
4
2x
2x
2x
3x
3x
3x
Manual Import/Export
1x
3X mirroring
Fixed Scalability
1x
NameNode
1x
No protocol support
10
4
5
Independent Scalability
Multi-Protocol
11
3X mirroring
Fixed Scalability
Manual Import/Export
No protocol support
1
2
3
4
5
Independent Scalability
Multi-Protocol
12
13
VM
Combined
Storage/
Compute
Hadoop in VM
VM lifecycle
determined
by Datanode
Limited elasticity
Limited to Hadoop
Multi-Tenancy
VM
Comput
e
T1
VM
VM
Storage
Separate Storage
VM
T2
Storage
Separate compute
from data
Elastic compute
Enable shared
workloads
Raise utilization
Enable deployment of
multiple Hadoop runtime
versions
14
Source: http://www.vmware.com/resources/techresources/10360
Copyright 2014 EMC Corporation. All rights reserved.
15
16
17
WGSN
Retail
Challenges
Rapidly launch new market intelligence service for
fashion retailers
Support large and growing volumes of Big Data
Performance, scalability, and tight integration with Hadoop
were the key reasons we chose Isilon. We also felt very
Pivotal Greenplum Database
comfortable with the partnership between EMC and
Pivotal HD
Pivotal. In the end, the EMC and Pivotal solution offered
EMC Isilon
the ideal balance of storage and compute with the right
Pivotal Data Science Labs level of support.
Solution
Results
Fast deployment with native Hadoop integration,
enabling rapid launch of new service
Delivered high performance scalability
Simplified platform administration
18
Elasticity
Multi-tenancy
Portability
https://community.emc.com/docs/DOC-26892
Copyright 2014 EMC Corporation. All rights reserved.
19