4.big Data and Real Time Analytics - Streams and Hadoop

Big Data and Real Time Analytics –
Streams and Hadoop
Infrastructure Matters – 2014 Briefing
© 2014 IBM Corporation

Big Data is more than just Hadoop
Big Data is a lot more than Hadoop!

What can you tell me
about Big Data? And our competitors don’t
understand this – they cannot
I want to know all about deliver value on an entire set of Big
Hadoop. Data use cases.
Service Oriented Finance CMO IBM

Big Data and Real Time Analytics 2
IBM Big Data Solutions
New/Enhanced
All Data
Applications
Real-time InfoSphere
analytics Discovery
zone Enterprise What is
Information warehouse happening?
ingestion and data mart
InfoSphere Discovery and
operational Streams and analytic exploration
information appliances Cognos
zone zone Why did it
Exploration, Cognitive What action
happen? should I take?
landing and DB2 BLU and Fabric
Reporting, analysis, Decision
archive zone PureData content analytics
System for management
DB2 BLU Analytics SPSS
BigInsights
and Hadoop What could
happen?
Predictive analytics
Information governance zone and modeling
InfoSphere Server and DataStage
Systems Security Storage

On premise, Cloud, As a service
Analyze all data, from

IBM Big any
Data source,
& Analytics with the right technology
Infrastructure
There are two main types of Big Data
Data in motion
Real-time Analytics
Zone  Data typically not stored
Stream
Computing
 Tremendous velocity
 Ultra low latency required
Our competitors
 Multiple data sources do not address
 Huge volumes of unstructured both of these!
data
Data at rest
Landing and
Analytics Zone  Data stored on disk
Hadoop  Huge volumes of unstructured
System
data
 No pre-defined schemas
 Too large for traditional tools to
process in a timely manner

New programming models and low cost
hardware solve Big Data problems
Streaming Cluster
Streaming Application
 Streaming and Apache Hadoop

applications
 Proven frameworks to process large Clusters of low cost
amounts of data POWER8 servers are
 Streaming for data in motion, Hadoop for ideal for Hadoop and
data at rest streaming applications
Hadoop Cluster
 Enable applications to transparently work
with large clusters of nodes in parallel

Service Oriented Finance wants to gain a
competitive advantage from Big Data
This application will give Service Oriented Finance wants to

our market managers a deploy a stock trading application with
real advantage! the following requirements
 Process millions of trades per second

 Application must scale
 Constant flow of input data
 Microsecond latency
 Unstructured trade data input
 Sophisticated analytics logic
Service Oriented Finance Market Manager

InfoSphere Streams is a platform for
Real-time Analytics on Big Data in motion
Just in time decisions
InfoSphere Streams can meet these
requirements!
Powerful Streams is a platform for real-time

analytics on Big Data
Analytics
Our competitors do not have this
capability
Millions of Microsecond
events per Latency
second
Sensor, video, audio, text and relational

data sources

Streams Makes it easy for data in motion
programming
 Developer Role InfoSphere Streams Console
 Eclipse based tools
 Visual application monitoring
 Built in accelerators
 Administrator Role
 Stream data visualization
 Start/stop jobs
 Business User Role
 Stream data visualization

Streams programming is drag and drop
simple
Source Adapters Operator Repository Sink Adapters
Application composition (Optimized Compilation)

Streams Studio provides a rich set of
Eclipse based tools
Drag and drop

simple

Streams provides more throughput than Apache
Storm in email analysis benchmark
Throughput - Four Nodes

40000
35000
30000
2.6x-12.3x
Throughput (emails/s)
25000
20000
More
15000
Streams Throughput
10000 Storm
5000
0
x4 (100%) x8 (100%) x8 (200%) x8 (400%)
Parallelism (dataset)
Based on IBM internal tests comparing InfoSphere Streams against Apache Storm. Results may not be typical and will vary based on actual workload, configuration, applications and other
variables in a production environment. Users of this document should verify the applicable data for their specific environment. Contact IBM and see what we can do for you.
https://www.ibmdw.net/streamsdev/wp-content/uploads/sites/15/2014/04/Streams-and-Storm-April-2014-Final.pdf
InfoSphere BigInsights includes Apache
Hadoop to process data at rest
Hadoop Cluster
Processing
Storage
Input
MapReduce Result
Java
Program
 Comprised of a cluster of inexpensive hardware

 Nodes have processors, memory and disks
 Special file system – Hadoop Distributed File System (HDFS)
 Special programming model – MapReduce

The Hadoop Distributed File System (HDFS)
distributes data across a Hadoop cluster
inputFile.txt
B = block
B1 B2 B3 R = replica
Hadoop Distributed File System
R3
R1
B1
B2
R3
… R2
B3
R2 R1
Node 1 Node 2 Node 3 Node n
 A distributed file system that spans all the nodes in a Hadoop cluster
 Files are split automatically at load time into blocks and spread among Data
Nodes
 System assumes nodes will fail

 Achieves reliability by replicating data across multiple nodes
 Elastically scalable
The MapReduce framework sends
programs out to the data
MapReduce
Job
…
Map and Map and Map and Map and
Reduce Tasks Reduce Tasks Reduce Tasks Reduce Tasks
HDFS HDFS HDFS HDFS
Node 1 Node 2 Node 3 Node n
 MapReduce job is sent out to each node
 Map and Reduce tasks run in parallel across nodes
 Hadoop framework does a lot of the “heavy lifting”

 e.g., moving data between map and reduce tasks

BigInsights makes it easy for all Big Data
roles
 Developer Role
 Eclipse based tooling InfoSphere BigInsights Console
 Read/write access to HDFS
 Extensive views of jobs and workflows in
system
 Application staging, launch and
scheduling center
 Many built in accelerators
 Administrator Role
 Complete management of cluster
− Monitor/start/stop components
− Add/remove nodes
 Portal style dashboards
 Business User Role

 No Java required
 Spreadsheet tooling
 Visualization

Service Oriented Finance wants to analyze
customer complaints
We need to know what our We can help you do that with

customers are complaining sentiment analysis using
about. BigInsights
Service Oriented Finance CMO IBM

Sentiment Analysis - A Big Data challenge
but also a Big Data opportunity
Trying to determine…
Product demand
New product
Feelings - Attitudes acceptance
Emotions - Opinions
Thoughts - Desires
Competitive threats
Threats to brand
reputations
Advertisement
Huge volumes of unstructured data
targets
Finding sentiment from social media data

DEMO: Using BigInsights to Analyze
negative sentiment on Twitter
The service
reps are very I love the
nice and helpful check guard
feature!
I don’t trust
the web site The ATM
for on-line fees are
banking ridiculous!
Topic
Data Source
Service Oriented
Twitter
Finance
Likes Dislikes
 Love the check guard feature  Don’t trust the on-line banking feature
 Like the on-line bill pay feature  Don’t like to wait in line for a long time
 Like that the ATMs are located all over  Don’t like the ATM fees
the city  Hate the overdraft fees
 Like the service representatives

Architecture matters when you design a micro
processor for emerging big workloads
It’s not about the number of transistors,
it is what you do with them to handle Big Workloads
POWER8 vs. Ivy Bridge EX

POWER7 to POWER8
- 96 threads/socket vs. 30
1.2 Billion 4.2 Billion
Transistors
45nm to
Transistors
- 4x Memory Bandwidth
22nm - 3x on-die Cache
567 mm 650 mm 2 - Cache latency reduced by 50%
- 5x I/O Bandwidth
- 15 metal layers vs. 9
- eDRAM vs SRAM
Westmere EX to Ivy Bridge EX POWER8 Unique Technology

2.6 Billion 4.3 Billion - CAPI Technology
32nm to Transistors
Transistors - Integrated PCIe
22nm
541 mm
2 - Transactional Memory
513 mm
- L4 Cache
- Dynamic Overclocking
CAPI: Coherent Accelerator Processor Interface
Is Ivy Bridge a new breakthrough architecture?
 The Ivy Bridge CPU micro-architecture is a “shrink” from Sandy Bridge
 Ivy Bridge is a “tick” in the “tick/tock” Intel release cycle compared to Sandy
Bridge
 A “tick” means same architecture as previous with some minor
improvements
 The major improvement is the 22 nm Tri-gate transistor technology from the
prior 32 nm technology
 More transistors
 More cores, sockets
 20% more L3 cache
Does the 22nm technology

result in better performance?
nm = nanometer which is a measure of the CMOS semiconductor device fabrication process.

Smaller nm process allows for more transistors
Source for Tick Tock Model: http://www.intel.com/content/www/us/en/silicon-innovations/intel-tick-tock-model-general.html

http://www.techradar.com/us/news/computing-components/processors/intel-ivy-bridge-what-you-need-to-know-1077240
Intel’s performance per Core is not
increasing over previous generation
2 Socket HP Servers
2500
2283
2069 2049
2000
RPE2 per Core
1500
Sandy Bridge Ivy Bridge Ivy Bridge
EP EP EX
1000
2.9 GHz 2.7 GHz 2.8 GHz
16 cores 24 cores 30 cores
500
0
The number shown is best in each category (sockets and number of cores) RPE2 numbers are derived from the
following six benchmark inputs:
Source of RPE2: (Gartner) SAP SD Two-Tier, TPC-C, TPC-H,

http://www.gartner.com/technology/research/RPE2-methodology-details.jsp SPECjbb2006 and two SPEC
CPU2006 components
The data in this tool is derived from RPE2 from Ideas International. Ideas International was acquired by Gartner, Inc. in 2012. © 2014
Gartner, Inc. and/or its affiliates. All rights reserved.”

The new POWER8 scale-out servers –
innovation to put data to work
 POWER8 roll-out is leading with scale-out (1 and 2 Socket) systems

 Expanded Linux focus: Ubuntu, KVM, and OpenStack
 OpenPOWER Innovations
1 & 2 Socket Power Systems

S812L S822L S822 S814 S824L S824
• 1-socket, 2U • 2-socket, 2U • 2-socket, 2U • 1-socket, 4U • 2-socket, 4U • 2-socket, 4U
• Up to 12 cores • Up to 24 cores • Up to 20 cores • Up to 8 cores • Up to 24 cores • Up to 24 cores
• 512 GB Memory • 1 TB memory • 1 TB memory • 512 GB memory • Linux only • 1 TB memory
• 6 PCIe Gen 3 • 9 PCIe Gen3 • 9 PCIe Gen 3 • 7 PCIe Gen 3 • NVIDIA GPU • 11 PCIe Gen 3
• Linux only • Linux only • AIX & Linux • AIX, IBM i, Linux • 2H 2014 • AIX, IBM i, Linux
• PowerVM & • PowerVM & • PowerVM • PowerVM • PowerVM
PowerKVM PowerKVM • June 10, 2014 • June 10, 2014 • June 10, 2014
• August 29, 2014 • June 10, 2014
Scale-out
Power S822L servers are priced
competitively to Intel Ivy Bridge servers
Comparable TCA Dell PowerEdge HP ProLiant IBM Power
R720 DL380 G8 S822L
Linux on Intel
$21,300 $22,763 $22,382
Ivy Bridge + KVM
vs.
Linux on
POWER8 + KVM
Server list price*
-3-year warranty, on-site
$12,605 $14,068 $14,895
Virtualization $2,998 $ 2,998 $2,998

- 2 sockets, 3 yr. 9x5 sub./supp. KVM for Red Hat on x86 (RHEV) KVM for Red Hat on x86 (RHEV) KVM for Linux on Power (PowerKVM)
Linux OS list price $5,697 $5,697 $4,489

- RHEL, 2 sockets, unlimited Red Hat subscription and Red Hat Red Hat subscription and Red Hat Red Hat subscription and IBM
guests, 9x5, 3 yr. sub./ supp. support support support
Total list price:

(Total cost of acquisition) $21,300 $22,763 $22,382
Server model Dell R720 HP Proliant DL380p G8 IBM Power 822L
Processor / cores Two 2.7 GHz , E5-2697, Ivy Bridge, 12-core processors Two 3.4 GHz POWER8, 10-core
Configuration 64 GB memory, 2 x 300GB 15k HDD, 10 Gb two port Same memory, HDD, NIC
* Based on US pricing for Power S822L announcing on April 28, 2014 matching configuration table above. Source: hp.com, dell.com, vmware.com

POWER architecture is ideal for Hadoop
and Streams workloads
 Twice the thread capability to use to run parallel Java
Mutli-Threading workloads
 More threads enables faster BLU Acceleration Single-
Instruction, Multiple Data (SIMD) vector processing
 More threads enable faster Hadoop workload processing
 High memory and I/O bandwidth enables faster Cognos ad

hoc queries and better SPSS scoring throughput
High I/O Performance  High I/O performance enables Cognos faster ad hoc queries
and better SPSS scoring throughput
 Higher speed that significantly outperforms competitive x86
Hadoop configurations
 Large processor cache per core, large numbers of cores and

high DRAM capacity on a single server benefits Cognos
Processor and Cache Cubes performance
 Extensive Soft-Error Recovery, self-healing for solid faults,
and alternate processor recovery is ideal for Hadoop-based
workloads
 IBM’s Java Virtual Machine is specifically optimized for the

Java Optimization POWER architecture to deliver optimal performance of big
data and analytics solutions

BigInsights on POWER beats the
competition with TeraSort benchmark
Normalized Performance
GB / Core / Min
1.4
1.7X 1.27
1.2
1.04
1
GB/core/min
0.8 0.73 0.75

0.67
POWER7+ POWER8
0.6 Competitor Competitor Competitor with with
BigInsights BigInsights
0.4
0.2
0
Xeon Xeon Xeon POWER7+ POWER8
E5-2697v2 E5-2630v2 E5-2665 7R2 S822L
8 nodes 24 nodes 16 nodes 10 nodes 8 nodes
192 cores 288 cores 256 cores 120 cores 192 cores
1 TB 10 TB 10 TB 1 TB 10 TB

Big SQL V3.0: Bringing SQL on Hadoop to
the next level
 Massively parallel SQL engine on

Hadoop SQL Based
Application
 Architected from the ground up
for low latency and high
throughput
 Comprehensive SQL support Big SQL
 The same SQL you use on your SQLEngine
MPP Run-time
data warehouse should run with
few or no modifications HDFS
 Full support for sub-queries
CSV Seq Parquet RC
 All standard join operations
Avro ORC JSON Custom
 Stored procedures / User
defined functions
 Supports all modern file formats

Big SQL on POWER8 delivers Big Data
queries faster than Hive on Ivy Bridge
“TCP-DS inspired” workload queries
BigInsights v3.0 on 8 S822L data nodes

Time to complete queries
in 7 concurrent streams
11.2x
S822L, 3.3 GHz
24 cores Faster
256 GB Memory
8 hours
RHEL 6.5
40 min 4.7x
Lower
Cost
CPO Master POWER Competitive Proof Points v2.9 27

Hadoop implementation pain points
 Building clusters is very complex, time consuming and can become costly
 Difficult to plan, architect and build out a cluster of nodes and network components
 Difficult to manage the cluster when deployed
 Inability to share resources results in “cluster sprawl”
 x86 deployments are not flexible enough
 Built with a fixed processor to disk ratio
 When you add more machines you get both processors and hard drives
 Some Hadoop workloads require more of one or the other but not always both at the
same time
 HDFS NameNode is a single point of failure
 HDFS is not POSIX compliant and wastes storage
 Requires use of non standard commands and utilities
 Copies of data may need to be stored twice
− Once in Linux file system and once in HDFS
 Hadoop workload management is very basic
 Does not utilize cluster resources to their maximum
 Does not support multi-tenancy
 Can lead to silos of under utilized clusters
Introducing the IBM Solution for Hadoop –
Power Systems Edition
 Best-in-class hardware IBM InfoSphere BigInsights
or Open-source Hadoop
► IBM POWER Systems
IBM Platform Symphony
► IBM DCS3700 Storage IBM Platform Cluster Manager
Distributed File System

 Best-in-class software IBM Elastic Storage, HDFS
 IBM BigInsights Linux Operating Environment

RHEL SUSE
 IBM Platform Computing
− IBM Platform Symphony
IBM Power Systems
− IBM Platform Cluster IBM Power 7+, Power8
Manager
 IBM Elastic Storage
− Formerly GPFS

Platform Symphony provides enterprise
class Hadoop cluster management
 Replaces the open source MapReduce scheduler of
Hadoop
 Existing Hadoop applications run unchanged
 Better performance and utilization in cluster overall and

within individual jobs
 Scheduling is more efficient
 More efficient data transfer mechanisms for shuffle phase of
MapReduce
 Uses generic slots rather than fixed map and reduce slots which
waste resources

Platform Symphony provides enterprise
class Hadoop cluster management cont.
 Multi-tenancy is built in – system is designed for concurrency
 Reduces costs through better resource sharing and utilization
 Provides dynamic priority based concurrency
− Interactive workloads like BigSheets can be given higher priority (.e.g., during
the day)
− Priorities can be adjusted in real time
 Preemptive scheduling policies ensure service levels are met while also
allowing the cluster to be fully utilized
− Jobs entering the system are immediately given the priority they are guaranteed
− Applications can decide if they allow other applications to use their unused
capacity
 GUI provided to configure, manage and monitor policies and workload
 Allows running multiple versions of Hadoop

 Helps control costs and provides flexibility during upgrades

Platform cluster manager removes the
complexity of standing up clusters
 Provides comprehensive set of tools to install, provision
and configure cluster management environment
 GUI, automated installation scripts and guidance
 Provides flexible provisioning options
 Bare-metal (i.e., no virtualization)
 KVM (virtualized cluster environment)
 Provides GUI for “end users” to provision, deploy, manage
and monitor clusters
 Improves administrator and/or user productivity
 Deploy clusters easier and faster
 Saves on administration costs
 Lowers risk of deploying Hadoop clusters

IBM Elastic Storage(GPFS) brings enterprise
class file system capabilities to Hadoop
 No single point of failure like HDFS NameNode
 Metadata is distributed
Native OS
Applications (POSIX)
 GPFS is POSIX compliant

 Any Linux command/utility can be run against data, not just Hadoop
commands
− Easier to use and manage
 Makes archival and backup easier
 Other workload types can access data in cluster
 Saves storage space and administrator work
− Avoids moving data around to process inside and outside of HDFS
− Avoids having duplicate data inside and outside of HDFS
 Allows variable block sizes to exist on same cluster to meet the needs
of different applications

IBM Solution for Hadoop – Power Systems Edition
key architecture components
PowerLinux Data Node DCS3700

IBM PowerLinux 7R2 (P7+ based, P8 coming in IBM DCS3700 Storage f/c 1818-80c
4Q2014)  A maximum of 60 x 3.5" HDD, 4 TB each
2 sockets Power7 3.55 GHz CPU  Standard controller (3000) validated
Data: 29 x 900Gb SAS HDDs, JBOD I/O Exp
OS: 1 x 300Gb SAS HDD
128 GB DDR3 RDIMMs
1GbE, 10GbE Switches
1GbE: IBM RackSwitch G8052
– 48 × 1 GbE RJ45 ports, four 10 GbE SFP+ ports
– Low 130 W power rating and variable speed fans to
PowerLinux Management Node reduce power consumption
(JobTracker, NameNode, Console) 10GbE: IBM RackSwitch G8264
IBM PowerLinux 7R2 (P7+ based, P8 coming – Optimized for applications requiring high bandwidth
in 4Q2014) and low latency
2 sockets Power7 3.55 GHz CPU – Up to 64 1 Gb/10 Gb SFP+ ports, four 40 Gb
QSFP+ports, 1.28 Tbps non-blocking throughput
OS: 6 x 600GB SAS HDD, mirrored
128GB DDR3 RDIMMs
Learn more about Big Data Analytics
Solutions on Power
IBM Analytics Solutions for
Power Systems
http://www-03.ibm.com/systems/power/solutions/bigdata-analytics/index.html
The PowerLinux Community

(DeveloperWorks)
www.ibm.com/developerworks/group/tpl/
@thinkpowerlinux

4.big Data and Real Time Analytics - Streams and Hadoop

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4.big Data and Real Time Analytics - Streams and Hadoop

Uploaded by

Copyright:

Available Formats

Big Data and Real Time Analytics –

Streams and Hadoop

Infrastructure Matters – 2014 Briefing

© 2014 IBM Corporation

Big Data is a lot more than Hadoop!

Service Oriented Finance CMO IBM

Systems Security Storage

Analyze all data, from

Big Data and Real Time Analytics 4

 Streaming and Apache Hadoop

Big Data and Real Time Analytics 5

This application will give Service Oriented Finance wants to

 Process millions of trades per second

Service Oriented Finance Market Manager

Big Data and Real Time Analytics 6

Powerful Streams is a platform for real-time

Sensor, video, audio, text and relational

Big Data and Real Time Analytics 7

Big Data and Real Time Analytics 8

Source Adapters Operator Repository Sink Adapters

Application composition (Optimized Compilation)

Big Data and Real Time Analytics 9

Drag and drop

Big Data and Real Time Analytics 10

Throughput - Four Nodes

 Comprised of a cluster of inexpensive hardware

Big Data and Real Time Analytics 12

Hadoop Distributed File System

Node 1 Node 2 Node 3 Node n

 System assumes nodes will fail

HDFS HDFS HDFS HDFS

Node 1 Node 2 Node 3 Node n

 MapReduce job is sent out to each node

 Map and Reduce tasks run in parallel across nodes

 Hadoop framework does a lot of the “heavy lifting”

Big Data and Real Time Analytics 14

 Business User Role

Big Data and Real Time Analytics 15

We need to know what our We can help you do that with

Service Oriented Finance CMO IBM

Finding sentiment from social media data

Big Data and Real Time Analytics 17

Big Data and Real Time Analytics 18

POWER8 vs. Ivy Bridge EX

Westmere EX to Ivy Bridge EX POWER8 Unique Technology

Does the 22nm technology

nm = nanometer which is a measure of the CMOS semiconductor device fabrication process.

Source for Tick Tock Model: http://www.intel.com/content/www/us/en/silicon-innovations/intel-tick-tock-model-general.html

Source of RPE2: (Gartner) SAP SD Two-Tier, TPC-C, TPC-H,

Big Data and Real Time Analytics 21

 POWER8 roll-out is leading with scale-out (1 and 2 Socket) systems

1 & 2 Socket Power Systems

Virtualization $2,998 $ 2,998 $2,998

Linux OS list price $5,697 $5,697 $4,489

Total list price:

Big Data and Real Time Analytics 23

 High memory and I/O bandwidth enables faster Cognos ad

 Large processor cache per core, large numbers of cores and

 IBM’s Java Virtual Machine is specifically optimized for the

Big Data and Real Time Analytics 24

0.8 0.73 0.75

Big Data and Real Time Analytics 25

 Massively parallel SQL engine on