You are on page 1of 35

Big Data and Real Time Analytics –

Streams and Hadoop

Infrastructure Matters – 2014 Briefing

© 2014 IBM Corporation


Big Data is more than just Hadoop

Big Data is a lot more than Hadoop!


What can you tell me
about Big Data? And our competitors don’t
understand this – they cannot
I want to know all about deliver value on an entire set of Big
Hadoop. Data use cases.

Service Oriented Finance CMO IBM


Big Data and Real Time Analytics 2
IBM Big Data Solutions
New/Enhanced
All Data
Applications

Real-time InfoSphere
analytics Discovery
zone Enterprise What is
Information warehouse happening?
ingestion and data mart
InfoSphere Discovery and
operational Streams and analytic exploration
information appliances Cognos
zone zone Why did it
Exploration, Cognitive What action
happen? should I take?
landing and DB2 BLU and Fabric
Reporting, analysis, Decision
archive zone PureData content analytics
System for management
DB2 BLU Analytics SPSS
BigInsights
and Hadoop What could
happen?
Predictive analytics
Information governance zone and modeling
InfoSphere Server and DataStage

Systems Security Storage


On premise, Cloud, As a service

Analyze all data, from


IBM Big any
Data source,
& Analytics with the right technology
Infrastructure
Big Data and Real Time Analytics 3
There are two main types of Big Data
Data in motion
Real-time Analytics
Zone  Data typically not stored
Stream
Computing
 Tremendous velocity
 Ultra low latency required
Our competitors
 Multiple data sources do not address
 Huge volumes of unstructured both of these!
data
Data at rest
Landing and
Analytics Zone  Data stored on disk
Hadoop  Huge volumes of unstructured
System
data
 No pre-defined schemas
 Too large for traditional tools to
process in a timely manner

Big Data and Real Time Analytics 4


New programming models and low cost
hardware solve Big Data problems
Streaming Cluster
Streaming Application

 Streaming and Apache Hadoop


applications
 Proven frameworks to process large Clusters of low cost
amounts of data POWER8 servers are
 Streaming for data in motion, Hadoop for ideal for Hadoop and
data at rest streaming applications
Hadoop Cluster
 Enable applications to transparently work
with large clusters of nodes in parallel

Big Data and Real Time Analytics 5


Service Oriented Finance wants to gain a
competitive advantage from Big Data

This application will give Service Oriented Finance wants to


our market managers a deploy a stock trading application with
real advantage! the following requirements

 Process millions of trades per second


 Application must scale
 Constant flow of input data
 Microsecond latency
 Unstructured trade data input
 Sophisticated analytics logic

Service Oriented Finance Market Manager

Big Data and Real Time Analytics 6


InfoSphere Streams is a platform for
Real-time Analytics on Big Data in motion
Just in time decisions
InfoSphere Streams can meet these
requirements!

Powerful Streams is a platform for real-time


analytics on Big Data
Analytics
Our competitors do not have this
capability
Millions of Microsecond
events per Latency
second

Sensor, video, audio, text and relational


data sources

Big Data and Real Time Analytics 7


Streams Makes it easy for data in motion
programming
 Developer Role InfoSphere Streams Console
 Eclipse based tools
 Visual application monitoring
 Built in accelerators
 Administrator Role
 Visual application monitoring
 Stream data visualization
 Start/stop jobs
 Business User Role
 Visual application monitoring
 Stream data visualization

Big Data and Real Time Analytics 8


Streams programming is drag and drop
simple

Source Adapters Operator Repository Sink Adapters

Application composition (Optimized Compilation)

Big Data and Real Time Analytics 9


Streams Studio provides a rich set of
Eclipse based tools

Drag and drop


simple

Big Data and Real Time Analytics 10


Streams provides more throughput than Apache
Storm in email analysis benchmark

Throughput - Four Nodes


40000

35000

30000
2.6x-12.3x
Throughput (emails/s)

25000

20000
More
15000
Streams Throughput
10000 Storm

5000

0
x4 (100%) x8 (100%) x8 (200%) x8 (400%)
Parallelism (dataset)

Based on IBM internal tests comparing InfoSphere Streams against Apache Storm. Results may not be typical and will vary based on actual workload, configuration, applications and other
variables in a production environment. Users of this document should verify the applicable data for their specific environment. Contact IBM and see what we can do for you.
https://www.ibmdw.net/streamsdev/wp-content/uploads/sites/15/2014/04/Streams-and-Storm-April-2014-Final.pdf
Big Data and Real Time Analytics 11
InfoSphere BigInsights includes Apache
Hadoop to process data at rest
Hadoop Cluster
Processing

Storage
Input

MapReduce Result
Java
Program

 Comprised of a cluster of inexpensive hardware


 Nodes have processors, memory and disks
 Special file system – Hadoop Distributed File System (HDFS)
 Special programming model – MapReduce

Big Data and Real Time Analytics 12


The Hadoop Distributed File System (HDFS)
distributes data across a Hadoop cluster
inputFile.txt
B = block
B1 B2 B3 R = replica

Hadoop Distributed File System

R3
R1

B1
B2
R3
… R2
B3

R2 R1

Node 1 Node 2 Node 3 Node n

 A distributed file system that spans all the nodes in a Hadoop cluster

 Files are split automatically at load time into blocks and spread among Data
Nodes

 System assumes nodes will fail


 Achieves reliability by replicating data across multiple nodes

 Elastically scalable
Big Data and Real Time Analytics 13
The MapReduce framework sends
programs out to the data
MapReduce
Job


Map and Map and Map and Map and
Reduce Tasks Reduce Tasks Reduce Tasks Reduce Tasks

HDFS HDFS HDFS HDFS

Node 1 Node 2 Node 3 Node n

 MapReduce job is sent out to each node

 Map and Reduce tasks run in parallel across nodes

 Hadoop framework does a lot of the “heavy lifting”


 e.g., moving data between map and reduce tasks

Big Data and Real Time Analytics 14


BigInsights makes it easy for all Big Data
roles

 Developer Role
 Eclipse based tooling InfoSphere BigInsights Console
 Read/write access to HDFS
 Extensive views of jobs and workflows in
system
 Application staging, launch and
scheduling center
 Many built in accelerators

 Administrator Role
 Complete management of cluster
− Monitor/start/stop components
− Add/remove nodes
 Portal style dashboards

 Business User Role


 No Java required
 Spreadsheet tooling
 Visualization

Big Data and Real Time Analytics 15


Service Oriented Finance wants to analyze
customer complaints

We need to know what our We can help you do that with


customers are complaining sentiment analysis using
about. BigInsights

Service Oriented Finance CMO IBM


Big Data and Real Time Analytics 16
Sentiment Analysis - A Big Data challenge
but also a Big Data opportunity
Trying to determine…

Product demand

New product
Feelings - Attitudes acceptance
Emotions - Opinions
Thoughts - Desires
Competitive threats

Threats to brand
reputations

Advertisement
Huge volumes of unstructured data
targets

Finding sentiment from social media data

Big Data and Real Time Analytics 17


DEMO: Using BigInsights to Analyze
negative sentiment on Twitter
The service
reps are very I love the
nice and helpful check guard
feature!
I don’t trust
the web site The ATM
for on-line fees are
banking ridiculous!

Topic
Data Source
Service Oriented
Twitter
Finance

Likes Dislikes

 Love the check guard feature  Don’t trust the on-line banking feature
 Like the on-line bill pay feature  Don’t like to wait in line for a long time
 Like that the ATMs are located all over  Don’t like the ATM fees
the city  Hate the overdraft fees
 Like the service representatives

Big Data and Real Time Analytics 18


Architecture matters when you design a micro
processor for emerging big workloads
It’s not about the number of transistors,
it is what you do with them to handle Big Workloads

POWER8 vs. Ivy Bridge EX


POWER7 to POWER8
- 96 threads/socket vs. 30
1.2 Billion 4.2 Billion
Transistors
45nm to
Transistors
- 4x Memory Bandwidth
22nm - 3x on-die Cache
567 mm 650 mm 2 - Cache latency reduced by 50%
- 5x I/O Bandwidth
- 15 metal layers vs. 9
- eDRAM vs SRAM

Westmere EX to Ivy Bridge EX POWER8 Unique Technology


2.6 Billion 4.3 Billion - CAPI Technology
32nm to Transistors
Transistors - Integrated PCIe
22nm
541 mm
2 - Transactional Memory
513 mm
- L4 Cache
- Dynamic Overclocking
CAPI: Coherent Accelerator Processor Interface
Big Data and Real Time Analytics 19
Is Ivy Bridge a new breakthrough architecture?
 The Ivy Bridge CPU micro-architecture is a “shrink” from Sandy Bridge
 Ivy Bridge is a “tick” in the “tick/tock” Intel release cycle compared to Sandy
Bridge
 A “tick” means same architecture as previous with some minor
improvements
 The major improvement is the 22 nm Tri-gate transistor technology from the
prior 32 nm technology
 More transistors
 More cores, sockets
 20% more L3 cache

Does the 22nm technology


result in better performance?

nm = nanometer which is a measure of the CMOS semiconductor device fabrication process.


Smaller nm process allows for more transistors

Source for Tick Tock Model: http://www.intel.com/content/www/us/en/silicon-innovations/intel-tick-tock-model-general.html


http://www.techradar.com/us/news/computing-components/processors/intel-ivy-bridge-what-you-need-to-know-1077240
Big Data and Real Time Analytics 20
Intel’s performance per Core is not
increasing over previous generation

2 Socket HP Servers
2500
2283
2069 2049
2000
RPE2 per Core

1500
Sandy Bridge Ivy Bridge Ivy Bridge
EP EP EX
1000
2.9 GHz 2.7 GHz 2.8 GHz
16 cores 24 cores 30 cores
500

0
The number shown is best in each category (sockets and number of cores) RPE2 numbers are derived from the
following six benchmark inputs:

Source of RPE2: (Gartner) SAP SD Two-Tier, TPC-C, TPC-H,


http://www.gartner.com/technology/research/RPE2-methodology-details.jsp SPECjbb2006 and two SPEC
CPU2006 components

The data in this tool is derived from RPE2 from Ideas International. Ideas International was acquired by Gartner, Inc. in 2012. © 2014
Gartner, Inc. and/or its affiliates. All rights reserved.”

Big Data and Real Time Analytics 21


The new POWER8 scale-out servers –
innovation to put data to work

 POWER8 roll-out is leading with scale-out (1 and 2 Socket) systems


 Expanded Linux focus: Ubuntu, KVM, and OpenStack
 OpenPOWER Innovations

1 & 2 Socket Power Systems


S812L S822L S822 S814 S824L S824
• 1-socket, 2U • 2-socket, 2U • 2-socket, 2U • 1-socket, 4U • 2-socket, 4U • 2-socket, 4U
• Up to 12 cores • Up to 24 cores • Up to 20 cores • Up to 8 cores • Up to 24 cores • Up to 24 cores
• 512 GB Memory • 1 TB memory • 1 TB memory • 512 GB memory • Linux only • 1 TB memory
• 6 PCIe Gen 3 • 9 PCIe Gen3 • 9 PCIe Gen 3 • 7 PCIe Gen 3 • NVIDIA GPU • 11 PCIe Gen 3
• Linux only • Linux only • AIX & Linux • AIX, IBM i, Linux • 2H 2014 • AIX, IBM i, Linux
• PowerVM & • PowerVM & • PowerVM • PowerVM • PowerVM
PowerKVM PowerKVM • June 10, 2014 • June 10, 2014 • June 10, 2014
• August 29, 2014 • June 10, 2014

Scale-out
Big Data and Real Time Analytics 22
Power S822L servers are priced
competitively to Intel Ivy Bridge servers
Comparable TCA Dell PowerEdge HP ProLiant IBM Power
R720 DL380 G8 S822L
Linux on Intel
$21,300 $22,763 $22,382
Ivy Bridge + KVM
vs.
Linux on
POWER8 + KVM
Server list price*
-3-year warranty, on-site
$12,605 $14,068 $14,895

Virtualization $2,998 $ 2,998 $2,998


- 2 sockets, 3 yr. 9x5 sub./supp. KVM for Red Hat on x86 (RHEV) KVM for Red Hat on x86 (RHEV) KVM for Linux on Power (PowerKVM)

Linux OS list price $5,697 $5,697 $4,489


- RHEL, 2 sockets, unlimited Red Hat subscription and Red Hat Red Hat subscription and Red Hat Red Hat subscription and IBM
guests, 9x5, 3 yr. sub./ supp. support support support

Total list price:


(Total cost of acquisition) $21,300 $22,763 $22,382
Server model Dell R720 HP Proliant DL380p G8 IBM Power 822L
Processor / cores Two 2.7 GHz , E5-2697, Ivy Bridge, 12-core processors Two 3.4 GHz POWER8, 10-core
Configuration 64 GB memory, 2 x 300GB 15k HDD, 10 Gb two port Same memory, HDD, NIC
* Based on US pricing for Power S822L announcing on April 28, 2014 matching configuration table above. Source: hp.com, dell.com, vmware.com

Big Data and Real Time Analytics 23


POWER architecture is ideal for Hadoop
and Streams workloads
 Twice the thread capability to use to run parallel Java
Mutli-Threading workloads
 More threads enables faster BLU Acceleration Single-
Instruction, Multiple Data (SIMD) vector processing
 More threads enable faster Hadoop workload processing

 High memory and I/O bandwidth enables faster Cognos ad


hoc queries and better SPSS scoring throughput
High I/O Performance  High I/O performance enables Cognos faster ad hoc queries
and better SPSS scoring throughput
 Higher speed that significantly outperforms competitive x86
Hadoop configurations

 Large processor cache per core, large numbers of cores and


high DRAM capacity on a single server benefits Cognos
Processor and Cache Cubes performance
 Extensive Soft-Error Recovery, self-healing for solid faults,
and alternate processor recovery is ideal for Hadoop-based
workloads

 IBM’s Java Virtual Machine is specifically optimized for the


Java Optimization POWER architecture to deliver optimal performance of big
data and analytics solutions

Big Data and Real Time Analytics 24


BigInsights on POWER beats the
competition with TeraSort benchmark
Normalized Performance
GB / Core / Min
1.4
1.7X 1.27
1.2
1.04
1
GB/core/min

0.8 0.73 0.75


0.67
POWER7+ POWER8
0.6 Competitor Competitor Competitor with with
BigInsights BigInsights
0.4

0.2

0
Xeon Xeon Xeon POWER7+ POWER8
E5-2697v2 E5-2630v2 E5-2665 7R2 S822L
8 nodes 24 nodes 16 nodes 10 nodes 8 nodes
192 cores 288 cores 256 cores 120 cores 192 cores
1 TB 10 TB 10 TB 1 TB 10 TB

Big Data and Real Time Analytics 25


Big SQL V3.0: Bringing SQL on Hadoop to
the next level

 Massively parallel SQL engine on


Hadoop SQL Based
Application
 Architected from the ground up
for low latency and high
throughput
 Comprehensive SQL support Big SQL
 The same SQL you use on your SQLEngine
MPP Run-time
data warehouse should run with
few or no modifications HDFS
 Full support for sub-queries
CSV Seq Parquet RC
 All standard join operations
Avro ORC JSON Custom
 Stored procedures / User
defined functions
 Supports all modern file formats

Big Data and Real Time Analytics 26


Big SQL on POWER8 delivers Big Data
queries faster than Hive on Ivy Bridge
“TCP-DS inspired” workload queries

BigInsights v3.0 on 8 S822L data nodes


Time to complete queries
in 7 concurrent streams
11.2x
S822L, 3.3 GHz
24 cores Faster
256 GB Memory
8 hours
RHEL 6.5
40 min 4.7x
Lower
Cost

CPO Master POWER Competitive Proof Points v2.9 27


Hadoop implementation pain points
 Building clusters is very complex, time consuming and can become costly
 Difficult to plan, architect and build out a cluster of nodes and network components
 Difficult to manage the cluster when deployed
 Inability to share resources results in “cluster sprawl”
 x86 deployments are not flexible enough
 Built with a fixed processor to disk ratio
 When you add more machines you get both processors and hard drives
 Some Hadoop workloads require more of one or the other but not always both at the
same time
 HDFS NameNode is a single point of failure
 HDFS is not POSIX compliant and wastes storage
 Requires use of non standard commands and utilities
 Copies of data may need to be stored twice
− Once in Linux file system and once in HDFS
 Hadoop workload management is very basic
 Does not utilize cluster resources to their maximum
 Does not support multi-tenancy
 Can lead to silos of under utilized clusters
Big Data and Real Time Analytics 28
Introducing the IBM Solution for Hadoop –
Power Systems Edition
 Best-in-class hardware IBM InfoSphere BigInsights
or Open-source Hadoop
► IBM POWER Systems
IBM Platform Symphony
► IBM DCS3700 Storage IBM Platform Cluster Manager

Distributed File System


 Best-in-class software IBM Elastic Storage, HDFS

 IBM BigInsights Linux Operating Environment


RHEL SUSE
 IBM Platform Computing
− IBM Platform Symphony
IBM Power Systems
− IBM Platform Cluster IBM Power 7+, Power8
Manager
 IBM Elastic Storage
− Formerly GPFS

Big Data and Real Time Analytics 29


Platform Symphony provides enterprise
class Hadoop cluster management
 Replaces the open source MapReduce scheduler of
Hadoop

 Existing Hadoop applications run unchanged

 Better performance and utilization in cluster overall and


within individual jobs
 Scheduling is more efficient
 More efficient data transfer mechanisms for shuffle phase of
MapReduce
 Uses generic slots rather than fixed map and reduce slots which
waste resources

Big Data and Real Time Analytics 30


Platform Symphony provides enterprise
class Hadoop cluster management cont.
 Multi-tenancy is built in – system is designed for concurrency
 Reduces costs through better resource sharing and utilization
 Provides dynamic priority based concurrency
− Interactive workloads like BigSheets can be given higher priority (.e.g., during
the day)
− Priorities can be adjusted in real time
 Preemptive scheduling policies ensure service levels are met while also
allowing the cluster to be fully utilized
− Jobs entering the system are immediately given the priority they are guaranteed
− Applications can decide if they allow other applications to use their unused
capacity
 GUI provided to configure, manage and monitor policies and workload

 Allows running multiple versions of Hadoop


 Helps control costs and provides flexibility during upgrades

Big Data and Real Time Analytics 31


Platform cluster manager removes the
complexity of standing up clusters
 Provides comprehensive set of tools to install, provision
and configure cluster management environment
 GUI, automated installation scripts and guidance
 Provides flexible provisioning options
 Bare-metal (i.e., no virtualization)
 KVM (virtualized cluster environment)
 Provides GUI for “end users” to provision, deploy, manage
and monitor clusters
 Improves administrator and/or user productivity
 Deploy clusters easier and faster
 Saves on administration costs
 Lowers risk of deploying Hadoop clusters

Big Data and Real Time Analytics 32


IBM Elastic Storage(GPFS) brings enterprise
class file system capabilities to Hadoop
 No single point of failure like HDFS NameNode
 Metadata is distributed
Native OS
Applications (POSIX)

 GPFS is POSIX compliant


 Any Linux command/utility can be run against data, not just Hadoop
commands
− Easier to use and manage
 Makes archival and backup easier
 Other workload types can access data in cluster
 Saves storage space and administrator work
− Avoids moving data around to process inside and outside of HDFS
− Avoids having duplicate data inside and outside of HDFS

 Allows variable block sizes to exist on same cluster to meet the needs
of different applications

Big Data and Real Time Analytics 33


IBM Solution for Hadoop – Power Systems Edition
key architecture components

PowerLinux Data Node DCS3700


IBM PowerLinux 7R2 (P7+ based, P8 coming in IBM DCS3700 Storage f/c 1818-80c
4Q2014)  A maximum of 60 x 3.5" HDD, 4 TB each
2 sockets Power7 3.55 GHz CPU  Standard controller (3000) validated
Data: 29 x 900Gb SAS HDDs, JBOD I/O Exp
OS: 1 x 300Gb SAS HDD
128 GB DDR3 RDIMMs
1GbE, 10GbE Switches
1GbE: IBM RackSwitch G8052
– 48 × 1 GbE RJ45 ports, four 10 GbE SFP+ ports
– Low 130 W power rating and variable speed fans to
PowerLinux Management Node reduce power consumption
(JobTracker, NameNode, Console) 10GbE: IBM RackSwitch G8264
IBM PowerLinux 7R2 (P7+ based, P8 coming – Optimized for applications requiring high bandwidth
in 4Q2014) and low latency
2 sockets Power7 3.55 GHz CPU – Up to 64 1 Gb/10 Gb SFP+ ports, four 40 Gb
QSFP+ports, 1.28 Tbps non-blocking throughput
OS: 6 x 600GB SAS HDD, mirrored
128GB DDR3 RDIMMs
Big Data and Real Time Analytics 34
Learn more about Big Data Analytics
Solutions on Power
IBM Analytics Solutions for
Power Systems
http://www-03.ibm.com/systems/power/solutions/bigdata-analytics/index.html

The PowerLinux Community


(DeveloperWorks)
www.ibm.com/developerworks/group/tpl/

@thinkpowerlinux
Big Data and Real Time Analytics 35

You might also like