Professional Documents
Culture Documents
InfoSphere BigInsights
• Augments open source Hadoop
with enterprise capabilities
– Enterprise-class storage
– Security
– Performance Optimization Hadoop
– Enterprise integration System
– Development tooling
– Analytic Accelerators
– Application and industry accelerators
– Visualization
Enterprise-Class Storage and Security
• IBM GPFS-SNC (Shared-Nothing Cluster) parallel file system
can replace HDFS to provide Enterprise-ready storage
– Better performance
– Better availability
• No single point of failure
– Better management
• Full POSIX compliance, supports multiple storage technologies
– Better security
• Kernel-level file system that can exploits OS-level security
• Security provided by reducing the surface area and securing
access to administrative interfaces and key Hadoop services
– LDAP authentication and reverse-proxy support restricts access to
authorized users
• Clients outside the cluster must use REST HTTP access
– Defines 4 roles not available in Hadoop for finer grained security:
• System Admin, Data Admin, Application Admin, and User
• Installer automatically lets you map these roles to LDAP groups and users
– GPFS-SNC means the cluster is aware of the underlying OS security
services without added complexity
Workload Optimization
Optimized performance for big data analytic workloads
Original Picture
Data
Capabilities Warehouse
Dedicated High
Speed: 10-100x faster than Performance
Disk Storage
traditional systems
R GUI
Client
nzAnalytics
Eclipse
Client
nzAdaptors
Partner
nzAnalytics ADE or IDE
Host
nzAdaptors
Hosts
nzAnalytics
nzAdaptors Partner
ADE or IDE
nzAnalytics
Partner
Visualization
nzAdaptors
Model
LARGE
Model DATA SET
Model
Analytic Host
Building…
Workbench Analytics
Building…
Model LARGE
DATA SET
Analytics
Structured
Unstructured
Streaming
Benefits
Transform and aggregate
any volume of information
Deliver data in batch or real
time through visually
designed logic
Hundreds of built-in
transformation functions
Metadata-driven
productivity, enabling
collaboration
Clean 2
Parallel pipelining
Clean 1
Import
Merge Analyze
Clean 2
Inter-node communications
Parallel access to sources
Parallelization of operations
20
Data Lineage Extender
• Support governance requirements for business provenance
• Extended visibility to enterprise data integration flows outside of InfoSphere
Information Server
• Comprehensive understanding of data lineage for trusted information
• Popular business use cases
– Non-IBM ETL tools and applications
– Mainframe COmmon Business-Oriented Language (COBOL) programs
– External scripts, Java programs, or web services
– Stored procedures
– Custom transformations
21
Lineage tracking with BigInsights
•Business Users
• Visualization of a large volume and wide
variety of data
Visualization Application Systems
& Discovery Development Management
•Developers
• Similarity in tooling and languages
• Mature open source tools with
enterprise capabilities
• Integration among environments
•Administrators
• Consoles to aid in systems management
Visualization - Spreadsheet-style user interface
• Browser-based
HDFS / GPFS
class MyAlgorithm {
initializeTask()
initializeTask()
class MyAlgorithm {
initializeTask()
beginIteration()
beginIteration() class MyAlgorithm {
beginIteration() initializeTask() … and reducers
processRecord()
processRecord() beginIteration()
mergeTasks() Mapper
processRecord() would be replaced
mergeTasks()
mergeTasks()
endIteration() mergeTasks()
} endIteration() with UDAPs (User-
endIteration()
endIteration() } Defined Analytic
}
Processes), …
corresponding
accumulators remain
acrossthe data
samepartitions
accumulators between initializeTask() initializeTask()
beginIteration() beginIteration()
Hadoop-based and Netezza- Mapper
processRecord() Reducer
processRecord()
mergeTasks() mergeTasks()
based implementations endIteration() endIteration()
} }
Objects can be connected into workflows with their
deployment optimized using semantic properties
• D = 5*(B’*A + A*C) B A C
– Transpose
• BasicOnePassTask
Transpose
• Can execute in Mapper or Reducer
– MM (matrix multiply)
• BasicOnePassMergeTask B’
• Has Map and Reduce components MAP
– Add (matrix add) MM MM
• BasicOnePassKeyedTask REDUCE
• Executes in Reducer and can be
piggybacked B’*A A*C
– Multiply (scalar multiply)
• BasicOnePassTask Add
• Can execute in Mapper or Reducer
• Entire computation can be
executed in one map-reduce job Multiply
due to differentiation of BasicTasks
SystemML compiles an R-like language into
MapReduce jobs and database jobs
Example Operations
X*Y cell-wise multiplication: zij = xij∙yij
A = B * (C / D)
Binary hop X/Y cell-wise division: zij = xij/yij
Multiply
B Binary hop
Divide Binary lop
Multiply
C D Group lop R1
Binary lop
M1
B Divide
C D
Input DML parsed Each high-level operator Each low-level operator Multiple low-level
into statement blocks operates on matrices, operates on key-value pairs operators combined in a
with typed variables vectors and scalars and scalars MapReduce job
Approximately thirty data-parallel algorithms have been
implemented to date using these and related APIs
• Simple Statistics • Regression Modeling
– CrossTab – Linear Regression
– Descriptive Statistics – Regularized Linear Models
• Clustering – Logistic Regression
– K-Means Clustering – Transform Regression
– Kernel K-Means – Conjugate Gradient Solver
– Fuzzy K-Means – Conjugate Gradient Lanczos Solver
– Iclust • Support Vector Machines
• Dimensionality Reduction – Support Vector Machines
– Principal Components Analysis – Ensemble SVM
– Kernel PCA • Trees and Rules
– Non-Negative Matrix Factorization – Adaptive Decision Trees
– Doubly-sparse NMF – Random Decision Trees
• Graph Algorithms – Frequent Item Sets - Apriori
– Connected Graph Analysis – Frequent Item Sets - FP-Growth
– Page Rank – Sequence Mining
– Hubs and Authorities • Miscellaneous
– Link Diffusion – k-Nearest Neighbors
– Social Network Analysis – Outlier Detection
(Leadership)
MARIO incorporates AI planning technology to enable ease of use
29
MARIO incorporates AI planning technology to enable ease of use
Analytic accelerators
– Analytics, operators, rule sets
Application Accelerators
– Analytics
– Models
– Adapters
Analytic Accelerators Designed for Variety
Accelerators Improve Time to Value
Over 100 sample User Defined Toolkits Standard Toolkits Industry Data Models
applications Banking, Insurance, Telco,
Healthcare, Retail
Big Data Platform - Analytic Applications