You are on page 1of 99

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

. Insert Information Protection Policy Classification from Slide 8


1
Oracle & Big Data
Dominic Giles
Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 8
2
Agenda

Introduction to Big Data


What is big data?
Whos doing it?
Who are our competitors?
Introduction to Hadoop
What makes up Hadoop
An introduction to Map Reduce
Developing a Map Reduce process
Building a cluster

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
3
Agenda

Introduction to Pig
Why Pig
Developing in Pig Latin
Introduction to Hive
SQL for Hadoop
Oracle noSQL Database
What is a no*SQL database
Who uses them?
Oracle noSQL Database

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
4
Agenda

Oracle and Big Data


What are we doing
Introduction to R
What is R
Why is Oracle using R

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
5
Big Data

What is Big Data?

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
6
Big Data

Its not always big

200GB+

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
7
Big Data

But it can
be very big 1PB+

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
8
Big Data

Typically Regarded as

Junk Data
Its easier to throw it away than do something useful with it

Web Logs
Sensor Output
Historical Data

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
9
Big Data

Its not
one
Unstructured, Structured,
type of
Semi Structured, Documents
XML, Images data
Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
10
Big Data

Its nearly
always parallel
Tens to thousands of shared
nothing commodity servers

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted.
11
Big Data

Business as usual
A 2002 study of how customers of Canadian Tire were using the the companys credit card found that 2,220 of 100,000
cardholders who used their credit cards in drinking places missed four payments within the next 12 months. By contrast
only 530 of the cardholders who used their credit cards at the dentist missed four payments within the next 12 months.
By CHARLES DUHIGG
Published May 2009

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
12
One current feature of big data is the difficulty working with it using
relational databases and desktop statistics/visualization packages,
requiring instead "massively parallel software running on tens,
hundreds, or even thousands of servers.

Jacobs, A. (6 July 2009). The Pathologies of Big Data. ACMQueue

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 8
13
Big Data
Whos Doing it?

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
14
Who are our competitors?
Everyone whos interested in managing Data

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
15
Introduction to Hadoop

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 8
16
An open source project to provide a
framework for the distributed processing
of data across clusters of computers

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 8
17
Hadoop
A Comparative Study

A Typical Exadata Cluster

Two Exadata Racks


16 Compute Nodes
200TB Disk (80TB Usable)

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
18
Hadoop
A Comparative Study
A Typical
Hadoop
Cluster

Thirty Racks
540 Compute Nodes
19PB Disk (6PB Usable)

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
19
Hadoop
Typical Hadoop Server
Typically 2U servers
Name Server
1GB of memory per million blocks (typically 24-32GB)
2 CPU Six Core
More resilient hardware than Data/Task Tracker Nodes
Data/Task Tracker Node
2 CPU Six Core
24 GB of memory
4 - 12 disks
Typically no RAID functionality

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
20
Hadoop
The core kernel is made up of two components
HDFS Task Manager

A distributed file system Schedules and parallelises Jobs

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted.
21
HDFS
A distributed file system
Responsible for distributing
NameNode
files throughout the cluster
JobTracker
Designed for high throughput
rather than low latency
Data Nodes Typical files are gigabytes in
size
Files are broken down into
chunks
HDFS is rack aware
Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
22
Customers.txt (20GB)
HDFS
Meta Data

NameNode

JobTracker The NameNode is


responsible for managing the
meta data of the file system
Data Nodes It is consulted every time a
file needs to written or read
Currently the Name Node is a
single point of failure

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
23
HDFS

The Name Node is arguably


Hadoops weakest link
A Secondary Name Node
doesnt provide failover
capability, just simple recovery.
Fixed in Hadoop release 23

Secondary Name Node


Name Node

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
24
HDFS
Accessing HDFS

Access to HDFS is performed via the hadoop utility or


via FUSE
$ hadoop fs -lsFound 2 itemsdrwxr-xr-x - oracle supergroup 0 2011-11-23 03:03 /user/oracle/inputdrwxr-
xr-x - oracle supergroup 0 2011-12-01 09:24 /user/oracle/output

$ hadoop fs -mkdir domsdata


$ hadoop fs -copyFromLocal Employees.dat domsdata
$ hadoop fs -ls domsdataFound 1 items-rw-r--r-- 1 oracle supergroup 49398 2011-12-06 18:00
/user/oracle/domsdata/Employees.dat

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
25
HDFS

Lets take a quick look

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
26
Map Reduce
Processing the data
Map Reduce programs are designed to process large
quantities of data in parallel
Data is processed using a key value pair
Every value has a key
Data is immutable i.e. Changes to the data being
processed are not reflected in original file
All the developer needs to do is implement map() and
reduce() functions. The framework does the rest

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
27
Map Reduce
Map : The first step
Takes an element and converts it into something based
on a key ready for the next step of processing
This might be parsing a comma delimited string
Cleaning the data
Rearranging the elements in a field
performing a simple calculation

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
28
A Map Reduce Process
Key=Row#, Value="smith, john, 25, 12-Jan-1970, 10000, 20"
The basic logic
Key=Row#, Value="jones, david, 25, 21-Aug-1970, 10050, 30"

Input
Map(Key,Value) -- pseudo code
Begin
Break up the Value by delimiter(",")
Name = InitCap(Field2 + Field1)
Department = Field6
Key = Department
Value = Name, Field3, Field4, Field5, Field6
Output(Key, Value)
Output End

Key=30, Value="David Jones, 25, 21-Aug-1970, 10050, 30"

Key=20, Value="John Smith, 25, 12-Jan-1970, 10000, 20"

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
29
Map Reduce
Map/Partitioning phase
Its also possible to reduce unnecessary network exchanges by implementing an
optional combiner step before the reduce process
Node1 Node 2

Map
Process

Optional
Combiner
Process

Partition
Process

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
30
Map Reduce
Reduce : The last step
Takes an iterator (list) of values with the same key and
reduces it by operations such as
Aggregating
Filtering
Sampling
The reduce process in turn writes its results back to HDFS
Unlike some other Map Reduce paradigms Hadoop does
not insist that the the Reduce process returns a single
element

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
31
A Reduce Process Key=30, Value="David Jones, 25, 21-Aug-1970, 10050, 30"

The Basic Logic Key=30, Value="John Bracken, 25, 15-Dec-1955, 15050, 30"

Key=30, Value="Peter Thompson, 25, 06-Jun-1945, 800, 30"

Reduce(Key,Value List) -- pseudo code


Begin
For every Value in the the Value List
Break up the Value by delimiter(",")
Name = Field1
Salary = Field5
if Salary is Max(Salary)
MaxValue = this Value;
End Loop
Value = MaxValue.Name, MaxValue.Salary
Output(Key,Value)
End

Key=30, Value="John Bracken, 15050"

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
32
Map Reduce
Putting it all together
Node1 Node2 Node3
HDFS Map and Reduce process
/users/hr/input
run concurrently
Map Data is send to a reduce
Process
process based on its key
Partition Reduce processes dont
Process
Sort
communicate with one
another
Reduce
Process If a process fails it is
HDFS restarted on another node
/users/hr/output

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
33
Map Reduce
Chaining Jobs

Clean and
Collections of Map Reduce
sort emails processes are chained
together
Output from one process acts
Generate as the input to another Map
text index
Reduce
Workflows and coordination is
implemented in code or via
Analyse
emails frameworks like Zoo Keeper

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
34
Map Reduce
Jobs and Tasks
A Job is
All classes and jar files needed to run a map reduce program
Jobs can be submitted from the command line
$> hadoop jar minimumsalary.jar /users/hr/input /users/hr/output

A Task is
The process responsible for executing the individual map reduce
steps. They are executed on nodes selected by the Job Tracker

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
35
Map Reduce

Lets take a quick look

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
36
Map Reduce
Monitoring the Cluster
Or via more sophisticated tools such as Cloudera
Manager

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
37
Configuring a Hadoop Cluster
Quick Configuration Overview
Core infrastructure is roughly a 50MB download
All the files that are needed to be modified are located in
the $HADOOP_HOME/conf directory
core-site.xml : Locations, ports and global properties
hdfs-site.xml : HDFS parameters i.e. replication levels
mapred-site.xml : job tracker parameters i.e name node
slaves : list of data nodes/task trackers
Ensure ssh is configured to enable remote startup and
shutdown of daemons
Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
38
Hadoop
Miscellaneous
Adding new nodes is a relatively painless task
Adding new data nodes does not trigger a rebalance of storage
This can be invoked via a script or temporarily changing the level of
replication (not always a good idea)
Update $HADOOP_HOME/conf/slaves and then start daemons
$> bin/hadoop datanode

Removing nodes simply requires them to be excluded


and allow their data to migrated to new nodes

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
39
Hadoop
The Family
The Hadoop family consists
Pig Hive
Data Analysis SQL of a collection of

Database imp/exp
utilities/applications

Sqoop
Zoo Keeper

And a couple that arent quite


Coordination

Serialisation
Avro
Map Reduce
so interesting to our Oracle
HBase
Columnar Database stack

HDFS
Distributed File System

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
40
Introduction to PIG

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 8
41
PIG
A higher level language
Designed to simplify the analysis of large data sets
Simpler than developing Java Map Reduce processes
Developers code in Pig Latin
Faster to develop in and easier to understand
Its structure allows the system to implement
optimisations whilst allowing the developer to work on
semantics
Extensible

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
42
PIG
Pig Latin
Similar to many popular scripting languages e.g. Python,
Ruby
10 lines of Pig Latin is typically equivalent to several
hundred lines of Java
Provides common operations like join,group,sort,
filter
Typically used for rapid prototyping, ad-hoc queries, web
log processing

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
43
PIG
Running Pig
Grunt : A simple Shell
Submitting scripts directly to the server
Java interface
Via an IDE
Eclipse has a plugin that allows for textual and graphical coding

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
44
PIG
Some simple examples
Log into the shell (Local Mode)
$> pig -x local

Load data from file system and display it


A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;
(John)
(Mary)
(Bill)
(Joe)

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
45
PIG
Some simple examples
Load some data delimited by commas
Order it by telephone number (ascending)
Store it to the file system

A = LOAD 'PhoneNumber.dat' USING PigStorage(',') AS (name:chararray, telephone:chararray);


B = FOREACH A GENERATE telephone, name;
X = ORDER B BY telephone ASC;
STORE X into 'justnames';
C = GROUP A by SUBSTRING(telephone,0,3);
D = FOREACH C GENERATE $0 as areacodeid, COUNT($1) as cnt;
E = FILTER D BY cnt > 3;DUMP E;

Count the unique area codes


Group the data by the first three numbers

Only print out the tuples with more than 3 entries

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
46
PIG

PIG Action

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted.
47
Introduction to Hive

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 8
48
Hive
Batch ETL for Hadoop
The Hive Query Language is similar in its constructs to
SQL
SQL is commonHive throughout
is not an adindustry
hoc queryproviding a easy
environment.
migration of skills Even simple
to the Hadoop environment
queries can take many minutes
Creation of Map Reduce Jobs can be time consuming
to complete
Hive QL provides abstraction from the complexity of the
Hadoop cluster
Hive compiler turns SQL into map reduce operations

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
49
Hive
Hive
Hadoop is not a No SQL datastore
Over 99% of the work run against Facebooks Hadoop
clusters are Hive jobs
Only a small proportion is hand coded map reduce operations
Hive is designed to make it simpler to process huge data
sets
The output of Hive is often loaded into a standard
relational database for faster ad hoc analysis

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
50
Hive
Architecture
User Space Map Reduce HDFS

User defines Map Reduce


Hive CLI, JDBC, ODBC
Scripts

Hive QL

HDF/UDAF
Parser

Planner Execution
MetaStore
(Typically mySQL)
Optimiser SerDe File Formats
Text File
Sequence File
RCFile

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
51
Hive
Basic DDL
Tables
Create Table
Partitions
Buckets (Maps to files)
Add/Remove columns
Add/Remove partitions
Views
Index

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
52
Hive
Basic Datatypes
Integers
tinyint, smallint, int, bigint
Booleans
Note : No Date or timestamp
Floating Point datatypes. These need to be
float, double held as strings
Strings
Complex Types
Structs, Maps, Arrays

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
53
Hive
SQL
Select statements
Group by, order, Equi Join, Outer join, Sub Queries
Partition pruning
Insert Statements
Multi Insert
No Updates or Deletes
Data is immutable. It can only be added to
Tables and partitions can be dropped

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
54
Hive
SQL Example

SELECT r.*, s.*


FROM r JOIN (
SELECT key, count(1) as count
FROM s
GROUP BY key) s
on r.key = s.key
WHERE s.count > 100

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
55
Hive

Bee Action

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted.
56
Introduction to Sqoop

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 8
57
Sqoop
Database Import/Export to HDFS
Provides a heterogeneous means of importing and
exporting data to/from relational databases
Access is typically via jdbc
Access to Oracle, mySQL, SQLServer etc
Third party suppliers i.e. Quest provide higher
performance alternatives to defaults
Growing in scope and capabilities to become a fully
fledged ETL solution

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
58
Sqoop
Example

$ sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
--split-by a.id --target-dir /user/foo/joinresults
$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar \
--export-dir /results/bar_data

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
59
Hadoop
Know Features
Query latency
Hadoop is batch orientated
Meta data management
Developers have to know location of data and its datatypes
Its model is insert, append. not update
Its not possible to use it as a transactional store
HBase can be used to provide random read/write capability

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
60
Summary
Hadoop
Massively distributed data processing framework
Allows the developer to concentrate on the the process
and not its parallelisation
Growing in popularity with enterprise customers

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
61
Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 8
62
NoSQL a brief history

Early 2000s, Web 2.0 companies started looking for RDBMS alternatives
2003: memcachedb (cached k-v store to reduce load on RDBMS)
2004: Google published MapReduce distributed processing paper
2006: Google published BigTable distributed database paper
2007: Amazon published Dynamo paper
2008+: Several open source projects are launched to productize NoSQL
solutions
2010+: Enterprises start to investigate NoSQL solutions

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
63
What is NoSQL?

Not-only-SQL (2009)
Broad class of non-relational DBMS systems that typically
Provide horizontal/distributed scalability
Avoid joins
Have relaxed consistency guarantees
Dont require a structured schema
Are application/developer-centric
No standards
Rapid evolving set of solutions (122+ on nosql-database.org)
Highly variable feature set
UnQL launched in July
Majority are open source
Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
64
Key-Value Store Workloads
Data Capture

Write data as fast as you can


Minimal indexing
No referential integrity
Relaxed durability guarantees (lower value data)
Scale write throughput via data distribution
Optimize write throughput per storage node (master, append-only log
file)
Asynchronous replication
Bulk operation support is useful for some applications
Workload can be steady and/or bursty
Throughput more important than latency

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
65
Primary NoSQL use cases

Social Networks (LinkedIn, Facebook, Digg, Google+, etc.)


Personalization (Amazon, Ebay, Yahoo, etc.)
Web-centric services (Apple, Cisco, AT&T, HP, Motorola, Nokia, Pros)
Customer Service
Device tracking
Airline pricing
Intelligence community
Finanical Services (JP Morgan, Wells Fargo)
Fraud detection
Document search (Thomson Reuters, exLibris)
Scientific research
Geophysical (Halliburton)
Biomedical

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
66
Who are the primary players?
NoSQL Databases
Key-value Columnar Document Graph
Oracle NoSQL DB* Cassandra MongoDB OrientDB
Voldemort* HBase CouchDB GraphDB
Tokyo Cabinet HyperTable RavenDB
Redis
Riak
CitrusLeaf
GenieDB* Google BigTable
Amazon Dynamo*
Google LevelDB

(*) Built on top of Berkeley DB

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
67
Key Value Example
Key Value (Opaque)
Maps to

Smith John Node 1, Partition 1034


0xFThe
12 A16F
Crescent,
1A1D 94C8
Oxford,
A7B7 Oxfordshire
374B F58B:0x1D
0145D627
677 3455
693E: A513
3-Jan-1989
0x1D D627 693E A513

Node12, Partition 675


Smith Sue
0xADF7
Flat 4, 191
A264
Easy
B304
Street,
287EReading,
BEF7 F555
Berkshire
BF4B 2635
: 1604
B304
444287E
4678BEF7
: 21-Apr-1978
F555 0x1D D627

Node 3, Partition 3002

Jones David

287E BEF7 0xADF7 B304 287E BEF7 F555 693E 0x1D D627
47 Alpine Grove, Reading, Berkshire : 1604 336 7890 : 20-Feb-1979
Node1, Partition 56

Wright Michael

B304 287E BEF7 F555 BF4B 2635 1A1D 94C8 A7B7 374B BEF7 F555 287E BEF7 0x1D D627
The Cider House, Main Street, Dorchester, West Dorset : 01203 393 4443 : 06-Aug-1962

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
69
Key Value Example
Major Key Component Minor Key Component Maps to Value (Opaque)

Smith John Address Node 1, Partition 1034 0xFThe


12 A16F
Crescent,
1A1D 94C8
Oxford,
A7B7 Oxfordshire
374B F58B
Node 1, Partition 1034
Smith John Phone Number 0x1D 677
0145 D6273455
693E A513
Node 1, Partition 1034
Smith John DOB 0x1D D627 693E A513
3-Jan-1989
Node12, Partition 675
Smith Sue Address Flat 4, 191 Easy Street, Reading, Berkshire
Node12, Partition 675
0xADF7 A264 B304 287E BEF7 F555 BF4B 2635
Smith Sue Phone Number 1604 444 4678
Node12, Partition 675
B304 287E BEF7 F555
Smith Sue DOB Node 3, Partition 3002
21-Apr-1978
D627 693E 0x1D D627
Jones David Address Node 3, Partition 3002 47 Alpine Grove, Reading, Berkshire
287E BEF7 0xADF7
Jones David Phone Number Node 3, Partition 3002 1604 336 7890
B304 287E BEF7 F555
Jones David DOB Node1, Partition 56 20-Feb-1979
Node1, Partition 56 693E 0x1D D627
Wright Michael Address
The Cider House, Main Street, Dorchester, West Dorset
Node5, Partition 56
Wright Michael Phone Number B304 287E BEF7 F555 BF4B 2635 1A1D 94C8 A7B7 374B
01203 393 4443
Wright Michael DOB BEF7 F555 287E BEF7
06-Aug-1962
Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
693E 0x1D D627
70
Oracle NoSQL Database
A distributed, scalable key-value database
Simple Data Model
Key-value pair with major+sub-key paradigm
Read/insert/update/delete operations

Scalability
Dynamic data partitioning and distribution Application Application
Optimized data access via intelligent driver NoSQLDB NoSQLDB
Driver Driver
High availability
One or more replicas
Disaster recovery through location of replicas
Resilient to partition master failures
No single point of failure

Transparent load balancing


Reads from master or replicas
Driver is network topology & latency aware

Elastic (Planned for Release 2) Storage Nodes Storage Nodes


Online addition/removal of Storage Nodes Data Center A Data Center B
Automatic data redistribution

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
71
Oracle NoSQLDB

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
72
Oracle NoSQLDB

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
73
Oracle NoSQL DB
Performance
Performance is a factor of
storage nodes and the
replication factor used
In this example each
replication group (3 nodes)
holds 100 million records
The benchmark scaled from
1 replication group to 32 (96
nodes)
Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
74
Inserting and Querying a NoSQL DB
Simple Java API Specify our major key
Specify our minor key
create the data we want to insert
majorComponents.add("Smith"); majorComponents.add("John");
minorComponents.add("phonenumber"); // Create the key Key myKey =
Key.createKey(majorComponents, minorComponents); String data = "408 555 5556";
majorComponents.add("Smith"); majorComponents.add("John");
Value myValue = Value.createValue(data.getBytes()); store.put(myKey, myValue);
minorComponents.add("phonenumber"); // Create the key Key myKey =
Key.createKey(majorComponents, minorComponents); ValueVersion returnedData =
store.get(myKey);

String phoneNumber = new String(returnedData.getValue());

Serialise the data


Write the data to the database

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
75
Oracle and Big Data

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 8
76
Big Data
Divided Solution Spectrum

Distributed File
System
Low Density Map Reduce Solutions
Key Value Store

High Density Relational Database Relational Database Advanced


ETL
(OTLP) (Data Warehouse) Analytics

Acquire Organise Analyse

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
77
Big Data
Oracle Big Data Software

HDFS Oracle
Oracle11g
Low Density Loader
Hadoop In Database
for
Analytics
Oracle NoSQL DB Hadoop

High Density Oracle11g ODI Oracle11g Oracle BIEE

Acquire Organise Analyse

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
78
Big Data
Oracle Engineered Solutions

HDFS Oracle
Oracle In
Loader
Low Density
Big Data Appliance Hadoop
for
Database

Exalytics
Analytics
Oracle NoSQL DB Hadoop

High Density Oracle11g Exadata


ODI Oracle11g Oracle BIEE

Acquire Organise Analyse

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
79
Big Data
Oracle & Big Data
A new generation of products to simplify the
introduction of big data to the enterprise
Big Data Appliance
A powerful pre build optimised cluster of machines
648TB per rack
216 cores, 864GB Memory
Cloudera Hadoop Distribution
Software
Oracle No*SQL DB (Community Edition)
Oracle Hadoop Loader
Oracle Data Integrator for Hadoop

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
80
Big Data Appliance Software Breakdown
Oracle Big Data Appliance
runs Cloudera CDH3
Software is preinstalled
and optimised for
HDFS Data Nodes
balanced configuration


Job Tracker
Hive Server
Additional racks use all


ODI Agent
MySQL Master nodes as HDFS data
Secondary Name Node nodes/noSQL DB servers
Cloudera Cluster Manager
Zoo Keeper

Hadoop Name Node


HBase Master Node

Infiniband Switch x3
KVM Switch

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
81
Oracle Loader for Hadoop
HDFS export for the Oracle Database
High Performance
HDFS Load data into a single
partitioned or non-partitioned
table
Support for scalar datatypes of
Direct Load

Oracle Database
Runs as a Hadoop Map-
Reduce job
Oracle11g
Online and offline load modes

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
82
Oracle SQL HDFS Connector
SQL Access to HDFS data from Oracle

Make HDFS files accessible to


HDFS Oracle Database through
external table definitions
HDFS Files

External Table

Oracle11g

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
83
Oracle R Connector for Hadoop

Oracle R Enables the execution of


Connector
R scripts on huge
quantities of data
Provides R API access to
Hadoop and Oracle

Oracle R
Enterprise

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
84
ODI & Big Data Appliance
Oracle Oracle Oracle
Big Data Appliance Exadata Exalytics

Oracle Loads
Loader for
Hadoop

InfiniBand InfiniBand

Oracle Data Activates


Integrator
Transforms
Via MapReduce(HIVE)
Stream Acquire Organize Analyze & Visualize
Oracle Data Integrator Enterprise Edition
High Performance, Productivity and Low TCO

Legacy
Sources
E-LT Transformation Any Data
vs. E-T-L Warehouse
Application
Sources Declarative Set-based design

Change Data Capture Any


Planning
System
OLTP DB Hot-pluggable Architecture
Sources
Pluggable Knowledge Modules
Optimized Data Loading through E-LT
The key to improved performance and reduced costs
Conventional ETL Architecture

Transform Load E-LT provides flexible architecture for


Extract
optimized performance
Benefits:
Manual
Manual Scripts
Scripts Leverage Set-based transformations
Next Generation Architecture No additional network hops

E-LT Takes advantage of existing hardware

Extract Load

Transform Transform
ODI & Big Data Connectors
Ease Of Use
Easy to use Graphical User Interface
Reduced Development / Maintenance Time
Reduced Management Time
Increased Reusability
FAST
Uses fast Oracle Connectors (typically 5 X faster)
No network hop. Use set based transformation on source and/or target.
Makes use of existing hardware.
Leverage existing hardware. Source and / or Target
BDA has 216 cores waiting to be used which ODI can use
Packaged with BDA and Connectors
ODI Designer Declarative Design
Package describing the end to end process
ODI Designer Declarative Design
Interface Artifact
ODI Designer Declarative Design
Model Artifact
ODI Designer Declarative Design
Topology
Big Data Knowledge Modules
IKM File to Hive (LOAD DATA).
Load unstructured data from File
(Local file system or HDFS ) into Hive
IKM Hive Control Append
Transform and validate structured data on Hive
IKM Hive Transform
Transform unstructured data on Hive
IKM File/Hive to Oracle (OLH)
Load processed data in Hive to Oracle
RKM Hive
Reverse engineer Hive tables to generate models
Generate and Automate
Using ODI & Oracle Loader for Hadoop
Use ODIs easy to use Graphical User Interface
Generate Map Reduce and data transformation code to run on Hadoop
Java
HiveQL
SQL
Invoke Oracle Loader for Hadoop

Use the drag-and-drop interface in ODI to


Run OLH from an event (time, data etc) using Graphical User Interface.
Data Lineage and Auditing
Large number of data flows in a
complex environment
How to get an overview?

? Web-based end-to-end data lineage


1. Understand your data flows
2. Follow the path of data
3. Drill-down to transformations
Processing Big Data with ODI
Summary
Create Source and
Targets Models for
File, DB, Hive

Using the
Using the KM for
Loading KM map
OLH unload Hive
data from local or
tables to
HDFS into
corresponding Oracle
corresponding
Tables.
Hive Table.

Using the KM for


Using the KM for
Structured data
semi-structured
process data in
data process data
Hive.
in Hive
Oracle Data Integrator for Big Data
Putting Together the Unique Advantages

Simplifies creation of Hadoop and MapReduce code to boost


productivity

Integrates big data heterogeneously via industry standards:


Hadoop, MapReduce, Hive, NoSQL, HDFS

Unifies integration tooling across unstructured/semi-structured


and structured data

Optimizes loading of big data to Oracle Exadata using Oracle Big


Data Connectors

Engineered for running on and integrating with Oracle Big Data


Appliance via Big Data Connectors
ODI & Big Data Connector Benefits
Fast - Optimised, high performance data loading between Hadoop and Oracle
Database

Easy to use through graphical user interface

Lower CPU utilisation on Oracle RDBMS while improving load rates

Optimised connector for Oracle R to analyze raw data on HDFS leveraging


Hadoop

Integrated and tested on Big Data Appliance with full support.

End to end data lineage and auditing


Oracle and Big Data

Oracle Oracle Oracle


Big Data Appliance Exadata Exalytics

InfiniBand InfiniBand

Stream Acquire Organize Analyze & Visualize

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted.
99
Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 8
100

You might also like