Oracle Big Data - Mini Class

Copyright 2011, Oracle and/or its affiliates. All rights reserved.
. Insert Information Protection Policy Classification from Slide 8

1
Oracle & Big Data
Dominic Giles
Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 8
2
Agenda
Introduction to Big Data

What is big data?
Whos doing it?
Who are our competitors?
Introduction to Hadoop
What makes up Hadoop
An introduction to Map Reduce
Developing a Map Reduce process
Building a cluster
Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted
3
Agenda
Introduction to Pig
Why Pig
Developing in Pig Latin
Introduction to Hive
SQL for Hadoop
Oracle noSQL Database
What is a no*SQL database
Who uses them?
Oracle noSQL Database
4
Agenda
Oracle and Big Data

What are we doing
Introduction to R
What is R
Why is Oracle using R
5
Big Data
What is Big Data?
6
Big Data
Its not always big
200GB+
7
Big Data
But it can
be very big 1PB+
8
Big Data
Typically Regarded as
Junk Data
Its easier to throw it away than do something useful with it
Web Logs
Sensor Output
Historical Data
9
Big Data
Its not
one
Unstructured, Structured,
type of
Semi Structured, Documents
XML, Images data
10
Big Data
Its nearly
always parallel
Tens to thousands of shared
nothing commodity servers
Copyright 2011, Oracle and/or its affiliates. All rights reserved. Confidential : Oracle Restricted.
11
Big Data
Business as usual
A 2002 study of how customers of Canadian Tire were using the the companys credit card found that 2,220 of 100,000
cardholders who used their credit cards in drinking places missed four payments within the next 12 months. By contrast
only 530 of the cardholders who used their credit cards at the dentist missed four payments within the next 12 months.
By CHARLES DUHIGG
Published May 2009
12
One current feature of big data is the difficulty working with it using
relational databases and desktop statistics/visualization packages,
requiring instead "massively parallel software running on tens,
hundreds, or even thousands of servers.
Jacobs, A. (6 July 2009). The Pathologies of Big Data. ACMQueue
13
Big Data
Whos Doing it?
14
Who are our competitors?
Everyone whos interested in managing Data
15
Introduction to Hadoop
16
An open source project to provide a
framework for the distributed processing
of data across clusters of computers
17
Hadoop
A Comparative Study
A Typical Exadata Cluster
Two Exadata Racks

16 Compute Nodes
200TB Disk (80TB Usable)
18
Hadoop
A Comparative Study
A Typical
Hadoop
Cluster
Thirty Racks
540 Compute Nodes
19PB Disk (6PB Usable)
19
Hadoop
Typical Hadoop Server
Typically 2U servers
Name Server
1GB of memory per million blocks (typically 24-32GB)
2 CPU Six Core
More resilient hardware than Data/Task Tracker Nodes
Data/Task Tracker Node
2 CPU Six Core
24 GB of memory
4 - 12 disks
Typically no RAID functionality
20
Hadoop
The core kernel is made up of two components
HDFS Task Manager
A distributed file system Schedules and parallelises Jobs
21
HDFS
A distributed file system
Responsible for distributing
NameNode
files throughout the cluster
JobTracker
Designed for high throughput
rather than low latency
Data Nodes Typical files are gigabytes in
size
Files are broken down into
chunks
HDFS is rack aware
22
Customers.txt (20GB)
HDFS
Meta Data
NameNode
JobTracker The NameNode is

responsible for managing the
meta data of the file system
Data Nodes It is consulted every time a
file needs to written or read
Currently the Name Node is a
single point of failure
23
HDFS
The Name Node is arguably

Hadoops weakest link
A Secondary Name Node
doesnt provide failover
capability, just simple recovery.
Fixed in Hadoop release 23
Secondary Name Node

Name Node
24
HDFS
Accessing HDFS
Access to HDFS is performed via the hadoop utility or

via FUSE
$ hadoop fs -lsFound 2 itemsdrwxr-xr-x - oracle supergroup 0 2011-11-23 03:03 /user/oracle/inputdrwxr-
xr-x - oracle supergroup 0 2011-12-01 09:24 /user/oracle/output
$ hadoop fs -mkdir domsdata

$ hadoop fs -copyFromLocal Employees.dat domsdata
$ hadoop fs -ls domsdataFound 1 items-rw-r--r-- 1 oracle supergroup 49398 2011-12-06 18:00
/user/oracle/domsdata/Employees.dat
25
HDFS
Lets take a quick look
26
Map Reduce
Processing the data
Map Reduce programs are designed to process large
quantities of data in parallel
Data is processed using a key value pair
Every value has a key
Data is immutable i.e. Changes to the data being
processed are not reflected in original file
All the developer needs to do is implement map() and
reduce() functions. The framework does the rest
27
Map Reduce
Map : The first step
Takes an element and converts it into something based
on a key ready for the next step of processing
This might be parsing a comma delimited string
Cleaning the data
Rearranging the elements in a field
performing a simple calculation
28
A Map Reduce Process
Key=Row#, Value="smith, john, 25, 12-Jan-1970, 10000, 20"
The basic logic
Key=Row#, Value="jones, david, 25, 21-Aug-1970, 10050, 30"
Input
Map(Key,Value) -- pseudo code
Begin
Break up the Value by delimiter(",")
Name = InitCap(Field2 + Field1)
Department = Field6
Key = Department
Value = Name, Field3, Field4, Field5, Field6
Output(Key, Value)
Output End
Key=30, Value="David Jones, 25, 21-Aug-1970, 10050, 30"
Key=20, Value="John Smith, 25, 12-Jan-1970, 10000, 20"
29
Map Reduce
Map/Partitioning phase
Its also possible to reduce unnecessary network exchanges by implementing an
optional combiner step before the reduce process
Node1 Node 2
Map
Process
Optional
Combiner
Process
Partition
Process
30
Map Reduce
Reduce : The last step
Takes an iterator (list) of values with the same key and
reduces it by operations such as
Aggregating
Filtering
Sampling
The reduce process in turn writes its results back to HDFS
Unlike some other Map Reduce paradigms Hadoop does
not insist that the the Reduce process returns a single
element
31
A Reduce Process Key=30, Value="David Jones, 25, 21-Aug-1970, 10050, 30"
The Basic Logic Key=30, Value="John Bracken, 25, 15-Dec-1955, 15050, 30"
Key=30, Value="Peter Thompson, 25, 06-Jun-1945, 800, 30"
Reduce(Key,Value List) -- pseudo code

Begin
For every Value in the the Value List
Break up the Value by delimiter(",")
Name = Field1
Salary = Field5
if Salary is Max(Salary)
MaxValue = this Value;
End Loop
Value = MaxValue.Name, MaxValue.Salary
Output(Key,Value)
End
Key=30, Value="John Bracken, 15050"
32
Map Reduce
Putting it all together
Node1 Node2 Node3
HDFS Map and Reduce process
/users/hr/input
run concurrently
Map Data is send to a reduce
Process
process based on its key
Partition Reduce processes dont
Process
Sort
communicate with one
another
Reduce
Process If a process fails it is
HDFS restarted on another node
/users/hr/output
33
Map Reduce
Chaining Jobs
Clean and
Collections of Map Reduce
sort emails processes are chained
together
Output from one process acts
Generate as the input to another Map
text index
Reduce
Workflows and coordination is
implemented in code or via
Analyse
emails frameworks like Zoo Keeper
34
Map Reduce
Jobs and Tasks
A Job is
All classes and jar files needed to run a map reduce program
Jobs can be submitted from the command line
$> hadoop jar minimumsalary.jar /users/hr/input /users/hr/output
A Task is
The process responsible for executing the individual map reduce
steps. They are executed on nodes selected by the Job Tracker
35
Map Reduce
Lets take a quick look
36
Map Reduce
Monitoring the Cluster
Or via more sophisticated tools such as Cloudera
Manager
37
Configuring a Hadoop Cluster
Quick Configuration Overview
Core infrastructure is roughly a 50MB download
All the files that are needed to be modified are located in
the $HADOOP_HOME/conf directory
core-site.xml : Locations, ports and global properties
hdfs-site.xml : HDFS parameters i.e. replication levels
mapred-site.xml : job tracker parameters i.e name node
slaves : list of data nodes/task trackers
Ensure ssh is configured to enable remote startup and
shutdown of daemons
38
Hadoop
Miscellaneous
Adding new nodes is a relatively painless task
Adding new data nodes does not trigger a rebalance of storage
This can be invoked via a script or temporarily changing the level of
replication (not always a good idea)
Update $HADOOP_HOME/conf/slaves and then start daemons
$> bin/hadoop datanode
Removing nodes simply requires them to be excluded

and allow their data to migrated to new nodes
39
Hadoop
The Family
The Hadoop family consists
Pig Hive
Data Analysis SQL of a collection of
Database imp/exp
utilities/applications
Sqoop
Zoo Keeper
And a couple that arent quite

Coordination
Serialisation
Avro
Map Reduce
so interesting to our Oracle
HBase
Columnar Database stack
HDFS
Distributed File System
40
Introduction to PIG
41
PIG
A higher level language
Designed to simplify the analysis of large data sets
Simpler than developing Java Map Reduce processes
Developers code in Pig Latin
Faster to develop in and easier to understand
Its structure allows the system to implement
optimisations whilst allowing the developer to work on
semantics
Extensible
42
PIG
Pig Latin
Similar to many popular scripting languages e.g. Python,
Ruby
10 lines of Pig Latin is typically equivalent to several
hundred lines of Java
Provides common operations like join,group,sort,
filter
Typically used for rapid prototyping, ad-hoc queries, web
log processing
43
PIG
Running Pig
Grunt : A simple Shell
Submitting scripts directly to the server
Java interface
Via an IDE
Eclipse has a plugin that allows for textual and graphical coding
44
PIG
Some simple examples
Log into the shell (Local Mode)
$> pig -x local
Load data from file system and display it

A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;
(John)
(Mary)
(Bill)
(Joe)
45
PIG
Some simple examples
Load some data delimited by commas
Order it by telephone number (ascending)
Store it to the file system
A = LOAD 'PhoneNumber.dat' USING PigStorage(',') AS (name:chararray, telephone:chararray);

B = FOREACH A GENERATE telephone, name;
X = ORDER B BY telephone ASC;
STORE X into 'justnames';
C = GROUP A by SUBSTRING(telephone,0,3);
D = FOREACH C GENERATE $0 as areacodeid, COUNT($1) as cnt;
E = FILTER D BY cnt > 3;DUMP E;
Count the unique area codes

Group the data by the first three numbers
Only print out the tuples with more than 3 entries
46
PIG
PIG Action
47
Introduction to Hive
48
Hive
Batch ETL for Hadoop
The Hive Query Language is similar in its constructs to
SQL
SQL is commonHive throughout
is not an adindustry
hoc queryproviding a easy
environment.
migration of skills Even simple
to the Hadoop environment
queries can take many minutes
Creation of Map Reduce Jobs can be time consuming
to complete
Hive QL provides abstraction from the complexity of the
Hadoop cluster
Hive compiler turns SQL into map reduce operations
49
Hive
Hive
Hadoop is not a No SQL datastore
Over 99% of the work run against Facebooks Hadoop
clusters are Hive jobs
Only a small proportion is hand coded map reduce operations
Hive is designed to make it simpler to process huge data
sets
The output of Hive is often loaded into a standard
relational database for faster ad hoc analysis
50
Hive
Architecture
User Space Map Reduce HDFS
User defines Map Reduce

Hive CLI, JDBC, ODBC
Scripts
Hive QL
HDF/UDAF
Parser
Planner Execution
MetaStore
(Typically mySQL)
Optimiser SerDe File Formats
Text File
Sequence File
RCFile
51
Hive
Basic DDL
Tables
Create Table
Partitions
Buckets (Maps to files)
Add/Remove columns
Add/Remove partitions
Views
Index
52
Hive
Basic Datatypes
Integers
tinyint, smallint, int, bigint
Booleans
Note : No Date or timestamp
Floating Point datatypes. These need to be
float, double held as strings
Strings
Complex Types
Structs, Maps, Arrays
53
Hive
SQL
Select statements
Group by, order, Equi Join, Outer join, Sub Queries
Partition pruning
Insert Statements
Multi Insert
No Updates or Deletes
Data is immutable. It can only be added to
Tables and partitions can be dropped
54
Hive
SQL Example
SELECT r.*, s.*

FROM r JOIN (
SELECT key, count(1) as count
FROM s
GROUP BY key) s
on r.key = s.key
WHERE s.count > 100
55
Hive
Bee Action
56
Introduction to Sqoop
57
Sqoop
Database Import/Export to HDFS
Provides a heterogeneous means of importing and
exporting data to/from relational databases
Access is typically via jdbc
Access to Oracle, mySQL, SQLServer etc
Third party suppliers i.e. Quest provide higher
performance alternatives to defaults
Growing in scope and capabilities to become a fully
fledged ETL solution
58
Sqoop
Example
$ sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
--split-by a.id --target-dir /user/foo/joinresults
$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar \
--export-dir /results/bar_data
59
Hadoop
Know Features
Query latency
Hadoop is batch orientated
Meta data management
Developers have to know location of data and its datatypes
Its model is insert, append. not update
Its not possible to use it as a transactional store
HBase can be used to provide random read/write capability
60
Summary
Hadoop
Massively distributed data processing framework
Allows the developer to concentrate on the the process
and not its parallelisation
Growing in popularity with enterprise customers
61
62
NoSQL a brief history
Early 2000s, Web 2.0 companies started looking for RDBMS alternatives
2003: memcachedb (cached k-v store to reduce load on RDBMS)
2004: Google published MapReduce distributed processing paper
2006: Google published BigTable distributed database paper
2007: Amazon published Dynamo paper
2008+: Several open source projects are launched to productize NoSQL
solutions
2010+: Enterprises start to investigate NoSQL solutions
63
What is NoSQL?
Not-only-SQL (2009)
Broad class of non-relational DBMS systems that typically
Provide horizontal/distributed scalability
Avoid joins
Have relaxed consistency guarantees
Dont require a structured schema
Are application/developer-centric
No standards
Rapid evolving set of solutions (122+ on nosql-database.org)
Highly variable feature set
UnQL launched in July
Majority are open source
64
Key-Value Store Workloads
Data Capture
Write data as fast as you can

Minimal indexing
No referential integrity
Relaxed durability guarantees (lower value data)
Scale write throughput via data distribution
Optimize write throughput per storage node (master, append-only log
file)
Asynchronous replication
Bulk operation support is useful for some applications
Workload can be steady and/or bursty
Throughput more important than latency
65
Primary NoSQL use cases
Social Networks (LinkedIn, Facebook, Digg, Google+, etc.)

Personalization (Amazon, Ebay, Yahoo, etc.)
Web-centric services (Apple, Cisco, AT&T, HP, Motorola, Nokia, Pros)
Customer Service
Device tracking
Airline pricing
Intelligence community
Finanical Services (JP Morgan, Wells Fargo)
Fraud detection
Document search (Thomson Reuters, exLibris)
Scientific research
Geophysical (Halliburton)
Biomedical
66
Who are the primary players?
NoSQL Databases
Key-value Columnar Document Graph
Oracle NoSQL DB* Cassandra MongoDB OrientDB
Voldemort* HBase CouchDB GraphDB
Tokyo Cabinet HyperTable RavenDB
Redis
Riak
CitrusLeaf
GenieDB* Google BigTable
Amazon Dynamo*
Google LevelDB
(*) Built on top of Berkeley DB
67
Key Value Example
Key Value (Opaque)
Maps to
Smith John Node 1, Partition 1034

0xFThe
12 A16F
Crescent,
1A1D 94C8
Oxford,
A7B7 Oxfordshire
374B F58B:0x1D
0145D627
677 3455
693E: A513
3-Jan-1989
0x1D D627 693E A513
Node12, Partition 675

Smith Sue
0xADF7
Flat 4, 191
A264
Easy
B304
Street,
287EReading,
BEF7 F555
Berkshire
BF4B 2635
: 1604
B304
444287E
4678BEF7
: 21-Apr-1978
F555 0x1D D627
Node 3, Partition 3002
Jones David
287E BEF7 0xADF7 B304 287E BEF7 F555 693E 0x1D D627
47 Alpine Grove, Reading, Berkshire : 1604 336 7890 : 20-Feb-1979
Node1, Partition 56
Wright Michael
B304 287E BEF7 F555 BF4B 2635 1A1D 94C8 A7B7 374B BEF7 F555 287E BEF7 0x1D D627
The Cider House, Main Street, Dorchester, West Dorset : 01203 393 4443 : 06-Aug-1962
69
Key Value Example
Major Key Component Minor Key Component Maps to Value (Opaque)
Smith John Address Node 1, Partition 1034 0xFThe

12 A16F
Crescent,
1A1D 94C8
Oxford,
A7B7 Oxfordshire
374B F58B
Smith John Phone Number 0x1D 677
0145 D6273455
693E A513
Smith John DOB 0x1D D627 693E A513
3-Jan-1989
Smith Sue Address Flat 4, 191 Easy Street, Reading, Berkshire
0xADF7 A264 B304 287E BEF7 F555 BF4B 2635
Smith Sue Phone Number 1604 444 4678
B304 287E BEF7 F555
Smith Sue DOB Node 3, Partition 3002
21-Apr-1978
D627 693E 0x1D D627
Jones David Address Node 3, Partition 3002 47 Alpine Grove, Reading, Berkshire
287E BEF7 0xADF7
Jones David Phone Number Node 3, Partition 3002 1604 336 7890
B304 287E BEF7 F555
Jones David DOB Node1, Partition 56 20-Feb-1979
Node1, Partition 56 693E 0x1D D627
Wright Michael Address
The Cider House, Main Street, Dorchester, West Dorset
Node5, Partition 56
Wright Michael Phone Number B304 287E BEF7 F555 BF4B 2635 1A1D 94C8 A7B7 374B
01203 393 4443
Wright Michael DOB BEF7 F555 287E BEF7
06-Aug-1962
693E 0x1D D627
70
Oracle NoSQL Database
A distributed, scalable key-value database
Simple Data Model
Key-value pair with major+sub-key paradigm
Read/insert/update/delete operations
Scalability
Dynamic data partitioning and distribution Application Application
Optimized data access via intelligent driver NoSQLDB NoSQLDB
Driver Driver
High availability
One or more replicas
Disaster recovery through location of replicas
Resilient to partition master failures
No single point of failure
Transparent load balancing

Reads from master or replicas
Driver is network topology & latency aware
Elastic (Planned for Release 2) Storage Nodes Storage Nodes

Online addition/removal of Storage Nodes Data Center A Data Center B
Automatic data redistribution
71
Oracle NoSQLDB
72
Oracle NoSQLDB
73
Oracle NoSQL DB
Performance
Performance is a factor of
storage nodes and the
replication factor used
In this example each
replication group (3 nodes)
holds 100 million records
The benchmark scaled from
1 replication group to 32 (96
nodes)
74
Inserting and Querying a NoSQL DB
Simple Java API Specify our major key
Specify our minor key
create the data we want to insert
majorComponents.add("Smith"); majorComponents.add("John");
minorComponents.add("phonenumber"); // Create the key Key myKey =
Key.createKey(majorComponents, minorComponents); String data = "408 555 5556";
majorComponents.add("Smith"); majorComponents.add("John");
Value myValue = Value.createValue(data.getBytes()); store.put(myKey, myValue);
minorComponents.add("phonenumber"); // Create the key Key myKey =
Key.createKey(majorComponents, minorComponents); ValueVersion returnedData =
store.get(myKey);
String phoneNumber = new String(returnedData.getValue());
Serialise the data

Write the data to the database
75
Oracle and Big Data
76
Big Data
Divided Solution Spectrum
Distributed File
System
Low Density Map Reduce Solutions
Key Value Store
High Density Relational Database Relational Database Advanced

ETL
(OTLP) (Data Warehouse) Analytics
Acquire Organise Analyse
77
Big Data
Oracle Big Data Software
HDFS Oracle
Oracle11g
Low Density Loader
Hadoop In Database
for
Analytics
Oracle NoSQL DB Hadoop
High Density Oracle11g ODI Oracle11g Oracle BIEE
78
Big Data
Oracle Engineered Solutions
HDFS Oracle
Oracle In
Loader
Low Density
Big Data Appliance Hadoop
for
Database
Exalytics
Analytics
Oracle NoSQL DB Hadoop
High Density Oracle11g Exadata

ODI Oracle11g Oracle BIEE
79
Big Data
Oracle & Big Data
A new generation of products to simplify the
introduction of big data to the enterprise
Big Data Appliance
A powerful pre build optimised cluster of machines
648TB per rack
216 cores, 864GB Memory
Cloudera Hadoop Distribution
Software
Oracle No*SQL DB (Community Edition)
Oracle Hadoop Loader
Oracle Data Integrator for Hadoop
80
Big Data Appliance Software Breakdown
Oracle Big Data Appliance
runs Cloudera CDH3
Software is preinstalled
and optimised for
HDFS Data Nodes
balanced configuration

Job Tracker
Hive Server
Additional racks use all

ODI Agent
MySQL Master nodes as HDFS data
Secondary Name Node nodes/noSQL DB servers
Cloudera Cluster Manager
Zoo Keeper
Hadoop Name Node

HBase Master Node
Infiniband Switch x3
KVM Switch
81
Oracle Loader for Hadoop
HDFS export for the Oracle Database
High Performance
HDFS Load data into a single
partitioned or non-partitioned
table
Support for scalar datatypes of
Direct Load
Oracle Database
Runs as a Hadoop Map-
Reduce job
Oracle11g
Online and offline load modes
82
Oracle SQL HDFS Connector
SQL Access to HDFS data from Oracle
Make HDFS files accessible to

HDFS Oracle Database through
external table definitions
HDFS Files
External Table
Oracle11g
83
Oracle R Connector for Hadoop
Oracle R Enables the execution of

Connector
R scripts on huge
quantities of data
Provides R API access to
Hadoop and Oracle
Oracle R
Enterprise
84
ODI & Big Data Appliance
Oracle Oracle Oracle
Big Data Appliance Exadata Exalytics
Oracle Loads
Loader for
Hadoop
InfiniBand InfiniBand
Oracle Data Activates

Integrator
Transforms
Via MapReduce(HIVE)
Stream Acquire Organize Analyze & Visualize
Oracle Data Integrator Enterprise Edition
High Performance, Productivity and Low TCO
Legacy
Sources
E-LT Transformation Any Data
vs. E-T-L Warehouse
Application
Sources Declarative Set-based design
Change Data Capture Any

Planning
System
OLTP DB Hot-pluggable Architecture
Sources
Pluggable Knowledge Modules
Optimized Data Loading through E-LT
The key to improved performance and reduced costs
Conventional ETL Architecture
Transform Load E-LT provides flexible architecture for

Extract
optimized performance
Benefits:
Manual
Manual Scripts
Scripts Leverage Set-based transformations
Next Generation Architecture No additional network hops
E-LT Takes advantage of existing hardware
Extract Load
Transform Transform
ODI & Big Data Connectors
Ease Of Use
Easy to use Graphical User Interface
Reduced Development / Maintenance Time
Reduced Management Time
Increased Reusability
FAST
Uses fast Oracle Connectors (typically 5 X faster)
No network hop. Use set based transformation on source and/or target.
Makes use of existing hardware.
Leverage existing hardware. Source and / or Target
BDA has 216 cores waiting to be used which ODI can use
Packaged with BDA and Connectors
ODI Designer Declarative Design
Package describing the end to end process
Interface Artifact
Model Artifact
Topology
Big Data Knowledge Modules
IKM File to Hive (LOAD DATA).
Load unstructured data from File
(Local file system or HDFS ) into Hive
IKM Hive Control Append
Transform and validate structured data on Hive
IKM Hive Transform
Transform unstructured data on Hive
IKM File/Hive to Oracle (OLH)
Load processed data in Hive to Oracle
RKM Hive
Reverse engineer Hive tables to generate models
Generate and Automate
Using ODI & Oracle Loader for Hadoop
Use ODIs easy to use Graphical User Interface
Generate Map Reduce and data transformation code to run on Hadoop
Java
HiveQL
SQL
Invoke Oracle Loader for Hadoop
Use the drag-and-drop interface in ODI to

Run OLH from an event (time, data etc) using Graphical User Interface.
Data Lineage and Auditing
Large number of data flows in a
complex environment
How to get an overview?
? Web-based end-to-end data lineage

1. Understand your data flows
2. Follow the path of data
3. Drill-down to transformations
Processing Big Data with ODI
Summary
Create Source and
Targets Models for
File, DB, Hive
Using the
Using the KM for
Loading KM map
OLH unload Hive
data from local or
tables to
HDFS into
corresponding Oracle
corresponding
Tables.
Hive Table.
Using the KM for

Using the KM for
Structured data
semi-structured
process data in
data process data
Hive.
in Hive
Oracle Data Integrator for Big Data
Putting Together the Unique Advantages
Simplifies creation of Hadoop and MapReduce code to boost

productivity
Integrates big data heterogeneously via industry standards:

Hadoop, MapReduce, Hive, NoSQL, HDFS
Unifies integration tooling across unstructured/semi-structured

and structured data
Optimizes loading of big data to Oracle Exadata using Oracle Big

Data Connectors
Engineered for running on and integrating with Oracle Big Data

Appliance via Big Data Connectors
ODI & Big Data Connector Benefits
Fast - Optimised, high performance data loading between Hadoop and Oracle
Database
Easy to use through graphical user interface
Lower CPU utilisation on Oracle RDBMS while improving load rates
Optimised connector for Oracle R to analyze raw data on HDFS leveraging

Hadoop
Integrated and tested on Big Data Appliance with full support.
End to end data lineage and auditing

Oracle and Big Data
Oracle Oracle Oracle

Big Data Appliance Exadata Exalytics
InfiniBand InfiniBand
Stream Acquire Organize Analyze & Visualize
99
100

Oracle Big Data - Mini Class

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Oracle Big Data - Mini Class

Uploaded by

Copyright:

Available Formats

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

. Insert Information Protection Policy Classification from Slide 8

Introduction to Big Data

Oracle and Big Data

What is Big Data?

Its not always big

Jacobs, A. (6 July 2009). The Pathologies of Big Data. ACMQueue

A Typical Exadata Cluster

Two Exadata Racks

A distributed file system Schedules and parallelises Jobs

JobTracker The NameNode is

The Name Node is arguably

Secondary Name Node

Access to HDFS is performed via the hadoop utility or

$ hadoop fs -mkdir domsdata

Lets take a quick look

Key=30, Value="David Jones, 25, 21-Aug-1970, 10050, 30"

Key=20, Value="John Smith, 25, 12-Jan-1970, 10000, 20"

Key=30, Value="Peter Thompson, 25, 06-Jun-1945, 800, 30"

Reduce(Key,Value List) -- pseudo code

Key=30, Value="John Bracken, 15050"

Lets take a quick look

Removing nodes simply requires them to be excluded

And a couple that arent quite

Load data from file system and display it

A = LOAD 'PhoneNumber.dat' USING PigStorage(',') AS (name:chararray, telephone:chararray);

Count the unique area codes

Only print out the tuples with more than 3 entries

User defines Map Reduce

SELECT r.*, s.*

Write data as fast as you can

Social Networks (LinkedIn, Facebook, Digg, Google+, etc.)

(*) Built on top of Berkeley DB

Smith John Node 1, Partition 1034

Node12, Partition 675

Node 3, Partition 3002

Smith John Address Node 1, Partition 1034 0xFThe

Transparent load balancing

Elastic (Planned for Release 2) Storage Nodes Storage Nodes

String phoneNumber = new String(returnedData.getValue());

Serialise the data

High Density Relational Database Relational Database Advanced

Acquire Organise Analyse

High Density Oracle11g ODI Oracle11g Oracle BIEE

Acquire Organise Analyse

High Density Oracle11g Exadata

Acquire Organise Analyse

Hadoop Name Node

Make HDFS files accessible to

Oracle R Enables the execution of

Oracle Data Activates

Change Data Capture Any

Transform Load E-LT provides flexible architecture for

E-LT Takes advantage of existing hardware

Use the drag-and-drop interface in ODI to

? Web-based end-to-end data lineage

Using the KM for

Simplifies creation of Hadoop and MapReduce code to boost

Integrates big data heterogeneously via industry standards:

Unifies integration tooling across unstructured/semi-structured

Optimizes loading of big data to Oracle Exadata using Oracle Big

Engineered for running on and integrating with Oracle Big Data

Easy to use through graphical user interface

SELECT r., s.