Cloud Computing Era Practice

Cloud Computing Era (Practice)
Phoenix Liau
Trend Micro
Three Major Trends to Chang the World
Cloud
Computing
Big
Data
Mobil
e

(NIST) :
Essential
Characteristics
Service Models
Deployment
Models
(as-a-service) Internet
(scalable) (elastic) IT
Its About the Ecosystem

Structured, Semistructured
Enterprise Data
Warehouse
Cloud
Computing
SaaS
PaaS
Iaa
S
Generate
Generate
Big Data
Lead
Lead
Business
Insights
create
create
Competition, Innovation,
Productivity
What is BigData?
A set of files
A database
A single file
What is the problem

Getting the data to the processors be
comes the bottleneck
Quick calculation
Typical disk data transfer rate:
75MB/sec
Time taken to transfer 100GB of data
to the processor:
approx. 22
minutes!
The Era of Big Data Are You Ready

Businesses are driving the growth of big data. The capable data stor
age, efficient management, and capturing values to business values
of huge size of data are enterprise big challenges.
Overwhelming quantities of big data will challenge enterprise storag
e infrastructure and data center architecture which will cause chain r
eactions in database storage, data mining, business intelligence, clo
ud computing, and computing application.
Data for business commercial analysis

2011: multi-terabyte (TB)
2020: 35.2 ZB (1 ZB = 1 billion TB)
Who Needs It?

Hadoop
Enterprise Database
When to use?
When to use?
Ad-hoc Reporting (<1sec)
Affordable Storage/Compute
Multi-step Transactions
Unstructured or Semi-structured
Lots of Inserts/Updates/Deletes
Resilient Auto Scalability
Hadoop!
inspired by
Apache Hadoop project
inspired by Google's MapReduce and Google File System paper
s.
Open sourced, flexible and available architecture for l

arge scale computation and data processing on a netw
ork of commodity hardware
Open Source Software + Hardware Commodity
IT Costs Reduction
Hadoop Core
MapReduce
HDFS
2011 Cloudera, Inc. All Rights Reserved.
HDFS
Hadoop Distributed File System
Redundancy
Fault Tolerant
Scalable
Self Healing
Write Once, Read Many Times
Java API
Command Line Tool
MapReduce
Two Phases of Functional Programming
Redundancy
Fault Tolerant
Scalable
Self Healing
Java API
13
Hadoop Core
Java
Java
MapReduce
HDFS
Java
Java
14
Word Count Example
Key: offset
Value: line
Key: word
Value: count
0:The cat sat on the mat

22:The aardvark sat on the sofa
Key: word
Value: sum of count
The Hadoop Ecosystems
The Ecosystem is the System

Hadoop has become the kernel of the distributed operati
ng system for Big Data
No one uses the kernel alone
A collection of projects at Apache
Relation Map
Hue
Mahout
(Web Console)
(Data Mining)
Oozie
(Job Workflow & Scheduling)
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
Pig/Hive (Analytical Language)
MapReduce Runtime
(Dist. Programming Framework)
Hbase
(Column NoSQL DB)
Hadoop Distributed File System (HDFS)
Zookeeper Coordination Framework

Hue
Mahout
(Web Console)
(Data Mining)
Oozie
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
MapReduce Runtime
Hbase
(Column NoSQL DB)
What is ZooKeeper
A centralized service for maintaining
Configuration information
Providing distributed synchronization
A set of tools to build distributed applications that can sa

fely handle partial failures
ZooKeeper was designed to store coordination data
Status information
Configuration
Location information
Flume / Sqoop Data Integration Framework

Hue
Mahout
(Web Console)
(Data Mining)
Oozie
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
MapReduce Runtime
Hbase
(Column NoSQL DB)
Whats the problem for data collection

Data collection is currently a priori and ad hoc
A priori decide what you want to collect ahead of time
Ad hoc each kind of data source goes through its own
collection path
(and how can it help?)
A distributed data collection service

It efficiently collecting, aggregating, and moving large a
mounts of data
Fault tolerant, many failover and recovery mechanism
One-stop solution for data collection of all formats
Flume: High-Level Overview

Logical Node
Source
Sink
Flume Architecture
Log
Log
...
Flume Node
Flume Node
HDFS
Flume Sources and Sinks

Local Files
HDFS
Stdin, Stdout
Twitter
IRC
IMAP
Sqoop
Easy, parallel database import/export
What you want do?
Insert data from RDBMS to HDFS
Export data from HDFS back into RDBMS
Sqoop
HDFS
Sqoop
RDBMS
28
Sqoop Examples
$sqoopimportconnectjdbc:mysql://localhost/world
usernameroottableCity
...
$hadoopfscatCity/partm00000
1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,He
rat,AFG,Herat,1868004,Mazare
Sharif,AFG,Balkh,1278005,Amsterdam,NLD,NoordHolland,731200
...
29
Pig / Hive Analytical Language

Hue
Mahout
(Web Console)
(Data Mining)
Oozie
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
MapReduce Runtime
Hbase
(Column NoSQL DB)
Why Hive and Pig?

Although MapReduce is very powerful, it can also be co
mplex to master
Many organizations have business or data analysts who
are skilled at writing SQL queries, but not at writing Java
code
Many organizations have programmers who are skilled
at writing code in scripting languages
Hive and Pig are two projects which evolved separately
to help such people analyze huge amounts of data via
MapReduce
Hive was initially developed at Facebook, Pig at Yahoo!
Hive
Developed by
What is Hive?
An SQL-like interface to Hadoop
Data Warehouse infrastructure that provides data summ

arization and ad hoc querying on top of Hadoop
MapRuduce for execution
HDFS for storage
Hive Query Language

Basic-SQL : Select, From, Join, Group-By
Equi-Join, Muti-Table Insert, Multi-Group-By
Batch query
SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
Hive
SQL
Hive
MapReduce
33
Pig
Initiated by
A high-level scripting language (Pig Latin)

Process data one step at a time
Simple to write MapReduce program
Easy understand
Easy debug
A
A == load
load a.txt
a.txt as
as (id,
(id, name,
name, age,
age, ...)
...)
B
B == load
load b.txt
b.txt as
as (id,
(id, address,
address, ...)
...)
C
C == JOIN
JOIN A
A BY
BY id,
id, B
B BY
BY id;STORE
id;STORE C
C into
into c.txt
c.txt
Pig
Script
Pig
MapReduce
Hive vs. Pig

Hive
Pig
Language
HiveQL (SQL-like)
Pig Latin, a scripting language
Schema
Table definitions
that are stored in a
metastore
A schema is optionally defined at

runtime
Programmait Access JDBC, ODBC
PigServer
WordCount Example
Input
Hello
Hello World
World Bye
Bye World
World
Hello
Hello Hadoop
Hadoop Goodbye
Goodbye Hadoop
Hadoop
For the given sample input the map emits

<< Hello,
Hello, 1>
1>
<< World,
World, 1>
1>
<< Bye,
Bye, 1>
1>
<< World,
World, 1>
1>
<< Hello,
Hello, 1>
1>
<< Hadoop,
Hadoop, 1>
1>
<< Goodbye,
Goodbye, 1>
1>
<< Hadoop,
Hadoop, 1>
1>
the
reduce
<< Bye,
1>
Bye,
1> just sums up the values
<< Goodbye,
Goodbye, 1>
1>
<< Hadoop,
Hadoop, 2>
2>
<< Hello,
Hello, 2>
2>
<< World,
World, 2>
2>
WordCount Example In MapReduce

public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
WordCount Example By Pig

A = LOAD 'wordcount/input' USING PigStorage as (token:chararray);
B = GROUP A BY token;
C = FOREACH B GENERATE group, COUNT(A) as count;
DUMP C;
WordCount Example By Hive

CREATE TABLE wordcount (token STRING);
LOAD DATA LOCAL INPATH wordcount/input'
OVERWRITE INTO TABLE wordcount;
SELECT count(*) FROM wordcount GROUP BY token;
The Story So Far

SQL
Hive
Pig
Java
MapReduce
Java
HDFS
Script
Sqoop Flume
SQL
4
1
RDBMS FS
Posix
Hbase Column NoSQL DB

Hue
Mahout
(Web Console)
(Data Mining)
Oozie
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
MapReduce Runtime
Hbase
(Column NoSQL DB)
Structured-data vs Raw-data
I Inspired by
Coordinated by Zookeeper
Low Latency
Random Reads And Writes
Distributed Key/Value Store
Simple API
PUT
GET
DELETE
SCANE
Hbase Data Model

Cells are versioned
Table rows are sorted by row key
Region a row range [start-key:end-key]
Hbase workflow
HBase Examples
hbase>
hbase>
hbase>
hbase>
hbase>
hbase>
hbase>
hbase>
create 'mytable', 'mycf

list
put 'mytable', 'row1', 'mycf:col1', 'val1
scan 'mytable
disable 'mytable
drop 'mytable'
Oozie Job Workflow & Scheduling

Hue
Mahout
(Web Console)
(Data Mining)
Oozie
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
MapReduce Runtime
Hbase
(Column NoSQL DB)
What is
A Java Web Application

Oozie is a workow scheduler for Hadoop
Crond for Hadoop
Triggered
Time
Data
Job 1 Job 2
Job 3
Job 4 Job 5
Oozie Features
Component Independent
MapReduce
Hive
Pig
SqoopStreaming
Mahout Data Mining

Hue
Mahout
(Web Console)
(Data Mining)
Oozie
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
MapReduce Runtime
Hbase
(Column NoSQL DB)
What is
Machine-learning tool
Distributed and scalable machine learning algorithms on
the Hadoop platform
Building intelligent applications easier and faster
Mahout Use Cases

Yahoo: Spam Detection
Foursquare: Recommendations
SpeedDate.com: Recommendations
Adobe: User Targetting
Amazon: Personalization Platform
Use case Example

Predict what the user likes based on
His/Her historical behavior
Aggregate behavior of people similar to him
Conclusion
Today, we introduced:
Why Hadoop is needed
The basic concepts of HDFS and MapReduce
What sort of problems can be solved with Hadoop
What other projects are included in the Hadoop ecosyst
em
Recap Hadoop Ecosystem

Hue
Mahout
(Web Console)
(Data Mining)
Oozie
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
MapReduce Runtime
Hbase
(Column NoSQL DB)
Case Study
Collaboration in the underground
New Unique Malware Discovered

1M
1M
unique
unique
Malwares
Malwares
every
every
month
month
Traditional approach is no more

sufficient to handle todays big data
New Design Concept for Threat Intelligence

Human
Human
Intelligen
Intelligen
ce
ce
CDN
CDN // xSP
xSP
Honeypot
Honeypot
Web
Web
Crawler
Crawler
Trend
Trend Micro
Micro
Mail
Mail Protection
Protection
Trend
Trend Micro
Micro
Web
Web Protection
Protection
Trend
Trend Micro
Micro
Endpoint
Endpoint
Protection
Protection
150M+ Worldwide Endpoints/Sensors
Challenges We Are Faced

The Concept is Great but .
6TB of data and 15B lines of logs received daily by
It becomes the Big Data Challenge!
Issues to Address
Raw Data
Information
Threat
Intelligence/Solution
Volume: Infinite
Time: No Delay
Target: Keep Changing Threats
SPN
Feedbac
k
SPN High Level

Architecture
SPAM
CDN Log
HTTP POST
L4
Log
Log
Receiver
Receiver
Log
Log
Receiver
Receiver
Web
Pages
L4
Log Post
Processin
g
Log Post
Processin
g
HTTP Download
Log Post
Processin
g
SPN infrastructure
infrastructure
SPN
Adhoc-Query (Pig)
MapReduce
HBase
Hadoop Distributed File System

(HDFS)
Lumber
Jack
Circus
(Ambari)
Tracking
Logging
System
(TLS)
Malware
Classifica
tion
Correlatio
n Platform
Global
Object
Cache
(GOC)
Feedback Information
Message Bus
Application
Application
Email Reputation
Service
Web Reputation
Service
File Reputation
Service
Trend Micro Big Data process capacity
85 Web Reputation
30 Email Reputation
70 File Reputation
6 TB raw
logs
1.5
Trend Micro: Web Reputation Services

Technology
Trend Micro
Products /
Technology
CDN Cache
Hadoop Cluster
Web Crawling
Machine Learning
Data Mining
Operation
User Traffic |
Honeypot
8 billions/day
40% filtered
Akamai
4.8
billions/day
Rating Server for Known

Threats
82% filtered
Unknown & Prefilter

860
millions/day
Page Download
15
Minutes
High Throughput Web

Service
Process
99.98% filtered
Threat
Analysis
25,000 malicious
URL /day
Block malicious URL within 15 minutes once it goes online!
Big Data Cases
Line Data on HBase

Line data
MODEL: <key> -> <model>
INDEX: <key> -> <[property in model>
User: <userID> -> <User obj>, <userID> <-> <phone>
Consistency in HBase
Contact model: use column qualifier to store
Support range query (e.g. message box)
Pig at Linkedin
Linkedin - Pig Example

views = LOAD '/data/awesome' USING VoldemortStora
ge();
views = LOAD '/data/etl/tracking/extracted/profile-view'
USING VoldemortStorage('date.range', 'num.days=90;d
ays.ago=1)
Facebook Messages
Facebook Open Source Stack
Memcached --> App Server Cache

ZooKeeper --> Small Data Coordination Service
HBase --> Database Storage Engine
HDFS --> Distributed FileSystem
Hadoop --> Asynchronous Map-Reduce Jobs
Questions?
Thank you!

Cloud Computing Era Practice

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cloud Computing Era Practice

Uploaded by

Copyright:

Available Formats

Cloud Computing Era (Practice)

Three Major Trends to Chang the World

Its About the Ecosystem

What is the problem

The Era of Big Data Are You Ready

Data for business commercial analysis

Who Needs It?

Ad-hoc Reporting (<1sec)

Resilient Auto Scalability

Open sourced, flexible and available architecture for l

2011 Cloudera, Inc. All Rights Reserved.

2011 Cloudera, Inc. All Rights Reserved.

2011 Cloudera, Inc. All Rights Reserved.

2011 Cloudera, Inc. All Rights Reserved.

Word Count Example

0:The cat sat on the mat

The Hadoop Ecosystems

The Ecosystem is the System

Pig/Hive (Analytical Language)

Hadoop Distributed File System (HDFS)

Zookeeper Coordination Framework

Pig/Hive (Analytical Language)

Hadoop Distributed File System (HDFS)

A set of tools to build distributed applications that can sa

Flume / Sqoop Data Integration Framework

Pig/Hive (Analytical Language)

Hadoop Distributed File System (HDFS)

Whats the problem for data collection

(and how can it help?)

A distributed data collection service

Flume: High-Level Overview

2011 Cloudera, Inc. All Rights Reserved.

Flume Sources and Sinks

2011 Cloudera, Inc. All Rights Reserved.

2011 Cloudera, Inc. All Rights Reserved.

2011 Cloudera, Inc. All Rights Reserved.

Pig / Hive Analytical Language

Pig/Hive (Analytical Language)

Hadoop Distributed File System (HDFS)

Why Hive and Pig?

Data Warehouse infrastructure that provides data summ

Hive Query Language

2011 Cloudera, Inc. All Rights Reserved.

A high-level scripting language (Pig Latin)

2011 Cloudera, Inc. All Rights Reserved.

Hive vs. Pig

Pig Latin, a scripting language

A schema is optionally defined at

Programmait Access JDBC, ODBC

For the given sample input the map emits

WordCount Example In MapReduce

WordCount Example By Pig

WordCount Example By Hive

The Story So Far

2011 Cloudera, Inc. All Rights Reserved.

Hbase Column NoSQL DB

Pig/Hive (Analytical Language)

Hadoop Distributed File System (HDFS)

Hbase Data Model

create 'mytable', 'mycf

2011 Cloudera, Inc. All Rights Reserved.

Oozie Job Workflow & Scheduling

Pig/Hive (Analytical Language)