You are on page 1of 75

Cloud Computing Era (Practice)

Phoenix Liau
Trend Micro

Three Major Trends to Chang the World

Cloud
Computing

Big
Data

Mobil
e


(NIST) :
Essential
Characteristics

Service Models

Deployment
Models

(as-a-service) Internet
(scalable) (elastic) IT

Its About the Ecosystem


Structured, Semistructured

Enterprise Data
Warehouse

Cloud
Computing
SaaS
PaaS
Iaa
S

Generate
Generate

Big Data
Lead
Lead

Business
Insights
create
create

Competition, Innovation,
Productivity

What is BigData?

A set of files

A database

A single file

What is the problem


Getting the data to the processors be
comes the bottleneck
Quick calculation
Typical disk data transfer rate:

75MB/sec
Time taken to transfer 100GB of data
to the processor:
approx. 22

minutes!

The Era of Big Data Are You Ready


Businesses are driving the growth of big data. The capable data stor
age, efficient management, and capturing values to business values
of huge size of data are enterprise big challenges.
Overwhelming quantities of big data will challenge enterprise storag
e infrastructure and data center architecture which will cause chain r
eactions in database storage, data mining, business intelligence, clo
ud computing, and computing application.

Data for business commercial analysis


2011: multi-terabyte (TB)
2020: 35.2 ZB (1 ZB = 1 billion TB)

Who Needs It?


Hadoop

Enterprise Database

When to use?

When to use?

Ad-hoc Reporting (<1sec)

Affordable Storage/Compute

Multi-step Transactions

Unstructured or Semi-structured

Lots of Inserts/Updates/Deletes

Resilient Auto Scalability

Hadoop!

inspired by
Apache Hadoop project
inspired by Google's MapReduce and Google File System paper
s.

Open sourced, flexible and available architecture for l


arge scale computation and data processing on a netw
ork of commodity hardware
Open Source Software + Hardware Commodity
IT Costs Reduction

Hadoop Core

MapReduce
HDFS

2011 Cloudera, Inc. All Rights Reserved.

HDFS
Hadoop Distributed File System
Redundancy
Fault Tolerant
Scalable
Self Healing
Write Once, Read Many Times
Java API
Command Line Tool

2011 Cloudera, Inc. All Rights Reserved.

MapReduce
Two Phases of Functional Programming
Redundancy
Fault Tolerant
Scalable
Self Healing
Java API

13

2011 Cloudera, Inc. All Rights Reserved.

Hadoop Core
Java
Java

MapReduce
HDFS
Java
Java
14

2011 Cloudera, Inc. All Rights Reserved.

Word Count Example

Key: offset
Value: line
Key: word
Value: count

0:The cat sat on the mat


22:The aardvark sat on the sofa

Key: word
Value: sum of count

The Hadoop Ecosystems

The Ecosystem is the System


Hadoop has become the kernel of the distributed operati
ng system for Big Data
No one uses the kernel alone
A collection of projects at Apache

Relation Map
Hue

Mahout

(Web Console)

(Data Mining)

Oozie
(Job Workflow & Scheduling)

Zookeeper

(Coordination)

Sqoop/Flume
(Data integration)

Pig/Hive (Analytical Language)

MapReduce Runtime
(Dist. Programming Framework)

Hbase
(Column NoSQL DB)

Hadoop Distributed File System (HDFS)

Zookeeper Coordination Framework


Hue

Mahout

(Web Console)

(Data Mining)

Oozie
(Job Workflow & Scheduling)

Zookeeper

(Coordination)

Sqoop/Flume
(Data integration)

Pig/Hive (Analytical Language)

MapReduce Runtime
(Dist. Programming Framework)

Hbase
(Column NoSQL DB)

Hadoop Distributed File System (HDFS)

What is ZooKeeper
A centralized service for maintaining
Configuration information
Providing distributed synchronization

A set of tools to build distributed applications that can sa


fely handle partial failures
ZooKeeper was designed to store coordination data
Status information
Configuration
Location information

Flume / Sqoop Data Integration Framework


Hue

Mahout

(Web Console)

(Data Mining)

Oozie
(Job Workflow & Scheduling)

Zookeeper

(Coordination)

Sqoop/Flume
(Data integration)

Pig/Hive (Analytical Language)

MapReduce Runtime
(Dist. Programming Framework)

Hbase
(Column NoSQL DB)

Hadoop Distributed File System (HDFS)

Whats the problem for data collection


Data collection is currently a priori and ad hoc
A priori decide what you want to collect ahead of time
Ad hoc each kind of data source goes through its own
collection path

(and how can it help?)

A distributed data collection service


It efficiently collecting, aggregating, and moving large a
mounts of data
Fault tolerant, many failover and recovery mechanism
One-stop solution for data collection of all formats

Flume: High-Level Overview


Logical Node
Source
Sink

Flume Architecture

Log

Log
...

Flume Node

Flume Node

HDFS

2011 Cloudera, Inc. All Rights Reserved.

Flume Sources and Sinks


Local Files
HDFS
Stdin, Stdout
Twitter
IRC
IMAP

2011 Cloudera, Inc. All Rights Reserved.

Sqoop
Easy, parallel database import/export
What you want do?
Insert data from RDBMS to HDFS
Export data from HDFS back into RDBMS

Sqoop

HDFS
Sqoop
RDBMS

28

2011 Cloudera, Inc. All Rights Reserved.

Sqoop Examples
$sqoopimportconnectjdbc:mysql://localhost/world
usernameroottableCity
...
$hadoopfscatCity/partm00000
1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,He
rat,AFG,Herat,1868004,Mazare
Sharif,AFG,Balkh,1278005,Amsterdam,NLD,NoordHolland,731200
...

29

2011 Cloudera, Inc. All Rights Reserved.

Pig / Hive Analytical Language


Hue

Mahout

(Web Console)

(Data Mining)

Oozie
(Job Workflow & Scheduling)

Zookeeper

(Coordination)

Sqoop/Flume
(Data integration)

Pig/Hive (Analytical Language)

MapReduce Runtime
(Dist. Programming Framework)

Hbase
(Column NoSQL DB)

Hadoop Distributed File System (HDFS)

Why Hive and Pig?


Although MapReduce is very powerful, it can also be co
mplex to master
Many organizations have business or data analysts who
are skilled at writing SQL queries, but not at writing Java
code
Many organizations have programmers who are skilled
at writing code in scripting languages
Hive and Pig are two projects which evolved separately
to help such people analyze huge amounts of data via
MapReduce
Hive was initially developed at Facebook, Pig at Yahoo!

Hive

Developed by

What is Hive?
An SQL-like interface to Hadoop

Data Warehouse infrastructure that provides data summ


arization and ad hoc querying on top of Hadoop
MapRuduce for execution
HDFS for storage

Hive Query Language


Basic-SQL : Select, From, Join, Group-By
Equi-Join, Muti-Table Insert, Multi-Group-By
Batch query
SELECT * FROM purchases WHERE price > 100 GROUP BY storeid

Hive

SQL
Hive
MapReduce

33

2011 Cloudera, Inc. All Rights Reserved.

Pig

Initiated by

A high-level scripting language (Pig Latin)


Process data one step at a time
Simple to write MapReduce program
Easy understand
Easy debug

A
A == load
load a.txt
a.txt as
as (id,
(id, name,
name, age,
age, ...)
...)
B
B == load
load b.txt
b.txt as
as (id,
(id, address,
address, ...)
...)
C
C == JOIN
JOIN A
A BY
BY id,
id, B
B BY
BY id;STORE
id;STORE C
C into
into c.txt
c.txt

Pig

Script
Pig
MapReduce

2011 Cloudera, Inc. All Rights Reserved.

Hive vs. Pig


Hive

Pig

Language

HiveQL (SQL-like)

Pig Latin, a scripting language

Schema

Table definitions
that are stored in a
metastore

A schema is optionally defined at


runtime

Programmait Access JDBC, ODBC

PigServer

WordCount Example
Input
Hello
Hello World
World Bye
Bye World
World
Hello
Hello Hadoop
Hadoop Goodbye
Goodbye Hadoop
Hadoop

For the given sample input the map emits


<< Hello,
Hello, 1>
1>
<< World,
World, 1>
1>
<< Bye,
Bye, 1>
1>
<< World,
World, 1>
1>
<< Hello,
Hello, 1>
1>
<< Hadoop,
Hadoop, 1>
1>
<< Goodbye,
Goodbye, 1>
1>
<< Hadoop,
Hadoop, 1>
1>

the
reduce
<< Bye,
1>
Bye,
1> just sums up the values
<< Goodbye,
Goodbye, 1>
1>
<< Hadoop,
Hadoop, 2>
2>
<< Hello,
Hello, 2>
2>
<< World,
World, 2>
2>

WordCount Example In MapReduce


public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}

WordCount Example By Pig


A = LOAD 'wordcount/input' USING PigStorage as (token:chararray);
B = GROUP A BY token;
C = FOREACH B GENERATE group, COUNT(A) as count;
DUMP C;

WordCount Example By Hive


CREATE TABLE wordcount (token STRING);
LOAD DATA LOCAL INPATH wordcount/input'
OVERWRITE INTO TABLE wordcount;
SELECT count(*) FROM wordcount GROUP BY token;

The Story So Far


SQL

Hive

Pig

Java

MapReduce

Java

HDFS

Script

Sqoop Flume
SQL

4
1

RDBMS FS

2011 Cloudera, Inc. All Rights Reserved.

Posix

Hbase Column NoSQL DB


Hue

Mahout

(Web Console)

(Data Mining)

Oozie
(Job Workflow & Scheduling)

Zookeeper

(Coordination)

Sqoop/Flume
(Data integration)

Pig/Hive (Analytical Language)

MapReduce Runtime
(Dist. Programming Framework)

Hbase
(Column NoSQL DB)

Hadoop Distributed File System (HDFS)

Structured-data vs Raw-data

I Inspired by
Coordinated by Zookeeper
Low Latency
Random Reads And Writes
Distributed Key/Value Store
Simple API

PUT
GET
DELETE
SCANE

Hbase Data Model


Cells are versioned
Table rows are sorted by row key
Region a row range [start-key:end-key]

Hbase workflow

HBase Examples
hbase>
hbase>
hbase>
hbase>
hbase>
hbase>
hbase>
hbase>

create 'mytable', 'mycf


list
put 'mytable', 'row1', 'mycf:col1', 'val1
put 'mytable', 'row1', 'mycf:col2', 'val2
put 'mytable', 'row2', 'mycf:col1', 'val3
scan 'mytable
disable 'mytable
drop 'mytable'

2011 Cloudera, Inc. All Rights Reserved.

Oozie Job Workflow & Scheduling


Hue

Mahout

(Web Console)

(Data Mining)

Oozie
(Job Workflow & Scheduling)

Zookeeper

(Coordination)

Sqoop/Flume
(Data integration)

Pig/Hive (Analytical Language)

MapReduce Runtime
(Dist. Programming Framework)

Hbase
(Column NoSQL DB)

Hadoop Distributed File System (HDFS)

What is

A Java Web Application


Oozie is a workow scheduler for Hadoop
Crond for Hadoop
Triggered
Time
Data

Job 1 Job 2
Job 3
Job 4 Job 5

Oozie Features
Component Independent

MapReduce
Hive
Pig
SqoopStreaming

2011 Cloudera, Inc. All Rights Reserved.

Mahout Data Mining


Hue

Mahout

(Web Console)

(Data Mining)

Oozie
(Job Workflow & Scheduling)

Zookeeper

(Coordination)

Sqoop/Flume
(Data integration)

Pig/Hive (Analytical Language)

MapReduce Runtime
(Dist. Programming Framework)

Hbase
(Column NoSQL DB)

Hadoop Distributed File System (HDFS)

What is
Machine-learning tool
Distributed and scalable machine learning algorithms on
the Hadoop platform
Building intelligent applications easier and faster

Mahout Use Cases


Yahoo: Spam Detection
Foursquare: Recommendations
SpeedDate.com: Recommendations
Adobe: User Targetting
Amazon: Personalization Platform

2011 Cloudera, Inc. All Rights Reserved.

Use case Example


Predict what the user likes based on
His/Her historical behavior
Aggregate behavior of people similar to him

Conclusion
Today, we introduced:
Why Hadoop is needed
The basic concepts of HDFS and MapReduce
What sort of problems can be solved with Hadoop
What other projects are included in the Hadoop ecosyst
em

Recap Hadoop Ecosystem


Hue

Mahout

(Web Console)

(Data Mining)

Oozie
(Job Workflow & Scheduling)

Zookeeper

(Coordination)

Sqoop/Flume
(Data integration)

Pig/Hive (Analytical Language)

MapReduce Runtime
(Dist. Programming Framework)

Hbase
(Column NoSQL DB)

Hadoop Distributed File System (HDFS)

Case Study

Collaboration in the underground

New Unique Malware Discovered


1M
1M
unique
unique
Malwares
Malwares
every
every
month
month

Traditional approach is no more


sufficient to handle todays big data

New Design Concept for Threat Intelligence


Human
Human
Intelligen
Intelligen
ce
ce

CDN
CDN // xSP
xSP

Honeypot
Honeypot
Web
Web
Crawler
Crawler

Trend
Trend Micro
Micro
Mail
Mail Protection
Protection
Trend
Trend Micro
Micro
Web
Web Protection
Protection

Trend
Trend Micro
Micro
Endpoint
Endpoint
Protection
Protection

150M+ Worldwide Endpoints/Sensors

Challenges We Are Faced


The Concept is Great but .
6TB of data and 15B lines of logs received daily by

It becomes the Big Data Challenge!

Issues to Address

Raw Data

Information

Threat
Intelligence/Solution

Volume: Infinite
Time: No Delay
Target: Keep Changing Threats

SPN
Feedbac
k

SPN High Level


Architecture

SPAM
CDN Log

HTTP POST
L4

Log
Log
Receiver
Receiver

Log
Log
Receiver
Receiver

Web
Pages

L4

Log Post
Processin
g

Log Post
Processin
g

HTTP Download

Log Post
Processin
g

SPN infrastructure
infrastructure
SPN

Adhoc-Query (Pig)
MapReduce

HBase

Hadoop Distributed File System


(HDFS)

Lumber
Jack

Circus
(Ambari)

Tracking
Logging
System
(TLS)

Malware
Classifica
tion

Correlatio
n Platform

Global
Object
Cache
(GOC)

Feedback Information
Message Bus

Application
Application

Email Reputation
Service

Web Reputation
Service

File Reputation
Service

Trend Micro Big Data process capacity

85 Web Reputation
30 Email Reputation
70 File Reputation
6 TB raw
logs
1.5

Trend Micro: Web Reputation Services


Technology

Trend Micro
Products /
Technology
CDN Cache

Hadoop Cluster
Web Crawling

Machine Learning
Data Mining

Operation

User Traffic |
Honeypot

8 billions/day
40% filtered

Akamai

4.8
billions/day

Rating Server for Known


Threats

82% filtered

Unknown & Prefilter


860
millions/day

Page Download

15
Minutes

High Throughput Web


Service

Process

99.98% filtered

Threat
Analysis

25,000 malicious
URL /day

Block malicious URL within 15 minutes once it goes online!

Big Data Cases

Line Data on HBase


Line data
MODEL: <key> -> <model>
INDEX: <key> -> <[property in model>
User: <userID> -> <User obj>, <userID> <-> <phone>

Consistency in HBase
Contact model: use column qualifier to store
Support range query (e.g. message box)

Pig at Linkedin

Linkedin - Pig Example


views = LOAD '/data/awesome' USING VoldemortStora
ge();
views = LOAD '/data/etl/tracking/extracted/profile-view'
USING VoldemortStorage('date.range', 'num.days=90;d
ays.ago=1)

Facebook Messages

Facebook Open Source Stack

Memcached --> App Server Cache


ZooKeeper --> Small Data Coordination Service
HBase --> Database Storage Engine
HDFS --> Distributed FileSystem
Hadoop --> Asynchronous Map-Reduce Jobs

Questions?

Thank you!

You might also like