You are on page 1of 24

Terapot: Massive Email Archiving with

Hadoop & Friends


- Commercial Hadoop Application

Jason Han
Founder & CEO, NexR
jshan@nexr.co.kr

Next Revolution, Toward Open Platform


#2

NexR: Introduction

Offering Hadoop & Cloud Computing Platform and Services

Hadoop Provisioning & Management Hadoop & Cloud Computing Services

Academic Support
Massive Email Archiving MapReduce Workflow Program

Massive Data Storage & Processing Platform

Cloud Computing Platform


(Compatible with Amazon AWS)

icube-cc (Co icube-sc


mpute) (Storage)
#3

Email Archiving: Objectives

  Regulatory compliance
  e-Discovery: Litigation and legal discovery
  E-mail backup and disaster recovery
  Messaging system & storage optimization
  Monitoring of internal and external e-mail content
#4

Email Archiving: Architecture

Email
Servers

Crawling
Journaling

DB Email Archiving
Server Servers (HA)
Search &
Discovery
Metadata Indexes
Storage
Network
Archival Storage
Aging Email

DAS SAN
NAS
Tape Library
#5

Email Archiving: Challenges

  Explosive growth of digital data


-  6 times (988XB) in 2010 than 2006
-  95% (939 XB) unstructured data including email
-  Increasing the cost and complexity of archiving
 Requiring scalable & low cost archiving

  Reinforcement of data retention regulation


-  Retention, Disposal, e-Discovery, Security
-  HIPPA(Healthcare) 21 ~ 23 yrs, SEC17(Trading) 6 yrs,
OSHA(Toxic) 30 yrs, SOX(Finance) 5 yrs, J-SOX, K-SOX
 Requiring scalable archiving & fast discovery

  Needs for intelligent data management


-  Knowledge management from email data
-  Filtering, monitoring, data mining, etc
 Requiring integration with intelligent system
#6

Email Archiving: Regulatory Compliance


#7

Email Archiving: Problems

Email
Servers

Crawling
Journaling

DB Email Archiving
Server Servers (HA) Centralized search
Search &
is slow &
Discovery
not scalable
Metadata Indexes
Storage
Network
Archival Storage
Discovery from ta Storage is expensi Email
Aging ve &
pe is slow
not scalable
DAS SAN
NAS
Tape Library
#8

Terapot: When Hadoop Met Email Archiving…


  Scale-out architecture with Hadoop
-  Hadoop HDFS for archiving email data
-  Hadoop MapReduce for crawling & indexing
-  Apache Lucene for search & discovery

Email
Servers Email Archiving
Servers (HA)
Distributed Crawling
Journaling

Hadoop MapReduce
(Crawling, Indexing, etc)

Metadata
DB Journaling Hadoop HDFS
Server (Archiving)
Server

Distributed Search & Discovery


#9

Terapot: Overview
  Design Principles
  Shared nothing architecture  Unlimited scalability
  Inexpensive hardware  Low cost
  Using open source software  Fast development
  Exploiting parallelism  High performance
  Integrating with analysis  High intelligence

  Features
  Distributed massive email archiving
  High scalability
  thousands of servers, billions of emails
  High Performance
  Fast search under 1-2 seconds for each user account
  Fast discovery in parallel with MapReduce
  High Intelligence
  Email data mining, such as social network analysis
  Support both on-premise version and cloud(hosted) version
  Development with various open source software
#10

Terapot: Open Source Software Stack

Frontend Layer

Apache Tomcat Apache JAMES

Crawling Indexing Searching Email Mining


Downloadi
ng
Zookeeper

Apache Lucene Hive


MySQL

Hadoop MapReduce

Hadoop HDFS

Backend Layer
#11

Terapot: Architecture
Terapot Clients Email Sources
HTTP/
SOAP REST JSON POP3 Mail NAS/
FTP/SFTP
Server Server NFS
Server

Terapot Frontend

Search Gateway MailServer MR Workflow Manager Analyzer

Batch processing Analysis


Searching Real-Time
Crawling Indexing Merging ETL Mining
Indexing

Hadoop MapReduce, Lucene, & Hive

HDFS
(email, index)
Local
(index)
#12

Terapot Data Archiving Flow


1. Send email

6. Receive email
Internet

2. Deliver email HTTP/


NAS/
FTP/SFTP
5. Forward email NFS
Server

SMTP
1. Search emails Server
1. Fetch emails in parallel

3. Push email
Crawler Indexing
(MR) (MR)
Real-Time
Shard Shard Shard Shard
Index 2. Save emails
Index Index Index
4. Save email & 3. Build index files
build index files in runtime

emails emails emails emails emails emails Index


HDFS
emails Index

Search Layer Real-Time Indexing Layer Batch Processing Layer


#13

Terapot Data Analysis Flow

Terapot Terapot
Mining Engine Archiving Storage

1. View Report for Archving data 1. Send HiveQL 1. Fetch emails in parallel
to analysis data

2. Generate
Transform
NexR Terapot Front Report in MySQL (MR)

2. Store large data

HIVE
Shard Shard
Analysis data Analysis data
MySQL HDFS
Analysis data Analysis data

Report Retrieval Layer Data Analysis Layer ETL Layer


#14

Technical Features
  Distributed Archiving
  Hadoop HDFS for storing email data
  Compression and deduplication for storage space efficiency
  Distributed Crawling & Indexing
  Implemented by Hadoop MapReduce
  Support both push-based crawling(HTTP) and pull-based crawling(SFTP, FTP
, HTTP, NFS, etc)
  Support batch indexing & merging by MapReduce and real-time indexing for i
nstant archiving
  Distributed Search
  Shard a search job and executing it in parallel
  Searchable instantly on receiving an email (due to real-time indexing)
  Parallel Download
  Download full search results in parallel by MapReduce
  Support various download protocol (Local FS, HDFS, FTP, SFTP, HTTP, etc)
  Standard Client Interface
  Support REST/SOAP and JSON interface
  Management
  Configurable MapReduce job scheduling (crawling, indexing, merging, etc)
#15

Crawling
  Store Massive Email Data in HDFS through MapReduce
  Hadoop utility(dfs –put) just copies data sequentially
  Each Crawling MR takes & stores a range of data in parallel

{key,email}*
Crawling
Crawling Data MR
Location
Client Information HDFS
Splitting Crawling
MR

Crawling
MR
INPUT
#16

Indexing
  Indexing Email Data with MapReduce
  Each Indexing MR takes a range of data and makes lucene index
in parallel

{key,index}*
Indexing
Indexing Email Data MR
Client HDFS
Splitting Indexing
MR

Indexing
MR
INPUT
#17

Real-Time Indexing
  Indexing Email Data in Runtime
  Indexing in memory on arriving a new email
  Flushing RT-Shard periodically into HDFS
Periodic
Real-Time Shard flushing
into HDFS
emails

Local Index
Forwarding
Mailet Email Data
RT emails
Component Shard HDFS
JAMES
RT emails
Shard

Mail
#18

Searching
  Distributed Search
  Indexes are split & stored in local disks
  Shard is responsible for searching a range of index

Local Index

Read email
Shard
Searching
Client Search
HDFS

Shard

Notification
Update shard state RT
& index information
Zookeeper Shard
#19

Parallel Downloading
  Downloading Massive Search Results in Parallel
  Support various types of communications for downloading
  Downloading MR sorts search results globally & pushes into targets
write result directly
write result Local
DL
Map
HDF
DL DL
write result Map Reduce
S
Shard
Donwload Download Request
Client DL DL FTP
Map Reduce
write result
Shard DL DL
Map Reduce
SFTP

DL
write result Map HTTP
Shard

HDFS Distributed
Global Sort
#20

Email Data Analysis


  Analysis Process
  ETL(Extract-Transform-Load) email archiving data to Hive table format
  Analyzing data using Hive with various analysis algorithm
  Generating the analysis result report

write result

Terapot
Mining
ETL M write result execute HiveQL
Terapot R
Mining Load Archving Data
HIVE
ETL M write result Generate Report

ETL M write result MySQL


R
#21

Types of Analysis
  Social Network Analysis
  Personal Network Analysis
  Computing distance between recipients or senders based on TO, CC, FRO
M links
  Analyzing the statistics of mail frequency
  Domain Analysis
  Computing distance between recipient’s domain based on TO, CC, FROM
  Keyword Analysis (in progress)
  Keyword frequency for each user
#22

Terapot Performance
  Experimental Environment
  11 Intel Servers: 1 Master + 10 Slaves
  Xeon 2.0 GHz 2 CPU, 16 GB Memory 4 TB Disk
  The number of emails: 270 millions (Index size: 270 GB)
  Results
Indexing in local disks
Number of Emails Number of Results Response Time (sec)

67,217,298 12,547,398 1.4

134,434,596 25,094,796 1.4

201,651,894 37,642,194 1.4

268,869,192 50,189,592 1.4

Indexing in HDFS
Number of Emails Number of Results Response Time (sec)

67,217,298 12,547,398 2.8

134,434,596 25,094,796 2.8

201,651,894 37,642,194 3.2

268,869,192 50,189,592 3.2


#23

Demonstration
#24

www.nexr.co.kr

Hadoop & Cloud Computing


Company

You might also like