(Hadoop) Terapot: Massive Email Archiving With Hadoop

Terapot: Massive Email Archiving with
Hadoop & Friends

- Commercial Hadoop Application
Jason Han
Founder & CEO, NexR
jshan@nexr.co.kr
Next Revolution, Toward Open Platform

#2
NexR: Introduction
Offering Hadoop & Cloud Computing Platform and Services
Hadoop Provisioning & Management Hadoop & Cloud Computing Services
Academic Support
Massive Email Archiving MapReduce Workflow Program
Massive Data Storage & Processing Platform
Cloud Computing Platform

(Compatible with Amazon AWS)
icube-cc (Co icube-sc

mpute) (Storage)
#3
Email Archiving: Objectives
  Regulatory compliance
  e-Discovery: Litigation and legal discovery
  E-mail backup and disaster recovery
  Messaging system & storage optimization
  Monitoring of internal and external e-mail content
#4
Email Archiving: Architecture
Email
Servers
Crawling
Journaling
DB Email Archiving
Server Servers (HA)
Search &
Discovery
Metadata Indexes
Storage
Network
Archival Storage
Aging Email
DAS SAN
NAS
Tape Library
#5
Email Archiving: Challenges
  Explosive growth of digital data

-  6 times (988XB) in 2010 than 2006
-  95% (939 XB) unstructured data including email
-  Increasing the cost and complexity of archiving
 Requiring scalable & low cost archiving
  Reinforcement of data retention regulation

-  Retention, Disposal, e-Discovery, Security
-  HIPPA(Healthcare) 21 ~ 23 yrs, SEC17(Trading) 6 yrs,
OSHA(Toxic) 30 yrs, SOX(Finance) 5 yrs, J-SOX, K-SOX
 Requiring scalable archiving & fast discovery
  Needs for intelligent data management

-  Knowledge management from email data
-  Filtering, monitoring, data mining, etc
 Requiring integration with intelligent system
#6
Email Archiving: Regulatory Compliance

#7
Email Archiving: Problems
Email
Servers
Crawling
Journaling
DB Email Archiving
Server Servers (HA) Centralized search
Search &
is slow &
Discovery
not scalable
Metadata Indexes
Storage
Network
Archival Storage
Discovery from ta Storage is expensi Email
Aging ve &
pe is slow
not scalable
DAS SAN
NAS
Tape Library
#8
Terapot: When Hadoop Met Email Archiving…

  Scale-out architecture with Hadoop
-  Hadoop HDFS for archiving email data
-  Hadoop MapReduce for crawling & indexing
-  Apache Lucene for search & discovery
Email
Servers Email Archiving
Servers (HA)
Distributed Crawling
Journaling
Hadoop MapReduce
(Crawling, Indexing, etc)
Metadata
DB Journaling Hadoop HDFS
Server (Archiving)
Server
Distributed Search & Discovery

#9
Terapot: Overview
  Design Principles
  Shared nothing architecture  Unlimited scalability
  Inexpensive hardware  Low cost
  Using open source software  Fast development
  Exploiting parallelism  High performance
  Integrating with analysis  High intelligence
  Features
  Distributed massive email archiving
  High scalability
  thousands of servers, billions of emails
  High Performance
  Fast search under 1-2 seconds for each user account
  Fast discovery in parallel with MapReduce
  High Intelligence
  Email data mining, such as social network analysis
  Support both on-premise version and cloud(hosted) version
  Development with various open source software
#10
Terapot: Open Source Software Stack
Frontend Layer
Apache Tomcat Apache JAMES
Crawling Indexing Searching Email Mining

Downloadi
ng
Zookeeper
Apache Lucene Hive

MySQL
Hadoop MapReduce
Hadoop HDFS
Backend Layer
#11
Terapot: Architecture
Terapot Clients Email Sources
HTTP/
SOAP REST JSON POP3 Mail NAS/
FTP/SFTP
Server Server NFS
Server
Terapot Frontend
Search Gateway MailServer MR Workflow Manager Analyzer
Batch processing Analysis

Searching Real-Time
Crawling Indexing Merging ETL Mining
Indexing
Hadoop MapReduce, Lucene, & Hive
HDFS
(email, index)
Local
(index)
#12
Terapot Data Archiving Flow

1. Send email
6. Receive email
Internet
2. Deliver email HTTP/

NAS/
FTP/SFTP
5. Forward email NFS
Server
SMTP
1. Search emails Server
1. Fetch emails in parallel
3. Push email
Crawler Indexing
(MR) (MR)
Real-Time
Shard Shard Shard Shard
Index 2. Save emails
Index Index Index
4. Save email & 3. Build index files
build index files in runtime
emails emails emails emails emails emails Index

HDFS
emails Index
Search Layer Real-Time Indexing Layer Batch Processing Layer

#13
Terapot Data Analysis Flow
Terapot Terapot
Mining Engine Archiving Storage
1. View Report for Archving data 1. Send HiveQL 1. Fetch emails in parallel
to analysis data
2. Generate
Transform
NexR Terapot Front Report in MySQL (MR)
2. Store large data
HIVE
Shard Shard
Analysis data Analysis data
MySQL HDFS
Analysis data Analysis data
Report Retrieval Layer Data Analysis Layer ETL Layer

#14
Technical Features
  Distributed Archiving
  Hadoop HDFS for storing email data
  Compression and deduplication for storage space efficiency
  Distributed Crawling & Indexing
  Implemented by Hadoop MapReduce
  Support both push-based crawling(HTTP) and pull-based crawling(SFTP, FTP
, HTTP, NFS, etc)
  Support batch indexing & merging by MapReduce and real-time indexing for i
nstant archiving
  Distributed Search
  Shard a search job and executing it in parallel
  Searchable instantly on receiving an email (due to real-time indexing)
  Parallel Download
  Download full search results in parallel by MapReduce
  Support various download protocol (Local FS, HDFS, FTP, SFTP, HTTP, etc)
  Standard Client Interface
  Support REST/SOAP and JSON interface
  Management
  Configurable MapReduce job scheduling (crawling, indexing, merging, etc)
#15
Crawling
  Store Massive Email Data in HDFS through MapReduce
  Hadoop utility(dfs –put) just copies data sequentially
  Each Crawling MR takes & stores a range of data in parallel
{key,email}*
Crawling
Crawling Data MR
Location
Client Information HDFS
Splitting Crawling
MR
Crawling
MR
INPUT
#16
Indexing
  Indexing Email Data with MapReduce
  Each Indexing MR takes a range of data and makes lucene index
in parallel
{key,index}*
Indexing
Indexing Email Data MR
Client HDFS
Splitting Indexing
MR
Indexing
MR
INPUT
#17
Real-Time Indexing
  Indexing Email Data in Runtime
  Indexing in memory on arriving a new email
  Flushing RT-Shard periodically into HDFS
Periodic
Real-Time Shard flushing
into HDFS
emails
Local Index
Forwarding
Mailet Email Data
RT emails
Component Shard HDFS
JAMES
RT emails
Shard
Mail
#18
Searching
  Distributed Search
  Indexes are split & stored in local disks
  Shard is responsible for searching a range of index
Local Index
Read email
Shard
Searching
Client Search
HDFS
Shard
Notification
Update shard state RT
& index information
Zookeeper Shard
#19
Parallel Downloading
  Downloading Massive Search Results in Parallel
  Support various types of communications for downloading
  Downloading MR sorts search results globally & pushes into targets
write result directly
write result Local
DL
Map
HDF
DL DL
write result Map Reduce
S
Shard
Donwload Download Request
Client DL DL FTP
Map Reduce
write result
Shard DL DL
Map Reduce
SFTP
DL
write result Map HTTP
Shard
HDFS Distributed
Global Sort
#20
Email Data Analysis

  Analysis Process
  ETL(Extract-Transform-Load) email archiving data to Hive table format
  Analyzing data using Hive with various analysis algorithm
  Generating the analysis result report
write result
Terapot
Mining
ETL M write result execute HiveQL
Terapot R
Mining Load Archving Data
HIVE
ETL M write result Generate Report
ETL M write result MySQL

R
#21
Types of Analysis
  Social Network Analysis
  Personal Network Analysis
  Computing distance between recipients or senders based on TO, CC, FRO
M links
  Analyzing the statistics of mail frequency
  Domain Analysis
  Computing distance between recipient’s domain based on TO, CC, FROM
  Keyword Analysis (in progress)
  Keyword frequency for each user
#22
Terapot Performance
  Experimental Environment
  11 Intel Servers: 1 Master + 10 Slaves
  Xeon 2.0 GHz 2 CPU, 16 GB Memory 4 TB Disk
  The number of emails: 270 millions (Index size: 270 GB)
  Results
Indexing in local disks
Number of Emails Number of Results Response Time (sec)
67,217,298 12,547,398 1.4
134,434,596 25,094,796 1.4
201,651,894 37,642,194 1.4
268,869,192 50,189,592 1.4
Indexing in HDFS
Number of Emails Number of Results Response Time (sec)
67,217,298 12,547,398 2.8
134,434,596 25,094,796 2.8
201,651,894 37,642,194 3.2
268,869,192 50,189,592 3.2

#23
Demonstration
#24
www.nexr.co.kr
Hadoop & Cloud Computing

Company

(Hadoop) Terapot: Massive Email Archiving With Hadoop

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(Hadoop) Terapot: Massive Email Archiving With Hadoop

Uploaded by

Copyright:

Available Formats

Terapot: Massive Email Archiving with

Hadoop & Friends

Next Revolution, Toward Open Platform

Offering Hadoop & Cloud Computing Platform and Services

Hadoop Provisioning & Management Hadoop & Cloud Computing Services

Massive Data Storage & Processing Platform

Cloud Computing Platform

icube-cc (Co icube-sc

Email Archiving: Objectives

Email Archiving: Architecture

Email Archiving: Challenges

 Explosive growth of digital data

 Reinforcement of data retention regulation

 Needs for intelligent data management

Email Archiving: Regulatory Compliance

Email Archiving: Problems

Terapot: When Hadoop Met Email Archiving…

Distributed Search & Discovery

Terapot: Open Source Software Stack

Apache Tomcat Apache JAMES

Crawling Indexing Searching Email Mining

Apache Lucene Hive

Search Gateway MailServer MR Workflow Manager Analyzer

Batch processing Analysis

Hadoop MapReduce, Lucene, & Hive

Terapot Data Archiving Flow

2. Deliver email HTTP/

emails emails emails emails emails emails Index

Search Layer Real-Time Indexing Layer Batch Processing Layer

Terapot Data Analysis Flow

2. Store large data

Report Retrieval Layer Data Analysis Layer ETL Layer

Email Data Analysis

ETL M write result MySQL

67,217,298 12,547,398 1.4

134,434,596 25,094,796 1.4

201,651,894 37,642,194 1.4

268,869,192 50,189,592 1.4

67,217,298 12,547,398 2.8

134,434,596 25,094,796 2.8

201,651,894 37,642,194 3.2

268,869,192 50,189,592 3.2

Hadoop & Cloud Computing

You might also like

  Explosive growth of digital data

  Reinforcement of data retention regulation

  Needs for intelligent data management