Professional Documents
Culture Documents
Jason Han
Founder & CEO, NexR
jshan@nexr.co.kr
NexR: Introduction
Academic Support
Massive Email Archiving MapReduce Workflow Program
Regulatory compliance
e-Discovery: Litigation and legal discovery
E-mail backup and disaster recovery
Messaging system & storage optimization
Monitoring of internal and external e-mail content
#4
Email
Servers
Crawling
Journaling
DB Email Archiving
Server Servers (HA)
Search &
Discovery
Metadata Indexes
Storage
Network
Archival Storage
Aging Email
DAS SAN
NAS
Tape Library
#5
Email
Servers
Crawling
Journaling
DB Email Archiving
Server Servers (HA) Centralized search
Search &
is slow &
Discovery
not scalable
Metadata Indexes
Storage
Network
Archival Storage
Discovery from ta Storage is expensi Email
Aging ve &
pe is slow
not scalable
DAS SAN
NAS
Tape Library
#8
Email
Servers Email Archiving
Servers (HA)
Distributed Crawling
Journaling
Hadoop MapReduce
(Crawling, Indexing, etc)
Metadata
DB Journaling Hadoop HDFS
Server (Archiving)
Server
Terapot: Overview
Design Principles
Shared nothing architecture Unlimited scalability
Inexpensive hardware Low cost
Using open source software Fast development
Exploiting parallelism High performance
Integrating with analysis High intelligence
Features
Distributed massive email archiving
High scalability
thousands of servers, billions of emails
High Performance
Fast search under 1-2 seconds for each user account
Fast discovery in parallel with MapReduce
High Intelligence
Email data mining, such as social network analysis
Support both on-premise version and cloud(hosted) version
Development with various open source software
#10
Frontend Layer
Hadoop MapReduce
Hadoop HDFS
Backend Layer
#11
Terapot: Architecture
Terapot Clients Email Sources
HTTP/
SOAP REST JSON POP3 Mail NAS/
FTP/SFTP
Server Server NFS
Server
Terapot Frontend
HDFS
(email, index)
Local
(index)
#12
6. Receive email
Internet
SMTP
1. Search emails Server
1. Fetch emails in parallel
3. Push email
Crawler Indexing
(MR) (MR)
Real-Time
Shard Shard Shard Shard
Index 2. Save emails
Index Index Index
4. Save email & 3. Build index files
build index files in runtime
Terapot Terapot
Mining Engine Archiving Storage
1. View Report for Archving data 1. Send HiveQL 1. Fetch emails in parallel
to analysis data
2. Generate
Transform
NexR Terapot Front Report in MySQL (MR)
HIVE
Shard
Shard
Analysis data Analysis data
MySQL HDFS
Analysis data Analysis data
Technical Features
Distributed Archiving
Hadoop HDFS for storing email data
Compression and deduplication for storage space efficiency
Distributed Crawling & Indexing
Implemented by Hadoop MapReduce
Support both push-based crawling(HTTP) and pull-based crawling(SFTP, FTP
, HTTP, NFS, etc)
Support batch indexing & merging by MapReduce and real-time indexing for i
nstant archiving
Distributed Search
Shard a search job and executing it in parallel
Searchable instantly on receiving an email (due to real-time indexing)
Parallel Download
Download full search results in parallel by MapReduce
Support various download protocol (Local FS, HDFS, FTP, SFTP, HTTP, etc)
Standard Client Interface
Support REST/SOAP and JSON interface
Management
Configurable MapReduce job scheduling (crawling, indexing, merging, etc)
#15
Crawling
Store Massive Email Data in HDFS through MapReduce
Hadoop utility(dfs –put) just copies data sequentially
Each Crawling MR takes & stores a range of data in parallel
{key,email}*
Crawling
Crawling Data MR
Location
Client Information HDFS
Splitting Crawling
MR
Crawling
MR
INPUT
#16
Indexing
Indexing Email Data with MapReduce
Each Indexing MR takes a range of data and makes lucene index
in parallel
{key,index}*
Indexing
Indexing Email Data MR
Client HDFS
Splitting Indexing
MR
Indexing
MR
INPUT
#17
Real-Time Indexing
Indexing Email Data in Runtime
Indexing in memory on arriving a new email
Flushing RT-Shard periodically into HDFS
Periodic
Real-Time Shard flushing
into HDFS
emails
Local Index
Forwarding
Mailet Email Data
RT emails
Component Shard HDFS
JAMES
RT emails
Shard
Mail
#18
Searching
Distributed Search
Indexes are split & stored in local disks
Shard is responsible for searching a range of index
Local Index
Read email
Shard
Searching
Client Search
HDFS
Shard
Notification
Update shard state RT
& index information
Zookeeper Shard
#19
Parallel Downloading
Downloading Massive Search Results in Parallel
Support various types of communications for downloading
Downloading MR sorts search results globally & pushes into targets
write result directly
write result Local
DL
Map
HDF
DL DL
write result Map Reduce
S
Shard
Donwload Download Request
Client DL DL FTP
Map Reduce
write result
Shard DL DL
Map Reduce
SFTP
DL
write result Map HTTP
Shard
HDFS Distributed
Global Sort
#20
write result
Terapot
Mining
ETL M write result execute HiveQL
Terapot R
Mining Load Archving Data
HIVE
ETL M write result Generate Report
Types of Analysis
Social Network Analysis
Personal Network Analysis
Computing distance between recipients or senders based on TO, CC, FRO
M links
Analyzing the statistics of mail frequency
Domain Analysis
Computing distance between recipient’s domain based on TO, CC, FROM
Keyword Analysis (in progress)
Keyword frequency for each user
#22
Terapot Performance
Experimental Environment
11 Intel Servers: 1 Master + 10 Slaves
Xeon 2.0 GHz 2 CPU, 16 GB Memory 4 TB Disk
The number of emails: 270 millions (Index size: 270 GB)
Results
Indexing in local disks
Number of Emails Number of Results Response Time (sec)
Indexing in HDFS
Number of Emails Number of Results Response Time (sec)
Demonstration
#24
www.nexr.co.kr