Flexible Analytics For Large Data Sets Using Elasticsearch

Flexible Analytics for Large Data Sets
Using Elasticsearch
Copyright 2012 TNR Global, LLC
May be reprinted with permission
Christopher Miles
Storing and analyzing large amounts of data are two very dierent challenges, but what if we
could combine them? Traditional database stores certainly provide the exible query tools we
need in order to answer our questions, but often our data lacks the uniformity needed by these
products or is simply too large. In this paper we will store large amounts of data in one of the
leading NoSQL stores, Elasticsearch, and leverage the powerful indexing tools it provides in
order to generate detailed analytics that we can use in response to exploratory, ad-hoc queries.
Introduction
Big Data promises access to the entirety of our accumulated data and the ability to mine
this data for nuggets of golden information that can revolutionize the way we handle our
problem space. Unfortunately, much of this analysis needs to be done in batches and that isnt
conducive to the kind of interactive exploration that many of us need.
Tools like HBase, for instance, can handle large amounts of information but in order to
provide metrics over the entire data set, we often need to clearly lay out what those metrics
are when we construct the layout of our data store; these values need to be calculated as we
https://en.wikipedia.org/wiki/Nosql
http://www.elasticsearch.org/
https://en.wikipedia.org/wiki/Big_data
http://hbase.apache.org/
load in our data. Additional measures that may become critical afterwards can only be added
after costly re-processing of the relevant portion of our data set, perhaps even our entire data
set. Tools like Hadoop enable us to process huge amounts of data quickly and eciently but
these processing jobs are often run in batches. Depending on the size of the data set and the
complexity of processing, a particular job could take anywhere from a few minutes to several
weeks to nish.
We are looking for a solution that answers our questions as soon as we ask them. Something
that can handle very large data sets and provide access to both our most granular pieces of data as
well as aggregated information. Wed like this solution to be exible enough that it can answer
a wide range of questions, ideally fast enough that we can engage in something that is more like
a dialogue and less like postal correspondence. While we have yet to nd this idealized tool we
have discovered that Elasticsearch can provide a good amount of the speed and exibility we
are seeking.
Overview
Before we dive into details, let us outline the various components of our solution.
Indexing and Data Storage
Elasticsearch provides both an indexing service as well as a data store. It does not require a
detailed schema, although one may be provided if needed. Any data that we can express as
JSON can be easily stored and indexed with Elasticsearch. Unlike other search services, we
can always retrieve our raw JSON data when we need it (for instance, if we needed to re-index
our data set).
One of the unique features of Elasticsearch that makes it especially well suited for our pur-
poses is the ease with which we can scale our solution. It provides us with the ability to either
add or remove resources (that is, individual machines running Elasticsearch) at any time. We
might do this in order to support a growing data set or, perhaps, in order to satisfy an in-
creasing amount of requests and to improve the performance of our solution. While other
products are looking to add this ability to painlessly scale (Lucid Imagination is working on
this functionality in their development builds of Solr) they are not there yet.
Providing a tutorial on installing and conguring Elasticsearch is outside the scope of this
document, however, there is an abundance of information online. We have another white
paper available that walks through setting up a cluster suitable for development purposes, in
addition the Elasticsearch project has a very detailed tutorial that describes deployment and
setup in Amazons EC2 environment.
http://hadoop.apache.org/
https://en.wikipedia.org/wiki/Json
http://www.lucidimagination.com/products/apache-releases/solr-4.0-dev
http://bit.ly/JuT9H7
http://www.elasticsearch.org/tutorials/2012/03/21/deploying-elasticsearch-with-chef-solo.html
https://en.wikipedia.org/wiki/Amazon_ec2
TNR Global, LLC Flexible Analytics for Large Data Sets Using Elasticsearch 2
Storing Raw Data
e next piece of the puzzle is to process our data set and load it into Elasticsearch. Certainly
some of us have data that is already in JSON or a JSON-like format, but most of us do not.
In any case we will want to do some level of processing on this data as we load it into our data
store. We may want to normalize values or engage in some level of sanity checking.
Hadoop provides an ideal solution to this type of problem. It enables us to store a large data
set over many individual machines (a cluster), providing us with low-cost access to a very large
amount of storage space. Hadoop will also provide us with the ability to harness the processing
power of these machines so that we can digest this data as quickly as possible.
While Hadoop will provide us with the mechanism we need to handle our initial loading
of data, theres no need for us to rely on a batch processing mechanism forever. Once we have
loaded our existing data, well have our crawlers load additional log data as they do their work.
Well see this data ingested into Elasticsearch and reected in our index in near real time.
In this document well be loading log data from our web crawlers. We run web crawls
for several clients, in this case well be processing the log les that cover a good sized data set:
over 250 million web pages each crawler run. is type of data provides a good example set,
there are several elds that we can use to provide aggregate statistics and there will be many
dierent ways people will want to slice and dice this data-set for analysis. Typical log entries
look something like the following:
Table 1: Sample Crawl Log Entry
Date Time URL Response Index Status
2011-06-25 21:33:03 http://www.acme.com/ 200 New
For our initial load of data well be collecting these log les from the individual crawler ma-
chines and consolidating them onto our Hadoop cluster for processing (approximately 115GB
of raw data per crawler run).
Extracting, Transforming and Loading Data
e last component of our solution involves extracting the data we want from our log les (in
our case, pretty much all of it), transforming that data into a format our data store can handle
(JSON for Elasticsearch), then we load this data into our data store. is process of extracting,
transforming and loading is commonly referred to as the ETL process or cycle.
Once again well be leveraging Hadoop in order to spread the processing of the log data
over all of our machines, we want to get the job done as quickly as possible. While processing
https://en.wikipedia.org/wiki/Extract,_transform,_load
data in batches isnt something we want to do often, when its unavoidable Hadoop provides a
great way to perform these types of tasks. As Hadoop processes our data, it will write this data
to a RabbitMQ message queue. At the same time Elasticsearch will consume this data and
add it to our index.
After our initial load of log data, well want additional information to be loaded as soon
as its produced. We have full control over our crawlers (we use the open-source Heritrix
product), altering the crawler to log this information directly to our message queue will be
a painless conguration change. For products that dont provide that level of customization,
other solutions exist. We have had reasonable success with both logstash and Flume.
Its important to decouple the process that extracts and transforms the source data from the
process that performs the indexing. Certainly we could write directly to Elasticsearch from our
ETL process, our primary issue with that layout is that the speed at which we can extract and
transform our data is tied directly to the speed of our indexer. When were processing large
amounts of data we want to complete that process as quickly as possible. e message queue
will eectively sever our ETL process from our indexing process, letting each run at its own
speed.
Hadoop Map/
Reduce Process
Message Queue
Elasticsearch
Index
Figure 1: Message Queue Separates ETL Process from Indexing
Another benet of breaking these two processes apart is that we can halt one without im-
pacting the other. If we needed to bring our Elasticsearch cluster oine for some reason, data
that is being loaded will simply accumulate in the message queue. When the cluster comes
back online and indexing resumes, Elasticsearch will simply consume messages o the queue
until it catches up.
In this paper well be writing our Hadoop ETL process in Clojure, a Lisp dialect that runs
on the Java Virtual Machine (JVM). While we are very comfortable implementing solutions
directly in Java, we have found that scripting languages like Clojure let us get our solutions o
the ground much faster. Hadoop supports a wide variety of languages via the streaming
utility including both Ruby and Python; youll want to choose the language that is the best t
for your environment.
http://www.rabbitmq.com/
https://webarchive.jira.com/wiki/display/Heritrix/Heritrix
http://logstash.net/
https://cwiki.apache.org/FLUME/
http://clojure.org/
https://en.wikipedia.org/wiki/Jvm
http://hadoop.apache.org/common/docs/r0.15.2/streaming.html
Providing a tutorial on installing and conguring Hadoop is outside the scope of this doc-
ument. Rather, we will be concentrating on conguring Elasticsearch, loading our data with
Hadoop and then querying Elasticsearch to get the aggregate data we need for our reporting
application. ere are many resources available on the internet that cover the installation and
conguration of Hadoop, including the Hadoop project itself and Cloudera.
e RabbitMQ project provides great documentation for their product. In addition to
detailed installation instructions for the major platforms, they also provide information specic
to Amazons EC2 environment. We recommend setting up a clustered environment, this
will make it much easier to add and remove nodes from your cluster and ensure that data is
never lost.
Configuring Elasticsearch
Before we start indexing our data, lets go over some of the more important conguration
options for our Elasticsearch cluster. While we certainly can go far with the out-of-the-box
settings, there are a couple that bear a closer look. We will not examine every single setting,
Elasticsearch provides a very exible product that can be molded to t a great many use-cases.
Rather well be looking at the ones that we need to think about in order to get up-and-running.
Index Sharding and Cluster Nodes
By default Elasticsearch will break our index into a set of ve shards, each shard represents
a chunk of our index that we can potentially migrate to another machine in our cluster. In
addition, we can choose to have one or more backup copies of each shard; this is an important
feature, it enables us to remove a machine (or node) from our cluster without being concerned
about losing any data. By default Elasticsearch will congure one copy (or replica) of each
shard in our index.
Deciding how many shards wed like to comprise our index represents all of the congu-
ration we need in order to have a resilient index that is easy to scale. Elasticsearch itself will
decide which shards live on which servers in order to provide a responsive cluster that can
handle the loss of one or more nodes (depending on the size of the cluster). When only one
node is available all of our shards will reside on that one node (obviously this is our least fault-
tolerant conguration). As nodes are added to our cluster, Elasticsearch will move both active
and replica shards out to these new nodes. Constrained by the size of the cluster, it will do its
best to ensure that no one node serves both the same active and replica shard. When a node is
removed from our cluster, Elasticsearch will nominate the missing shards replicas as the new
active shard and will create new replica shards if necessary. We have another paper available
http://wiki.apache.org/hadoop/GettingStartedWithHadoop
https://ccp.cloudera.com/display/CDHDOC/CDH3+Quick+Start+Guide
http://www.rabbitmq.com/download.html
http://www.rabbitmq.com/ec2.html
http://www.rabbitmq.com/clustering.html
that demonstrates this functionality in a development environment.
Its important to take a moment and think about how large your data set may grow and how
many nodes you may eventually add to your cluster. e default settings of ve shards with
one replica per shard will let you scale your cluster up to ten nodes, each node will contain one
shard; ve nodes will be hosting active shards, the other ve will be hosting the replicas.
We will be using Amazon EC2s high-memory, double extra large instances, each instance
will have 34.2GB of memory. Elasticsearch isnt the only thing well be running on these
machines, weve allocated 25GB of RAM to Elasticsearch itself leaving 10GB available for
other uses. With ve machines that will provide us with 125GB of RAM available solely for
managing our index. Certain Elasticsearch operations (for instance faceting and sorting) will
require loading all of the data in the result set into RAM. For instance, if we wanted to facet on
the server crawled across the entire data set, Elasticsearch will load all of the data in our host
eld into memory and then calculate the aggregate values. Let me be clear that Elasticsearch
will only load data fromthe requested result set into memory. If we wanted to facet on the server
for all request occurring in the month of June, 2011 then only the host eld for transactions
that fall on the month of June, 2011 will be loaded; a much smaller set of data that will require
much less memory to be available.
We will be storing the logs from the last four crawler runs in our index and we would like
the freedom to facet on data across the entire result set. Our data set will be approximately
one billion rows or 460GB of raw data. Each row of data is composed of four distinct items
(transaction date and time, the URL crawled, the source servers response and our indexers
response), one fourth of our data set is 115GB; that in turn would mean that if we were to
facet on a eld across our entire data set, each node in our ve machine cluster might need to
load 23GB of data into RAM.
ese numbers are conservative approximations, there are many factors that they do not
take into account (the size of each eld of data, how much larger a eld may become when
indexed, etc.) Still, it provides us with some numbers that we can wrap our head around. Our
ve shard cluster with ve nodes looks like it will meet our needs if we do not facet on any
elds that have been tokenized by our indexer. In this scenario we also cannot grow our index
capacity further (we can add ve more nodes but they will only host replica shards, increasing
performance but not the maximum index size). Instead well congure Elasticsearch to use ten
shards (with one replica each) even though we currently plan on running a ve node cluster. With
this conguration we will have always have the ability to add more replica nodes to increase our
query performance and more primary nodes to increase the maximum capacity of our index.
Once we have begun adding data to our index we will no longer have the option to add
or remove shards. e number of shards that we choose now will stay with the index over its
entire lifespan. is means that if you decide you want to increase the number of shards after
you have added data to your index, you will need to create an entirely new index and migrate
the data from the old to the new. A shard represent a piece (or partition) of our index, as more
http://bit.ly/JuT9H7
http://aws.amazon.com/ec2/instance-types/
http://www.elasticsearch.org/guide/reference/api/search/facets/terms-facet.html
nodes are added shards are balanced across these machines. Once we have added enough nodes
that each one hosts only one shard, we will have reached the maximum size of our cluster.
ese settings are congured in the Elasticsearch conguration le, elasticsearch.yml.
While you may congure these settings on a per-index basis, it is our recommendation that
the settings in the conguration le reect the largest index you are willing to support. is
will prevent someone in you organization from accidentally creating an index that cannot be
scaled to take advantage of your clusters full capacity.
e amount of memory allocated to Elasticsearch can be set in the elasticsearch.in.sh le.
A sample is provide with the Elasticsearch installation, you can then copy the le to one of the
suggested locations and customize it there.
Ensuring Safe Cluster Restarts with the Gateway Feature
Elasticsearch provides a gateway function that will store the state of your indexes and cluster
between full restarts (that is, restarts that involve taking every node oine). As your index
and cluster conguration change, Elasticsearch will write these changes to the gateway. After
a restart each node will look to the gateway to restore their state. By default Elasticsearch
will use the local gateway, storing this information on the local le system of each node.
Unfortunately the local gateway disables caching index data in memory, its best to use a
gateway that leverages a shared le system.
Since we are already using Hadoop to load our data into our index we will be using the
Hadoop gateway. Note that there are other options available, for instance the Amazon S3
gateway also presents an attractive option for our setup. You will need to congure these
settings in the elasticsearch.yml le.
Once again, this is a setting that will stay with your index over its entire lifespan. If you
decide to change your gateway implementation after adding data to your index, youll need
to congure a new index with the new settings and then migrate your data from the old to the
new.
Elasticsearch Head
While Elasticsearch does not come with an web-based interface out-of-the-box, the Elastic-
search Head project provides this functionality. It can be used to quickly gauge the status of
our cluster, detailed information on each index and node is also kept within easy reach. We
can browse through our index or even run queries through the provided interface.
https://github.com/elasticsearch/elasticsearch/blob/0.19/cong/elasticsearch.yml#L09
http://www.elasticsearch.org/guide/reference/api/admin-indices-create-index.html
https://github.com/elasticsearch/elasticsearch/blob/0.19/bin/elasticsearch
http://www.elasticsearch.org/guide/reference/modules/gateway/index.html
http://www.elasticsearch.org/guide/reference/modules/gateway/local.html
http://www.elasticsearch.org/guide/reference/modules/gateway/hadoop.html
http://www.elasticsearch.org/guide/reference/modules/gateway/s3.html
https://github.com/elasticsearch/elasticsearch/blob/0.19/cong/elasticsearch.yml#L233
https://github.com/mobz/elasticsearch-head
Figure 2: Sample Screenshot of Elasticsearch Head
Installation of the plugin is painless, we recommend installing Elasticsearch Head on every
node. In this conguration any active node may be used to access the web-based interface.
RabbitMQ River
Elasticsearch provides processes that run inside of the Elasticsearch cluster and may pull (or
be pushed) data that will then be added to an index, this is called a river. Well be using
the RabbitMQ river; the project also provides several other implementations including one
for CouchDB. You may have as many of these processes as needed, in practice one node of
your cluster will be assigned the task of running and monitoring a particular river; only one
instance of each river will run at one time. When the node responsible for a particular river is
taken oine, another node will be assigned that river and it will take over processing.
Creating Our Index
With Elasticsearch up and running, the only thing left is to create our index. In line with
Elasticsearchs enforcing of reasonable defaults we could simply skip this step all-together; a
new index will be created the rst time we attempt to use it to store data. Elasticsearch will
attempt to gure out what data type is in each eld of our document by looking at the JSON
data and, in some cases, it will attempt to parse out the data in text elds. For instance,
when a new text eld is detected Elasticsearch will attempt to parse that eld as a date and
if it succeeds, that eld will be treated as date value (the raw text associated with the JSON
document will also be retained). is behavior is congurable, the documentation goes into
greater detail.
We will be providing a mapping for our index at creation time. We know what our data
will look like and creating the mapping is not so arduous. Well use the command line tool
http://www.elasticsearch.org/guide/reference/river/index.html
http://www.elasticsearch.org/guide/reference/api/index
.
html
http://www.elasticsearch.org/guide/reference/mapping/root-object-type.html
curl for illustration purposes. Depending on your scenario, it may make sense to have your
application create missing indexes when it attempts to load in data.
curl -XPUT http://node1.tnrglobal.com:9200/crawl_log/
Were going to be storing each transaction from our log les (that is, every line from every
log le) in our index. e command below will create a mapping that tells Elasticsearch what
type of data will be in our transaction documents and how to treat that data. While we are
providing a schema for our data this is not required, Elasticsearch does not require a schema.
If we provided documents with more or less data Elasticsearch will not complain.
curl -XPOST http://node1.tnrglobal.com:9200/crawl_log -d {
mappings : {
transaction : {
properties : {
host : {type : string},
host_raw : {type : string, index : not_analyzed},
url : {type : string},
url_hash : {type : string, index : not_analyzed},
timestamp : {type : date},
response : {type : string},
status : {type : string}
}
}
}
}
e host eld will contain only the server host name portion of a transactions URL. In
addition to indexing the name of the host were also storing the same data in a not analyzed
form, this means that the host_raw eld will be search-able but it wont be processed in any
other way (i.e. tokenized). We will be able to search the host_raw eld for exact host name
matches (i.e. www.acme.com) and use the host eld for bits and pieces of host names
(i.e. www and acme). In addition, if we were to facet over the host_raw eld we could
categorize each transaction by its host without incurring the overhead of loading tokenized
eld data into memory.
While we index the url eld fromthe log le, were also storing a hash of that eld. One of
the problems with the url eld is that we dont have a great deal of control over what it looks
like. Some URLs are very lengthy and that could cause problems if we were to use them when
constructing links in our reporting application. For this reason we store a hash of each URL as
well, if we need to generate a link for a URL or if we need to specically fetch transactions for
a specic URL, we can use a hash instead of the actual URLs text.
e only other interesting eld is the timestamp eld, this will contain the date and time
the transaction was logged. When we load in our data well be combining the date and time
elds from our log entries into this one eld. When we provide data for Elasticsearch to index,
http://curl.haxx.se/
its going to try to parse the timestamp value we provide using the Joda Time date with
optional time format. is will handle most sanely formatted dates, well be formatting our
date like so: yyyy-MM-ddTHH:mm:ss.
Lastly, we create the Elasticsearch process that will monitor our RabbitMQ message queue
and add new entries to our index. Again, were using curl for illustration purposes; another
approach may make more sense depending on the scenario.
curl -XPUT http://node1.tnrglobal.com:9200/_river/crawl_queue/_meta -d {
type : rabbitmq,
rabbitmq : {
host : localhost,
port : 5672,
user : guest,
pass : guest,
vhost : /,
queue : index,
routing_key : index,
exchange_type : direct,
exchange_durable : true,
queue_durable : true,
queue_auto_delete : false
},
index : {
bulk_size : 100,
bulk_timeout : 10ms,
ordered : false
}
}
Be sure to create the appropriate queue on the RabbitMQ side as well. From there you can
decide if you want to mirror the queue across the cluster, etc.
Extract, Transform and Load (ETL)
With our index up and running, were ready to start loading in data! Our next step will be
to create a Hadoop map/reduce job that will read through our source log les and write that
data to our RabbitMQ message queue. As items are written to this queue Elasticsearch will
consume those items and add them to our index.
We have found that many organizations that arent already using a framework (like Hadoop)
that leverages the map/reduce approach to data processing tend be reluctant to use these tools.
While map/reduce may be counter-intuitive, we do feel that its much easier to grasp with a
straightforward example. To that end we are going to go into greater detail on how we use this
approach to process our log data.
To give you an idea of what our JSON documents will look like, the command below will
insert one transaction into our index.
http://joda-time.sourceforge.net/
curl -XPUT node1.tnrglobal.com:9200/crawl_log/transaction/9f419923fe473293b3e2da8a3ead0797 -d {
host : jobs.telegraph.co.uk,
host_raw : jobs.telegraph.co.uk,
url : http://jobs.telegraph.co.uk/search-results-rss.aspx?discipline=33,
url_hash : 2f5960385dc2bcb15a4fbd9898114b3e,
timestamp : 2012-04-18T17:27:12,
response : 200,
status : new
}
Were using a hash of the URL combined with the date and time of the transaction to
generate a unique ID. We do this so that if we have a problem during indexing and need to
re-load a particular le, we wont end up with more than one entry for any particular log entry.
Well be using Clojure (a Lisp that runs on the JVM) to implement our mapper and reducer
tasks. While we are comfortable using Java for these tasks, we nd we can get the same job
completed much faster when we use a scripting language like Clojure. e Hadoop Streaming
API enables you to use almost any language you like.
Project Setup
Writing a Hadoop job is not as much of a chore as you might think. Clojure projects typically
use the Leiningen tool to manage the build process and dependent libraries. After installing
Leiningen (also a straight-forward task) we ask it to create a new project for us.
$ lein new crawl-log-loader
A new project will be created. We can then replace the project denition le (project.clj)
with the text below.
(defproject crawl-log-loader 1.0
:description Load Crawler Log data
:dependencies [[org.clojure/clojure 1.3.0]
[clojure-hadoop 1.4.1]
[com.mefesto/wabbitmq 0.2.0]
[org.clojure/tools.logging 0.2.3]
[org.clojure/tools.cli 0.2.1]]
:dev-dependencies [[org.codehaus.jackson/jackson-mapper-asl 1.9.2]
[org.slf4j/jcl104-over-slf4j 1.4.3]
[org.slf4j/slf4j-log4j12 1.4.3]]
:main crawl-log-loader.core)
e Clojure Hadoop library provides the tools we need to implement our job, Wab-
bitMQ is used to communicate with our RabbitMQ server. We include the Clojure logging
https://github.com/technomancy/leiningen
https://github.com/alexott/clojure-hadoop
https://github.com/mefesto/wabbitmq
and command-line utilities to make our application easier to manage, the remaining three li-
braries are required for any Hadoop job.
Next we open up our main project le and import the libraries that well be using to write
our job.
(ns crawl-log-loader.core
(:use [clojure.tools.logging]
[clojure.tools.cli]
[com.mefesto.wabbitmq])
(:require [clojure.string :as string]
[clojure-hadoop.gen :as gen]
[clojure-hadoop.imports :as imp])
(:import [org.apache.hadoop.util Tool]
[java.net URL]
[org.apache.commons.logging LogFactory]
[org.apache.commons.logging Log]))
We import the logging library in order to log our progress, then WabbitMQ. e command-
line tools are used to parse the arguments passed in from the command line, specically the
location of the source les to process. Next we import the Clojure Hadoop tools and the Java
URL library for parsing the logged data. Lastly we use the Apache Log4J library to perform
the actual logging.
Nowwe implement our job. Were going to do this a little backwards in order to keep things
easy to follow. First well setup our Tool, the object that will represent our job as a whole.
Well then code up our mapper, reducer and a couple of support functions that well need to
parse our log data and write to our RabbitMQ queue.
http://logging.apache.org/log4j/
Tool
Our Tool will handle our command line arguments and then setup the rest of our job.
(defn tool-run
Provides the main function needed to bootstrap the Hadoop application.
[^Tool this args-in]
;; define our command line flags and parse out the provided
;; arguments
(let [[options args banner]
(cli args-in
[-h --help
Show usage information :default false :flag true]
[-p --path HDFS path of data to consume]
[-o --output HDFS path for the output report])]
;; display the help message
(if (:help options)
(do (println banner) 0)
;; setup and run our job
(do
(doto (Job.)
(.setJarByClass (.getClass this))
(.setJobName crawl-log-load)
(.setOutputKeyClass Text)
(.setOutputValueClass LongWritable)
(.setMapperClass (Class/forName crawl-log-loader.core_mapper))
(.setReducerClass (Class/forName crawl-log-loader.core_reducer))
(.setInputFormatClass TextInputFormat)
(.setOutputFormatClass TextOutputFormat)
(FileInputFormat/setInputPaths (:path options))
(FileOutputFormat/setOutputPath (Path. (:output options)))
(.waitForCompletion true))
0))))
Parsing command line arguments is a dull but important function of our Tool. As you can
see above, we use the CLI library to both setup our arguments and to parse those arguments
out into a hash-map. More information on howthis library works can be found on the projects
web page.
Hadoop wants our application to return a status code that indicates either healthy comple-
tion of the job or that we exited under an error condition. We return 0 to indicate that our
job exited normally. For this job its more important to plow through our data set, any log
entries that give us a problem will be logged and then well move on and try the next entry;
wed rather discard some data rather than halt a job thats been running for a couple of hours.
If our application isnt invoked with the -h or help ag, we setup the Hadoop job. We
create a new Job object and set several elds. Note that we set the output key and value class.
https://github.com/clojure/tools.cli
e main purpose of this job is to load data into RabbitMQ (and by extension Elasticsearch)
but well also output the number of transactions per host. is could be used for any number
of things, well be using it to double check the data that ends up in our index.
We then set mapper and reducer classes, well write those up next. We set the input and
output formats; the TextInputFormat reads plain text les line-by-line, a good t our log input.
e TextOutputFormat writes out plain text les.
Mapper
Next we code up our mapping function. Even though were covering this after the code for the
Tool, this would appear above the tool in the actual source code.
(defn mapper-map
Provides the function for handling our map task. We parse the data,
post it to our queue and then write out the host and 1. This output
is used to provide a summary report that details the number of URLs
logged per host.
[this key value ^MapContext context]
;; parse the data
(let [parsed-data (parse-data value)]
;; post the data to our queue
(post-log-data parsed-data)
;; write our counter for our reduce task
(.write context
(Text. (:host parsed-data))
(LongWritable. 1))))
is is straightforward: each line of our source log les is provided to our mapping function
in the form of a key (the line number from which it was read) and a value (the entry in
the log le). For our purposes the key is not a useful value and can be safely ignored. e
entry in the log le, on the other hand, represents our data and we then parse that out into
components with the parse-data function.
Next, we pass our parsed log data over to our post-log function. is function will then
write the log entry to our RabbitMQ queue were it can be read and loaded by Elasticsearch.
Lastly we output the host component of the crawled URL along with the number 1. Later
on in the process, our reduce function will add together all the values (that is, the 1) for
each URL host, this will comprise our nal report detailing how many URLs were crawled for
each host.
Reducer
Next we code up our reducer function.
(defn reducer-reduce
Provides the function for our reduce task. We sum the values for
each host yeilding the number of URLs logged per host.
[this key values ^ReduceContext context]
;; sum the values for each host
(let [sum (reduce + (map (fn [^LongWritable value]
(.get value))
values))]
;; write out the total
(.write context key (LongWritable. sum))))
is function is also very simple. We map over the incoming values (Hadoop wraps these
values in a LongWritable instance), pull out the actual numbers and reduce those values (by
adding them together) into our nal sum. We write out the key; the name of the host, and the
sum; the total number of transactions for this host.
ese two function clearly illustrate the components of the map/reduce cycle. One function
maps over all of the incoming data and produces values that are later reduced by the other
function. Because these functions are stateless they need have no knowledge of the other. is
is what lets us spread the processing of this data over all of the nodes in our cluster without
having to spend a lot of time and processing power communicating between them.
https://en.wikipedia.org/wiki/Map_reduce
Support Functions
We called two subordinate functions in our mapping function, these are detailed here. e rst
parses out the individual elds of data from our log le entries.
(defn parse-data
Parses a String representing a row of data from a crawler log entry
into a hash-map of data.
[text]
;; parse out the row of data by splitting on whitespace
(let [data (string/split (str text) #\s+)]
(if (< 3 (count data))
(cond
(not (= 19 (count (nth data 0))))
(warn (str Invalid timestamp for logged row: \ text \))
(not (< 0 (count (nth data 1))))
(warn (str Missing response for logged row :\ text \))
(warn (str Missing status for logged row :\ text \))
(warn (str Missing URL for logged row :\ text \))
:else
(try
;; parse out our host
(let [host (parse-host (nth data 3))]
;; parse out the date and time
(try
(let [timestamp (.parse (SimpleDateFormat. yyyy-MM-dd-HH:mm:ss)
(nth data 0))]
{:timestamp (.format (SimpleDateFormat. yyyy-MM-ddTHH:mm:ss)
timestamp)
:response (nth data 1)
:status (nth data 2)
:url (nth data 3)
:host host})
(catch Exception exception
(warn (str Couldnt parse timestamp from logged row: \ text \)))))
(catch Exception exception
(warn (str Couldnt parse URL for logged row: \ text \)))))
(warn (str Couldnt parse data from logged row: \ text \)))))
First we split our log entry into elds by breaking it apart when we see white space. We
double-check to make sure that we have at least the four items of data that we need to continue.
Next we check each piece of data and verify that its at least present. We then attempt to parse
the date and time of our entry and assemble a data structure that represents this log entry.
Our other function, post-log-data, simply writes this data to our message queue using
the Elasticsearch bulk data format. e Elasticsearch river will then read these messages and
apply them to our index.
With our job coded up, we can tell Leiningen to build our project. We want all of the
dependent libraries to be includes in one complete JAR le in order to keep deployment to our
Hadoop cluster easy.
$ lein uberjar
Once complete our JAR le will be created, we can then copy this over to our Hadoop
cluster and run our job. Well tell it where to nd our log les on the cluster by passing in the
path with the -p ag and where to write out the completed report with the -o ag.
As our job runs we can monitor how quickly Hadoop is processing our data as well as how
fast Elasticsearch is indexing that data by inspecting the RabbitMQ web-based administration
interface. We can check on the indexing side of the process and view the data loaded into
Elasticsearch in near real time through Elasticsearch Head.
Indexing Log Data in Near Real Time
Once the our initial load of historical data has been completed, we dont want to rely on a
recurring batch process in order to keep our index up-to-date. Instead well alter our web
crawlers to write this data directly to our message queue as crawls are conducted. Once again
we will reap the benet of isolating our processing function from our indexing function; the
speed of our crawler will not be impacted by the speed of our indexer. e crawlers will write
messages to the queue at their own speed, Elasticsearch will read and processes these messages
at its own pace.
You may not have the option to alter the application that is producing your log data. Both
the logstash and Flume products provide solutions, they can handle a great many products
and use-cases. You can delegate the task of writing your data to the message queue as it arrives
to one of these tools.
Analytics and Interactive Reports
With our data loaded and indexed we can now move on to querying that data and, eventually,
providing an interactive reporting tool for our customers. Elasticsearchs strong and exible
http://www.elasticsearch.org/guide/reference/api/bulk.html
http://www.rabbitmq.com/management.html
http://logstash.net/docs/1.1.0/outputs/elasticsearch_river
https://github.com/kenshoo/ume-rabbitmq-sink
faceting support makes this an easy task.
Queries and Faceting
First, well author a query that will provide us with the total number of transactions for each type
of web server response the crawler encountered. For instance pages that were loaded without
issue will have a 200 code, pages that were missing or had an incorrect URL might have a
404 code.
curl -X POST http://note1.tnrglobal.com:9200/crawl_log/_search?pretty=true -d
{
query : { query_string : {query : +_type:transaction} },
facets : {
responses : { terms : {field : response} }
},
size : 0
}
e query is in two parts, the rst asks Elasticsearch to return all of the documents in our
index that are of the type transaction. is is, in fact, every document in our data set. Next
we provide faceting instructions, we request one set of facets called responses (we can call
each set of facets whatever we like). We then tell Elasticsearch that wed like a term facet over
the document eld response.
A term facet returns the number of items that match each term in the provided eld.
By default the returned data will be ordered by the number of documents, that is, the term
with the most matching documents will be the rst in the result set. Terms with zero matching
documents will not be returned at all (unless the match_all parameter is set).
e data returned by Elasticsearch is listed below, weve elided the information in the middle
to save space.
http://www.elasticsearch.org/guide/reference/api/search/facets/index.html
https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
http://www.elasticsearch.org/guide/reference/api/search/facets/terms-facet.html
{
timed_out : false,
_shards : {
total : 5,
successful : 5,
failed : 0
},
hits : {
total : 254881619,
max_score : 1.0,
hits : [ ]
},
facets : {
response : {
_type : terms,
missing : 0,
total : 254881619,
other : 874096,
terms : [ {
term : 200,
count : 208331697
}, {
term : 302,
count : 21151858
}, {
term : 304,
count : 6795804
},
...
{
term : 500,
count : 779267
}, {
term : 403,
count : 330583
} ]
}
}
}
Elasticsearch tells us the number of shards that responded to our request and if any failed
to respond (if enough nodes were oine we would receive partial results). e hits stanza
details the number of documents that matched our query and the score that represents the best
match.
Our faceting data is returned in the last stanza. In the term value, we can see each term
and the number of documents that match that term. is is the kind of aggregate data that we
will use for our reporting tool; the data above might t best in a pie chart, for instance.
In addition to the term facet, Elasticsearch provides several other types of facet queries.
e range facet lets you specify a set of ranges, Elasticsearch will return both the number
of documents that fall within each range as well as aggregated data based on a eld.
e histogram and date histogram facets return data across intervals. For instance, using
the date histogram facet you can facet across a date eld and return documents grouped
by an interval of time, perhaps by day, week or month.
Query and lter facets provide the number of documents that match a particular query.
Statistical facets allow you to compute statistical data based on the value of a numeric
eld. is includes computations like total, sum, sum of squares, average, mean, mini-
mum, maximum, etc.
Elasticsearch provides a term_stats facet that allows you to combine a term facet with a
statistical facet, faceting over a set of terms in one eld while also calculating a statistical
value on another.
A geo_distance facet can be used to calculate ranges of distances from a provide geo-
graphical location as well as aggregated information (like a total).
We wont detail the use of every type of facet, the Elasticsearch documentation provides
examples for each one. However we will provide an example of using the date histogram as
it is a good match for our problem space.
e query below will display the number of pages crawled by day.
curl -X POST http://node1.tnrglobal.com:9200/crawl_log/_search\?pretty\=true -d {
query : { query_string : {query : +_type:transaction} },
size : 0,
facets : {
histogram : {
date_histogram : {field : timestamp, interval : day }
}
}
}
Elasticsearch will fetch all of our documents and then bucket them by the day of their
timestamps. at data returned will be in order by date with the most recent date listed rst.
http://www.elasticsearch.org/guide/reference/api/search/facets/range-facet.html
http://www.elasticsearch.org/guide/reference/api/search/facets/histogram-facet.html
http://www.elasticsearch.org/guide/reference/api/search/facets/date-histogram-facet.html
http://www.elasticsearch.org/guide/reference/api/search/facets/query-facet.html
http://www.elasticsearch.org/guide/reference/api/search/facets/lter-facet.html
http://www.elasticsearch.org/guide/reference/api/search/facets/statistical-facet.html
http://www.elasticsearch.org/guide/reference/api/search/facets/terms-stats-facet.html
http://www.elasticsearch.org/guide/reference/api/search/facets/geo-distance-facet.html
{
timed_out : false,
_shards : {
total : 5,
successful : 5,
failed : 0
},
hits : {
total : 254881619,
max_score : 1.0,
hits : [ ]
},
facets : {
histogram : {
_type : date_histogram,
entries : [ {
time : 1308873600000,
count : 111658
}, {
time : 1308960000000,
count : 10361119
}, {
time : 1309046400000,
count : 8331424
}, {
time : 1309132800000,
count : 5845844
}, {
time : 1309219200000,
count : 4863707
},
...
]
}
}
}
e time eld provided in the returned data contains the time as UTC milliseconds since
the epoch (similar to a POSIX time value but with millisecond granularity).
Weve been concentrating on facet queries as this provides one of the main tools youll need
to use when creating a reporting tool. Elasticsearch is also an accomplished indexing tool and
provides a wide range of queries built around its own query language. Under the hood,
Elasticsearch is leveraging the powerful Apache Lucene search engine and provides access to
all of the features that Lucene provides.
https://en.wikipedia.org/wiki/Unix_time
http://www.elasticsearch.org/guide/reference/query-dsl/
http://lucene.apache.org/core/
Constructing an Interactive Reporting Tool
We wont provide a tutorial on how to create a reporting tool, everyones needs will vary greatly
as well as the tools they might use to build such a solution. Typically youll want to create a web-
based application using a lightweight framework, perhaps something like Flask for Python,
Compojure for Clojure or even something implemented in PHP. Whatever language you
have the most experience with will be the best choice.
Instead well provide an overview of an application that weve created to meet this need. We
ended up implementing our solution in Python with the Flask framework and used the Twitter
Bootstrap template to rapidly develop our solution. e pie chart depicting the response
codes encountered by the crawler is pictured below.
Figure 3: Response Chart
Using the date histogram facet, it was simple and painless to oer a bar chart depicting
the transactions by date.
Figure 4: Transactions by Day
As the customer drills down into data by host, we provide the more traditional table-based
detail. In addition to listing the transactions for the selected host, we can also provide the same
http://ask.pocoo.org/
https://github.com/weavejester/compojure
http://php.net/
http://twitter.github.com/bootstrap/
pie and bar charts by faceting on the results for the selected host. Pictured below is a traditional
table of matching items.
Figure 5: Host Specic Transaction Table
Below our table of transactions, we provide colorful charts that clearly illustrate what the
listed transactions have in common, in this case highlighting the fact that the vast majority of
URLs for this particular site were crawled without issue.
Figure 6: Host Specic Response Chart
Conclusion
It is clear that Elasticsearch can gracefully handle the large data sets that are typically associated
with Big Data. By planning ahead for your maximum estimated storage load you can make
some smart choices that will ensure your index can keep pace with your growing data set.
e faceting functions provided by Elasticsearch allow you to query and explore your data set
as you collect it, in stark contrast to tools like HBase where its necessary to plan out what
aggregate values youd like to track and then calculating these values at load time. Once your
initial loading of data is completed, you can then feed your data to your index as you collect
it, ensuring that your customers can make use of this information as quickly as possible.
is is not to say that there is no place for tools like HBase! e faceting features provided
by Elasticsearch are very exible but cannot cover every possible scenario. ey can, however,
allow for the wide range of ad-hoc querying that we all need in order to actively explore a large
data set.
TNR Global, LLC
245 Russell Street
Suite 10
Hadley, MA 01035
info@tnrglobal.com
For updates to information in this paper and
other search technologies we work with, click
here: http://www.tnrglobal.com/blog

Flexible Analytics For Large Data Sets Using Elasticsearch

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Flexible Analytics For Large Data Sets Using Elasticsearch

Uploaded by

Copyright:

Available Formats

Flexible Analytics for Large Data Sets

You might also like