Professional Documents
Culture Documents
30 May 2012
Logs are an essential part of any computing system, supporting capabilities from audits to
error management. As logs grow and the number of log sources increases (such as in cloud
environments), a scalable system is necessary to efficiently process logs. This practice session
explores processing logs with Apache Hadoop from a typical Linux system.
Logs come in all shapes, but as applications and infrastructures grow, the result is a massive
amount of distributed data that's useful to mine. From web and mail servers to kernel and boot
logs, modern servers hold a rich set of information. Massive amounts of distributed data are a
perfect application for Apache Hadoop, as are log filestime-ordered structured textual data.
You can use log processing to extract a variety of information. One of its most common uses is to
extract errors or count the occurrence of some event within a system (such as login failures). You
can also extract some types of performance data, such as connections or transactions per second.
Other useful information includes the extraction (map) and construction of site visits (reduce) from
a web log. This analysis can also support detection of unique user visits in addition to file access
statistics.
Overview
About this article
You may want to read these articles before working through the exercises:
Trademarks
Page 1 of 8
developerWorks
ibm.com/developerWorks/
Prerequisites
To get the most from these exercises, you should have a basic working knowledge of Linux.
Some knowledge of virtual appliances is also useful for bringing a simple environment up.
Page 2 of 8
ibm.com/developerWorks/
developerWorks
Create a script that extracts all log lines with the predefined criteria.
Exercise solutions
The specific output depends on your particular Hadoop installation and configuration.
supergroup
supergroup
supergroup
Page 3 of 8
developerWorks
ibm.com/developerWorks/
This example assumes that you performed the steps of exercise 2 (to ingest data into the HDFS).
Listing 3 provides the map application.
Listing 5 illustrates the process of invoking the Python MapReduce example in Hadoop.
Page 4 of 8
ibm.com/developerWorks/
developerWorks
Page 5 of 8
developerWorks
ibm.com/developerWorks/
Resources
Learn
Distributed computing with Linux and Hadoop (Ken Mann and M. Tim Jones,
developerWorks, December 2008): Discover Apache's Hadoop, a Linux-based software
framework that enables distributed manipulation of vast amounts of data, including parallel
indexing of internet web pages.
Distributed data processing with Hadoop, Part 1: Getting started (M. Tim Jones,
developerWorks, May 2010): Explore the Hadoop framework, including its fundamental
elements, such as the Hadoop file system (HDFS), common node types, and ways to monitor
and manage Hadoop using its core web interfaces. Learn to install and configure a singlenode Hadoop cluster, and delve into the MapReduce application.
Distributed data processing with Hadoop, Part 2: Going further (M. Tim Jones,
developerWorks, June 2010): Configure a more advanced setup with Hadoop in a multinode cluster for parallel processing. You'll work with MapReduce functionality in a parallel
environment and explore command line and web-based management aspects of Hadoop.
Distributed data processing with Hadoop, Part 3: Application development (M. Tim Jones,
developerWorks, July 2010): Explore the Hadoop APIs and data flow and learn to use them
with a simple mapper and reducer application.
Data processing with Apache Pig (M. Tim Jones, developerWorks, February 2012): Pigs
are known for rooting around and digging out anything they can consume. Apache Pig does
the same thing for big data. Learn more about this tool and how to put it to work in your
applications.
Writing a Hadoop MapReduce Program in Python (Michael G. Noll, updated October 2011,
published September 2007): Learn to write a simple MapReduce program for Hadoop in the
Python programming language in this tutorial.
IBM InfoSphere BigInsights Basic Edition offers a highly scalable and powerful analytics
platform that can handle incredibly high data throughput rates that can range to millions of
events or messages per second.
The Open Source developerWorks zone provides a wealth of information on open source
tools and using open source technologies.
developerWorks Web development specializes in articles covering various web-based
solutions.
Stay current with developerWorks technical events and webcasts focused on a variety of IBM
products and IT industry topics.
Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and
tools, as well as IT industry trends.
Watch developerWorks on-demand demos ranging from product installation and setup demos
for beginners, to advanced functionality for experienced developers.
Follow developerWorks on Twitter, or subscribe to a feed of Linux tweets on developerWorks.
Get products and technologies
Cloudera's Hadoop Demo VM (May 2012): Start using with Apache Hadoop with a set of
virtual machines that include a Linux image and a preconfigured Hadoop instance.
Practice: Process logs with Apache Hadoop
Page 6 of 8
ibm.com/developerWorks/
developerWorks
Page 7 of 8
developerWorks
ibm.com/developerWorks/
Page 8 of 8