You are on page 1of 5

Performing Indexing operation using HadoopMapReduce and Terrier

Atharva Patel Sai Dinesh


atharva_patel@daiict.ac.in sai_dinesh@daiict.ac.in
DA-IICT DA-IICT

Abstract in Java language which is a very popular


programming language. HadoopMapreduce
This paper introduces the map reduce tool developed by Yahoo and Apache
mechanism for distributing a data or provides a very good platform to carry out
such data and computational intensive
computation intensive tasks on larger
tasks. MapReduce methodology is
number of computation nodes to
developed by Google Inc.[12] which provides
achieve speed up in the operation. As an great amount of abstraction to the user in
example we describe how an indexing performing their computations without
operation of large amount of data can be getting worried about the issues involved in
carried out using MapReduce technique. parallelism, fault tolerance, load balancing,
Following that is a guide for setting up a data distribution etc.
cluster for using hadoop and configuring
terrier to make use of this platform. This paper will give the viewer an
introductory level of description about the
Paper also describes some results
mechanism of map and reduce methodology
derived from some experiments carried with an example of performing indexing
out in the Glasgow University. The operation on a sample data collection of
paper can turn out to be a useful FIRE 2010in TELEGRAPH English language
resource for a beginner in the area of of the size of around 2GB. Section 2 contains
Mapreduce and Hadoop platform. the description of map reduce mechanism.
Section 3 describes indexing as an example
1. Introduction for using map-reduce methodology for
distributing the task. Section 4 of this paper
Indexing in real world on large data corpus
will show the setup guide and configuration
is considered to be very data and
details. This section can prove to be very
computation intensive task. For academic
useful for the people willing to make use of
researchers especially in the field of
Hadoop to perform various Information
information Retrieval, it is very essential to
Retrieval related task using terrier or any
be able to verify their ideas and concepts on
other software tool. In section 5 we have
actual large document collection. Carrying
presented the results derived from a
out these operations on a single node
experiment carried out at Glasgow
computer is very poor choice as it takes
University, which demonstrate the
around hours or days in performing certain
computational speedup that can be
operations like indexing. The problem
achieved by deploying hadoop clusters for
becomes even worse when the size of the
indexing large data.
data collection is very large (>100GB).
2. Introduction to MapReduce
It seems to be very rational choice
method[1]
to make use of the distributed computing
mechanism to perform such operations in The computation takes a set of input
comparably less amount of time. For the key/value pairs, and produces a set of
academic learning and research purpose, output key/value pairs. The user of the
terrier v3.0 is popular software as it mapReduce library expresses the
provides very easy way to change the computation as two functions: Map and
parameters of various operations supported Reduce.
by it as well as the open source code written
Map, written by the user, takes an input pair these values to form a possibly smaller set
and produces a set of intermediate of values. Typically just zero or one output
key/value pairs. The MapReduce library value is produced per Reduce invocation.
groups together all intermediate values The intermediate values are supplied to the
associated with the same intermediate key user's reduce function via an iterator.This
Iand passes themto the Reduce function. allows us to handle lists of values that are
too large to fit in memory.
The Reduce function, also written by the Symbolically these operations can be
user, acceptsan intermediate keyI and a set expressed like this:
of values for that key. It merges together map(k1,v1) →list(k2,v2)
reduce(k2,list(v2)) →list(v2)

posting list> pairs. As in single-pass


3. Single pass indexing using indexing, the posting lists are compressed
scalable MapReduce[2] to minimize the data that is transferred
between map and reduce tasks. Moreover,
In a normal opeartion of single pass each map task is not aware of its context in
indexing without mapreduce, compressed the overall indexing job, the doc-ids used in
posting lists for each term are built in the emitted posting lists cannot be globally
memory as the corpus is scanned. When correct instead, these doc-ids start from 0 in
local memory is exhausted, the partial each flush.
indices are flushed to disk and final index is
built from the merged flushes. Elias-Gamma The partial indices (flushes) are then sorted
compression is used to store document- by term, map and flush numbers before
identifier (doc-id) deltas; ensuring postings being passed to one or more reduce tasks.
are fully compressed, both in memory and Each reducer collates the posting lists to
on disk. create the final inverted index for the
documents it processed. In particular, as the
Now performing this single pass indexing flushes are collected at an appropriate
task using MapReduce, document reduce task, the posting lists for each term
processing is split over m map tasks, with are merged by map number and flush
each map task processing its own subset of number, to ensure that the posting lists for
the input data. However memory runs low each term are in a globally correct doc-id
or all the documents for that map have been ordering. The reduce function takes each
processed, the partial index is flushed from term in turn and merges the posting lists for
Figure 1Correcting document IDs while merging. [2]
the map task, by emitting a set of <term, that term into a full posting list. Figure
1presents an example for a distributed or /usr. JAVA_HOME should be mentioned
setting MapReduce indexing paradigm of in the hadoop-env.sh file as it is executed
200 documents. Note that the number of every time when hadoop starts and it
reduce tasks therefore determines the final overwrites all the paths set by the
number of inverted index shards created. .bashrc, .bash_profile or .profile files.

4. Setup guide for installing Hadoop If the user wants to add IP addresses of all
on a cluster the master and slave nodes in /etc/hosts
file then he/she only use the host names of
the IP addresses mentioned in the
/etc/hosts files at all the configuration files
like hadoop-site.xml in older versions like
0.18.x or in files like core-site.xml, hdfs-
site.xml and mapred-stie.xml etc. Violating
this may cause hadoop to throw some
exceptions saying: expected
hdfs://hostname:9000, found
hdfs://AA.BB.CC.DD:9000 etc.

While starting hadoop cluster using ./start-


all.sh file, it doesn’t clearly specify whether
Figure 2 Visualization of the operation of all the services like namenode, jobtracker,
HadoopMapReduce[8] datanode, tasktracker, secondarynamenode
etc. started successfully or not. If the user
The references I have mentioned below [7] wants to be sure about this, he/she should
and [8] provide excellent step by step start each of these services one by one using
guidance for installing and running hadoop command like ./hadoopnamenode will start
on Linux systems. There are several namenode on master machine.
corrections to them that I would like to ./hadoopjobtracker will start
mention here. mapredjobtracker on master machine. They
will also print specific messages on screen if
As suggested in reference [7], while creating some exceptions were thrown while
a new user account called ‘hadoop’ one starting up these services so that user can
should specify the option –m in the try to work out them accordingly.
command:
useradd –g hadoop_user –s /bin/bash –d In the reference [8] the author has
/home/hadoop –m hadoop mentioned about one exception appearing
about the namespaceID incompatible
This will tell the system to create home between namenode and datanode. Two
directory for the user. Without that, the workarounds are suggested by the author
system will not generate a home directory there. Here is one more way in which one
for new ‘hadoop’ user. can update namespaceIDs of all the slaves
just by running a script from the master
Different machines would be having node. Here it is [9].
different versions and locations of Java
installed on them. It is preferable to use In our experiment we often failed to start or
same version of Java JDK on all the connect jobtracker from the namenode. The
machines, otherwise all the machines temporary solution we applied was to run
should have at least Sun JDK 1.6 or more jobtracker on the same machine as
recent version than that installed and the namenode by giving same hostname in
JAVA_HOME path should be set correctly for configuration files for namenode
pointing to whatever installed java version’s (fs.default.name property in core-site.xml)
installation directory: two most popular and jobtracker (mapred.job.tracker in
places where it is found are /usr/lib/jvm mapred-site.xml).
/
For carrying out operations like indexing, home/hadoop/English/TELEGRAPH_UTF8/
we need to store and replicate document document1. Now open collection.spec file in
collection on the datanodes. Namenode some text editor with find and replace
stores metadata of this collection while the feature and replace the word /home
datanode would be storing the actual data. (whatever part of the file location) with
For this we can use following command to something like hdfs://master:9000/home
store data from the local machine to hadoop according to your desired file structure in
file system. hadoop file system where you previously
./hadoopfs –put <absolute path to data on stored data using ./hadoopfs –put
the local storage><location on hadoop file <src><target> command.Apply this replace
system> operation to the whole collection.spec file
For e.g. and it will be ready for use by the terrier for
./hadoopfs –put finding the document to be indexed.
/home/hadoop/English/TELEGRAPH- One very important thing one should take
UTF8/2004_utf8 care of is that add the line
/home/hadoop/dataToBeIndexed/ terrier.plugins=org.terrier.utility.io.Hadoop
Plugin
If one has started the datanode individually ,before adding the index location path in the
on slave machines then one can see the terrier properties. Otherwise trec_terrier.sh
datanode receiving blocks of different sizes would not be able to recognize the hdfs://
from the master nodes. After completing scheme mentioned by the index path
transfer of these data, the datanodes and property.
namenode start verification of them for terrier.index.path=hdfs://master:9000/ho
each block. The verification process is slow me/hadoop/Indices/TELEGRAPH
and time consuming as one can see the
progress going on really slowly on the
screen. Initially for testing pupose, one 5. An overview for the speedup
should try to put small sized files in the achieved by the MapReduce
hadoop file system to check whether cluster for indexing operation
everything is working fine. After that, one
should go for storing larger data files. An experiment similar to the one described
by us was carried out by Richard M. C.
The next steps are to configure the hadoop McCreadie, Craig Macdonald and IadhOunis
for running terrier jobs on top of it. The at University of Glasgow.[2] Below is the
guide provided here for terrier v3.0 [10] figure showing the speed up they obtained
and [11] are very helpful. by trying out different configurations for
the number of clusters and number of
As described in reference [11], the reducers running.
collection.spec file in the terrier v3.0 needs
to have file locations written in the format:
hdfs://master:9000/home/hadoop/dataTo
BeIndexed
Now if this file is created using the data
available on the local storage, then it didn’t
seem to us to run a command from hadoop
or terrier and collection.spec will contain
file locations in such a format. For that
purpose a trick might work for you: First
setup terrier by running the command Image source: [2]
./trec-setup.sh <absolute path to data to be Here they have used .GOV2 data corpus
indexed available on local machine>. This which is of around 25GB size. The results of
will create collection.spec file which will this experiment showed that increasing
contain file locations of the format: computing power and the reducers lead to
almost linear increase in the speedup of the 2. Richard M. C. McCreadie, Craig
complete indexing operation time. Below is Macdonald and IadhOunisOn Single-
the graph showing the total time required Pass Indexing with MapReduce
for indexing various data collections like 3. Elias-Gamma compression is used to
WT2G, WT10G, .GOV, .GOV2 for 1 and 8 store document-identifier deltas
machines cluster of HadoopMapReduce 4. Java download and installation
system respectively. guide:
http://java.com/en/download/help
/linux_install.xml
5. Hadoop download sites:
http://www.apache.org/dyn/closer.
cgi/hadoop/core/
6. Disabling firewall and iptables:
http://www.cyberciti.biz/faq/turn-
on-turn-off-firewall-in-linux/
7. Setting up a hadoop cluster:
Image source [2] http://www.comp.nus.edu.sg/~shil
ei/document/Setting_Up_a_Hadoop_
6. Conclusion Cluster.pdf
As per the description given above, it can be 8. Setting up and running a hadoop
concluded that using HadoopMapreduce for cluster on ubuntulinux operating
carrying out the data and computation system:
intensive tasks can be a very wise and http://www.michael-
effective decision. We believe that the noll.com/tutorials/running-hadoop-
references to setup guides and error on-ubuntu-linux-multi-node-
correction tricks would turn out to be a cluster/
useful resource for a beginner in setting up 9. NamespaceID incompatible in
a hadoop cluster and run indexing jobs or datanodes workaround:
using terrier v3.0. The discussion about the http://www.hadoop-
speed up that can be achieved by adding blog.com/2010/12/error-is-
more machines and reducers to the cluster starting-datanode-error.html
can give better idea to the user about the 10. Hadoop and terrier configuration
benefits of deploying hadoop for setup guide:
performing such intensive tasks. http://terrier.org/docs/v3.0/hadoo
p_configuration.html
7. Acknowledgement 11. Guide for carrying out single pass
We are grateful to our instructor Prof. indexing operation using terrier 3.0
PrasenjitMajumder for providing us such an and hadoop 0.18.3:
opportunity of learning HadoopMapreduce http://terrier.org/docs/v3.0/hadoo
and Terrier platform through our course on p_indexing.html
Introduction to Information Retrieval. We 12. Mapreduce-simplified data
also cordially thank to my Institute DA-IICT processing on large clusters
for providing me computational resources http://labs.google.com/papers/map
in carrying out these experiment. Your reduce-osdi04.pdf
guide and support have motivated us and 13. Apache Hadoop:
have greatly contributedto the existence of http://hadoop.apache.org/
this paper.

References

1. Jeffrey Dean and Sanjay


GhemawatMapReduce: Simplified
Data Processing on Large Clusters

You might also like