Professional Documents
Culture Documents
MapReduce
Mike McGrath mmcgrath@umd.edu
College of Information Studies
University of Maryland
College Park, MD 20742
ABSTRACT the actual content of the page, while the hub value measures the
In this paper, I describe an implementation of the Hubs and quality of links on the page. The algorithm can be specified as
Authorities link analysis algorithm in Hadoop, and show the follows:
results of applying this algorithm to a mention graph collected
from the Twitter microblogging service. In the context of
• Initialize the hub and authority values for each node to
1
Twitter, each user’s Twitter feed is treated as a node in the
graph, and each retweet or mention of the user by another user • For k steps, where k is some natural number:
is treated as a link to the feed.
o For each node: update its authority value by
The performance of the algorithm on the Twitter mention graph adding the sum of all hub values of incoming
and a large webgraph data set is analyzed, and the results of the links to the node’s current authority value
Hubs and Authorities calculation on the Twitter data are
o For each node: update its hub value by
presented.
taking the sum of all authority values of
outgoing links to the node’s current hub
1. INTRODUCTION value
The last decade has seen a phenomenal growth in the o Normalize each authority value by dividing
proliferation and use of social networking sites on the Web. One
by the square root of the quadratic sum of all
of the most popular of these sites is Twitter, a microblogging
hub values
website that allows users to post messages (“tweets”) of 140
characters or less to a feed that can be followed by other users. o Normalize each hub value by dividing by the
On Twitter, an informal syntax has developed for mentioning square root of the quadratic sum of all
other users and reposting their messages. Common practice authority values
when mentioning other users in a tweet is to precede user names
with the at-sign (@). When reposting a another user’s tweet in
one’s own feed (“retweeting”), it is customary to precede the In the context of Twitter, we can view each Twitter user’s
reposted message with the acronym RT and then mention the message feed as a node, and each mention or retweet by another
original poster using the @user convention. This network of user as a link to that node. Good Twitter hubs, then, are users
posts, reposts, and mentions forms a directed graph, which lends who frequently mention or retweet highly authoritative Twitter
itself well to the sort of link analysis normally performed on users, and Twitter authorities are those users that are often
networks of hyperlinked pages on the web (i.e. PageRank, HITS, mentioned or retweeted by highly ranked Twitter hubs.
SALSA, etc.). In this paper, I perform such an analysis on a
series of Twitter data using Kleinberg’s Hubs and Authorities 2.2 MapReduce, Apache Hadoop, and
algorithm [1] to generate a ranked list of authoritative Twitter Cloud9
users based on the number of times they are mentioned or The Hubs and Authorities analysis performed for this paper was
retweeted by other users. The algorithm itself has been implemented using the Hadoop framework, using components
implemented using Apache Hadoop, an open-source from the Cloud9 library
implementation of the MapReduce framework originally MapReduce is a software framework and programming
described by Dean and Ghemawat [2]. Hadoop allows for rapid paradigm developed at Google to simplify distributed computing
development of software for performing distributed computation on very large datasets across clusters of commodity hardware. In
on large datasets. MapReduce, a set of input key-value pairs are fed through a map
function which is applied to each key-pair to generate a set of
2. BACKGROUND intermediate key-value pairs. These intermediate key-value pairs
are then fed to a reduce function which performs some
2.1 Hubs and Authorities Algorithm aggregate operation on the set of all values belonging to a
Hubs and Authorities is an iterative link analysis algorithm particular key. The advantage of the MapReduce framework is
designed to be applied to a network of hyperlinked web pages. that it allows for map tasks to run in parallel, handles
The algorithm assigns two values to each page in the network: a distribution of the intermediate key-value pairs to reducers, and
hub value based on the value of outgoing links, and an authority then allows for all reduce tasks to be run in parallel. The
value based on the values of incoming links. Hubs and framework handles task scheduling and distribution of data
authorities exist in a mutually reinforcing relationship. Good amongst the nodes in the cluster, saving the developer
hubs are those pages that link to many authoritative pages, and considerable time and effort.
good authorities are those pages that are linked to by good hub
pages. The authority value is designed to measure the quality of
Hadoop is an open-source implementation of the MapReduce
framework described by Dean and Ghemawat. It was originally
developed at Yahoo but now is maintained under the auspices of
the Apache Foundation.
Cloud9 is a library for Hadoop developed at the University of
Maryland, designed to serve as both a teaching tool and to
support research in data-intensive processing, particularly text
processing. Cloud9 provides a number of data structures that
proved useful in the implementation of Hubs and Authorities in
Hadoop. Documentation for the Cloud 9 library exists at
http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/
Where the key is a Cloud9 PairOfStrings object, and the value is 3. DESIGN AND IMPLEMENTATION
a Hadoop IntWritable object. The mention graph was generated The Hubs and Authorities algorithm was implemented as a set
from a set of roughly 29 million tweets generated between 2006 of five individual Hadoop jobs, as shown in Figure 1. The first
and 2009. The tweets were culled via the Twitter API and are two jobs (Auth Formatter and Hub Formatter) read in the twitter
not constant over time. They are essentially the previous N mention graph data and output them into a format that can be
tweets collected from each user in some subset of Twitter users. used for computation by the actual hubs and authorities
The mention graph occupies about 53 MB on the cluster calculation job. The third job (HubsAndAuthorities) updates the
distributed file system. hub and authority values for each node, the fourth job
(Normalization Step 1) finds the quadratic sums of all of the hub
Because the retweet mentions graph is not especially large, I and authority values in the graph, and the fifth job
also ran the algorithm on a 12GB web graph, in order to gauge (Normalization Step 2) completes the normalization task by
performance on a large dataset. dividing each hub and authority value by the square root of the
correct quadratic sum.
2.4 Execution Environment
The experiments described on this paper were performed on a 3.1 Formatters
416-node cluster of commodity machines provided by Google One formatter job produces hub weight data for each node, and
and IBM as part of their Academic Cloud Computing Initiative. the other produces authority weight data for each node. Ouput of
The specifications of the individual compute nodes have not these formatters is in the following format for authority data:
been provided, but each node on the cluster has been configured
<name> (A, (<auth rank>, [incoming links]))
to be able to run two map tasks or two reduce tasks at a time, for
a total system capacity of 828 concurrent map tasks or reduce where the key is a Hadoop Text object storing the node name,
tasks. and the value is a Cloud 9 Tuple object with the left value set to
the symbol “A” to indicate that this is Authority data. The right
2.5 Previous Research value of this tuple is another Tuple, whose left value is a
Others have attempted to implement the hubs and authorities DoubleWritable representing the authority weight for this node,
algorithm using the MapReduce framework. Dong[3] produced a and the right value is a Cloud 9 ArrayListWritable object
three-step implementation of the algorithm using Hadoop that containing the names of all of the incoming links to the node. I
provided the inspiration for the implementation discussed in this ran the Authority formatter twice. On one execution, I initialized
paper, however his implementation suffered from some each node’s authority value to its number of retweets or
efficiency drawbacks. All key-value pairs in his implementation mentions provided in the mention graph. On the second
were passed from job to job as text rather than sequence files. execution, I simply initialized all authority values to 1.0.
His implementation also did not make use of combiners in Similarly, hub data is formatted like so:
strategic areas that could have boosted performance, and he
performed file system reads in the map phase of certain jobs that <name> (H, (<auth rank>, [outgoing links]))
would have been more efficiently performed during job Where the key is once again the node name, and the value is a
configuration. Cloud9 Tuple object with the left value set to the symbol “H,”
An alternate system for ranking Twitter users exists on the web indicating that this is Authority data. The right value of this
at http://trst.me. This site, built by a team of data analysts tuple is another Tuple, whose left value is a DoubleWritable
known as Infochimps, performs a PageRank-like link analysis representing the hub weight for this node, and the right value is
on Twitter data comprised of 1.6 billion tweets collected since a Cloud9 ArrayListWritable object containing the names of all of
2006. Their implementation is based on a graph of Twitter the outgoing links from this node. Hub values were initialized to
follower links and is described in more detail at 1.0 in both executions of the algorithm.
http://trst.me/about.
3.2 Hub and Authority Update computed by the HubsAndAuthorities job, as well as the
normalization factors computed in Step 1 of the normalization
3.2.1 Map process, and then divides each hub or authority value by the
The HubsAndAuthorities job reads in the output from the correct normalization factor. The normalization factors are read
formatters or from a previous iteration of the algorithm. The in from the filesystem during job configuration and are
map task reads in each key-value pair from the input and outputs distributed to the various mappers using the Hadoop JobConf
the following for each input node n object.
<n>(A,(<current auth. value>, []))
3.3.2.1 Map
<n>(A,(-1,0,[incoming links])) In the configuration stage, each mapper reads the normalization
for each incoming node i in incoming link list: factors from the JobConf. The mapper then reads in each key-
value pair from the input and divides each hub or authority
<i> (H,(<auth value of n>,[])) value by the correct normalization factor (based on whether it is
<n>(H,(<current auth. value>, [])) tagged with an “H” or an “A”). Mapper output has the same
structure as the formatter output described in part 3.1
<n>(H,(-1,0,[outgoing links]))
for each outgoing node o in outgoing link list
3.3.2.2 Reduce
The reduce phase for Step 2 of the normalization process is
<n> (A,(<hub value of n>,[])) simply the Hadoop identity reducer. No computation needs to be
performed during the reduce phase.
The hub and authority values of -1 are dummy values assigned
to the output key-value pairs containing the incoming/outgoing The output from the normalization phase can be fed back into
adjacency lists to signal to the reducer that these key-value pairs the HubAndAuthorities job to begin another iteration of the
should not be used to update the authority or hub values. computation.