Professional Documents
Culture Documents
EVERYTHING
YOU NEED
TO KNOW
by Daniel Jebaraj
Contents
Ignore HDInsight at Your Own Peril: Everything You Need to Know ............................................................ 3
Abstract ..................................................................................................................................................... 3
Introduction .................................................................................................................................................. 4
Storage and Analysis of Big Data .................................................................................................................. 4
Hadoop/HDInsight ........................................................................................................................................ 5
Scalable StorageHadoop Distributed File System ..................................................................................... 5
Scalable processing ....................................................................................................................................... 6
MapReduce ............................................................................................................................................... 8
Map ....................................................................................................................................................... 8
Shuffle ................................................................................................................................................... 9
Reduce ................................................................................................................................................ 10
MapReduce sampleJava implementation of word count ................................................................... 10
Prerequisites ....................................................................................................................................... 10
Compiling provided Java sample......................................................................................................... 10
Upload the input text document to HDFS .......................................................................................... 11
C# implementation of word Count ......................................................................................................... 14
Review results ..................................................................................................................................... 14
Important notes .................................................................................................................................. 14
C# Mapper........................................................................................................................................... 14
C# Reducer....................................................................................................................................... 15
MapReduce the Easy Way .......................................................................................................................... 15
Building a simple recommendation engine ................................................................................................ 16
Perfectly correlated data ........................................................................................................................ 16
Uncorrelated data ................................................................................................................................... 16
C# implementation to calculate correlations ......................................................................................... 19
Simple Recommendation System Using Pig................................................................................................ 21
Load and store ........................................................................................................................................ 21
Relation ................................................................................................................................................... 21
Joins ........................................................................................................................................................ 22
Filter ........................................................................................................................................................ 22
Projection ................................................................................................................................................ 22
Syncfusion | Ignore HDInsight at Your Own Peril: Everything You Need to Know
Grouping ................................................................................................................................................. 22
Dump....................................................................................................................................................... 22
Pig script that analyzes movie ratings..................................................................................................... 22
Load the data from HDFS .................................................................................................................... 22
Obtain a list of unique movie combinations ....................................................................................... 22
Project the data to a more usable form.............................................................................................. 24
Obtain groups containing ratings for each pair of movies.................................................................. 24
Calculating correlations ...................................................................................................................... 24
Project final results ............................................................................................................................. 24
Dump final results for review.............................................................................................................. 24
Running the script ............................................................................................................................... 25
Results ................................................................................................................................................. 25
Applying the same concepts to a much larger set of data ..................................................................... 25
Structure of u.item .............................................................................................................................. 26
Structure of u.data .............................................................................................................................. 26
Running the script ............................................................................................................................... 28
The Role of Traditional BI............................................................................................................................ 29
Data Mining Post-ETL .................................................................................................................................. 29
Data Mining with Big Data .......................................................................................................................... 30
Big Data Processing Is Not Just for Big Data ............................................................................................... 30
ConclusionHarnessing Your Data Is Easier Than You Think..................................................................... 30
How Can Syncfusion Help? ......................................................................................................................... 30
Contact information .................................................................................................................................... 31
Appendix AInstalling and Configuring HDInsight on a Single Node (Pseudo-Cluster) .............................. 32
Appendix BConfiguring NetBeans for HDInsight Development on Windows........................................... 34
Syncfusion | Ignore HDInsight at Your Own Peril: Everything You Need to Know
http://www.windowsazure.com/en-us/services/hdinsight/
Syncfusion | Ignore HDInsight at Your Own Peril: Everything You Need to Know
Introduction
There has been a virtual explosion in the amount of data being created. Not very long ago, transactional
information was the main source of data. In the past, only a few large organizations accumulated
unwieldy amounts of transactional data. The need to store and process such amounts of data was not a
common business requirement for most organizations.
Now, the situation has changed dramatically. Organizations have woken up to the reality that huge
amounts of data are being generated on a daily basis by people and machines.
Consider some examples:
Such data can be big, difficult to store and process using traditional methods. Such unwieldiness is
what distinguishes big data from other data.
In spite of storage and processing difficulties, big data offers potentially huge business value. It provides
an opportunity to gather insight concerning business activities in ways previously not possible. Customer
web log information, for instance, can be used to predict valuable trends that an organization may not
otherwise be aware of. Machine-generated information can be used to predict failure, probability of
accidents, and other such events long before they happen. Social media signals can be used to predict
the failure or success of specific marketing initiatives.
This white paper focuses on the storage and processing of big data using the HDInsight distribution (the
terms Hadoop and HDInsight are used interchangeably). Understanding this is critical to harnessing big
data and putting it to use to further business goals.
Syncfusion | Introduction
Hadoop/HDInsight
Against this backdrop, Hadoop has gained broad acceptance as an effective storage and processing
mechanism for big data. Hadoop is an open-source implementation of systems that Google
implemented2 internally to solve big data problems related to storing indexes for web scale data.
Hadoop at its core has two pieces: one for storing large amounts of unstructured data in a cost-effective
manner and another for processing large amounts of data in a cost-effective manner.
The data storage solution is named Hadoop Distributed File System (HDFS).
The processing solution is an implementation of the MapReduce programming model
documented by Google.
Each file that is stored by HDFS is split into large blocks (typically 64 MB each, but this setting is
configurable).
Each block is then stored on multiple machines that are part of the HDFS cluster. A centralized
metadata store has information on where individual parts of a file are stored.
Considering that HDFS is implemented on commodity hardware, machines and disks are
expected to fail. When a node fails, HDFS will ensure that data blocks the node held are
replicated to other systems.
This scheme allows for the storage of large files in a fault-tolerant manner across multiple machines.
http://research.google.com/archive/mapreduce.html
Syncfusion | Hadoop/HDInsight
HDFS visually
In HDFS, the metadata store is typically on a machine referred to as the name node. The nodes where
data is stored are referred to as data nodes. In the previous diagram, there are three data nodes. Each
of these nodes contains a copy of each block of data that is stored on the HDFS cluster. A production
implementation of HDFS will have many more nodes, but the essential structure still applies.
The data blocks stored on individual machines also play an important role in efficiently processing data
by the implementation of MapReduce in Hadoop, but we will have more to say about that shortly.
Scalable processing
Before we discuss MapReduce, it will be helpful to carefully consider the issues associated with scaling
out the processing of data across multiple machines. We will do this using a simple example. Assume we
have a text file, and we would like an individual count of all words that appear in that text file.
This is the pseudo-code for a simple word-counting program that runs on a single machine:
Open the text file for reading, and read each line.
Parse each line into words.
Increment and store the count of each word as it appears in a dictionary or similar structure.
Close the file and output summary information.
Simple enough. Now consider that you have several gigabytes (maybe petabytes) of text files. How will
we modify the simple program described above to process this kind of information by scaling out3 across
multiple machines?
Some issues to consider:
Data storageThe system should provide a way to store the data being processed.
Data distributionThe system should be able to distribute data to each of the processing
nodes.
Scale up vs. scale outIt will not be ideal to implement such a processing system on a single machine. A powerful
machine can certainly process gigabytes of text, but there is a limit to this kind of scaling.
As we consider these aspects, it is evident that implementing a custom version of a truly scalable parallel
system across multiple machines is not a trivial task, even for a problem as simple as counting words.
Hadoop makes scaling out processing easier by implementing solutions to these issues, summarized in
the following table.
Issue considered
Data storage
HDFS
Parallelizable algorithm
Fault tolerance
Aggregation
Storage of results
HDFS
MapReduce
We have seen that Hadoop as of version 1.x4 mandates the MapReduce programming model.
MapReduce is a functional programming model that moves away from shared resources and related
synchronization or contention issues. It instead uses simple parts that are inherently scalable to achieve
complex solutions.
Googles paper on MapReduce provides the following description:
MapReduce is a programming model and an associated implementation for
processing and generating large data sets. Users specify a map function that
processes a key/value pair to generate a set of intermediate key/value pairs, and a
reduce function that merges all intermediate values associated with the same
intermediate key. Many real-world tasks are expressible in this model.
Programs written in this functional style are automatically parallelized and
executed on a large cluster of commodity machines. The run-time system takes
care of the details of partitioning the input data, scheduling the program's
execution across a set of machines, handling machine failures, and managing the
required inter-machine communication. This allows programmers without any
experience with parallel and distributed systems to easily utilize the resources of a
large distributed system.
The MapReduce programming model is not hard to understand, especially if we study it using a simple
example. MapReduce as implemented in Hadoop is comprised of three stages. We will look at the three
stages of any MapReduce program in detail.
Map
The Map stage takes input in the form of a key and a value, processes the input, and then outputs
another key and value. In this sense, it is no different than the implementation of Map in several
programming environments.
Considering the word count example, a Map task is likely to follow these steps:
Input
Example walkthrough
Input to Mapper
Key
{Any number indicating the index within the block
being processed}
Value
Twinkle, Twinkle Little Star
Output by Mapper
We assume that punctuation does not count in our context. Note that the word Twinkle was seen
twice during processing, and therefore appears twice with 1 as the value and Twinkle as the key.
Key
Twinkle
Twinkle
Little
Star
Value
1
1
1
1
Shuffle
Once the Map stage is over, data collected from the Mappers (remember, there could be several
Mappers operating in parallel) will be sent to the Shuffle stage.
During the Shuffle stage, all values that have the same key are collected and stored as a conceptual list
tied to the key under which they were registered.
In the word count example, assuming the single line of text we observed earlier was the only input, this
is what the output by the Shuffle phase should be:
Key
Twinkle
Little
Star
List of values
1,1
1
1
The Shuffle stage guarantees that data under a specific key will be sent to exactly one reducer (the next
stage).
Shuffle is not typically implemented by the application. Hadoop implements shuffle and guarantees that
all data values that belong to a single key will be gathered together and passed to a single reducer. In
the instance mentioned above, the key Twinkle will be processed by a single reducer. It will never be
processed by more than one reducer. Data under different keys can of course be routed to different
reducers.
Reduce
The reducers role is to process the transformed data and output yet another key-value pair. This is the
key-value pair that is actually written to the output. In the word count sample, the reducer can simply
return the word as a key again, with the value being a summation of all the ones that appear in the
provided list of values. This will, of course, be the number of times the word has appeared in the text
the desired output.
Key
Twinkle
Little
Star
Value
2
1
1
The beauty of MapReduce is that once a problem is broken into MapReduce terms and tested on a small
amount of data, you can be confident you have a scalable solution that can handle large volumes of
similar data.
We will now review a working implementation of the word count problem implemented using
MapReduce in Java and C#.
We chose to show the solution in both Java and C# since Java is the native language of the Hadoop
environment. Other languages such as C# are supported by streaming through stdin and stdout, but Java
is the language you will often turn to when reviewing available sample code or implementing more
advanced Hadoop features. For this reason, it is a good idea to have a working knowledge of using Java
with Hadoop.
10
Once you have a compiled JAR file, please follow these steps to execute the sample:
8. Use this command to view the content of the output directory: hadoop fs -cat
warpeacecount/part*
9. You should observe results dumped to the console, as shown in the following image.
11
After running the Java implementation of Word Count implemented using MapReduce, take a look at
the three parts of the code, reproduced here:
The Mapper
The Mapper takes lines of input, and for each word seen, returns the word as a key with the value 1,
as described in our earlier walkthrough.
// Template arguments state the type of input key,value and output key, value
public static class Map extends MapReduceBase implements Mapper<LongWritable, // the
input's key type
Text, // the value of the input (line of text in this case)
Text, // the key type of the output. The word that was seen in this case.
IntWritable> // the value type of the output. The number "1" for each
time the word was seen.
{
@Override
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> oc, Reporter rep) throws IOException {
String line = value.toString();
12
Reducer
The Reducer aggregates output from the Shuffle stage, as seen in the earlier walkthrough. It then
outputs each word as a key with its total count as a value.
public static class Reduce extends MapReduceBase implements Reducer<Text,
IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
13
jobConf.setOutputValueClass(IntWritable.class);
jobConf.setMapperClass(Map.class);
jobConf.setReducerClass(Reduce.class);
jobConf.setInputFormat(TextInputFormat.class);
jobConf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.addInputPath(jobConf,new Path(args[0]));
FileOutputFormat.setOutputPath(jobConf,new Path(args[1]));
JobClient.runJob(jobConf);
}
If you review each line in the code sample and observe the results, you should have a working
understanding of how MapReduce works.
Review results
hadoop fs -cat warpeacecountcs/part*
Important notes
The C# sample uses the Hadoop SDK available on CodePlex. We have included copies of the
assemblies and files needed. You will not need to build the SDK to work with the sample.
If you do have issues running the C# sample, we recommend that you build the Hadoop SDK
from code and then run the sample with updated dependencies.
Additionally, though the Hadoop SDK is available through Nuget, we do not recommend going
that route since we experienced some issues when building against the Nuget version. Building
the SDK from source is the way to go if you have issues.
The C# versions of the Mapper and Reducer are shown in the following sample. If you compare them
with the Java version, you will see they have similar functionality.
C# Mapper
public class WordCountMapper : MapperBase
{
public override void Map(string inputLine, MapperContext context)
14
{
try
{
string[] words = inputLine.Split(' ');
foreach (string word in words)
context.EmitKeyValue(word, "1");
}
catch (ArgumentException ex)
{
return;
}
}
}
C# Reducer
public class WordCountReducer : ReducerCombinerBase
{
public override void Reduce(string key, IEnumerable<string> values,
ReducerCombinerContext context)
{
context.EmitKeyValue(key, values.Count().ToString());
}
}
15
We will not work with Hive in this article, but we will spend a fair amount of time with Pig. Hive provides
a SQL-like approach to specifying MapReduce jobs. If you are interested in Hive, we encourage you to
check out material available online and the book Programming Hive5.
Pig and Hive are both compelling environments for authoring MapReduce jobs. As developers, we prefer
Pig since its syntax is closer to that of a programming language. If you come from a SQL background, you
may prefer Hive. HDInsight has great support for both. You will not be at a disadvantage choosing one
over the other for most tasks.
In the next section, we will look into building a simple product recommendation engine using Pig. The
task of building a product recommendation engine is a real-world, big data use case. We will simplify its
specification and implementation in order to make it easier to understand, but the fundamental ideas
will remain the same as those in actual use. Working through this sample will give you a good
understanding of using Pig for complex MapReduce tasks.
10
20
30
40
50
60
60
70
80
90
100
On the other hand, consider the table below. The two columns are not related in an evident manner.
Uncorrelated data
1
2
5
3123
12321
http://www.amazon.com/Programming-Hive-Edward-Capriolo/dp/1449319335
16
3
4
5
6
6
7
8
9
10
3123
12312
5555555
2323123
123213
23123
12313
1231232
13
A mathematical way to measure the extent of correlation is the Pearson product-moment correlation
coefficient. You can read about the formula and complete details on Wikipedia6. The Pearson Coefficient
can vary between -1 and 1, as summarized below.
Pearson product-moment correlation
coefficient value
-1
Comments
Perfectly correlated data, but as one rises the other
decreases.
Uncorrelated data
+1
The Pearson Coefficient can be any number between these values. Please refer to the Microsoft Excel
file named cor.xlsx, available under the folder correlation-excel in the sample code folder. It has simple
examples of correlations. Excel has built-in support for calculating the Pearson product-moment
correlation coefficient7.
Applying this information to our problem (deriving recommendations for related products), consider the
following:
Name
Jack
Mark
Albert
John
6
7
Assume there are two movies, Lord of the Rings and The Chronicles of Narnia, that we wish to
evaluate to see if they are similar (similarity being defined in this context as the possibility that
someone liking one will also like the other).
Assume users watched both movies and rated them, as given in the following table.
The Lord of the Rings
2
4
4
5
http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient.
http://office.microsoft.com/en-us/excel-help/correl-HP005209023.aspx.
17
Using Excels CORREL function, we calculate the Pearson Correlation Coefficient to be 0.8705715.
Please refer to sample Excel file correlation-excel\lord of the rings.xlsx to play with the provided data.
It is clear that the ratings for the two movies are strongly correlated. Now assume you have similar
ratings for thousands of movies from millions of users. It should be possible to calculate the correlation
coefficients for each pair of movies where ratings from the same user are available for both. Once these
have been calculated, they can be loaded into a relational database system; we should be able to quickly
look up the top N movies simply by looking at the pre-calculated correlation values.
Note: There are other ways to calculate correlations, and it is entirely possible that one system is vastly
superior to another for certain kinds of data. We use the Pearson product-moment correlation
coefficient since it is one of the most commonly used and is easily calculated using Excel. Also, the
method we use has a substantial number of shortcomings (dealing with sparse data is one). As stated
earlier, it does however serve as a useful example to understand more complex uses for MapReduce.
Consider the data set ratings.csv8 available in the folder named data, included with the sample code for
this document. It has data in the following form.
Name of
movie critic
Lisa Rose
Lisa Rose
Lisa Rose
Lisa Rose
Lisa Rose
Lisa Rose
Gene Seymour
Gene Seymour
Gene Seymour
Gene Seymour
Gene Seymour
Name of movie
Lady in the Water
Snakes on a Plane
Just My Luck
Superman Returns
You Me and Dupree
The Night Listener
Lady in the Water
Snakes on a Plane
Just My Luck
Superman Returns
The Night Listener
Rating
2.5
3.5
3
3.5
2.5
3
3
3.5
1.5
5
3
This data is only slightly different from the form we considered earlier. We need to obtain pairs of
movies and obtain ratings for each of them by the same user. Once we do this, we will have a list of
ratings for pairs of movies by the same users. We can then calculate the correlation coefficient for the
pairs of movies.
This data set and sample were adapted from the excellent book, Programming Collective Intelligence: Building
Smart Web 2.0 Applications by Toby Segaran - http://www.amazon.com/Programming-Collective-IntelligenceBuilding-Applications/dp/0596529325.
18
http://www.codeproject.com/Articles/42492/Using-LINQ-to-Calculate-Basic-Statistics
19
return ratings1.Pearson(ratings2);
}
Once we are able to calculate the correlation coefficient for a pair of movies, obtaining related
recommendations is simply a matter of obtaining related movies with the highest correlation scores to
the movie in question. The code is given in the following sample, and is straightforward.
In the code, the threshold parameter exists to ensure that movies with a very low or negative
correlation are not picked up. This can certainly be a problem if your data set is very sparse and does not
contain enough ratings. For our purpose, the threshold is set to -1. For practical use, it may need to be
set to 0.5 or so.
20
return results.ToArray();
}
Running the program to obtain the top related movies based on the movie Superman Returns provides
the following output:
You Me and Dupree
0.657951694959769
Lady in the Water
0.487950036474267
Snakes on a Plane
0.111803398874989
The Night Listener
-0.179847194799054
Just My Luck
-0.422890031611031
Relation
Pig works with collections of data that it refers to as relations. A relation is not to be confused with a
relational database relation. A relation in Pig terminology is simply a collection of data. It is best to think
of relations as similar to a table with rows and columns of data (for our current context). When grouped,
relations can also contain keys with an associated collection of values for each unique key.
10
http://www.amazon.com/Programming-Pig-Alan-Gates/dp/1449302645
21
Joins
Pig can accomplish Joins in a manner that is conceptually intuitive for users who have worked with
relational data. It can join two relations using a common key.
Filter
Pig can apply filters to data. A provided predicate is checked to see if data should be included or
excluded.
Projection
Pig can project from an existing collection in a manner similar to the SQL Select statement. Pigs
equivalent statement is named Generate.
Grouping
Pig can group data by one or more keys. Once grouped, you do not have to flatten the resulting data.
You can maintain a hierarchical structure with keys and lists of values related to the keys. These can
then be projected as needed.
Dump
Pig includes a Dump statement that can dump the contents of a relation to the console. Dump is useful
when working with Pig since you can run commands without writing the results to disk.
Movie name
Lady in the Water
Rating
2.5
Syncfusion | Simple Recommendation System Using Pig
22
2.5
3.5
3
3.5
2.5
3
After the join, just the result of joining the first row with the second relation should appear as seen in
the following table, which combines the first row, repeating it, with each of Lisa Roses ratings.
Lisa
Rose
Lisa
Rose
Lisa
Rose
Lisa
Rose
Lisa
Rose
Lisa
Rose
Lady in the
Water
Lady in the
Water
Lady in the
Water
Lady in the
Water
Lady in the
Water
Lady in the
Water
2.5 Lisa
Rose
2.5 Lisa
Rose
2.5 Lisa
Rose
2.5 Lisa
Rose
2.5 Lisa
Rose
2.5 Lisa
Rose
2.5
Snakes on a Plane
3.5
Just My Luck
Superman Returns
3.5
You Me and
Dupree
The Night Listener
2.5
3
As you can see, the first row is a duplicate that needs to be filtered from our result. After filtering, the
results derived from the first row of data will appear as seen in the following table.
Lisa
Rose
Lisa
Rose
Lisa
Rose
Lisa
Rose
Lisa
Rose
Lady in the
Water
Lady in the
Water
Lady in the
Water
Lady in the
Water
Lady in the
Water
2.5 Lisa
Rose
2.5 Lisa
Rose
2.5 Lisa
Rose
2.5 Lisa
Rose
2.5 Lisa
Rose
Snakes on a Plane
3.5
Just My Luck
Superman Returns
3.5
You Me and
Dupree
The Night Listener
2.5
3
23
The filter operation removes combinations of movies that are identical. One point to note is that after a
join, we refer to fields using the form original_relation_name::field_name.
filtered = FILTER combined BY ratings1::movie < ratings2::movie;
Calculating correlations
We now have all the information we need to calculate correlations. Pig offers built-in support for
calculating correlations. We make use of the pairs of ratings that have been gathered during the
grouping.
The COR function that calculates the correlation returns a list of records with additional information
besides the correlation value. Use the Flatten statement to flatten the results from COR into a single
Tuple of data.
correlations = foreach grouped_ratings generate group.movie1 as movie1,group.movie2
as movie2,
FLATTEN(COR(movie_pairs.rating1, movie_pairs.rating2)) as (var1, var2, correlation);
24
Results
(Just My Luck,Superman Returns,-0.42289003161103106)
(Just My Luck,Lady in the Water,-0.944911182523068)
(Just My Luck,Snakes on a Plane,-0.3333333333333333)
(Just My Luck,You Me and Dupree,-0.4856618642571827)
(Just My Luck,The Night Listener,0.5555555555555556)
(Superman Returns,You Me and Dupree,0.657951694959769)
(Superman Returns,The Night Listener,-0.1798471947990542)
(Lady in the Water,Superman Returns,0.4879500364742666)
(Lady in the Water,Snakes on a Plane,0.7637626158259734)
(Lady in the Water,You Me and Dupree,0.3333333333333333)
(Lady in the Water,The Night Listener,-0.6123724356957946)
(Snakes on a Plane,Superman Returns,0.11180339887498948)
(Snakes on a Plane,You Me and Dupree,-0.6454972243679028)
(Snakes on a Plane,The Night Listener,-0.5663521139548541)
(The Night Listener,You Me and Dupree,-0.25)
If you compare these results with the results from the C# version, you will observe that they are
identical.
25
The MovieLens site has ratings with 100,000, one million, and 10 million records11. The structure of the
data set is explained below. We are only interested in the u.info and u.data files.
Structure of u.item
Each line in this file has information on a specific movie. The only two fields that we will end up using are
the first two containing the unique ID for the movie and the name of the movie.
Movie ID
Movie name
Toy Story (1995
Structure of u.data
Each line in this file has identifiers for critics, the movie they rated, and the rating they gave. There is
also a column with timestamp information. The timestamp information is not needed for our purpose.
Critic ID
196
Movie ID
242
Rating
3
Timestamp (unused)
881250949
The complete Pig script used to perform this analysis is given in the following sample. Observe that it is
similar to the script we used with the smaller data set. The only major difference is that we perform an
extra join to include movie names as part of the results since the names are stored in a separate u.item
file.
-- Load MovieLens file u.data twice since we need to do a self-join to obtain unique
pairs of movies as before.
-- This script assumes you have uploaded the u.data file and u.item file into a
folder named movielens on HDFS
ratings1 = load 'movielens/u.data' as (critic:long, movie:long, rating:double);
ratings2 = load 'movielens/u.data' as (critic:long, movie:long, rating:double);
-- Since movies are identified by ID of type long, we filter and remove cases where
both movies are identical
-- The resulting relation contains unique pairs of movies
filtered = FILTER combined BY ratings1::movie != ratings2::movie;
11
We ran our test on a single machine with the 100k record data set. For larger data sets, it may be better to run
on a true cluster, one that you set up and configure locally. Or better yet, use one that you set up on Windows
Azures HDInsight service: http://www.windowsazure.com/en-us/services/hdinsight/.
26
-- Load item names and do a join to get actual names instead of just ID references
-- Notice that the separator between fields is a '|' in the file u.item.
movies = load 'movielens/u.item' using PigStorage('|') as (movie:long,
moviename:chararray);
-- Get the name of the first movie
named_results = JOIN results BY movie1, movies BY movie;
-- Get the name of the second movie
named_results2 = JOIN named_results BY results::movie2, movies BY movie;
27
Movie 2
ID
Correlation
Coefficient
Movie 1 ID
(repeated13)
1469
-0.6651330399133046
1469
1489
0.8703882797784892
1489
1510
NaN
1510
1475
-0.3273268353539886
1475
1419
0.6750771560841521
1419
1436
1.0
1436
1656
NaN
1656
Movie 1
Name
Tom and
Huck (1995)
Chasers
(1994)
Mad Dog
Time (1996)
Bhaji on the
Beach
(1993)
Highlander
III: The
Sorcerer
(1994)
Mr. Jones
(1993)
Little City
(1998)
Movie 2
Movie 2
ID
Name
(repeated)
4
Get Shorty
(1995)
4
Get Shorty
(1995)
4
Get Shorty
(1995)
4
Get Shorty
(1995)
4
Get Shorty
(1995)
Get Shorty
(1995)
Get Shorty
(1995)
12
MovieLens usage terms prohibit the distribution of the data. You will have to download a copy yourself in order
to test this script.
13
Repeated due to the joins with u.item. We should have added a projection, but we did not do so to keep the
code succinct.
28
Looking at a couple of result rows (highlighted), it is a safe bet to recommend Get Shorty to those who
like Chasers. It is a bad idea to recommend Get Shorty to those who like Tom and Huck.
The data contains fields that are repeated as well as several NaN values. It would be a good exercise to
modify the Pig script so NaN values, which appear due to a lack of common ratings for the pair of
movies, are removed. Also, you can modify projections so duplicate fields are removed.
The content thus far should have given you a good overview of the fundamentals of working with
Hadoop/HDInsight. You should now have enough of an understanding of the general environment
related to big data to briefly review some related topics.
14
http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?_r=0
29
15
16
http://mahout.apache.org/
https://plus.google.com/+TimOReilly/posts/4Xa76AtxYwd
30
Contact information
Syncfusion, Inc.
2501 Aerial Center Parkway
Suite 200
Morrisville, NC 27560
USA
Sales@syncfusion.com
31
4. The install creates a shortcut to a command line environment configured with the right
environment for running Hadoop. Navigate to this, and start the environment.
32
33
5. The following dialog will be displayed. Select Libraries and check to see if JDK 1.6 (this is the
version of the JDK that corresponds to Java 6) is selected. If you do not see JDK 1.6 selected,
please select it. If JDK 1.6 is not listed, click Manage Platforms.
Syncfusion | Appendix BConfiguring NetBeans for HDInsight Development on Windows
34
35
7. A dialog will then be displayed with a selection dialog that can be pointed to the location of the
JDK, as shown in the following image.
8. Now, make sure JDK 1.6 is selected as the platform, and close the selection dialog. The project
should then display JDK 1.6 under the libraries tree entry.
Syncfusion | Appendix BConfiguring NetBeans for HDInsight Development on Windows
36
9. The word count java project already contains a reference to the hadoop-core-1.1.0SNAPSHOT.jar file. In new projects that you create, you should include a reference to this library
(installed by HDInsight to {install disk}:\Hadoop\hadoop-1.1.0-SNAPSHOT\hadoop-core-1.1.0SNAPSHOT.jar). You may have to add additional library references if you use additional features.
Please consult included documentation for this information.
10. Once these settings are in place, you should be able to build the project using the Runbuild
project menu option. A JAR file will be created and available under a folder named dist under
the main project folder. This JAR file can be deployed to Hadoop clusters.
37