Syncfusion WhitePaper HDInsight

Ignore HDInsight
at Your Own Peril:
EVERYTHING
YOU NEED
TO KNOW
by Daniel Jebaraj
Contents
Ignore HDInsight at Your Own Peril: Everything You Need to Know ............................................................ 3
Abstract ..................................................................................................................................................... 3
Introduction .................................................................................................................................................. 4
Storage and Analysis of Big Data .................................................................................................................. 4
Hadoop/HDInsight ........................................................................................................................................ 5
Scalable StorageHadoop Distributed File System ..................................................................................... 5
Scalable processing ....................................................................................................................................... 6
MapReduce ............................................................................................................................................... 8
Map ....................................................................................................................................................... 8
Shuffle ................................................................................................................................................... 9
Reduce ................................................................................................................................................ 10
MapReduce sampleJava implementation of word count ................................................................... 10
Prerequisites ....................................................................................................................................... 10
Compiling provided Java sample......................................................................................................... 10
Upload the input text document to HDFS .......................................................................................... 11
C# implementation of word Count ......................................................................................................... 14
Review results ..................................................................................................................................... 14
Important notes .................................................................................................................................. 14
C# Mapper........................................................................................................................................... 14
C# Reducer....................................................................................................................................... 15
MapReduce the Easy Way .......................................................................................................................... 15
Building a simple recommendation engine ................................................................................................ 16
Perfectly correlated data ........................................................................................................................ 16
Uncorrelated data ................................................................................................................................... 16
C# implementation to calculate correlations ......................................................................................... 19
Simple Recommendation System Using Pig................................................................................................ 21
Load and store ........................................................................................................................................ 21
Relation ................................................................................................................................................... 21
Joins ........................................................................................................................................................ 22
Filter ........................................................................................................................................................ 22
Projection ................................................................................................................................................ 22
Syncfusion | Ignore HDInsight at Your Own Peril: Everything You Need to Know
Grouping ................................................................................................................................................. 22
Dump....................................................................................................................................................... 22
Pig script that analyzes movie ratings..................................................................................................... 22
Load the data from HDFS .................................................................................................................... 22
Obtain a list of unique movie combinations ....................................................................................... 22
Project the data to a more usable form.............................................................................................. 24
Obtain groups containing ratings for each pair of movies.................................................................. 24
Calculating correlations ...................................................................................................................... 24
Project final results ............................................................................................................................. 24
Dump final results for review.............................................................................................................. 24
Running the script ............................................................................................................................... 25
Results ................................................................................................................................................. 25
Applying the same concepts to a much larger set of data ..................................................................... 25
Structure of u.item .............................................................................................................................. 26
Structure of u.data .............................................................................................................................. 26
Running the script ............................................................................................................................... 28
The Role of Traditional BI............................................................................................................................ 29
Data Mining Post-ETL .................................................................................................................................. 29
Data Mining with Big Data .......................................................................................................................... 30
Big Data Processing Is Not Just for Big Data ............................................................................................... 30
ConclusionHarnessing Your Data Is Easier Than You Think..................................................................... 30
How Can Syncfusion Help? ......................................................................................................................... 30
Contact information .................................................................................................................................... 31
Appendix AInstalling and Configuring HDInsight on a Single Node (Pseudo-Cluster) .............................. 32
Appendix BConfiguring NetBeans for HDInsight Development on Windows........................................... 34
Ignore HDInsight at Your Own Peril: Everything You Need to Know

Abstract
HDInsight is a Microsoft-provided distribution of Apache Hadoop, adapted for and tested on the
Windows platform. It can be deployed to private self-hosted clusters or can be accessed as a service on
the Windows Azure1 cloud. It is currently available as a public preview and is expected to be released in
the fourth quarter of 2013.
HDInsight brings truly scalable data processing of structured and unstructured data to the Windows
Platform. This white paper will cover all you need to know about Hadoop, MapReduce, and HDInsight in
order to harness the benefits of Big Data.
This white paper is suitable for both developers and managers. You could skip over the included code
and still get a working understanding of the overall system. For developers, we have included several
working samples that will help you understand the environment.
http://www.windowsazure.com/en-us/services/hdinsight/
Introduction
There has been a virtual explosion in the amount of data being created. Not very long ago, transactional
information was the main source of data. In the past, only a few large organizations accumulated
unwieldy amounts of transactional data. The need to store and process such amounts of data was not a
common business requirement for most organizations.
Now, the situation has changed dramatically. Organizations have woken up to the reality that huge
amounts of data are being generated on a daily basis by people and machines.
Consider some examples:
Activity logs generated by customers browsing websites.

Logs generated by complex, field-deployed machinery, recording key measurements several
times a second.
Signals generated on social media related to a companys products and services.
Such data can be big, difficult to store and process using traditional methods. Such unwieldiness is
what distinguishes big data from other data.
In spite of storage and processing difficulties, big data offers potentially huge business value. It provides
an opportunity to gather insight concerning business activities in ways previously not possible. Customer
web log information, for instance, can be used to predict valuable trends that an organization may not
otherwise be aware of. Machine-generated information can be used to predict failure, probability of
accidents, and other such events long before they happen. Social media signals can be used to predict
the failure or success of specific marketing initiatives.
This white paper focuses on the storage and processing of big data using the HDInsight distribution (the
terms Hadoop and HDInsight are used interchangeably). Understanding this is critical to harnessing big
data and putting it to use to further business goals.
Storage and Analysis of Big Data

Though most organizations have realized that they are accumulating more data than ever before, only a
minority have implemented a storage and analysis strategy for such data. There is good reason for this.
The storage of transactional data in relational database systems has been well understood for several
decades. The tools that are used for this purpose, such as SQL Server and Oracle, are well understood.
Relational data is stored in a highly structured format and processed using SQL. The warehousing of this
data to build online analytical processing (OLAP) systems is also understood well.
On the other hand, unstructured data (especially in huge quantities), is by nature very different. It has
no predefined structure. Relational databases are not suitable for storing such data. Storage on
traditional file systems is also problematic since the size of the data often exceeds the capabilities of
single machines.
Syncfusion | Introduction
Hadoop/HDInsight
Against this backdrop, Hadoop has gained broad acceptance as an effective storage and processing
mechanism for big data. Hadoop is an open-source implementation of systems that Google
implemented2 internally to solve big data problems related to storing indexes for web scale data.
Hadoop at its core has two pieces: one for storing large amounts of unstructured data in a cost-effective
manner and another for processing large amounts of data in a cost-effective manner.
The data storage solution is named Hadoop Distributed File System (HDFS).
The processing solution is an implementation of the MapReduce programming model
documented by Google.
Scalable StorageHadoop Distributed File System

HDFS is a file system designed to store large amounts of data economically. It does this by storing data
on multiple commodity machinesscaling out instead of scaling up.
HDFS, for all its underlying magic, is simple to understand at a conceptual level.
Each file that is stored by HDFS is split into large blocks (typically 64 MB each, but this setting is
configurable).
Each block is then stored on multiple machines that are part of the HDFS cluster. A centralized
metadata store has information on where individual parts of a file are stored.
Considering that HDFS is implemented on commodity hardware, machines and disks are
expected to fail. When a node fails, HDFS will ensure that data blocks the node held are
replicated to other systems.
This scheme allows for the storage of large files in a fault-tolerant manner across multiple machines.
http://research.google.com/archive/mapreduce.html
Syncfusion | Hadoop/HDInsight
HDFS visually
In HDFS, the metadata store is typically on a machine referred to as the name node. The nodes where
data is stored are referred to as data nodes. In the previous diagram, there are three data nodes. Each
of these nodes contains a copy of each block of data that is stored on the HDFS cluster. A production
implementation of HDFS will have many more nodes, but the essential structure still applies.
The data blocks stored on individual machines also play an important role in efficiently processing data
by the implementation of MapReduce in Hadoop, but we will have more to say about that shortly.
Scalable processing
Before we discuss MapReduce, it will be helpful to carefully consider the issues associated with scaling
out the processing of data across multiple machines. We will do this using a simple example. Assume we
have a text file, and we would like an individual count of all words that appear in that text file.
This is the pseudo-code for a simple word-counting program that runs on a single machine:
Open the text file for reading, and read each line.
Parse each line into words.
Increment and store the count of each word as it appears in a dictionary or similar structure.
Close the file and output summary information.
Simple enough. Now consider that you have several gigabytes (maybe petabytes) of text files. How will
we modify the simple program described above to process this kind of information by scaling out3 across
multiple machines?
Some issues to consider:
Data storageThe system should provide a way to store the data being processed.
Data distributionThe system should be able to distribute data to each of the processing
nodes.
Scale up vs. scale outIt will not be ideal to implement such a processing system on a single machine. A powerful
machine can certainly process gigabytes of text, but there is a limit to this kind of scaling.
Syncfusion | Scalable processing
Parallelizable algorithmThe processing algorithm should be parallelizable. Each node

should be able to run without being held up by another during any given stage. If nodes
have to share data, the difficulties associated with the synchronization of such data are to
be considered.
Fault toleranceAny of the nodes processing data can fail. In order for the system to be
resilient given the failure of individual nodes, the failure should be promptly detected and
the work should be delegated to another available node.
AggregationThe system should have a way to aggregate results produced by individual
nodes to compute final summaries.
Storage of resultsThe final output can itself be substantial; there should be an effective
way to store and retrieve it.
As we consider these aspects, it is evident that implementing a custom version of a truly scalable parallel
system across multiple machines is not a trivial task, even for a problem as simple as counting words.
Hadoop makes scaling out processing easier by implementing solutions to these issues, summarized in
the following table.
Issue considered
The Hadoop solution
Data storage
HDFS
Data distributionThe system should be able to

distribute data to each of the processing nodes.
Hadoop keeps data distribution between nodes

to a minimum. It instead moves processing code
to each node and processes the data where it is
already available on disk.
Parallelizable algorithm
We will study this aspect in more detail, but in

essence, as of version 1.x, Hadoop mandates
the MapReduce programming model to enable
a scalable processing model.
Fault tolerance
Hadoop monitors data storage nodes and will

add replicas as nodes become unavailable.
Hadoop also monitors tasks assigned to nodes
and will reassign if a node becomes unavailable.
Aggregation
This is accounted for in a distributed manner

through the Reduce stage, as we will see in the
next section.
Storage of results
HDFS
MapReduce
We have seen that Hadoop as of version 1.x4 mandates the MapReduce programming model.
MapReduce is a functional programming model that moves away from shared resources and related
synchronization or contention issues. It instead uses simple parts that are inherently scalable to achieve
complex solutions.
Googles paper on MapReduce provides the following description:
MapReduce is a programming model and an associated implementation for
processing and generating large data sets. Users specify a map function that
processes a key/value pair to generate a set of intermediate key/value pairs, and a
reduce function that merges all intermediate values associated with the same
intermediate key. Many real-world tasks are expressible in this model.
Programs written in this functional style are automatically parallelized and
executed on a large cluster of commodity machines. The run-time system takes
care of the details of partitioning the input data, scheduling the program's
execution across a set of machines, handling machine failures, and managing the
required inter-machine communication. This allows programmers without any
experience with parallel and distributed systems to easily utilize the resources of a
large distributed system.
The MapReduce programming model is not hard to understand, especially if we study it using a simple
example. MapReduce as implemented in Hadoop is comprised of three stages. We will look at the three
stages of any MapReduce program in detail.
Map
The Map stage takes input in the form of a key and a value, processes the input, and then outputs
another key and value. In this sense, it is no different than the implementation of Map in several
programming environments.
Considering the word count example, a Map task is likely to follow these steps:
Input
KeyKey identifying the value being provided to the Mapper.

o In the context of Hadoop and the word counting problem, this key is simply the starting
index of the text in the data block being processed. We can consider it to be opaque and
ignore it.
ValueValue is a single line of text. Consider it as a unit to be processed.
Processing (implemented by user code)

Splits the provided line of text into individual words.
Output
For each word, output a set of key or value pairs. The mechanism to output these values is provided by
Hadoop.
4
Version 2.0 introduces additional programming models.
KeyA suitable key is the actual word detected.

ValueA suitable value is 1. Think of this as a distributed marker that simply denotes that we
saw a particular word once. It is important to distinguish this from a dictionary approach. With a
dictionary, we will look up the current value and increment it by one. In our case, we do not do
this. Every time we see a word, we simply mark that we have seen it again by outputting a 1.
Aggregation will happen later.
Example walkthrough
Input to Mapper
Key
{Any number indicating the index within the block
being processed}
Value
Twinkle, Twinkle Little Star
Output by Mapper
We assume that punctuation does not count in our context. Note that the word Twinkle was seen
twice during processing, and therefore appears twice with 1 as the value and Twinkle as the key.
Key
Twinkle
Twinkle
Little
Star
Value
1
1
1
1
Shuffle
Once the Map stage is over, data collected from the Mappers (remember, there could be several
Mappers operating in parallel) will be sent to the Shuffle stage.
During the Shuffle stage, all values that have the same key are collected and stored as a conceptual list
tied to the key under which they were registered.
In the word count example, assuming the single line of text we observed earlier was the only input, this
is what the output by the Shuffle phase should be:
Key
Twinkle
Little
Star
List of values
1,1
1
1
The Shuffle stage guarantees that data under a specific key will be sent to exactly one reducer (the next
stage).
Shuffle is not typically implemented by the application. Hadoop implements shuffle and guarantees that
all data values that belong to a single key will be gathered together and passed to a single reducer. In
the instance mentioned above, the key Twinkle will be processed by a single reducer. It will never be
processed by more than one reducer. Data under different keys can of course be routed to different
reducers.
Reduce
The reducers role is to process the transformed data and output yet another key-value pair. This is the
key-value pair that is actually written to the output. In the word count sample, the reducer can simply
return the word as a key again, with the value being a summation of all the ones that appear in the
provided list of values. This will, of course, be the number of times the word has appeared in the text
the desired output.
Key
Twinkle
Little
Star
Value
2
1
1
The beauty of MapReduce is that once a problem is broken into MapReduce terms and tested on a small
amount of data, you can be confident you have a scalable solution that can handle large volumes of
similar data.
We will now review a working implementation of the word count problem implemented using
MapReduce in Java and C#.
We chose to show the solution in both Java and C# since Java is the native language of the Hadoop
environment. Other languages such as C# are supported by streaming through stdin and stdout, but Java
is the language you will often turn to when reviewing available sample code or implementing more
advanced Hadoop features. For this reason, it is a good idea to have a working knowledge of using Java
with Hadoop.
MapReduce sampleJava implementation of word count

Prerequisites
1. Install HDInsight following the steps given in Appendix AInstalling and Configuring HDInsight
on a Single Node (Pseudo-Cluster).
2. Download sample code from https://bitbucket.org/syncfusion/hdinsightwp/src.
3. Install and configure the NetBeans IDE for Hadoop development as documented in Appendix
BConfiguring NetBeans for HDInsight Development on Windows.
Compiling provided Java sample

Open and compile the Java Sample Word Count, available in the sample folder, word count java, using
the NetBeans IDE.
Alternatively, you can use your favorite Java IDE or the Java command line, but do keep in mind the
following:
You have to use Java 6, 64-bit version for compilation.

You have to package the compiled class files in a JAR file for execution by the Hadoop
environment.
10
Once you have a compiled JAR file, please follow these steps to execute the sample:
Upload the input text document to HDFS

1. Start the Hadoop command line shell through the link that is created when you install HDInsight.
2. Navigate to the folder where you have installed the source for this article. Specifically, navigate
to the folder named data.
3. Since HDFS is a virtual file system, Hadoop provides access to HDFS files through a shell. The
shell offers several standard POSIX commands. To copy the data file required for this sample,
use the following commands:
hadoop fs -mkdir warpeaceCreates a directory on HDFS named warpeace. The
directory will be created under the users home directory since we did not provide
an absolute path.
hadoop fs -put warpeace.txt warpeaceUploads warpeace.txt from the local file
system to the HDFS file system.
hadoop fs -ls warpeaceVerifies that the file was uploaded correctly.
4. Next, navigate to the sample folder named word count (java\wordcount\dist). The compiled
JAR file named wordcount.jar should be under this directory if you compiled using the NetBeans
IDE.
5. Run this command: hadoop jar wordcount.jar warpeace warpeacecount
6. Hadoop will start running the job right away. If all goes well, you should see the task complete
with no error messages.
7. Once the task completes with no errors, issue this command to see the output: hadoop fs -ls
warpeacecount
You should see the output depicted in the following image. There are multiple files in the
output. One contains status information, and another contains log information. The third file
with the name part-##### is the file of interest that contains the actual results. If there were
multiple nodes working in parallel, we would see additional files that contain parts of the result.
8. Use this command to view the content of the output directory: hadoop fs -cat
warpeacecount/part*
9. You should observe results dumped to the console, as shown in the following image.
11
After running the Java implementation of Word Count implemented using MapReduce, take a look at
the three parts of the code, reproduced here:
The Mapper
The Mapper takes lines of input, and for each word seen, returns the word as a key with the value 1,
as described in our earlier walkthrough.
// Template arguments state the type of input key,value and output key, value
public static class Map extends MapReduceBase implements Mapper<LongWritable, // the
input's key type
Text, // the value of the input (line of text in this case)
Text, // the key type of the output. The word that was seen in this case.
IntWritable> // the value type of the output. The number "1" for each
time the word was seen.
{
private final static IntWritable one = new IntWritable(1);

private Text word = new Text();
@Override
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> oc, Reporter rep) throws IOException {
String line = value.toString();
12
StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
oc.collect(word, one);
}
}
}
Reducer
The Reducer aggregates output from the Shuffle stage, as seen in the earlier walkthrough. It then
outputs each word as a key with its total count as a value.
public static class Reduce extends MapReduceBase implements Reducer<Text,
IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Entry point with configuration information

Hadoop requires some plumbing in order to submit a job, but this plumbing is straightforward. It uses
console arguments to configure values, such as the input and output paths. Several other settings can
also be specified if needed.
public static void main(String[] args) throws IOException {
JobConf jobConf = new JobConf(Wordcount.class);

jobConf.setJobName("Word Count example");
jobConf.setOutputKeyClass(Text.class);
13
jobConf.setOutputValueClass(IntWritable.class);
jobConf.setMapperClass(Map.class);
jobConf.setReducerClass(Reduce.class);
jobConf.setInputFormat(TextInputFormat.class);
jobConf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.addInputPath(jobConf,new Path(args[0]));
FileOutputFormat.setOutputPath(jobConf,new Path(args[1]));
JobClient.runJob(jobConf);
}
If you review each line in the code sample and observe the results, you should have a working
understanding of how MapReduce works.
C# implementation of word Count

We will now take a look at the C# implementation of the same sample.
The C# sample is available in the folder named word count cs. Unlike the Java sample, the C# sample is
configured to run itself. You do not, therefore, need to invoke Hadoop. Just know that you can have a
self-running exe that can start a Hadoop job, or you can have a command start the job for you based on
provided parameters (as we did with the Java sample).
The C# sample, once it is done, will create a folder named warpeacecountcs with results identical to the
Java version.
Review results
hadoop fs -cat warpeacecountcs/part*
Important notes
The C# sample uses the Hadoop SDK available on CodePlex. We have included copies of the
assemblies and files needed. You will not need to build the SDK to work with the sample.
If you do have issues running the C# sample, we recommend that you build the Hadoop SDK
from code and then run the sample with updated dependencies.
Additionally, though the Hadoop SDK is available through Nuget, we do not recommend going
that route since we experienced some issues when building against the Nuget version. Building
the SDK from source is the way to go if you have issues.
The C# versions of the Mapper and Reducer are shown in the following sample. If you compare them
with the Java version, you will see they have similar functionality.
C# Mapper
public class WordCountMapper : MapperBase
{
public override void Map(string inputLine, MapperContext context)
14
{
try
{
string[] words = inputLine.Split(' ');
foreach (string word in words)
context.EmitKeyValue(word, "1");
}
catch (ArgumentException ex)
{
return;
}
}
}
C# Reducer
public class WordCountReducer : ReducerCombinerBase
{
public override void Reduce(string key, IEnumerable<string> values,
ReducerCombinerContext context)
{
context.EmitKeyValue(key, values.Count().ToString());
}
}
MapReduce the Easy Way

We have looked at writing MapReduce the hard way with Java and C#. It is essential to understand how
MapReduce works; for that reason, the Java and C# samples we just reviewed are useful. In practice
however, writing MapReduce jobs in C# or Java can be compared to writing software programs in
Assembly. You dont usually do it unless you absolutely have to.
Several domain-specific languages exist for the specific purpose of authoring MapReduce jobs. Pig
(http://pig.apache.org/) and Hive (http://hive.apache.org/) are two of the more commonly used
languages.
Syncfusion | MapReduce the Easy Way
15
We will not work with Hive in this article, but we will spend a fair amount of time with Pig. Hive provides
a SQL-like approach to specifying MapReduce jobs. If you are interested in Hive, we encourage you to
check out material available online and the book Programming Hive5.
Pig and Hive are both compelling environments for authoring MapReduce jobs. As developers, we prefer
Pig since its syntax is closer to that of a programming language. If you come from a SQL background, you
may prefer Hive. HDInsight has great support for both. You will not be at a disadvantage choosing one
over the other for most tasks.
In the next section, we will look into building a simple product recommendation engine using Pig. The
task of building a product recommendation engine is a real-world, big data use case. We will simplify its
specification and implementation in order to make it easier to understand, but the fundamental ideas
will remain the same as those in actual use. Working through this sample will give you a good
understanding of using Pig for complex MapReduce tasks.
Building a simple recommendation engine

Product recommendations are available on many popular sites such as Amazon.com and Netflix.com.
When we review or buy a specific product, these websites usually offer helpful suggestions on other
products that may be of interest. They use complex algorithms tuned over years to achieve these
results. The underlying concepts though, are quite simple. The key concept is that of a correlation
between two pieces of data.
Consider the following table with two columns of data. Even a casual review tells us that one column is
in tandem with the other (simply calculated by a multiplier).
Perfectly correlated data

1
2
3
4
5
6
6
7
8
9
10
10
20
30
40
50
60
60
70
80
90
100
On the other hand, consider the table below. The two columns are not related in an evident manner.
Uncorrelated data
1
2
5
3123
12321
http://www.amazon.com/Programming-Hive-Edward-Capriolo/dp/1449319335
Syncfusion | Building a simple recommendation engine
16
3
4
5
6
6
7
8
9
10
3123
12312
5555555
2323123
123213
23123
12313
1231232
13
A mathematical way to measure the extent of correlation is the Pearson product-moment correlation
coefficient. You can read about the formula and complete details on Wikipedia6. The Pearson Coefficient
can vary between -1 and 1, as summarized below.
Pearson product-moment correlation
coefficient value
-1
Comments
Perfectly correlated data, but as one rises the other
decreases.
Uncorrelated data
+1
Perfectly correlated data.
The Pearson Coefficient can be any number between these values. Please refer to the Microsoft Excel
file named cor.xlsx, available under the folder correlation-excel in the sample code folder. It has simple
examples of correlations. Excel has built-in support for calculating the Pearson product-moment
correlation coefficient7.
Applying this information to our problem (deriving recommendations for related products), consider the
following:
Name
Jack
Mark
Albert
John
6
7
Assume there are two movies, Lord of the Rings and The Chronicles of Narnia, that we wish to
evaluate to see if they are similar (similarity being defined in this context as the possibility that
someone liking one will also like the other).
Assume users watched both movies and rated them, as given in the following table.
The Lord of the Rings
2
4
4
5
The Chronicles of Narnia

3
4.5
3.5
5
http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient.
http://office.microsoft.com/en-us/excel-help/correl-HP005209023.aspx.
17
Using Excels CORREL function, we calculate the Pearson Correlation Coefficient to be 0.8705715.
Please refer to sample Excel file correlation-excel\lord of the rings.xlsx to play with the provided data.
It is clear that the ratings for the two movies are strongly correlated. Now assume you have similar
ratings for thousands of movies from millions of users. It should be possible to calculate the correlation
coefficients for each pair of movies where ratings from the same user are available for both. Once these
have been calculated, they can be loaded into a relational database system; we should be able to quickly
look up the top N movies simply by looking at the pre-calculated correlation values.
Note: There are other ways to calculate correlations, and it is entirely possible that one system is vastly
superior to another for certain kinds of data. We use the Pearson product-moment correlation
coefficient since it is one of the most commonly used and is easily calculated using Excel. Also, the
method we use has a substantial number of shortcomings (dealing with sparse data is one). As stated
earlier, it does however serve as a useful example to understand more complex uses for MapReduce.
Consider the data set ratings.csv8 available in the folder named data, included with the sample code for
this document. It has data in the following form.
Name of
movie critic
Lisa Rose
Lisa Rose
Lisa Rose
Lisa Rose
Lisa Rose
Lisa Rose
Gene Seymour
Gene Seymour
Gene Seymour
Gene Seymour
Gene Seymour
Name of movie
Lady in the Water
Snakes on a Plane
Just My Luck
Superman Returns
You Me and Dupree
The Night Listener
Lady in the Water
Snakes on a Plane
Just My Luck
Superman Returns
The Night Listener
Rating
2.5
3.5
3
3.5
2.5
3
3
3.5
1.5
5
3
This data is only slightly different from the form we considered earlier. We need to obtain pairs of
movies and obtain ratings for each of them by the same user. Once we do this, we will have a list of
ratings for pairs of movies by the same users. We can then calculate the correlation coefficient for the
pairs of movies.
This data set and sample were adapted from the excellent book, Programming Collective Intelligence: Building
Smart Web 2.0 Applications by Toby Segaran - http://www.amazon.com/Programming-Collective-IntelligenceBuilding-Applications/dp/0596529325.
18
C# implementation to calculate correlations

We have included a sample implemented in straight C# without the use of MapReduce (simple
recommendation - cs console). This will help clarify the actual process of making these calculations. We
will then perform the same calculations using Pig.
The code that calculates correlations follows this procedure:
1. It takes an array containing all ratings and the two movies for which correlation values should be
calculated.
2. It then gets a list of critics who have rated both movies using LINQ.
3. It prepares two lists with parallel ratings.
4. It then uses the LINQStatistics9 library to calculate the Pearson product-moment correlation
coefficient from the two lists of data.
// Calculate the correlation between two movies using ratings by the same critic.
// The stronger the correlation, the more similar the two movies can be considered to
be.
private static double CalculateCorrelation(MovieRating[] movieRatings, string
movie1, string movie2)
{
Console.WriteLine(movie1);
Console.WriteLine(movie2);
// Get the critics who have rated either movie.
var items1 = movieRatings.Where(item => item.Movie == movie1).ToArray();
var items2 = movieRatings.Where(item => item.Movie == movie2).ToArray();
// Critics who have rated both movies.

var commonItems = items1.Intersect(items2, new
PropertyComparer<MovieRating>("Critic")).Select(item => item.Critic);
// No common critics - seen correlation is 0.0

if (!commonItems.Any())
return 0.0;
var ratings1 = new List<double>();

var ratings2 = new List<double>();
9
http://www.codeproject.com/Articles/42492/Using-LINQ-to-Calculate-Basic-Statistics
19
foreach (var critic in commonItems)

{
DumpWithOffset(critic);
var r1 = items1.Where(i => i.Critic == critic).Select(i =>
i.Rating).First();
ratings1.Add(r1);
DumpWithOffset(r1);
var r2 = items2.Where(i => i.Critic == critic).Select(i =>
i.Rating).First();
ratings2.Add(r2);
DumpWithOffset(r2);
}
return ratings1.Pearson(ratings2);
}
Once we are able to calculate the correlation coefficient for a pair of movies, obtaining related
recommendations is simply a matter of obtaining related movies with the highest correlation scores to
the movie in question. The code is given in the following sample, and is straightforward.
In the code, the threshold parameter exists to ensure that movies with a very low or negative
correlation are not picked up. This can certainly be a problem if your data set is very sparse and does not
contain enough ratings. For our purpose, the threshold is set to -1. For practical use, it may need to be
set to 0.5 or so.
public static Recommendation[] GetRelatedProducts(MovieRating[] movieRatings,

string movie, double threshold = -1)
{
var allMovies = movieRatings.Select(x => x.Movie).Distinct();
var results = allMovies.Where(c => c != movie)

.Select(c => new Recommendation()
{Movie = c, Rating = Analysis.CalculateCorrelation(movieRatings,
movie, c)})
.Where(x => x.Rating > threshold)
20
.OrderByDescending(x => x.Rating);
return results.ToArray();
}
Running the program to obtain the top related movies based on the movie Superman Returns provides
the following output:
You Me and Dupree
0.657951694959769
Lady in the Water
0.487950036474267
Snakes on a Plane
0.111803398874989
The Night Listener
-0.179847194799054
Just My Luck
-0.422890031611031
Simple Recommendation System Using Pig

Let us now analyze the same data set using Pig. Pig is termed as a data-flow language. It allows us to
express our processing requirements as a series of transformationsthe result of one flowing into
another. Pig then translates our specifications into Map and Reduce tasks.
In our opinion, it is similar to LINQ; once you play around with a few samples, you will have a good idea
of how it works. The key concepts are explained below. We will stick to explaining these concepts in a
manner that makes sense for our samples. We will not stray into additional details. If you need a
complete introduction to Pig, we recommend Programming Pig by Alan Gates10.
Load and store

Pig can load and store data from or to HDFS and other data sources. Pig can load files containing
comma-separated or tab-delimited data. It can also handle several other forms of data. It is also possible
to extend Pig to support custom data sources.
Relation
Pig works with collections of data that it refers to as relations. A relation is not to be confused with a
relational database relation. A relation in Pig terminology is simply a collection of data. It is best to think
of relations as similar to a table with rows and columns of data (for our current context). When grouped,
relations can also contain keys with an associated collection of values for each unique key.
10
http://www.amazon.com/Programming-Pig-Alan-Gates/dp/1449302645
Syncfusion | Simple Recommendation System Using Pig
21
Joins
Pig can accomplish Joins in a manner that is conceptually intuitive for users who have worked with
relational data. It can join two relations using a common key.
Filter
Pig can apply filters to data. A provided predicate is checked to see if data should be included or
excluded.
Projection
Pig can project from an existing collection in a manner similar to the SQL Select statement. Pigs
equivalent statement is named Generate.
Grouping
Pig can group data by one or more keys. Once grouped, you do not have to flatten the resulting data.
You can maintain a hierarchical structure with keys and lists of values related to the keys. These can
then be projected as needed.
Dump
Pig includes a Dump statement that can dump the contents of a relation to the console. Dump is useful
when working with Pig since you can run commands without writing the results to disk.
Pig script that analyzes movie ratings

In the following sample, we have explained the Pig code (in sample simple recommendation pig\recommend.pig) that analyzes the ratings document to calculate correlations between all pairs of
movies, as we did with C#.
Load the data from HDFS
The first step is to load the data from HDFS. We use Pigs load statement. Since the file that we are
processing (ratings.csv) is in CSV format, we pass in the comma as the separator to PigStorage (the
default load mechanism).
We have to load the same data twice in order to do a self-join. In the future, it may be possible to work
with two references to the same relation, but Pig does not work this way currently.
ratings1 = load 'recommend/ratings.csv' using PigStorage(',') as (critic:chararray,
movie:chararray, rating:double);
ratings2 = load 'recommend/ratings.csv' using PigStorage(',') as (critic:chararray,
movie:chararray, rating:double);
Obtain a list of unique movie combinations

In order to obtain a list of unique movie combinations, we first do a self-join by the name of the critic.
For each record, this will give us a complete set of records with ratings by the same critic. We then have
to filter out records with duplicate movie names.
As an example, consider the first record in the data set.
Critics name
Lisa Rose
Movie name
Lady in the Water
Rating
2.5
22
Also, these are the complete list of ratings by Lisa Rose.

Lisa Rose
Lisa Rose
Lisa Rose
Lisa Rose
Lisa Rose
Lisa Rose
Lady in the Water

Snakes on a Plane
Just My Luck
Superman Returns
You Me and Dupree
The Night Listener
2.5
3.5
3
3.5
2.5
3
After the join, just the result of joining the first row with the second relation should appear as seen in
the following table, which combines the first row, repeating it, with each of Lisa Roses ratings.
Lisa
Rose
Lisa
Rose
Lisa
Rose
Lisa
Rose
Lisa
Rose
Lisa
Rose
Lady in the
Water
Lady in the
Water
Lady in the
Water
Lady in the
Water
Lady in the
Water
Lady in the
Water
2.5 Lisa
Rose
2.5 Lisa
Rose
2.5 Lisa
Rose
2.5 Lisa
Rose
2.5 Lisa
Rose
2.5 Lisa
Rose
Lady in the Water
2.5
Snakes on a Plane
3.5
Just My Luck
Superman Returns
3.5
You Me and
Dupree
The Night Listener
2.5
3
As you can see, the first row is a duplicate that needs to be filtered from our result. After filtering, the
results derived from the first row of data will appear as seen in the following table.
Lisa
Rose
Lisa
Rose
Lisa
Rose
Lisa
Rose
Lisa
Rose
Lady in the
Water
Lady in the
Water
Lady in the
Water
Lady in the
Water
Lady in the
Water
2.5 Lisa
Rose
2.5 Lisa
Rose
2.5 Lisa
Rose
2.5 Lisa
Rose
2.5 Lisa
Rose
Snakes on a Plane
3.5
Just My Luck
Superman Returns
3.5
You Me and
Dupree
The Night Listener
2.5
3
The Pig code that affects this transformation is:

combined = JOIN ratings1 BY critic, ratings2 BY critic;
23
The filter operation removes combinations of movies that are identical. One point to note is that after a
join, we refer to fields using the form original_relation_name::field_name.
filtered = FILTER combined BY ratings1::movie < ratings2::movie;
Project the data to a more usable form

We project the results of the join, properly naming each field in the process. This will make it easier to
continue processing the data.
movie_pairs = FOREACH filtered GENERATE ratings1::critic AS critic1,
ratings1::movie AS movie1,
ratings1::rating AS rating1,
ratings2::critic AS critic2,
ratings2::rating AS rating2;
Obtain groups containing ratings for each pair of movies

This is achieved by simply grouping both movies in our relation.
grouped_ratings = group movie_pairs by (movie1, movie2);
Calculating correlations
We now have all the information we need to calculate correlations. Pig offers built-in support for
calculating correlations. We make use of the pairs of ratings that have been gathered during the
grouping.
The COR function that calculates the correlation returns a list of records with additional information
besides the correlation value. Use the Flatten statement to flatten the results from COR into a single
Tuple of data.
correlations = foreach grouped_ratings generate group.movie1 as movie1,group.movie2
as movie2,
FLATTEN(COR(movie_pairs.rating1, movie_pairs.rating2)) as (var1, var2, correlation);
Project final results

Now, we just need the names of the movies and the correlation coefficient.
results = foreach correlations generate movie1, movie2, correlation;
Dump final results for review

Do not store results in this case. Simply dump them to the console for review.
dump results;
24
The following steps have to be followed to run the code.
Running the script

The script we just walked through is available at simple recommendation - pig\recommend.pig. Follow
these steps to run the script.
1. Create a folder on HDFS named recommend.
Hadoop fs mkdir recommend
2. Upload data\ratings.csv to HDFS.
Hadoop fs put ratings.csv recommend
3. Run the script with the command.
Pig recommend.pig
4. Though the data set is small, it will take a while to complete since there is fixed-processing
overhead with setting up the MapReduce jobs.
5. When finished, you should see the results dumped to the console.
Results
(Just My Luck,Superman Returns,-0.42289003161103106)
(Just My Luck,Lady in the Water,-0.944911182523068)
(Just My Luck,Snakes on a Plane,-0.3333333333333333)
(Just My Luck,You Me and Dupree,-0.4856618642571827)
(Just My Luck,The Night Listener,0.5555555555555556)
(Superman Returns,You Me and Dupree,0.657951694959769)
(Superman Returns,The Night Listener,-0.1798471947990542)
(Lady in the Water,Superman Returns,0.4879500364742666)
(Lady in the Water,Snakes on a Plane,0.7637626158259734)
(Lady in the Water,You Me and Dupree,0.3333333333333333)
(Lady in the Water,The Night Listener,-0.6123724356957946)
(Snakes on a Plane,Superman Returns,0.11180339887498948)
(Snakes on a Plane,You Me and Dupree,-0.6454972243679028)
(Snakes on a Plane,The Night Listener,-0.5663521139548541)
(The Night Listener,You Me and Dupree,-0.25)
If you compare these results with the results from the C# version, you will observe that they are
identical.
Applying the same concepts to a much larger set of data

We can apply the same system to a much larger data set: the MovieLens ratings dataset available from
http://www.grouplens.org/datasets/movielens/.
25
The MovieLens site has ratings with 100,000, one million, and 10 million records11. The structure of the
data set is explained below. We are only interested in the u.info and u.data files.
Structure of u.item
Each line in this file has information on a specific movie. The only two fields that we will end up using are
the first two containing the unique ID for the movie and the name of the movie.
Movie ID
Movie name
Toy Story (1995
Several other fields (unused)

NA
Structure of u.data
Each line in this file has identifiers for critics, the movie they rated, and the rating they gave. There is
also a column with timestamp information. The timestamp information is not needed for our purpose.
Critic ID
196
Movie ID
242
Rating
3
Timestamp (unused)
881250949
The complete Pig script used to perform this analysis is given in the following sample. Observe that it is
similar to the script we used with the smaller data set. The only major difference is that we perform an
extra join to include movie names as part of the results since the names are stored in a separate u.item
file.
-- Load MovieLens file u.data twice since we need to do a self-join to obtain unique
pairs of movies as before.
-- This script assumes you have uploaded the u.data file and u.item file into a
folder named movielens on HDFS
ratings1 = load 'movielens/u.data' as (critic:long, movie:long, rating:double);
ratings2 = load 'movielens/u.data' as (critic:long, movie:long, rating:double);
-- Join by critic name

combined = JOIN ratings1 BY critic, ratings2 BY critic;
-- Since movies are identified by ID of type long, we filter and remove cases where
both movies are identical
-- The resulting relation contains unique pairs of movies
filtered = FILTER combined BY ratings1::movie != ratings2::movie;
11
We ran our test on a single machine with the 100k record data set. For larger data sets, it may be better to run
on a true cluster, one that you set up and configure locally. Or better yet, use one that you set up on Windows
Azures HDInsight service: http://www.windowsazure.com/en-us/services/hdinsight/.
26
-- Project intermediate results

movie_pairs = FOREACH filtered GENERATE ratings1::critic AS critic1,
ratings1::rating AS rating1,
ratings2::critic AS critic2,
ratings2::rating AS rating2;
-- Group by movie names

grouped_ratings = group movie_pairs by (movie1, movie2);
-- Calculate correlation in rating values

correlations = foreach grouped_ratings generate group.movie1 as movie1,group.movie2
as movie2,
FLATTEN(COR(movie_pairs.rating1, movie_pairs.rating2)) as (var1, var2,
correlation);
-- Project results removing fields that we do not need

results = foreach correlations generate movie1, movie2, correlation;
-- Load item names and do a join to get actual names instead of just ID references
-- Notice that the separator between fields is a '|' in the file u.item.
movies = load 'movielens/u.item' using PigStorage('|') as (movie:long,
moviename:chararray);
-- Get the name of the first movie
named_results = JOIN results BY movie1, movies BY movie;
-- Get the name of the second movie
named_results2 = JOIN named_results BY results::movie2, movies BY movie;
-- Write the results to HDFS

-- Please ensure that this folder does not exist
-- Remove with hadoop fs -rmr movielensout if it exists
27
STORE named_results2 INTO 'movielensout';
Running the script

1. Download and extract the 100k data set from the MovieLens website12,
http://www.grouplens.org/datasets/movielens/. These files are not included with the provided
sample code.
2. The data set contains several files. Only two of these files are required for our immediate needs:
u.data and u.item.
3. Upload u.data and u.item files to HDFS:
Hadoop fs mkdir movielens
Hadoop fs put u.data u.item movielens
4. Run the script with the command: Pig movielens.pig The script will take a while to complete.
Our tests on a single-machine, pseudo-cluster took about 10 minutes with the 100k data set.
5. The results are written to a folder named movielensout by default. If you run the script more
than once, be sure to remove this folder (hadoop fs rmr movielensout) before running the
script. Pig will complain if the output folder already exists.
6. Once the script completes, you can copy the results to your local file system for review.
>hadoop fs -get movielensout/part*
A portion of the output we obtained from this job is shown in the following table.
Movie 1
ID
Movie 2
ID
Correlation
Coefficient
Movie 1 ID
(repeated13)
1469
-0.6651330399133046
1469
1489
0.8703882797784892
1489
1510
NaN
1510
1475
-0.3273268353539886
1475
1419
0.6750771560841521
1419
1436
1.0
1436
1656
NaN
1656
Movie 1
Name
Tom and
Huck (1995)
Chasers
(1994)
Mad Dog
Time (1996)
Bhaji on the
Beach
(1993)
Highlander
III: The
Sorcerer
(1994)
Mr. Jones
(1993)
Little City
(1998)
Movie 2
Movie 2
ID
Name
(repeated)
4
Get Shorty
(1995)
4
Get Shorty
(1995)
4
Get Shorty
(1995)
4
Get Shorty
(1995)
4
Get Shorty
(1995)
Get Shorty
(1995)
Get Shorty
(1995)
12
MovieLens usage terms prohibit the distribution of the data. You will have to download a copy yourself in order
to test this script.
13
Repeated due to the joins with u.item. We should have added a projection, but we did not do so to keep the
code succinct.
28
Looking at a couple of result rows (highlighted), it is a safe bet to recommend Get Shorty to those who
like Chasers. It is a bad idea to recommend Get Shorty to those who like Tom and Huck.
The data contains fields that are repeated as well as several NaN values. It would be a good exercise to
modify the Pig script so NaN values, which appear due to a lack of common ratings for the pair of
movies, are removed. Also, you can modify projections so duplicate fields are removed.
The content thus far should have given you a good overview of the fundamentals of working with
Hadoop/HDInsight. You should now have enough of an understanding of the general environment
related to big data to briefly review some related topics.
The Role of Traditional BI

Analysis with Hadoop is a batch process. Such analysis takes time. While Hadoop can become an
invaluable part of your extract-transform-load (ETL) pipeline, in many instances you still need to store
final output in relational form or in a data warehouse of some sort. That way, it can be accessed on
demand. In fact, this is exactly what we would do with the movie recommendations we calculated. We
would store them in a relational database and update the information at regular intervals as new ratings
come in. We can then provide movie recommendations on demand.
In our opinion, traditional Business Intelligence tools do not lose their importance. They just get much
more powerful by harnessing the capabilities of the Hadoop ecosystem. This is precisely what we expect
to see happening on the Microsoft Business Intelligence stack as well. Microsoft, and other vendors, will
make it easier to integrate Hadoop and SQL Server/SQL Server Analysis Services.
Data Mining Post-ETL

The ability to perform large-scale ETL on data that was previously unavailable or difficult to process
opens up several additional avenues for data mining. Common data mining taskssuch as dependency
modeling, clustering, classification, detection of anomalies, regression, and aggregationdepend on
access to good data from multiple sources. The ability to process big data now provides additional data
sources that can be used with these tasks.
The famous example of Target learning14 that a lady was pregnant by analyzing and modelling shopping
habits was likely achieved because they were able to integrate not just transactional data, but additional
data sources such as web activity logs. The combined model is likely superior to one built from a smaller
segment of available data.
It is important to understand that the actual data mining environment does not have to be capable of
handling big data. The mining process can still be performed using traditional tools. The open source R
environment is a powerful tool for data mining. SQL Server also provides a solid set of data mining tools
already available to many organizations. When used in tandem with Hadoop, R and SQL Server can be
used to build compelling models that can predict customer spending, customer actions, fraud, machine
failure, and much more.
14
http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?_r=0
Syncfusion | The Role of Traditional BI
29
Data Mining with Big Data

It is also possible to model directly using big data. The open source Mahout15 project implements several
algorithms (including a recommendation system suitable for production use) in a distributed manner
that can be integrated with Hadoop. A complete set of currently implemented algorithms is available at
https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms.
Big Data Processing Is Not Just for Big Data

The processing methods we have seen are applicable to a broad swath of data that does not necessarily
have to be big. The processing methods are powerful and useful for much smaller data sets as well. This
is especially true when data is not in a structured format.
ConclusionHarnessing Your Data Is Easier Than You Think

There has been a lot of talk about big data, Hadoop, and data science. Having a good understanding of
the environment surrounding big data will help you make the right decisions, whether as a developer or
business person.
At Syncfusion, we sincerely believe getting your big data strategy right is not hard. It just requires a solid
understanding of the fundamentals and a willingness to push boundaries and test how the adoption of
new strategies can make your business more competitive. The future belongs not to those who have
more data, but to those who have data put to good use. We close with this statement by Tim OReilly in
a Google+ conversation16:
Companies that have massive amounts of data without massive amounts of
clue are going to be displaced by startups that have less data but more clue.
How Can Syncfusion Help?

Syncfusion has been working with HDInsight since the earliest releases of the product. We also have
extensive experience with traditional business intelligence, data mining, the R environment, data
visualization, and enterprise reporting.
Syncfusions solutions team can implement big data solutions from end-to-end. Contact us today to
learn more.
15
16
http://mahout.apache.org/
https://plus.google.com/+TimOReilly/posts/4Xa76AtxYwd
Syncfusion | Data Mining with Big Data
30
Contact information
Syncfusion, Inc.
2501 Aerial Center Parkway
Suite 200
Morrisville, NC 27560
USA
Sales@syncfusion.com
Syncfusion | Contact information
31
Appendix AInstalling and Configuring HDInsight on a Single Node

(Pseudo-Cluster)
1. Install HDInsight (developer preview version as of October 24, 2013) from
http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHT-PREVIEW
2. Install according to the installation package prompts.
3. Once installation is complete, ensure that the following services are running. They are not set to
run by default, so you will have to manually start them or change settings so they start
automatically.
4. The install creates a shortcut to a command line environment configured with the right
environment for running Hadoop. Navigate to this, and start the environment.
Syncfusion | Appendix AInstalling and Configuring HDInsight on a Single Node (PseudoCluster)
32
Syncfusion | Appendix AInstalling and Configuring HDInsight on a Single Node (PseudoCluster)
33
Appendix BConfiguring NetBeans for HDInsight Development on

Windows
1. Install version 6 of the Java SDK from
http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloadsjavase6-419409.html.
2. Install the latest version of NetBeans from https://netbeans.org/downloads/.
3. Once installed, you can open the included word count java sample in NetBeans.
4. Once you open the project, select project properties by right-clicking on the project name, as
shown in the following image, and selecting Properties.
5. The following dialog will be displayed. Select Libraries and check to see if JDK 1.6 (this is the
version of the JDK that corresponds to Java 6) is selected. If you do not see JDK 1.6 selected,
please select it. If JDK 1.6 is not listed, click Manage Platforms.
Syncfusion | Appendix BConfiguring NetBeans for HDInsight Development on Windows
34
6. Click Add Platform on the next dialog that is displayed.
35
7. A dialog will then be displayed with a selection dialog that can be pointed to the location of the
JDK, as shown in the following image.
8. Now, make sure JDK 1.6 is selected as the platform, and close the selection dialog. The project
should then display JDK 1.6 under the libraries tree entry.
36
9. The word count java project already contains a reference to the hadoop-core-1.1.0SNAPSHOT.jar file. In new projects that you create, you should include a reference to this library
(installed by HDInsight to {install disk}:\Hadoop\hadoop-1.1.0-SNAPSHOT\hadoop-core-1.1.0SNAPSHOT.jar). You may have to add additional library references if you use additional features.
Please consult included documentation for this information.
10. Once these settings are in place, you should be able to build the project using the Runbuild
project menu option. A JAR file will be created and available under a folder named dist under
the main project folder. This JAR file can be deployed to Hadoop clusters.
37

Syncfusion WhitePaper HDInsight

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Syncfusion WhitePaper HDInsight

Uploaded by

Copyright:

Available Formats

Ignore HDInsight

at Your Own Peril:

Ignore HDInsight at Your Own Peril: Everything You Need to Know

Activity logs generated by customers browsing websites.

Storage and Analysis of Big Data

Scalable StorageHadoop Distributed File System

Syncfusion | Scalable processing

Parallelizable algorithmThe processing algorithm should be parallelizable. Each node

The Hadoop solution

Data distributionThe system should be able to

Hadoop keeps data distribution between nodes

We will study this aspect in more detail, but in

Hadoop monitors data storage nodes and will

This is accounted for in a distributed manner

Syncfusion | Scalable processing

KeyKey identifying the value being provided to the Mapper.

Processing (implemented by user code)

Version 2.0 introduces additional programming models.

Syncfusion | Scalable processing

KeyA suitable key is the actual word detected.

Syncfusion | Scalable processing

MapReduce sampleJava implementation of word count

Compiling provided Java sample

You have to use Java 6, 64-bit version for compilation.

Upload the input text document to HDFS

Syncfusion | Scalable processing

private final static IntWritable one = new IntWritable(1);

Syncfusion | Scalable processing

StringTokenizer tokenizer = new StringTokenizer(line);

Entry point with configuration information

JobConf jobConf = new JobConf(Wordcount.class);

Syncfusion | Scalable processing

C# implementation of word Count

Syncfusion | Scalable processing

MapReduce the Easy Way

Building a simple recommendation engine

Perfectly correlated data

Syncfusion | Building a simple recommendation engine

Perfectly correlated data.

The Chronicles of Narnia

Syncfusion | Building a simple recommendation engine

Syncfusion | Building a simple recommendation engine

C# implementation to calculate correlations

// Critics who have rated both movies.

// No common critics - seen correlation is 0.0

var ratings1 = new List<double>();

Syncfusion | Building a simple recommendation engine

foreach (var critic in commonItems)

public static Recommendation[] GetRelatedProducts(MovieRating[] movieRatings,

var results = allMovies.Where(c => c != movie)

Syncfusion | Building a simple recommendation engine

.OrderByDescending(x => x.Rating);

Simple Recommendation System Using Pig

Load and store

Syncfusion | Simple Recommendation System Using Pig

Pig script that analyzes movie ratings

Obtain a list of unique movie combinations

Also, these are the complete list of ratings by Lisa Rose.

Lady in the Water

Lady in the Water

The Pig code that affects this transformation is:

Syncfusion | Simple Recommendation System Using Pig

Project the data to a more usable form