You are on page 1of 8

SYNOPSIS

A Minor Project submitted


For the partial fulfillment of the degree of
Bachelor of Technology
Information Technology(IT)
(Session 20016-2017)

Project Coordinator:
Mr. Rinkaj Goyal
Bhandari(01216401513)

Submitted by:
Kushagra
Sahil Gupta(02016401513)

(SIGNATURE)

University School Of Information and Communication Technology


Guru Gobind Singh Indraprastha University
Dwarka Sector-16C , New Delhi (Delhi)

September-2016

ABSTRACT:
Recommender systems are found in many ecommerce applications today.
Recommender systems usually provide the user with a list of recommendations that
they might prefer, or supply predictions on how much the user might prefer each item.
Two common approaches for providing recommendations are collaborative filtering and
content based filtering. By combining these two approaches, hybrid recommendation
systems can be developed that considers both the ratings of the user and the items
feature to recommend the items to the user. The features of limited amount of data can
be analyzed with the existing data analysis tools but when considering an
www.Movielens.com dataset of size in GIGABYTES , a big data analysis tool such as
Hadoop is used. Hadoop is a software framework for distributed processing of large data
sets. Hadoop uses MapReduce paradigm to perform distributed processing over clusters
of computers to reduce the time involved in analyzing the items feature (keywords of a
book). The proposed system is reliable and fault tolerant when compared to the existing
recommendation systems as it collects the ratings from the user to predict the interest
and analyses the item to find the features. The system is also adaptive as it updates the
rating list frequently and finds the updated interest of the user. Experimental results
show that the proposed system
1. INTRODUCTION
Big data analysis is one of the upcoming disciplines in data mining where the large
unstructured data that is very difficult to store and retrieve in an efficient manner. Big
data doesnt refer not only to exabytes or petabytes of data. When the amount of data
that is needed to be processed is greater than the capacity of the system, then it refers to
Bigdata. The three perspectives of big data are volume, velocity and variety [1]. Volume
refers to the amount of data that is being processed. It has moved to Zettabytes and
Petabytes as of 2014 and expected to increase in future. Velocity refers to the speed at
which the data can be processed with minimal error rate. Variety refers to all types of
data starting from unstructured raw data to semi-structured and structured data which
can be easily analyzed and used for the process of decision making and predictive
analysis.
This exponential growth in data has lead to many vital challenges in business. Existing
tools have become inadequate to process such large sets of data. In order toovercome
this, Google introduced a programming model called MapReduce [2]. This system was
considered as a great evolution in the field of data mining. Soon after, a tool called
Hadoop was introduced. Hadoop is a tool used for analyzing large sets of data using
distributed clusters.
This tool can also be used for parallel p rogramming. There are many big data analysis
tools but the key terms that

made Hadoop distinct from others are:


Accessible-Hadoop can run on large and distributed clusters of nodes or on some
services of cloud computing such as Amazons Elastic Compute Cloud (EC2).
Robust-Hadoop is architected with the capacity to withstand or tolerate hardware
malfunctions such as shut down or data loss. It can gracefully handle most such failures
with the help of secondary Namenode.
Scalable-Hadoop can be scaled to add more nodes once the multi node cluster has been
set up.
Simple- users can easily write parallel code with the help of Hadoop.

Personalized recommendations are ubiquitous in social network and shopping sites


these days. How do they do it? Al long as enough user interaction data is available for
items e.g., products in shopping sites, a kind of recommendation engine based on whats
know as Collaborative Filtering is not that difficult to build.
2. PROBLEM DOMAIN
Lets a take small digression into algorithm complexity and big O notation. Since
we are interested in finding correlation between pairs of items, the complexity is O(n x
n). If a shopping site has 500,000 products, potentially we may have to in the order of
250 billion computations. Granted the correlation data will be sparse, because its
unlikely that every pair of items will have some user interested in them. Still its
challenging to process this amount of data. Since the user interest in products changes
with time, the correlation result has a temporal aspect to it. The correlation calculation
needs to done periodically so that the results are up to date. Since correlation calculation
lends itself well to divide and conquer pattern, we will use Hadoop.

3. SOLUTION DOMAIN
I will follow a technique called Item Based Collaborative Filtering. The basic idea
is simple and it involves two main steps.

Product Rating
What we want to correlate is the rating for different products i.e. we are going to use
rating to find similarities between products. If products are explicitly rated by user e.g.,
in a scale of 1-5, we can use that number directly. Let s assume we dont have any such
rating data. The site may not offer any product rating feature. Even if the feature is
available, visitors may simply ignore it.
Instead we take a more intuitive pragmatic approach by monitoring user behavior.

The following table summarizes the rating logic

Depending on the nature of an users interaction with a product, it will be rated in the
scale of 1-5, as per the table above. We could make this more sophisticated by taking
into account parameters like amount of time spent on product page, how recent the user
behavior data etc.

Rating Correlation

Hadoop Processing
Essentially we have to generate the product of ratings for every of pair of products rated
by some user.
We have two kinds of input. The first one contains the mean and std deviation of ratings
for all the products as shown below. For reasons explained later, this input needs to be
in the format of one row for each product Id pair followed by the mean and std deviation
for rating for each product in the pair

Here pid is the product Id, m is mean rating and s is std deviation of rating. When a row
of this type is input is processed by the mapper, it will emit the pid pair as key and the
rest as the value.
I am taking a digression to explore how we can generate such data. We can use map
reduce again to generate such data. Given a list of product Id and associated mean and
std dev of rating, how can we generate such pair wise list. There are too many such
combinations. How can we reduce the scale of the problem.

There will be approximately 1800 such groups and each group will only have those
product Ids that start with the corresponding characters. For example, the group keyed
by a3 will only have product Ids that start with either a or 3.

Within each group its easier create the unique product Id pairs, because we are dealing
with smaller sets of data. Finally, we combine the results for the individual groups to get
final list with will contain all unique product Id pairs.
The other input contains rating for all users, with one row per user. It will contain user
Id followed multiple pairs of product Id and rating

Where uid is the user id and pid is the product id. Each row will have a variable number
of product Id and rating pairs depending on how many products have rating data for a
given user. When a row of this input is processed by the mapper, it emits multiple keys
and values. All possible pairs of pid are enumerated and each pair is emitted as a key.
The corresponding product of the rating is emitted as the value.
For any pid pair, the grouping needs to be done in such a way that the first value in the
list of values in the reducer input will be the the two mean and std deviation for the the
two products. Subsequent values are the product of ratings for the two products.

Mapper
The mapper implementation is as follows. The key has three tokens, 2 pid followed by 0
or 1. We will be using custom group comparator with 0 or 1 appended to the key, so that
for any given pid pair, the value containing the mean and std dev will appear before all
the values containing the rating products in the reducer input. That is the reason for
appending 0 and 1 to the key.
Here is some example mapper output. The first row shows an example of mean and std
dev for a pid pair. The next two rows are examples of product of rating for pid pairs. The
key consists of 2 pid followed by 0 or 1, depending on the record type.

Reducer
Here is an example reducer input. The first element in the list of values is the mean and
std dev of rating for the pid pair in the key. Following values in the list are the product of
rating corresponding to the pid pair in the key.

Every call to the reducer will create output text, which will consist of 3 coma separated
tokens which are pid1,pid2 and correlation
Partitioning
Unless we implement a partitioner, there is not guarantee that all the data for the same
product Id pair will go to the same reducer in the Hadoop cluster.
We need to write a partitioner based on the first 2 tokens of the key i.e., 2 pid values.
Here is the implementation

The return value will determine which among the numReducer number of reducers will
process the data for a pair of pid.
Group Comparator
We need to ensure that all the data for a given pid pair get fed into the reducer in one
call. To be able to do that we need to take control away fro Hadoops default group
partitioning into our own hands.
Just like the partitioner, the group comparator will also be based on the first two tokens
of the key as shown below

Final Thoughts
We are most of our way through our recommendation engine. A this point in our hand
we have several reducer output files where each row contains 2 pid values and the
corresponding correlation coefficient for all possible product pairs for rating data is
available.
Armed with the correlation values and the ratings for a target user, we can make product
recommendations for the user. But thats another map reduce job.

4. SYSTEM DOMAIN
Platform Specification:
Software Implementation Software:
Windows Operating System.
Netbeans 7.01..02
Apache Hadoop 2.7.2
WAMP Server
Hardware specification
Hardware:
Pentium III processor, 500 MHz
Minimum 256 MB RAM
5 GB Hard disk.

You might also like