You are on page 1of 6

Proposalfora

ThesisintheFieldof
InformationTechnology
InPartialFulfillmentoftheRequirements
ForaMasterofLiberalArtsDegree

HarvardUniversity
ExtensionSchool

April07,2014

LuisF.Montoya
13926NW21stLane
Gainesville,FL32606
352.331.7230
luism@ieee.org

ProposedStartDate:April2014
AnticipatedDateofGraduation:AfterJune2014
ThesisDirector:TBD

1.

Thesis Title

Implementation of a Hadoop Cluster with necessary tools for a small-midsize enterprise using
off the shelf computers and Ubuntu Operating System and Determination of data size to justify
its usage over Relational Databases.

2.

Abstract

Since the announcement of Google's invention of the MapReduce programming model for the
processing of large data, new models keep emerging and offer performance improvement for
some specific tasks. One of these models is Hadoop, but sometimes is more efficient to use
Relational databases instead, if the size of the data is not large enough. In other words, there
should be a threshold of data size or data structure that merits the migration from Relational
Databases to Hadoop/MapReduce.
This project intends to create a small Hadoop cluster and study a way to find that threshold for
both structured and unstructured (doesn't fit into tables) data. Several web publications and
web blogs have discussed this topic (Stucchio, C), (OGrady), (Cross, B), but none of these
sources make a conscientious analysis of where the mentioned threshold lies, both in terms of
data size and data structures.

3.

Description of the Project

3.1

Introduction to Hadoop

Apache Hadoop is an open source project used for distributed processing of large data sets
across clusters of commodity servers (What is Hadoop?). The purpose of this approach was to
use lower cost hardware and rely on the software ability to detect and handle failures. Hadoop is
a collection of projects to solve large and complex data problems.
Apache Hadoop has two main subprojects: MapReduce and HDFS. MapReduce is a
programming model for processing large data sets with a parallel, distributed algorithm on a
cluster. The work is broken down into mapper and reducer tasks to manipulate data stored
across a cluster of servers. HDFS or Hadoop Distributed File System allows applications to be
run across multiple servers. Data in a Hadoop cluster is broken down into blocks and distributed
throughout the cluster. In this way, the map and reduce functions can be executed on smaller
subsets of larger data sets, thus providing the scalability required for Big Data Processing.
The intention of this project is to implement a Hadoop Cluster using low price personal
computers with a Linux OS installed in them, although at time of writing, Amazon Web Services
(AWS) has just slashed prices for their Cloud Computing Services, which makes it probably
more practical from the hardware maintenance point of view, especially for small cluster sizes. If
the local machines approach is utilized, these computers can be obtained locally and relatively
inexpensive. The cluster will have anywhere between 4 and 10 of these servers.
The implementation section will discuss the hardware and software requirements in more detail.
Needless to say, the software to be installed has to be open source to minimize costs, as
required by small businesses trying to have their own cloud infrastructure.
Since MapReduce techniques is challenging (OGrady), one or both of the most widely used
projects, Hive and Pig will be examined and/or used. These projects provide SQL-Like interfaces
that complement the MapReduce functionality.
Finally, if one want to have almost Real Time Queries on large data, one of the tools to be used
is Impala from Cloudera. This tool outperforms Hive in certain type of data queries. The cluster
implemented here has Impala installed in it.

3.2

The Data Sources and Data Manipulation

Since this project will use data of various sizes, the idea is to obtain data from sources that
generate information continuously, such as weather, Wikipedia Data Dumps, Google
Developers Data Dumps or the Public Data Sets on Amazon Web Services. After the data is
obtained, it needs to be merged (if necessary), cleaned, analyzed and finally presented in a
meaningful way to the final user. To accomplish this tasks, R programming language will be used.
R is indicated when data from different sources is obtained, because programming with R will
present the data in a more usable form.
3

3.3

Software Tools to be Used

The software distribution to be used for this project is Cloudera Distribution System (CDS)
(Cloudera Downloads) and the servers will have installed all the open source programs such as
MySQL, R, Hive, Impala and MongoDB, required to run this project successfully. CDS installs
Hive and Impala as part of their distribution package. Other tools will be installed as the need
arises.
One of the most important aspects of data to consider when switching to Hadoop is the ETL
(extraction, transformation, and loading) step when handling the data. If the data is even of
medium size but unstructured, use of Hadoop integrated with R might provide fast and
interactive queries. Other tools necessary for the success of this project will be studied also.

4.

Work Plan

4.1
Create a Hadoop cluster of about 4 to 10 nodes in AWS and run some of the standard
experiments to verify that the cluster is operating normally.
4.2
Obtain some of the data mentioned on section 3.2 and time/run several queries using
the Hadoop cluster.
4.3
On the data obtained, investigate which is suitable for queries using Relational DataBase
Management Systems (RDBMS) MySQL and run the same type of queries as in 4.2 using the
MySQL installed in an AWS instance. Repeat tests with different data sizes until the queries fail
to produce results on a timely manner. Also, investigate possible ways to manipulate the
unstructured data for it to be more friendly to RDBMS, using R and related tools.
4.4
Parallel to the implementation using AWS, a small Hadoop cluster will be implemented
using of the shelf computers with at least 1 GB of memory and 500GB hard drive of storage.
The same experiments will be tried in this homebrew cluster. Comparison of results will be
analyzed.

5.

Implementation

5.1
Create an AWS EC2 Virtual Server running CentOS of RedHat Linux, m1.medium or
m1.large capable of handling Hadoop. Install Hadoop in it and create an image of this instance.
(More details of the installation process will be produced in the final document).
5.2

Create several instances of the image created on 5.1.

5.3

Configure the Hadoop cluster (White, T).

5.4

Run the experiments described in chapter 4.

5.5

Repeat the same implementation/experiments using the homebrew Hadoop cluster.

6.

Results

TBD

7.

Conclusion

TBD

8.

References

8.1

White, Tom. Hadoop, The definite Guide, 3rd Edition, OReilly, 2012

8.2
What is Hadoop? by IBM Software, extracted from
http://www-01.ibm.com/software/data/infosphere/hadoop/
8.3

Amazon Web Services (AWS) extracted from https://aws.amazon.com/

8.4

Cloudera Inc, http://www.cloudera.com/content/support/en/downloads.html

8.5
OGrady S., What Factors Justify the Use of of Apache Hadoop. Retrieved from
http://redmonk.com/sogrady/2011/01/13/apache-hadoop/
8.6

Big Data Now, OReilly Media, 2013

8.7

Planning for Big Data, OReilly Radar Team, 2012

8.8
Stucchio C. Don't use Hadoop - your data isn't that big, extracted from
http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
8.9
Cross, B. Big Data Is Less About Size And More About Freedom, extracted from
http://techcrunch.com/2010/03/16/big-data-freedom/
8.10 Szegedi, I. Integrating R with Cloudera Impala for Real-Time Queries on Hadoop,
extracted from
https://bighadoop.wordpress.com/2013/11/25/integrating-r-with-cloudera-impala-for-real-timequeries-on-hadoop/

You might also like