You are on page 1of 18

R-Hadoop Integration on Ubuntu:This manual is direct for R and Hadoop integration on Ubuntu 12.

04 Pre-requisites:We assume, that the user would have following up and running before starting R and Hadoop integration
Ubuntu 12.04

Hadoop 1.x + If you do not have the Hadoop preinstalled on your Ubuntu machine, please follow the Single-node-cluster-(pseudo-distributed-mode-cluster.pdf guide present in your LMS under Module-7, to set-up the environment for R integration with Hadoop. Once Hadoop installation is done, make sure that all the processes are running:

Note: R integration with Hadoop has issues when it comes to java-openjdk. To resolve it, we need to have oracle-java6 installed on the machine. To install oracle-java6 please follow the following steps:

Give the command: sudo apt-get update

Click Yes to accept the agreement.

Edit the .bashrc file:


# Set Hadoop-related environment variables export CONF=/home/user/hadoop-1.2.0/conf # Set JAVA_HOME export JAVA_HOME=/usr/lib/jvm/java-6-oracle # Add Hadoop bin/ directory to PATH export PATH=$PATH:$/home/user/hadoop-1.2.0/bin Note: Please add the exact location of the specified files from your system.

Make sure JAVA_HOME is set to the correct java location.

Installing RHadoop
RHadoop has mainly following three R packages: rmr2 rhdfs rhbase rmr2 package provides Hadoop MapReduce functionality in R, rhdfs provides HDFS file operations in R and rhbase provides HBase connectivity from R.

Step #1: Update the sources.list.


sudo gedit /etc/apt/sources.list

Adding the line:


deb http://cran.cnr.berkeley.edu/bin/linux/ubuntu/ precise/

Step #2: sudo apt-get update

Step #3: Install r-base package.


sudo apt-get install r-base

Checking the version of R:

Download the following packages from: http://cran.cnr.berkeley.edu/ bitops rhdfs digest rJava functional RJSONIO plyr rmr2 Rcpp stringr reshape2 The installation requires the corresponding tar.gz archives to be downloaded. If the downloaded files are in Downloads, give the following command:

To untar the zipped file:

Then we can run R CMD INSTALL command with sudo privileges.

Rcpp Package

RJSONIO Package

digest Package

functional package

stringr package

plyr package

bitops package

reshape2 package

rmr2 package

Before installing rJava package we need to follow the following steps:


sudo JAVA_HOME=/usr/lib/jvm/java-6-oracle/jre R CMD javareconf

rJava package
sudo R CMD INSTALL rJava rJava_0.9-3.tar.gz

sudo HADOOP_CMD=/home/istvan/hadoop/bin/hadoop R CMD INSTALL rhdfs rhdfs_1.0.5.tar.gz

Make sure that the following packages are installed:

Getting started with RHadoop


In principle, RHadoop MapReduce is a similar operation to R lapply function that applies a function over a list or vector.

Without mapreduce function we could write a simple R code to double all the numbers from 1 to 100:
> ints = 1:100 > doubleInts = sapply(ints, function(x) 2*x) > head(doubleInts) [1] 2 4 6 8 10 12

With RHadoop rmr package we could use mapreduce function to implement the same calculations see doubleInts.R script:

Sys.setenv(HADOOP_HOME="/home/vikas/hadoop") Sys.setenv(HADOOP_CMD="/home/vikas/hadoop/bin/hadoop") library(rmr2) library(rhdfs) ints = to.dfs(1:100) calc = mapreduce(input = ints, map = function(k, v) cbind(v, 2*v)) from.dfs(calc) $val

You might also like