Professional Documents
Culture Documents
04 Pre-requisites:We assume, that the user would have following up and running before starting R and Hadoop integration
Ubuntu 12.04
Hadoop 1.x + If you do not have the Hadoop preinstalled on your Ubuntu machine, please follow the Single-node-cluster-(pseudo-distributed-mode-cluster.pdf guide present in your LMS under Module-7, to set-up the environment for R integration with Hadoop. Once Hadoop installation is done, make sure that all the processes are running:
Note: R integration with Hadoop has issues when it comes to java-openjdk. To resolve it, we need to have oracle-java6 installed on the machine. To install oracle-java6 please follow the following steps:
Installing RHadoop
RHadoop has mainly following three R packages: rmr2 rhdfs rhbase rmr2 package provides Hadoop MapReduce functionality in R, rhdfs provides HDFS file operations in R and rhbase provides HBase connectivity from R.
Download the following packages from: http://cran.cnr.berkeley.edu/ bitops rhdfs digest rJava functional RJSONIO plyr rmr2 Rcpp stringr reshape2 The installation requires the corresponding tar.gz archives to be downloaded. If the downloaded files are in Downloads, give the following command:
Rcpp Package
RJSONIO Package
digest Package
functional package
stringr package
plyr package
bitops package
reshape2 package
rmr2 package
rJava package
sudo R CMD INSTALL rJava rJava_0.9-3.tar.gz
Without mapreduce function we could write a simple R code to double all the numbers from 1 to 100:
> ints = 1:100 > doubleInts = sapply(ints, function(x) 2*x) > head(doubleInts) [1] 2 4 6 8 10 12
With RHadoop rmr package we could use mapreduce function to implement the same calculations see doubleInts.R script:
Sys.setenv(HADOOP_HOME="/home/vikas/hadoop") Sys.setenv(HADOOP_CMD="/home/vikas/hadoop/bin/hadoop") library(rmr2) library(rhdfs) ints = to.dfs(1:100) calc = mapreduce(input = ints, map = function(k, v) cbind(v, 2*v)) from.dfs(calc) $val