Prediction(2)R running through Spark/Hadoop Cluster

Prediction(2)R running through Spark/Hadoop Cluster

1. How we Load the Config in R
install.packages("yaml", repos="http://cran.rstudio.com/")

library("yaml")
config = yaml.load_file("config.yaml")

config$spark$home

These codes in Rstudio can be run. And also we can run them directly from shell
> Rscript scripts/WordCount.R

2. Prepare Hadoop Data
Create the Directory
>hadoop fs -mkdir user/carl/sparkR

Upload the file
>cd  /home/carl/install/spark-1.4.1-bin-hadoop2.6/examples/src/main/resources

> hadoop fs -put ./people.json /user/carl/sparkR/

3. This RScript Run Great on Hadoop Cluster
#install.packages("yaml", repos="http://cran.rstudio.com/")

library("yaml")
config = yaml.load_file("config.yaml")

spark_home <- config$spark$home
spark_r_location <- paste0(spark_home,"/R/lib")
spark_server <- config$spark$server

library("SparkR", lib.loc = spark_r_location)

sc <- sparkR.init(master = spark_server, appName = "SparkR_Wordcount",
                  sparkHome = spark_home)
sqlContext <- sparkRSQL.init(sc)

path <- file.path("sparkR/people.json")

peopleDF <- jsonFile(sqlContext, path)

printSchema(peopleDF)
head(peopleDF)

Running great both on RStudio and RScript.

Tips
1. Error Message:
trying to use CRAN without setting a mirror

Solution:
install.packages("yaml", repos="http://cran.rstudio.com/")

Add the repos there will fix the problem.

References:
http://www.mayin.org/ajayshah/KB/R/

http://stackoverflow.com/questions/5272846/how-to-get-parameters-from-config-file-in-r-script

wordcount example
https://github.com/amplab-extras/SparkR-pkg/blob/master/examples/wordcount.R

猜你喜欢

转载自sillycat.iteye.com/blog/2242559