Prediction(1)Data Collection
All the data are in JSON format in S3 buckets.
We can verify and view the JSON data on this online tool.
http://www.jsoneditoronline.org/
I try to do the implementation on zeppelin which is really a useful tool.
Some important codes are as follow:
val date_pattern = "2015/08/{17,18,19,20}" //week1
//val date_pattern = "2015/08/{03,04,05,06,07,08,09}" //week2
//val date_pattern = "2015/{07/27,07/28,07/29,07/30,07/31,08/01,08/02}"
//val date_pattern = "2015/07/29"
val clicks = sqlContext.jsonFile(s"s3n://mybucket/click/${date_pattern}/*/*")
That codes can follow the pattern and load all the files.
clicks.registerTempTable("clicks")
//applications.printSchema
The can register the data as a table and print out the schema of the JSON data.
val jobs = sc.textFile("s3n://mybucket/jobs/publishers/xxx.xml.gz")
import sqlContext.implicits._
val jobsDF = jobs.toDF()
This can load all the text files in zip format and convert that to and Dataframe
%sql
select SUBSTR(timestamp,0,10), job_id, count(*) from applications group by SUBSTR(timestamp,0,10), job_id
%sql will give us the ability to write SQLs and display that data below in graph.
val clickDF = sqlContext.sql("select SUBSTR(timestamp,0,10) as click_date, job_id, count(*) as count from clicks where SUBSTR(timestamp,0,10)='2015-08-20' group by SUBSTR(timestamp,0,10), job_id")
import org.apache.spark.sql.functions._
val clickFormattedDF = clickDF.orderBy(asc("click_date"),desc("count"))
These command will do the query and sorting for us on Dataframe.
val appFile = "s3n://mybucket/date_2015_08_20"
clickFormattedDF.printSchema
sc.parallelize(clickFormattedDF.collect, 1).saveAsTextFile(appFile)
writes the data back to S3.
Here is the place to check the hadoop cluster
http://localhost:9026/cluster
And once we start that spark context, we can visit this URL to get the status on spark
http://localhost:4040/
References:
http://www.jsoneditoronline.org/
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html
https://gist.github.com/bigsnarfdude/d9c0ceba1aa8c1cfa4e5
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.sql.DataFrame
Prediction(1)Data Collection
猜你喜欢
转载自sillycat.iteye.com/blog/2237763
今日推荐
周排行