Getting Started with Spark

Use spark to analyze sogou logs

Download the abridged version of the user query log, the full version http://download.labs.sogou.com/dl/q.htmlData
format description:
access time\tuser ID\t[query word]\tThe URL returns the result The ranking in \t the sequence number of the user's click\t the URL that the user clicked.
Among them, the user ID is automatically assigned according to the cookie information when the user uses the browser to access the search engine, that is, different queries entered using the browser at the same time correspond to the same user ID. Realize the following functions:
1. Top 10 most popular query words
2. Top 10 users query

 

The files downloaded from us are stored in hdfs. For the installation of hadoop, I refer to http://blog.csdn.net/stark_summer/article/details/4242427, this blog.

Since the downloaded file format is GBK, it needs to be transcoded before uploading to hdfs.

 

find *.txt -exec sh -c "iconv -f GB18030 -t UTF8 {} > {}.txt" \;

 Then upload the downloaded file to hdfs

 

 

hadoop fs -mkdir /data
hadoop fs -put /root/dfs/SogouQ.reduced /data/sogou

 

Next, we can write spark programs to achieve the above problems

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

object WebLogInfo {
  def main(args: Array[String]) {
    val dataFile = "hdfs://vs1:9000/data/sogou/SogouQ.reduced"
    val resultFile = "hdfs://vs1:9000/data/result"
    val conf = new SparkConf().setAppName("WebLogInfoApp")
    val sc = new SparkContext(conf)
    val linesRdd = sc.textFile(dataFile).map(line => line.split('\t')).filter(_.length >= 5)
    val userArray = linesRdd.map(w => (w(1), 1)).reduceByKey(_+_).map(x => (x._2, x._1)).sortByKey(false).take(10)
    val userRdd = sc.parallelize(userArray, 1).map(x => (x._2, x._1))
    val wordArray = linesRdd.map(w => (w(2), 1)).reduceByKey(_+_).map(x => (x._2, x._1)).sortByKey(false).take(10)
    val wordRdd = sc.parallelize(wordArray, 1).map(x => (x._2, x._1))
    val urlArray = linesRdd.map(w => (w(4).split('/')(0), 1)).reduceByKey(_+_).map(x => (x._2, x._1)).sortByKey(false).take(50)
    val urlRdd = sc.parallelize(urlArray, 1).map(x => (x._2, x._1))

    (userRdd ++ wordRdd ++ urlRdd).repartition(1).saveAsTextFile(resultFile)
    sc.stop()
  }
}

The result can be calculated by typing the code into a jar and uploading it to the spark cluster

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327074560&siteId=291194637