Use spark to analyze sogou logs
Download the abridged version of the user query log, the full version http://download.labs.sogou.com/dl/q.htmlData
format description:
access time\tuser ID\t[query word]\tThe URL returns the result The ranking in \t the sequence number of the user's click\t the URL that the user clicked.
Among them, the user ID is automatically assigned according to the cookie information when the user uses the browser to access the search engine, that is, different queries entered using the browser at the same time correspond to the same user ID. Realize the following functions:
1. Top 10 most popular query words
2. Top 10 users query
The files downloaded from us are stored in hdfs. For the installation of hadoop, I refer to http://blog.csdn.net/stark_summer/article/details/4242427, this blog.
Since the downloaded file format is GBK, it needs to be transcoded before uploading to hdfs.
find *.txt -exec sh -c "iconv -f GB18030 -t UTF8 {} > {}.txt" \;
Then upload the downloaded file to hdfs
hadoop fs -mkdir /data hadoop fs -put /root/dfs/SogouQ.reduced /data/sogou
Next, we can write spark programs to achieve the above problems
import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object WebLogInfo { def main(args: Array[String]) { val dataFile = "hdfs://vs1:9000/data/sogou/SogouQ.reduced" val resultFile = "hdfs://vs1:9000/data/result" val conf = new SparkConf().setAppName("WebLogInfoApp") val sc = new SparkContext(conf) val linesRdd = sc.textFile(dataFile).map(line => line.split('\t')).filter(_.length >= 5) val userArray = linesRdd.map(w => (w(1), 1)).reduceByKey(_+_).map(x => (x._2, x._1)).sortByKey(false).take(10) val userRdd = sc.parallelize(userArray, 1).map(x => (x._2, x._1)) val wordArray = linesRdd.map(w => (w(2), 1)).reduceByKey(_+_).map(x => (x._2, x._1)).sortByKey(false).take(10) val wordRdd = sc.parallelize(wordArray, 1).map(x => (x._2, x._1)) val urlArray = linesRdd.map(w => (w(4).split('/')(0), 1)).reduceByKey(_+_).map(x => (x._2, x._1)).sortByKey(false).take(50) val urlRdd = sc.parallelize(urlArray, 1).map(x => (x._2, x._1)) (userRdd ++ wordRdd ++ urlRdd).repartition(1).saveAsTextFile(resultFile) sc.stop() } }
The result can be calculated by typing the code into a jar and uploading it to the spark cluster