Spark shell word frequency statistics and statistical PV experience

All processes are tested by myself

And if I understand it in a way that I can accept, you can refer to it. If you have any questions, please leave a message to correct me.

sample

[hadoop@h201 ~]$ cat hh.txt

hello,world

hello,hadoop

hello,oracle

hadoop,oracle

hello,world

hello,hadoop

hello,oracle

hadoop,oracle

 

Word frequency statistics, and its reverse sorting process by the number of words and its detailed explanation

1. Load the file into an RDD

Scala>  var file=sc.textFile(hdfs://h201:9000/hh.txt)

2. Split each line by comma, load the result into an array, extract one word at a time, _ represents a placeholder for each input

Scala>  val  h1=file.flatMap(_.split(,))

3.  Load each element in the array into the map method to perform a unified processing task , and return each input word as a key- value pair of k, v , and the reduceByKey() method only runs the method in the parentheses for value and iterates Calculation _+_ represents accumulation, and returns the key- value pair of k and v that have been iteratively calculated

Scala>  val  h2=h1.map(x=>(x,1)).reduceByKey(_+_)

4.  Then use the second map to receive the k, v key-value pair of the previous step to exchange the position output. For example:

The input is ( " hello " ,5 ) becomes ( 5, " hello " )

Scala>  val  h3=h2.map(_.2,_.1)

5.  Sort the results by key value

Scala> val h4=h4.sortByKey(false) false= reverse order true= ascending order

6.  Use the map function to exchange the sequenced key-value pairs. For example:

(5,hello) (4,hadoop)   变成(hello,5)(hadoop,4)

Scala> val  h5=h4.map(_.2,_.1)

7.  At this point, the word frequency statistics have been completed and arranged in descending order of the number of words. The next step is to output the results to a folder. Note that it is a directory

Scala>  h5.saveAsTextFile("hdfs://h201:9000/output1")

All the above operations are split for easy understanding, and all operations can be combined into one code: as follows

Scala > val wc = file.flatMap(_.split(",")).map(x=>(x,1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).saveAsTextFile(hdfs://h201:9000/output1)

 

Difference between flatMap() and map()

flatMap() and map() both perform the same operation on each line of input but produce different results;

Example sample:

hello,world

hello,hadoop

hello,oracle

Import the file as RDD = " var file=sc.textFile( " hdfs://xxx:9000/xx.txt " )

Also use the split method to separate by commas

Var fm=file.flatMap(_.split(,))   每行按逗号分隔后产生的结果解将每个单词放在一个集合中,下面如果使用fm中的内容是每次只会导入一个单词:

java表示就是{hello,world,hello,hadoop,hello,oracle} 相当于一维数组

Var m=file.map(_.split(,))   每行按逗号分隔后产生的结果是将每行的变成一个字符串数组,再放到一个大的结果集中,下面如果使用m中的内容每次导入一个数组:

java表示就是{{hello,world},{hello,hadoop},{hello,oracle}} 相当于二维数组

这在使用Apache日志统计PV时很有用例如日志格式如下:

123.23.4.5 - - xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

23.12.4.5 - - xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

我们只需要取出按空格分隔的第一个列即可 这是使用flatMap就不合适了我们可以用map

Salca > var file=sc.textFile(hdfs://h201:9000/access.log)

Salca> var h1=file.map(_.split( ,2))      #按空格分隔最多两列

Salca> var h2=h1.map(x=>(x(0),1))       #输入的数组去第0列,即可取出IP

Salca> var h3=h2.reduceByKey(_+_)      #统计每个链接的登录次数

下面就是排序和保存在这里就不在重复了。


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325042862&siteId=291194637