Scala 小技巧 - 单行代码完成word count

单行word count

Scala中可以一行命令就能做到word count的效果

假设有如下文本:

Hello mr apache spark

Hello world apache spark

Hello we want study spark

Hello we want study apache

Hello apache and hadoop


在scala终端中将数据存入list用以模拟

scala> var txt = List("Hello mr apache spark", "Hello world apache spark", "Hello we want study spark", "Hello we want study apache", "Hello apache and hadoop")
txt: List[String] = List(Hello mr apache spark, Hello world apache spark, Hello we want study spark, Hello we want study apache, Hello apache and hadoop)

我们的思路是什么?

1. 分词,得到一个list,存放了出现的所有单词

2. 对单词出现的次数做统计

3. 排序

1. 分词

首先我们将list进行map,对里面每一串文本,按照空格切分,得到一个list 里面存放的是多个数组,数组存放的是每个单词

scala> txt.map(_.split(" "))
res7: List[Array[String]] = List(Array(Hello, mr, apache, spark), Array(Hello, world, apache, spark), Array(Hello, we, want, study, spark), Array(Hello, we, want, study, apache), Array(Hello, apache, and, hadoop))

  

可以看到,得到的结果是 List[ Array[String]]    list里面的Array里面存放的就是一个个单词

但是我们不想要list 里面存放的是array, 想要list里面直接存放的就是单词,下面执行:

scala> txt.map(_.split(" ")).flatten
res8: List[String] = List(Hello, mr, apache, spark, Hello, world, apache, spark, Hello, we, want, study, spark, Hello, we, want, study, apache, Hello, apache, and, hadoop)

  

这样,就得到我们想要的样子。

其实上面两步操作,1 map  2 flatten 可以合并为一个操作,如下:

scala> txt.flatMap(_.split(" "))
res9: List[String] = List(Hello, mr, apache, spark, Hello, world, apache, spark, Hello, we, want, study, spark, Hello, we, want, study, apache, Hello, apache, and, hadoop)

  

2. 统计

在得到上述结果后就需要对单词出现的个数进行统计,怎么统计呢?

思路是,我们借鉴hadoop的mapper函数形式,对每一个单词都存入一个map(key value 集合)中,以单词为key,数字1为value:

scala> txt.flatMap(_.split(" ")).map((_, 1))
res10: List[(String, Int)] = List((Hello,1), (mr,1), (apache,1), (spark,1), (Hello,1), (world,1), (apache,1), (spark,1), (Hello,1), (we,1), (want,1), (study,1), (spark,1), (Hello,1), (we,1), (want,1), (study,1), (apache,1), (Hello,1), (apache,1), (and,1), (hadoop,1))

  

这样,我们就得到了一个list, list内存放的是一个个单独的map(元组,对偶元组)

然后,按照key进行group

scala> txt.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1)
res21: scala.collection.immutable.Map[String,List[(String, Int)]] = Map(want -> List((want,1), (want,1)), world -> List((world,1)), hadoop -> List((hadoop,1)), spark -> List((spark,1), (spark,1), (spark,1)), apache -> List((apache,1), (apache,1), (apache,1), (apache,1)), Hello -> List((Hello,1), (Hello,1), (Hello,1), (Hello,1), (Hello,1)), mr -> List((mr,1)), we -> List((we,1), (we,1)), study -> List((study,1), (study,1)), and -> List((and,1)))

  

groupBy后得到一个map, key是单词, value是 一个list,list存放上一条执行的元组也就是 (单词, 1) 这个元组

实际上现在的这个map  key是单词, value这个list 的size 其实就是单词出现的次数了,所以我们要把value转换成出现的次数:

scala> txt.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1).map( t => (t._1, t._2.size))
res2: scala.collection.immutable.Map[String,Int] = Map(want -> 2, world -> 1, hadoop -> 1, spark -> 3, apache -> 4, Hello -> 5, mr -> 1, we -> 2, study -> 2, and -> 1)

  

使用map方法,内用一个匿名函数,t 代表的就是上面map中的一个元素(key 单词,value 是list 的那个元素), 然后

函数的功能是 创建一个元组(map)key是单词, value就是原本元素value那个list 的size

这样得到了单词为key 次数为value的一个map

3. 排序

txt.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1).map( t => (t._1, t._2.size)).toList.sortBy(_._2)
res5: List[(String, Int)] = List((world,1), (hadoop,1), (mr,1), (and,1), (want,2), (we,2), (study,2), (spark,3), (apache,4), (Hello,5))

  

将得到的map(key 单词, value 次数)进行排序操作

由于map不支持sortBy函数,将map转换成list在执行sortBy

得到排序后的结果,但是我们要升序的,所以执行最后一步操作:

txt.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1).map( t => (t._1, t._2.size)).toList.sortBy(_._2).reverse
res8: List[(String, Int)] = List((Hello,5), (apache,4), (spark,3), (study,2), (we,2), (want,2), (and,1), (mr,1), (hadoop,1), (world,1))

  

调用reverse方法反转即可

这样就得到了我们要的结果

所以总结就是:

txt.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1).map( t => (t._1, t._2.size)).toList.sortBy(_._2).reverse

  

打印出来看一看:

scala> txt.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1).map( t => (t._1, t._2.size)).toList.sortBy(_._2).reverse.map(println)
(Hello,5)
(apache,4)
(spark,3)
(study,2)
(we,2)
(want,2)
(and,1)
(mr,1)
(hadoop,1)
(world,1)
res10: List[Unit] = List((), (), (), (), (), (), (), (), (), ())

  

或者:

scala> for( i <- txt.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1).map( t => (t._1, t._2.size)).toList.sortBy(_._2).re
verse) println(i)
(Hello,5)
(apache,4)
(spark,3)
(study,2)
(we,2)
(want,2)
(and,1)
(mr,1)
(hadoop,1)
(world,1)

  

欢迎转载,欢迎提出意见

如果本文对您有帮助,还请点击一下推荐哦,Thanks♪(・ω・)ノ

https://www.cnblogs.com/bigdatacaoyu

猜你喜欢

转载自www.cnblogs.com/bigdatacaoyu/p/10925404.html