大数据基础之词频统计Word Count

其他 2018-12-13 17:44:41 阅读次数: 0

对文件进行词频统计，是一个大数据领域的hello word级别的应用，来看下实现有多简单：

1 Linux单机处理

egrep -o "\b[[:alpha:]]+\b" test_word.log|sort|uniq -c|sort -rn|head -10

2 Spark分布式处理（Scala）

val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
sc.textFile("test_word.log").flatMap(_.split("\\s+")).map((_, 1)).reduceByKey(_ + _).sortBy(_._2, false).take(10).foreach(println)

测试文件test_word.log内容如下：

hello world
hello www

输出如下：

2 hello
1 world
1 barney

猜你喜欢

转载自www.cnblogs.com/barneywill/p/10115301.html

大数据基础之词频统计Word Count

Word Count

大数据系列（二）hadoop实现最基础word count

Spark的word count

SparkStreaming Word Count

python实现Word Count

Spark Word Count

Word Count作业

word count（小组）

specific word count (index of )

Word Count结对编程

specific word count(index of)

Flink:word count demo

scala-word count

special word count

word count项目情况

Spark 实现word count

个人项目(Word Count)

个人项目（Word Count）

Word Count 个人作业

个人项目 Word Count

Word Count (Java)

Word Count（C语言）

Spark Streaming的Word Count

数据库中count(1)、count(*)、count(列名)的总结

Reversion Count(Java大数)

大数 Reversion Count

linux wc word count（统计文件个数）

Word Count--字符统计小程序

今日推荐

中国码农的“35岁魔咒”

蘭雅 CorelDRAW 插件 2024.5.1 国际劳动节版，免费下载

Arc Browser for Windows 1.0 正式 GA

90后程序员开发视频搬运软件、不到一年获利超 700 万，结局很刑！

《美国对全球网络空间安全与发展的威胁和破坏》报告发布

周排行

Java基础复习_day13_Collection集合

2018.11.16 c语言学习经验

且看Java内置四大核心函数式接口

小程序云开发中数据库的数据分段和显示图片

python的函数

Web-JS进阶

【干货】C++常用代码积累笔记大全

Spring的ioc操作与 IOC底层原理

构建之法20191121-11 Scrum立会报告+燃尽图 07

Spring boot之Hello World访问404

每日归档

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)

2024-04-28(0)

2024-04-27(56)

2024-04-26(39)