1. spark to weight (as a key data of each row grouping, so were the de-emphasis, and then remove the key can be a)
Original data: 2012-3-1 A 2012-3-2 B 2012-3-3 C
2012-3-2 b
实现源码: rdd.filter(_.trim().length() > 0).map(line => (line.trim(), "")).groupByKey().sortByKey(true).keys.foreach(println)
2. Data cleaning (filtration)
Original data: HTTPS: //blog.csdn.net/weixin_42540606/article/details/81100882 HTTP: //192.168.20.111:8080/ HTTPS: //www.cnblogs.com/redhat0019/p/8665491 .html HTTP: / /192.168.20.111:50070/dfshealth.html # the Tab-the Overview HTTP: //192.168.20.124:1082/osgiWeb/page/hgu/index.jsp