一。如何处理RDD的filter
1. 把第一行的行头去掉
scala> val collegesRdd= sc.textFile("/user/hdfs/CollegeNavigator.csv")
collegesRdd: org.apache.spark.rdd.RDD[String] = /user/hdfs/CollegeNavigator.csv MapPartitionsRDD[3] at textFile at <console>:24
scala> collegesRdd.count
res1: Long = 504
scala> val header= collegesRdd.first
header: String = "Name","Address","Website","Type","Awards offered","Campus setting","Campus housing","Student population","Undergraduate students","Graduation Rate","Transfer-Out Rate","Cohort Year *","Net Price **","Largest Program","IPEDS ID","OPE ID"
scala> val headerlessRdd= collegesRdd.filter( line=>{ line!= header } )
headerlessRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:28
这里其实已经使用了一个filter,就是过滤行头的filter。
val filterRdd= headerlessRdd.filter(line =>{
val count=line.split("\",\"")(7)
val len=count.length()
len>4
})
scala> filterRdd.count
res8: Long = 121
得到学生数目大于10000的学校
二、写filter函数
上面的例子也可以写一个filter函数
def filterfunc(line :String):Boolean ={ val count=line.split("\",\"")(7) val len=count.length() len > 4 }
val filterRdd2=headerlessRdd.filter(filterfunc)
会得出如下结果
scala> filterRdd2.count
18/11/20 03:41:33 WARN spark.ExecutorAllocationManager: No stages are running, but numRunningTasks != 0
res10: Long = 121