打印RDD中的内容到logs中【一篇就够】

Printing elements of an RDD

Another common idiom is attempting to print out the elements of an RDD using rdd.foreach(println) or rdd.map(println). On a single machine, this will generate the expected output and print all the RDD’s elements. However, in cluster mode, the output to stdout being called by the executors is now writing to the executor’s stdout instead, not the one on the driver, so stdout on the driver won’t show these! To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus: rdd.collect().foreach(println). This can cause the driver to run out of memory, though, because collect() fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println).

关于RDD的输出,以上叙述摘自官网: 
意思就是:利用rdd.foreach(println) 或者 rdd.map(println),在一台机器上时,会得到理想的输出,打印出所有的RDD的数值;

但在集群环境中,输出会被executors唤起,被写到executors的输出,而不是驱动所在的主机,所以在主机上不会显示打印信息,为了能够在主机上打印信息,要使用collect()函数首先把RDD放到主机节点上,rdd.collect().foreach(println),但因为collect()会将整个RDD的数据放到主机上,会使得驱动主机内存溢出。

如果你只想打印出有限个RDD数据,一个靠谱的方法就是用take(): rdd.take(100).foreach(println)

例:下面这样可以正常打印出rdd信息到一台机器上

rdd.collect().foreach(row => {
    println("row.length===:" + row.length)
    for ( i <- 0 to (row.length -1))
        println("===<" + i + ">===:" + row.get(i))
})

下面这样会存在看不到的情况

rdd.foreach(row => {
    println("row.length===:" + row.length)
    for ( i <- 0 to (row.length -1))
        println("===<" + i + ">===:" + row.get(i))
})

猜你喜欢

转载自blog.csdn.net/sjmz30071360/article/details/88787971