spark学习 —— RDD入门

helloworld

把文件传入 hdfs

$ hadoop fs -ls /
$ hadoop fs -mkdir /input
$ hadoop fs -put /usr/local/spark-2.4.5/README.md /input
2020-04-10 15:22:06,562 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
$ hadoop fs -ls /input
Found 1 items
-rw-r--r--   2 root supergroup       3756 2020-04-10 15:22 /input/README.md

在 pyspark 中读取文件

>>> lines = sc.textFile("/input/README.md")
>>> lines.persist()
/input/README.md MapPartitionsRDD[13] at textFile at NativeMethodAccessorImpl.java:0
>>> lines.count()
104                                                                             
>>> lines.first()
u'# Apache Spark

总结

  1. Create some input RDDs from external data.
  2. Transform them to define new RDDs using transformations like filter() .
  3. Ask Spark to persist() any intermediate RDDs that will need to be reused.
  4. Launch actions such as count() and first() to kick off a parallel computation, which is then optimized and executed by Spark.

一、创建 RDD

两种方式:

  • 通过现有的集合创建,sc.parallelize()
lines = sc.parallelize(["pandas", "i like pandas"])
  • 通过外部文件创建
lines = sc.textFile("/path/to/README.md")

二、RDD 操作

两种操作:变换transformation、动作 action

  • 变换:RDD => RDD,常见的有:map() , filter()
  • 动作:RDD => value,常见的有:count() , first()

1. 变换

filter

inputRDD = sc.textFile("log.txt")
errorsRDD = inputRDD.filter(lambda x: "error" in x)
warningsRDD = inputRDD.filter(lambda x: "warning" in x)
badLinesRDD = errorsRDD.union(warningsRDD)
# 以下是动作
print "Input had " + badLinesRDD.count() + " concerning lines"
print "Here are 10 examples:"
for line in badLinesRDD.take(10):
	print line

一个 RDD在创建和变换之后并不会执行,只是定义了一个计算图,有点类似tensorflow,真正遍历数据是在动作
在这里插入图片描述
这个例子显然有更好的解法:

badLinesRDD = inputRDD.filter(lambda x: "warning" or "error" in x)

map

nums = sc.parallelize([1, 2, 3, 4])
squared = nums.map(lambda x: x * x)

# 动作
for num in squared.collect():
	print "%i " % (num)

注意:collect() 是返回 rdd 的所有元素的动作,不可以在数据总量超过单机内存的情况下使用!!!

flatmap

>>> lines = sc.parallelize(["hello world", "hi"])
>>> words = lines.flatMap(lambda line: line.split(" "))
>>> words.first()
'hello'
>>> words.collect()
['hello', 'world', 'hi']

和 map 的区别
在这里插入图片描述

2. 动作

  • count(),
  • first(),
  • collect(),
  • take(10),

三、自定义函数

通过 lambda 表达式

word = rdd.filter(lambda s: "error" in s)

通过 函数

def containsError(s):
	return "error" in s

word = rdd.filter(containsError)

注意

不要把类的成员函数传给 filter,否则会把整个对象传给各个计算节点,造成额外开销

原创文章 338 获赞 621 访问量 50万+

猜你喜欢

转载自blog.csdn.net/itnerd/article/details/105444627