spark中RDD的transformation&action

Introduction:

1. Transformation is to get a new RDD , there are many ways, such as generating a new RDD from the data source, generating a new RDD from the RDD

2 , action is to get a value, or a result (directly cache RDD into memory)

All transformations are lazy strategies, that is, if only the transformation is submitted, the calculation will not be performed, and the calculation will only be triggered when the action is submitted.


transformation operation:

 map(func): use func for each element in the RDD dataset that calls map , and then return a new RDD, the returned dataset is a distributed dataset

 filter(func):  applies func to each element in the RDD dataset that calls filter , then returns an RDD of elements that make func true

flatMap(func): Similar to map , but flatMap generates multiple results

mapPartitions(func): Much like map , but map is per element and mapPartitions is per partition

mapPartitionsWithSplit(func): Similar to mapPartitions , but func works on one of the splits , so there should be an index in func

sample(withReplacement,faction,seed):抽样

union(otherDataset) : returns a new dataset containing the source dataset and the set of elements of the given dataset

distinct([numTasks]): Returns a new dataset containing distinct elements in the source dataset

groupByKey(numTasks): returns (K,Seq[V]) , which is the key-valuelist accepted by the reduce function in hadoop

reduceByKey(func,[numTasks]): It is to use a given reducefunc and then act on groupByKey to generate (K, Seq[V]), such as sum, average

sortByKey([ascending],[numTasks]): Sort by key , ascending or descending , ascending is boolean type

join(otherDataset,[numTasks]): When there are two KV datasets (K, V) and (K, W) , the returned dataset is (K, (V, W)) , and numTasks is the number of concurrent tasks

cogroup(otherDataset,[numTasks]): When there are two KV datasets (K,V) and (K,W) , the returned dataset is (K,Seq[V],Seq[W]) , and numTasks is concurrent number of tasks

cartesian(otherDataset) : Cartesian product is m*n , everyone understands


action operation:

reduce(func) : To put it bluntly, it is aggregation, but the incoming function is two parameter inputs and returns a value. This function must satisfy the commutative and associative laws

collect() : Generally, when the filter or the result is small enough, use the collect package to return an array

count():返回的是dataset中的element的个数

first():返回的是dataset中的第一个元素

take(n):返回前nelements,这个士driverprogram返回的

takeSample(withReplacementnumseed):抽样返回一个dataset中的num个元素,随机种子seed

saveAsTextFilepath):把dataset写到一个textfile中,或者hdfs,或者hdfs支持的文件系统中,spark把每条记录都转换为一行记录,然后写到file

saveAsSequenceFile(path):只能用在key-value对上,然后生成SequenceFile写到本地或者hadoop文件系统

countByKey():返回的是key对应的个数的一个map,作用于一个RDD

foreach(func):dataset中的每个元素都使用func

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325885967&siteId=291194637