[Spark] Spark basic operations

Foreword

Create a new RDD RDD based on the existing data set build

(1) map (func): Each element RDD data calls to map the set make use of func, and then returns a new RDD, the returned data set is distributed data sets.

(2) filter (func): calling filter for each element of the data set are used RDD func, and then returns a func so as to true RDD element configuration.

(3) flatMap (func): map and the like, but the result is a plurality of flatMap generated.

(4) mapPartitions (func): map and the like, but each map Element, and mapPartitions each partition.

(5) mapPartitionsWithSplit (func): and mapPartitions like, but the role of func is on one of the split, so there should be func index.

(6)sample(withReplacement,faction,seed):抽样。

(7) union (otherDataset): Returns a new dataset, comprising a source of a given dataset and the dataset set of elements.

(8) distinct ([numTasks]): Returns a new dataset, the dataset containing the element distinct source of the dataset.

(9) groupByKey (numTasks): return (K, Seq [V]), i.e. in Hadoop reduce function accepts a key-valuelist.

(10) reduceByKey (func, [numTasks]): that is, with a given role then reduce func groupByKey generated (K, Seq [V]), such as sum, average.

(11) sortByKey ([ascending], [numTasks]): according to sort key is ascending or descending, Ascending a boolean type.

Action:

After running computing RDD data set, it returns a result value or the write external storage

(1) reduce (func): is gathered, it is passed two parameter input function returns a value, this function must be commutative and associative law.

(2) collect (): generally small enough or when the filter result, and then return to collect a package array.

(3) count (): returns the number of the element in the dataset.

(4) first (): returns the first element in the dataset.

(5) take (n): before returning the n elements.

(6) takeSample (withReplacement, num, seed): Returns a sample of the dataset num elements, random seed seed.

(7) saveAsTextFile (path): the dataset wrote in a textfile, or HDFS, or HDFS file system support, Spark put each record are converted to a row, and then written to the file in.

(8) saveAsSequenceFile (path): it can only be used for a key-value, and then generates SequenceFile written to the local file system or Hadoop.

(9) countByKey (): returns the number corresponding to a map key acting on a RDD.

(10) foreach (func): for each element in the dataset used func.

operating

file.filter(line => line.length>10).first().union(file).count()
file.sample(true,0.5).count

在Spark窗口,加载数据,将数据转变为RDD
val rdd = sc.textFile(“hdfs://localhost:9000/myspark3/wordcount/buyer_favorite”);
对RDD进行统计并将结果打印输出。
rdd1.map(line => ( line.split(’\t’)(1).toInt, line.split(’\t’)(0) ) ).sortByKey(true).collect
对rdd1和rdd2进行map映射,得出关键的两个列的数据
val rdd11 = rdd1.map(line=> (line.split(’\t’)(0), line.split(’\t’)(2)) )
val rdd22 = rdd2.map(line=> (line.split(’\t’)(1), line.split(’\t’)(2)) )
将rdd11以及rdd22中的数据,根据Key值,进行Join关联,得到最终结果
val rddresult = rdd11 join rdd22
用collect()方法启动程序
rddjoin.collect

Guess you like

Origin blog.csdn.net/weixin_44039347/article/details/91598465