RDD操作（3）

一、如何创建RDD？

1.parallelizing an existing collection in your driver program:并行化一个现有的集合来创建RDD

Example：
scala> val data=Array(1,2,3,4,5)
data: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val dts=sc.parallelize(data)
dts: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26

scala> dts.collect
res0: Array[Int] = Array(1, 2, 3, 4, 5)

问题：这个task 2个咋来的？

答：Spark will run one task for each partition of the cluster，一个partition就有一个task，这个2就是partition决定的。

例如:HDFS通常就是根据你的blocksize大小来表示一个task，小文件会多。

问题：为什么通常一个cpu设置2-4个分区呢？

Typically you want 2-4 partitions for each CPU in your cluster

答：如果一个core对应一个task，很多情况下会出现你的core空闲，前面跑完，后面的接上，避免浪费资源。

2.referencing a dataset in an external storage system

such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat：引用外部存储系统的数据集

（1）Text file RDDs can be created using SparkContext’s textFile method.

使用这个方法，如果使用的是分布式spark，那么你的inputsource必须在每个worker上都要有，也就是你不要用个standalone，textfile又是local的文件，这样会报找不到数据源的。

val dts=sc,TextFile("hdfs://192.168.137.251:8020/ruozedata.txt")

二、Some notes on reading files with Spark:（关于Spark读取文件的一些注意事项）

（1）如果使用的本地文件系统（file开头的路径），文件也必须在工作节点上的同一路径上可访问。要么将文件复制到所有工作者，要么使用一个挂载网络的共享文件系统。

（2）可以读取目录下的所有文件，也可以读取压缩文件，还可以读取wildcards（通配符）比如/a/*.txt这样读取所有txt的文件。

（3）textFile方法手动设置分区，默认情况下在hdfs上市以一个block作为默认的partition大小，也可以自己设置，但是分区数目不能少于partition数目。

val rdd=sc.wholeTextFiles("hdfs://192.168.137.251:9000/data")
rdd: org.apache.spark.rdd.RDD[(String, String)] = hdfs://192.168.137.251:9000/data MapPartitionsRDD[5] at wholeTextFiles at <console>:24

scala> rdd.collect
res2: Array[(String, String)] =
Array((hdfs://192.168.137.251:9000/data/1,"
ruoze
laoj
hello
ptthon
zidong
youmuyou
"), (hdfs://192.168.137.251:9000/data/2,"
ruoze
laoj
hello
ptthon
zidong

（4）