Article Directory
1. What is RDD
(1) RDD concept
1. Spark provides a core abstraction of data called Resilient Distributed Dataset (RDD). All or part of this dataset can be cached in memory and reused across multiple computations. RDD is actually a collection of data distributed on multiple nodes.
2. The elasticity of RDD mainly means that when the memory is insufficient, the data can be persisted to the disk, and RDD has efficient fault tolerance.
A distributed dataset means that a dataset is stored on different nodes, and each node stores a part of the dataset.
(2) RDD example
Store the data set (hello, world, scala, spark, love, spark, happy) on three nodes, node 1 stores (hello, world), node 2 stores (scala, spark, love), node 3 stores (spark ,happy), so that the data of the three nodes can be calculated in parallel, and the data of the three nodes together form an RDD.
(3) Main features of RDD
RDDs are immutable, but RDDs can be transformed into new RDDs for operations.
RDDs are partitionable. RDD consists of many partitions, and each partition corresponds to a Task task to execute.
Operating on RDD is equivalent to operating on each partition of RDD.
RDD has a series of functions that perform calculations on partitions, called operators.
There are dependencies between RDDs, which can be pipelined and avoid the storage of intermediate data.
2. Prepare
(1) Prepare documents
1. Prepare local system files
Create test.txt in the /home directory
words separated by spaces
2. Start the HDFS service
Execute the command: start-dfs.sh
3. Upload files to HDFS
Upload test.txt to the /park directory of HDFS
view file content
(2) Start Spark Shell
1. Start the Spark service
Execute the command: start-all.sh
2. Start Spark Shell
View the WebUi interface of Spark Shell
3. Create RDD
(1) Create an RDD from a collection of objects
Spark can convert a collection of objects into an RDD through the parallelize() or makeRDD() methods.
1. Use the parallelize() method to create RDD
Excuting an order:val rdd = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8))
2. Use the makeRDD() method to create RDD
Execute command: val rdd = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8))
Execute command: rdd.collect(), collect rdd data for display
The brackets of the action operator [action operator] collect() can be omitted
3. Brief description
(2) Create RDD from external storage
Spark's textFile() method can read data in the local file system or other external systems, and create RDD. The difference is that the source path of the data is different
1. Read local system files
Execute the command: val lines = rdd.collect(), view the content in the RDD, and save it to the constant lines
Execute the command: lines.foreach(println) (use the foreach traversal operator)
Execute the command: for (line <- lines) println (line)
using for loop to achieve
2. Read files on HDFS
Execute the command: `val rdd = sc.textFile("hdfs://master:9000/park/test.txt")
`
Execute the command: val lines = rdd.collect, view the content in the RDD
to get the lines containing spark, execute the command: val sparkLines = rdd.filter((line) => line.contains(“spark”)) (filter is a Transformation operator [transformation operator])
There is a simpler way to write, execute the command: `val sparkLines = rdd.filter(_.contains(“spark”))
`
Use the traversal operator to display sparkLines content