Spark basic study notes: create RDD

1. What is RDD

(1) RDD concept

1. Spark provides a core abstraction of data called Resilient Distributed Dataset (RDD). All or part of this dataset can be cached in memory and reused across multiple computations. RDD is actually a collection of data distributed on multiple nodes.
2. The elasticity of RDD mainly means that when the memory is insufficient, the data can be persisted to the disk, and RDD has efficient fault tolerance.
A distributed dataset means that a dataset is stored on different nodes, and each node stores a part of the dataset.

(2) RDD example

Store the data set (hello, world, scala, spark, love, spark, happy) on three nodes, node 1 stores (hello, world), node 2 stores (scala, spark, love), node 3 stores (spark ,happy), so that the data of the three nodes can be calculated in parallel, and the data of the three nodes together form an RDD.

insert image description here

(3) Main features of RDD

RDDs are immutable, but RDDs can be transformed into new RDDs for operations.
RDDs are partitionable. RDD consists of many partitions, and each partition corresponds to a Task task to execute.
Operating on RDD is equivalent to operating on each partition of RDD.
RDD has a series of functions that perform calculations on partitions, called operators.
There are dependencies between RDDs, which can be pipelined and avoid the storage of intermediate data.

2. Prepare

(1) Prepare documents

1. Prepare local system files

Create test.txt in the /home directory

insert image description here

words separated by spaces
insert image description here

2. Start the HDFS service

Execute the command: start-dfs.shinsert image description here

3. Upload files to HDFS

Upload test.txt to the /park directory of HDFS

insert image description here
view file content
insert image description here

(2) Start Spark Shell

1. Start the Spark service

Execute the command: start-all.sh
insert image description here

2. Start Spark Shell

insert image description here
View the WebUi interface of Spark Shell
insert image description here

3. Create RDD

(1) Create an RDD from a collection of objects

Spark can convert a collection of objects into an RDD through the parallelize() or makeRDD() methods.

1. Use the parallelize() method to create RDD

Excuting an order:val rdd = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8)) insert image description here

2. Use the makeRDD() method to create RDD

Execute command: val rdd = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8))
insert image description here
Execute command: rdd.collect(), collect rdd data for display

insert image description here

The brackets of the action operator [action operator] collect() can be omitted
insert image description here

3. Brief description

(2) Create RDD from external storage

Spark's textFile() method can read data in the local file system or other external systems, and create RDD. The difference is that the source path of the data is different

1. Read local system files

insert image description here
Execute the command: val lines = rdd.collect(), view the content in the RDD, and save it to the constant lines
insert image description here
Execute the command: lines.foreach(println) (use the foreach traversal operator)

insert image description here
Execute the command: for (line <- lines) println (line)
insert image description here
using for loop to achieve
insert image description here

2. Read files on HDFS

Execute the command: `val rdd = sc.textFile("hdfs://master:9000/park/test.txt")

` insert image description here
Execute the command: val lines = rdd.collect, view the content in the RDD
insert image description here
to get the lines containing spark, execute the command: val sparkLines = rdd.filter((line) => line.contains(“spark”)) (filter is a Transformation operator [transformation operator])

insert image description here
There is a simpler way to write, execute the command: `val sparkLines = rdd.filter(_.contains(“spark”))

`
insert image description here
Use the traversal operator to display sparkLines content

insert image description here

Guess you like

Origin blog.csdn.net/py20010218/article/details/125357264