What is RDD

RDD (Resilient Distributed Dataset) is called Resilient Distributed Dataset, which is the most basic data abstraction in Spark.
The code is an abstract class, which represents a flexible, immutable, partitionable collection of elements that can be computed in parallel.
lazy loading

Five characteristics of RDD

insert image description here

RDD programming

RDD creation

There are three ways to create RDDs in Spark: creating RDDs from collections, creating RDDs from external storage, and creating them from other RDDs.

First, let's create a maven project
1. Create a maven project and add spark-core dependencies to the pom file

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.12</artifactId>
        <version>3.1.3</version>
    </dependency>
</dependencies>

2. If you do not want to print a large number of logs at runtime, you can add the log4j.properties file in the resources folder and add log configuration information

log4j.rootCategory=ERROR, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{
    
    yy/MM/dd HH:mm:ss} %p %c{
    
    1}: %m%n

# Set the default spark-shell log level to ERROR. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=ERROR

# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=ERROR
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=ERROR
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR

# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR

Create an RDD from a collection

Code:

public class Test01_list {
    
    
    public static void main(String[] args) {
    
    
        //1. 创建spark配置
        SparkConf conf = new SparkConf().setAppName("SparkCore").setMaster("local[*]");
        //2.创建SparkContext
        JavaSparkContext sc = new JavaSparkContext(conf);
        //3.编写代码
        //parallelize（List,分区数）
        JavaRDD<Integer> javaRDD = sc.parallelize(Arrays.asList(1, 2, 3, 4，5),2);
		
        javaRDD.collect().forEach(System.out::println);
        //4.关闭资源
        sc.stop();
    }
}

Partitioning rules:
The above code partitioning rules:

0号分区：1，2
1号分区：3，4，5

Use the integer division mechanism to open left and right
to find the quotient first, and distribute the remaining remainder from the back to the front

Create RDD from external storage

Code:

public class Test02_file {
    
    
    public static void main(String[] args) {
    
    
        //1. 创建spark配置
        SparkConf conf = new SparkConf().setAppName("SparkCore").setMaster("local[*]");
        //2.创建SparkContext
        JavaSparkContext sc = new JavaSparkContext(conf);
        //3.编写代码
        JavaRDD<String> javaRDD = sc.textFile("Input",2);
        javaRDD.collect().forEach(System.out::println);
        //4.关闭资源
        sc.stop();
    }
}

Partition rules The
specific number of partitions needs to be calculated by a formula.
First, obtain the total length of the file, totalSize,
and calculate the average length. goalSize = totalSize / numSplits,
obtain the block size of 128M
, and calculate the split size. );
Finally, use splitSize to split the entire file according to the principle of 1.1 times to get several partitions.

In actual development, you only need to look at the total size of the file / the number of filled partitions and the block size, whichever is smaller will be split

Created from other RDDs.

It is mainly to generate a new RDD after an RDD operation is completed. Use the conversion operator, see my next blog for details

【SparkCore-RDD】