Spark learning --- 2, SparkCore (RDD overview, RDD programming (creation, partition rules, conversion operators, Action operators))

1. Overview of RDDs

1.1 What is RDD

RDD (Resilient Distributed Dataset) is called elastic distributed data set, which is the abstraction of distributed data set in Spark. The code is an abstract class, which represents a flexible, immutable, partitionable collection of elements that can be computed in parallel.

1.2 Five characteristics of RDD

1. A group of partitions, which is the basic unit of the data set, marking which partition the data belongs to. 2.
A function to calculate each partition.
3. The dependencies between RDDs.
4. A Partitioner, which is the fragmentation function of RDD: Control the data flow of the partition (key-value pair)
5. A list to store the preferred location for accessing each Partition. If the number of nodes and partitions does not correspond, set the partition to that node first. Mobile data is inferior to mobile computing, unless resources are insufficient.

2. RDD programming

2.1 Creation of RDDs

There are three ways to create an RDD in Spark:
1. Create from a collection
2. Create from external storage
3. Create from other RDDs

2.1.1 IDEA environment preparation

1. Create a maven project named SparkCore
2. Add spark-core dependencies and scala compilation plug-ins in the pom file

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.12</artifactId>
        <version>3.3.0</version>
    </dependency>
</dependencies>

3. If you don’t want to print a lot of logs when running, you can add the log4j2.properties file in the resources folder and add log configuration information

log4j.rootCategory=ERROR, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{
    
    yy/MM/dd HH:mm:ss} %p %c{
    
    1}: %m%n

# Set the default spark-shell log level to ERROR. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=ERROR

# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=ERROR
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=ERROR
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR

# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
2.1.2 Create IDEA shortcut keys

The template code that exists when creating SparkContext and SparkConf, we can set the idea shortcut key to generate it with one click.
1. Click File->Settings...->Editor->Live Templates->output->Live Template

insert image description here

insert image description here
insert image description here

//第八步的代码
// 1.创建配置对象
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("sparkCore");

// 2. 创建sparkContext
JavaSparkContext sc = new JavaSparkContext(conf);

// TODO. 编写代码

// x. 关闭sc
sc.stop();

Set up automatic package import
insert image description here

2.1.3 Create from collection

1. Create package com.zhm.spark
2. Create class Test01_createRDDWithList

public class Test01_createRDDWithList {
    
    
    public static void main(String[] args) {
    
    
        //1、创建配置对象
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("sparkCore");

        //2、创建sparkContext
        JavaSparkContext sparkContext = new JavaSparkContext(conf);

        //3、Todo 编写代码--由字符串数组创建RDD
        JavaRDD<String> stringRDD = sparkContext.parallelize(Arrays.asList("hello", "zhm"));

        //4、收集RDD
        List<String> result = stringRDD.collect();

        //5、遍历打印输出结果
        result.forEach(System.out::println);


        //6、 关闭 sparkContext
        sparkContext.stop();
    }
}

operation result:
insert image description here

2.1.4 Creation of datasets from external storage systems

Create RDDs for data sets in external storage systems, such as local file systems, and all data sets supported by Hadoop, such as HDFS, HBase, etc.
1. Data preparation
Right-click on the new SparkCore project name –> New input folder –> Right-click on the input folder –> New word.txt. Edit the following content

hello world
hello zhm
hello future

2. Create RDD

public class Test02_createRDDWithFile {
    
    
    public static void main(String[] args) {
    
    
       //1、创建配置对象
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("sparkCore");

        //2、创建SparkContext
        JavaSparkContext javaSparkContext = new JavaSparkContext(conf);

        //3、编写代码--读取路径./input下的文件,并创建RDD
        JavaRDD<String> fileRDD = javaSparkContext.textFile("./input/word.txt");

        //4、收集RDD
        List<String> result = fileRDD.collect();

        //5、遍历打印输出结果
        result.forEach(System.out::println);

        //6、关闭 sparkContext
        javaSparkContext.stop();

    }
}

operation result:
insert image description here

2.2 Partition rules

2.2.1 Creating an RDD from a Collection

1. Create a package name: com.zhm.spark.partition
2. Code verification

package com.zhm.spark.partition;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

import java.util.Arrays;

/**
 * @ClassName Test01_ListPartition
 * @Description TODO
 * @Author Zouhuiming
 * @Date 2023/6/27 11:42
 * @Version 1.0
 */
public class Test01_ListPartition {
    
    
    public static void main(String[] args) {
    
    
        //1、创建配置对象
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("sparkCore");


        //2、创建sparkContext
        JavaSparkContext javaSparkContext = new JavaSparkContext(conf);

        //3、编写代码
        // 0*(5/2)=0   1*(5/2)=2.5   (0,2.5] 左开右闭 1,2
        //1*(5/2)=2.5   2*(5/2)=5     (2.5,5]左开右闭 3,4,5
        JavaRDD<Integer> integerRDD = javaSparkContext.parallelize(Arrays.asList(11, 12, 36, 14, 05), 2);

//        4、将RDD储存问文件观察文件判断分区
        integerRDD.saveAsTextFile("output");


//        JavaRDD<String> stringRDD = javaSparkContext.parallelize(Arrays.asList("1", "2", "3", "4", "5"),2);
//        stringRDD.saveAsTextFile("output");


        //5、关闭javaSparkContext
        javaSparkContext.stop();
    }
}

operation result:
insert image description here

insert image description here

2.2.2 Create RDD from file
package com.zhm.spark.partition;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

/**
 * @ClassName Test02_FilePartition
 * @Description TODO
 * @Author Zouhuiming
 * @Date 2023/6/27 13:54
 * @Version 1.0
 */
public class Test02_FilePartition {
    
    
    public static void main(String[] args) {
    
    
        //1、创建配置对象
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("sparkCore");

        //2、创建sparkContext对象
        JavaSparkContext javaSparkContext = new JavaSparkContext(conf);

        //3、编写代码
        JavaRDD<String> stringJavaRDD = javaSparkContext.textFile("input/1.txt",3);

        //4、将stringJavaRDD储存到文件中
        stringJavaRDD.saveAsTextFile("output");

        //5、关闭资源
        javaSparkContext.stop();
    }
}


Running results:
insert image description here
insert image description here
insert image description here
1. Partition rules
(1) Calculation method of the number of partitions:
If: JavaRDD stringJavaRDD = javaSparkContext.textFile(“input/1.txt”,3);
insert image description here

  • totalSize = 10 // totalSize refers to the real length in the file, here you need to confirm the newline character of your file, different newline characters are different
  • goalSize = 10 / 3 = 3(byte) //Indicates that each partition stores 3 bytes of data
  • number of partitions = totalSize/ goalSize = 10 /3 => 3,3,4
  • Since the 4 subsections of the third partition are greater than 1.1 times of the 3 subsections, which conforms to the hadoop slice strategy of 1.1 times, an additional partition will be created, namely 3, 3, 3, 1
    (2) Spark reads files, using It is read in the way of Hadoop, so it is read line by line, which has nothing to do with the number of bytes.
    (3) The calculation of the data read position is calculated in units of offset.
    (4) Calculation of the offset range of the data partition
    insert image description here

2.3 Transformation conversion operator

2.3.1 Value type

Create package name com.zhm.spark.operator.value

2.3.1.1 map() mapping

1. Usage: Given a mapping function f, map(f) performs data conversion on RDD at the granularity of elements
2. Mapping function f:
(1) The mapping function f can have a clear signature function, or it can be an anonymous internal function
(2 ) The parameter type of the mapping function f must be consistent with the element type of RDD, and the output type is left to the developer to decide.
3. Explanation:
Function f is a function that can be written as an anonymous subclass, and it can accept a parameter. When an RDD executes the map method, each data item in the RDD is traversed, and the f function is applied once to generate a new RDD. That is, each element in this new RDD is obtained by sequentially applying the f function to each element in the original RDD.
4. Requirement: Splice "Thank you" at the end of each line in the Lover.txt file
5. Specific implementation

package com.zhm.spark.operator.value;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;

/**
 * @ClassName StudyMap
 * @Description 对单个元素进行操作
 * @Author Zouhuiming
 * @Date 2023/6/27 14:04
 * @Version 1.0
 */
public class StudyMap {
    
    
    public static void main(String[] args) {
    
    
        //1、创建配置对象
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("sparkCore");

        //2、创建sparkContext
        JavaSparkContext sparkContext = new JavaSparkContext(conf);

        //3、编写代码  对元素进行操作
        JavaRDD<String> stringJavaRDD = sparkContext.textFile("input/Lover.txt");

        JavaRDD<String> mapRDD = stringJavaRDD.map(s -> s + " Thank you");

        JavaRDD<String> mapRDD1 = stringJavaRDD.map(new Function<String, String>() {
    
    
            @Override
            public String call(String s) throws Exception {
    
    
                return s + "Thank you";
            }
        });

        //4、遍历打印输出结果
        mapRDD.collect().forEach(System.out::println);
        System.out.println("++++++++++++++++++++++++");
        mapRDD1.collect().forEach(System.out::println);

        //5 关闭 sparkContext
        sparkContext.stop();
    }
}

operation result
insert image description here

2.3.1.2 flatMap() flattening

In fact, flatMap is the same as the map operator, and flatMap is also used for data mapping.
1. Usage: flatMap(f), with element as the granularity, performs data conversion on RDD.
2. Features:
Different from map, the type of mapping function f is (element) -> (element)
The type of mapping function of flatMap is (element) -> (collection)
3. Process:
(1) Create a collection
( 2) Remove the "outer packaging" of the collection, and collect elements in advance
4. Function description
Similar to the map operation, each element in the RDD is converted into a new element by applying the f function, and encapsulated into the RDD.
Difference: In the flatMap operation, the return value of the f function is a collection, and each element in the collection will be split out and placed in a new RDD.
5. Case description: Create a collection, which stores sub-collections, and take out the data from all sub-collections and put them into a large collection.
6. Implementation

package com.zhm.spark.operator.value;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;

/**
 * @ClassName StudyFLatMap
 * @Description 炸裂
 * @Author Zouhuiming
 * @Date 2023/6/27 14:09
 * @Version 1.0
 */
public class StudyFLatMap {
    
    
    public static void main(String[] args) {
    
    
        //1、创建配置对象
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("sparkCore");

        //2、创建sparkContext
        JavaSparkContext sparkContext = new JavaSparkContext(conf);

        //3、编写逻辑代码--创建列表arrayList,其中每个元素的类型是字符串列表
        ArrayList<List<String>> arrayList = new ArrayList<>();

        arrayList.add(Arrays.asList("1","2","3"));
        arrayList.add(Arrays.asList("4","5","6"));
        arrayList.add(Arrays.asList("7","8","9"));

        //4、根据arraylist创建RDD
        JavaRDD<List<String>> listJavaRDD = sparkContext.parallelize(arrayList);

        //5、使用flatMap将RDD中每个元素进行转换打散,泛型为打散之后的数据
        JavaRDD<String> stringJavaRDD = listJavaRDD.flatMap(new FlatMapFunction<List<String>, String>() {
    
    
            @Override
            public Iterator<String> call(List<String> strings) throws Exception {
    
    
                return strings.iterator();
            }
        });

        //6、收集RDD,并打印输出
        System.out.println("---------输出集合构建的RDD之flatMap测试------------");
        stringJavaRDD.collect().forEach(System.out::println);


        //Todo 从文件读取数据的话要自己实现将元素转换为集合
        //7、读取文件中的数据
        JavaRDD<String> javaRDD = sparkContext.textFile("input/word.txt");

        //8、将每行数据按空格切分之后,转换为一个list数组再将String数组转换为list集合返回list集合的迭代器
        JavaRDD<String> stringJavaRDD1 = javaRDD.flatMap(new FlatMapFunction<String, String>() {
    
    

            @Override
            public Iterator<String> call(String s) throws Exception {
    
    
                String[] split = s.split(" ");
                return Arrays.asList(split).iterator();
            }
        });

        //9、收集RDD,并打印输出
        System.out.println("-----输出文件系统构建的RDD之flatMap测试---");
        stringJavaRDD1.collect().forEach(s -> {
    
    
            System.out.println(s);
        });

        //10 关闭 sparkContext
        sparkContext.stop();
    }
}


operation result
insert image description here

2.3.1.3 filter() filtering

1. Usage: filter(f), executes the decision function f on RDD with the element as the granularity
2. Judgment function
(1) f refers to the function whose type is (RDD element type) => (Boolean)
(2) Judgment function The formal parameter type of f must be consistent with the element type of RDD, and the return result of f can only be True or False.
3. Function description
(1) When the filter(f) method is called on any RDD, the f function will be applied to each element in the RDD
(2) The function is to keep the RDD that satisfies f (that is, the return value of f is True) Data, to filter out data that does not satisfy f (that is, the return value of f is false).
4. Requirement description: create an RDD, filter out the data that is equal to 0 when the remainder of 2 is equal to 0
insert image description here
5. Code implementation

package com.zhm.spark.operator.value;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;

import java.util.Arrays;

/**
 * @ClassName StudyFilter
 * @Description 过滤元素
 * @Author Zouhuiming
 * @Date 2023/6/27 14:26
 * @Version 1.0
 */
public class StudyFilter {
    
    
    public static void main(String[] args) {
    
    
        //1、创建配置对象
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("sparkCore");

        //2、创建sparkContext
        JavaSparkContext sparkContext = new JavaSparkContext(conf);

        //3、Todo  根据集合创建RDD
        JavaRDD<Integer> javaRDD = sparkContext.parallelize(Arrays.asList(1, 2, 3, 4, 5), 2);

        // 根据数据与2取模,过滤掉余数不是0的数据元素
        JavaRDD<Integer> filterRDD = javaRDD.filter(new Function<Integer, Boolean>() {
    
    
            @Override
            public Boolean call(Integer integer) throws Exception {
    
    
                return integer % 2 == 0;
            }
        });

        System.out.println("------filter算子测试------");
        filterRDD.collect().forEach(System.out::println);
        //x 关闭 sparkContext
        sparkContext.stop();
    }
}


operation result:
insert image description here

2.3.1.4 groupBy() grouping

1. Usage: groupBy(f), execute the function f on each element with the element as the granularity.
2. Function f:
(1) The function f is the user-defined implementation content, and the return value is arbitrary.
(2) The function return value is the key of the return value of the operator groupBy, and the element is value.
(3) The return value of the operator groupBy is a new repartitioned K-V type RDD
3. Function description: grouping, grouping according to the return value of the incoming function. Put the values ​​corresponding to the same key into an iterator.
4. Case description: Create an RDD and group the elements with a value of 2.
5. Code implementation

package com.zhm.spark.operator.value;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;

import java.util.Arrays;

/**
 * @ClassName StudyGroupBy
 * @Description TODO
 * @Author Zouhuiming
 * @Date 2023/6/27 14:32
 * @Version 1.0
 */
public class StudyGroupBy {
    
    
    public static void main(String[] args) throws InterruptedException {
    
    
        //1、创建配置对象
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("sparkCore");

        //2、创建sparkContext
        JavaSparkContext sparkContext = new JavaSparkContext(conf);

        //3、Todo 根据集合创建RDD
        JavaRDD<Integer> javaRDD = sparkContext.parallelize(Arrays.asList(1, 2, 3, 4, 5), 2);

        //4、储存到文件查看groupby之前的情况
        javaRDD.saveAsTextFile("outputGroupByBefore");


        //5、对RDD执行groupBy操作,计算规则是value%2
        JavaPairRDD<Integer, Iterable<Integer>> integerIterableJavaPairRDD = javaRDD.groupBy(new Function<Integer, Integer>() {
    
    
            @Override
            public Integer call(Integer integer) throws Exception {
    
    
                return integer % 2;
            }
        });

        //6、类型可以容易修改
        JavaPairRDD<Boolean, Iterable<Integer>> booleanIterableJavaPairRDD = javaRDD.groupBy(new Function<Integer, Boolean>() {
    
    
            @Override
            public Boolean call(Integer integer) throws Exception {
    
    
                return integer % 2 == 0;
            }
        });

        //7、输出结果
        System.out.println("---执行groupBy之后的RDD分区情况---");
        integerIterableJavaPairRDD.collect().forEach(System.out::println);
        booleanIterableJavaPairRDD.collect().forEach(System.out::println);

        //x 关闭 sparkContext
        Thread.sleep(1000000000L);
        sparkContext.stop();
    }
}


Running results:
insert image description here
6. Description:
(1) groupBy will have a shuffle process
(2) shuffle: the process of disrupting and reorganizing data in different partitions
(3) shuffle will definitely be placed on the disk. You can execute the program in local mode and see the effect through 4040.
insert image description here

2.3.1.5 distinct() deduplication

1. Usage: distinct(numPartitions), realize distributed deduplication of RDD.
2. Parameter numPartitions: Specifies the number of partitions of the RDD after deduplication.
3. Function description: deduplicate the internal elements, and put the deduplicated elements into a new RDD.
4. Code implementation

package com.zhm.spark.operator.value;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

import java.util.Arrays;

/**
 * @ClassName StudyDistinct
 * @Description 去重
 * @Author Zouhuiming
 * @Date 2023/6/27 14:44
 * @Version 1.0
 */
public class StudyDistinct {
    
    
    public static void main(String[] args) throws InterruptedException {
    
    
        //1、创建配置对象
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("sparkCore");

        //2、创建sparkContext
        JavaSparkContext sparkContext = new JavaSparkContext(conf);

        //3、Todo 根据集合创建RDD
        JavaRDD<Integer> javaRDD = sparkContext.parallelize(Arrays.asList(1, 2, 3, 1, 2, 3, 4, 5, 6), 2);

        //4、使用distinct算子实现去重,底层使用分布式去重,慢但是不会OOM
        JavaRDD<Integer> distinctRDD = javaRDD.distinct();

        //5、收集打印
        distinctRDD.collect().forEach(System.out::println);


        Thread.sleep(100000000L);
        //6 关闭 sparkContext
        sparkContext.stop();
    }
}


operation result:
insert image description here

5. There will be a shuffle process in distinct.insert image description here

2.3.1.6 sortBy() sorting

1. Usage: RDD.sortBy(f, ascending, numpartitions)
2. Parameter introduction:
(1) Function f: Execute function f for each element, and the return value type is consistent with the type in the element
(2) ascending: data type It is Boolean, and the default is True. The parameter determines the sorting order of the elements in the RDD after sorting, that is, ascending/descending
(3) numpartitions: the number of partitions of the sorted RDD.
3. Function description
This operation is used to sort data. Before sorting, the data can be processed by the f function, and then sorted according to the result of the f function processing, and the default is positive order. The number of partitions of the newly generated RDD after sorting by default is the same as the number of partitions of the original RDD. Spark's sorting results are globally ordered.
4. Description of case requirements: Create an RDD, and implement forward and reverse sorting according to the size of the numbers.
insert image description here
5. Code implementation

package com.zhm.spark.operator.value;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;

import java.util.Arrays;

/**
 * @ClassName StudySortBy
 * @Description TODO
 * @Author Zouhuiming
 * @Date 2023/6/27 14:48
 * @Version 1.0
 */
public class StudySortBy {
    
    
    public static void main(String[] args) {
    
    
        //1、创建配置对象
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("sparkCore");

        //2、创建sparkContext
        JavaSparkContext sparkContext = new JavaSparkContext(conf);

        //3、Todo
        JavaRDD<Integer> javaRDD = sparkContext.parallelize(Arrays.asList(1, 3, 2, 9, 6, 5, 3), 2);

        //4、使用sortBy算子对javaRDD进行排序(泛型->以谁作为标准排序,true->为正序)

        JavaRDD<Integer> javaRDD1 = javaRDD.sortBy(new Function<Integer, Integer>() {
    
    
            @Override
            public Integer call(Integer integer) throws Exception {
    
    
                return integer;
            }
        }, true, 2);

        //5、收集输出
        javaRDD1.collect().forEach(System.out::println);

        //x 关闭 sparkContext
        sparkContext.stop();
    }
}


operation result:
insert image description here

2.3.2 Key-Value Type

Create package name: com.zhm.spark.operator.keyvalue

2.3.2.1 mapToPair()

1. Usage: RDD.mapToPair(f), execute the function f for each record in the parent RDD to get a new record <k, v> 2.
Function: Convert Value type to key-Value type
3. Code implementation

package com.zhm.spark.operator.keyvalue;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;

import java.util.Arrays;
import java.util.Iterator;

/**
 * @ClassName StudyMapToPair
 * @Description 将不是kv类型的RDD转换为kv类型的RDD
 * @Author Zouhuiming
 * @Date 2023/6/27 14:52
 * @Version 1.0
 */
public class StudyMapToPair {
    
    
    public static void main(String[] args) {
    
    
        //1、创建配置对象
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("sparkCore");

        //2、创建sparkContext
        JavaSparkContext sparkContext = new JavaSparkContext(conf);

        //3、Todo
        JavaRDD<Integer> javaRDD = sparkContext.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6), 2);

        //4、使用mapToPair算子对javaRDD转换为kv类型的RDD

        JavaPairRDD<Integer, Integer> pairRDD = javaRDD.mapToPair(new PairFunction<Integer, Integer, Integer>() {
    
    
            @Override
            public Tuple2<Integer, Integer> call(Integer integer) throws Exception {
    
    
                return new Tuple2<>(integer, integer);
            }
        });

        //5、收集输出
        System.out.println("------由v型RDD转换得到的kv型RDD------");
        pairRDD.collect().forEach(System.out::println);

        //Todo 由集合直接创建KV型RDD
        JavaPairRDD<Integer, Integer> integerIntegerJavaPairRDD = sparkContext.parallelizePairs(Arrays.asList(new Tuple2<>(1, 1), new Tuple2<>(2, 2), new Tuple2<>(3, 3), new Tuple2<>(4, 5)));

        //6、收集输出
        System.out.println("-----由集合直接创建KV型RDD-----");
        integerIntegerJavaPairRDD.collect().forEach(System.out::println);
        
     sparkContext.stop();
    }
}

operation result:
insert image description here

2.3.2.2 mapValues() only operates on Value

1. Usage: newRDD = oldRdd.mapValues(func)
2. Parameter function func: a custom implemented function that only works on v of the (k,v) data in oldRdd.
3. Function description: only operate on V for the type of (K, V)
4. Requirement description: create a pairRDD, and add the character suffix "Fighting" to value
5. Code implementation

package com.zhm.spark.operator.keyvalue;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import scala.Tuple2;

import java.util.Arrays;

/**
 * @ClassName StudyMapValues
 * @Description 对kv类型的RDD的v进行操作
 * @Author Zouhuiming
 * @Date 2023/6/27 15:01
 * @Version 1.0
 */
public class StudyMapValues {
    
    
    public static void main(String[] args) {
    
    
        //1、创建配置对象
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("sparkCore");

        //2、创建sparkContext
        JavaSparkContext sparkContext = new JavaSparkContext(conf);

        //3、Todo
        JavaPairRDD<Integer, Integer> javaPairRDD = sparkContext.parallelizePairs(Arrays.asList(new Tuple2<>(1, 1), new Tuple2<>(2, 2), new Tuple2<>(3, 3),
                new Tuple2<>(4, 4), new Tuple2<>(5, 5)), 2);

        //4、为kv型RDD的v拼接尾缀"Fighting"
        JavaPairRDD<Integer, String> resultRDD = javaPairRDD.mapValues(new Function<Integer, String>() {
    
    
            @Override
            public String call(Integer integer) throws Exception {
    
    
                return integer + " Fighting";
            }
        });

        //5、打印收集
        resultRDD.collect().forEach(System.out::println);


        //6 关闭 sparkContext
        sparkContext.stop();
    }
}


operation result:
insert image description here

2.3.2.3 groupByKey() regroups according to K

1. Usage: KVRDD.groupByKey();
2. Function description
groupByKey operates on each key, but only generates one result set without aggregation.
This operation can specify the partitioner or the number of partitions (HashPartitioner is used by default)
3. Requirement Description
Count the number of occurrences of words
insert image description here
4. Code implementation

package com.zhm.spark.operator.keyvalue;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;

import java.util.Arrays;

/**
 * @ClassName StudyGroupByKey
 * @Description 分组聚合
 * @Author Zouhuiming
 * @Date 2023/6/27 15:07
 * @Version 1.0
 */
public class StudyGroupByKey {
    
    
    public static void main(String[] args) {
    
    
        //1、创建配置对象
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("sparkCore");

        //2、创建sparkContext
        JavaSparkContext sparkContext = new JavaSparkContext(conf);

        //3、Todo
        JavaRDD<String> javaRDD = sparkContext.parallelize(Arrays.asList("a", "a", "a", "b", "b", "b", "b", "a"), 2);

        //4、根据JavaRDD创建KVRDD
        JavaPairRDD<String, Integer> pairRDD = javaRDD.mapToPair(new PairFunction<String, String, Integer>() {
    
    
            @Override
            public Tuple2<String, Integer> call(String s) throws Exception {
    
    
                return new Tuple2<>(s, 1);
            }
        });

        //5、聚合相同的key
        JavaPairRDD<String, Iterable<Integer>> groupByKeyRDD = pairRDD.groupByKey();

        //6、收集并输出RDD内容
        System.out.println("-----查看groupByKeyRDD的内容------");

        groupByKeyRDD.collect().forEach(System.out::println);
        sparkContext.stop();
    }
}

5. Running result:
insert image description here

2.3.2.4 reduceByKey() aggregates V according to K

1. Usage: KVRDD.reduceByKey(f);
2. Function description: aggregate the elements in RDD[K,V] according to the V of the same K. There are various overloaded forms, which can set the number of partitions of the new RDD.
3. Requirement description: count the number of occurrences of words
insert image description here
4. Code implementation

package com.zhm.spark.operator.keyvalue;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;

import java.util.Arrays;

/**
 * @ClassName StudyReduceByKey
 * @Description TODO
 * @Author Zouhuiming
 * @Date 2023/6/27 15:15
 * @Version 1.0
 */
public class StudyReduceByKey {
    
    
    public static void main(String[] args) {
    
    
        //1、创建配置对象
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("sparkCore");

        //2、创建sparkContext
        JavaSparkContext sparkContext = new JavaSparkContext(conf);

        //3、Todo
        JavaRDD<String> javaRDD = sparkContext.parallelize(Arrays.asList("a", "a", "a", "b", "b", "b", "b", "a"), 2);

        //4、根据JavaRDD创建KVRDD
        JavaPairRDD<String, Integer> javaPairRDD = javaRDD.mapToPair(new PairFunction<String, String, Integer>() {
    
    
            @Override
            public Tuple2<String, Integer> call(String s) throws Exception {
    
    
                return new Tuple2<>(s, 1);
            }
        });

        //5、聚合相同的key,统计单词出现的次数
        JavaPairRDD<String, Integer> result = javaPairRDD.reduceByKey(new Function2<Integer, Integer, Integer>() {
    
    
            @Override
            public Integer call(Integer integer, Integer integer2) throws Exception {
    
    
                return integer + integer2;
            }
        });

        //6、收集并输出RDD的内容
        System.out.println("查看--result--的内容");
        result.collect().forEach(System.out::println);

        //x 关闭 sparkContext
        sparkContext.stop();
    }
}


operation result:
insert image description here

2.3.2.5 Difference between reduceByKey and groupByKey

1. educeByKey: aggregate according to the key, there is a combine (pre-aggregation) operation before the shuffle, and the returned result is RDD[K,V].
2. groupByKey: group by key and shuffle directly.
3. On the premise of not affecting the business logic, reduceByKey is preferred. The sum operation does not affect the business logic, but the average value affects the business logic. When affecting business logic, it is recommended to convert the data types before merging.

2.3.2.6 sortByKey() sorts according to K

1. Usage: kvRDD.sortByKey(true/false)
2. Function description: Called on a (K, V) RDD, K must implement the Ordered interface, and return a (K, V) RDD sorted by key.
3. Parameter description: true: for ascending order, false for descending order
4. Requirement description: Create a pairRDD and sort according to the positive and reverse order of the key
insert image description here
5. Code implementation

package com.zhm.spark.operator.keyvalue;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;

import java.util.Arrays;

/**
 * @ClassName StudySortByKey
 * @Description TODO
 * @Author Zouhuiming
 * @Date 2023/6/27 15:47
 * @Version 1.0
 */
public class StudySortByKey {
    
    
    public static void main(String[] args) {
    
    
        //1、创建配置对象
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("sparkCore");

        //2、创建sparkContext
        JavaSparkContext sparkContext = new JavaSparkContext(conf);

        //3、Todo 创建kvRDD
        JavaPairRDD<Integer, String> javaPairRDD = sparkContext.parallelizePairs(Arrays.asList(new Tuple2<>(4, "d"), new Tuple2<>(3, "c"), new Tuple2<>(1, "a"),
                new Tuple2<>(2, "b")));

        //4、收集输出
        System.out.println("排序前:");
        javaPairRDD.collect().forEach(System.out::println);

        //5、对RDD按照key进行排序
        JavaPairRDD<Integer, String> sortByKeyRDD = javaPairRDD.sortByKey(true);

        //收集输出
        System.out.println("排序后:");
        sortByKeyRDD.collect().forEach(System.out::println);


        //x 关闭 sparkContext
        sparkContext.stop();
    }
}


operation result:
insert image description here

2.4 Action operator

The action operator triggers the execution of the entire job. Because conversion operators are lazy loaded and will not be executed immediately.
Create package name: com.zhm.spark.operator.action

2.4.1 collect(): returns the data set in the form of an array

1. Usage: RDD.collect();
2. Function description: In the driver program, all elements of the data set are returned in the form of Array.
insert image description here
Note: All data will be pulled to the Driver side, use with caution
3. Requirement description: Create an RDD, and collect the RDD content to the Driver side for printing (the code is implemented at the end)

2.4.2 count() returns the number of elements in the RDD

1. Usage: RDD.count(), the return value is Long type
2. Function description: return the number of elements in the RDD
insert image description here
3. Requirement description: create an RDD and count the number of items in the RDD (the code is implemented at the end)

2.4.3 first() returns the first element in the RDD

1. Usage: RDD.first(), return value type is element type
2. Function description: return the first element in RDD
insert image description here
3. Requirement description: create an RDD and return the first element in the RDD (code implementation at the end)

2.4.4 take() returns an array consisting of the first n elements of RDD

1. RDD.take(int num), the return value is a List list of the element type in RDD
2. Function description: return an array composed of the first n elements of RDD
insert image description here
3. Requirement description: create an RDD and take out the first 3 elements Elements (code implementation at the end)

2.4.5 countByKey() counts the number of each key

1. Usage: pairRDD.countByKey(), the return value type is Map<[key type in RDD], Long>
2. Function description: count the number of each key
insert image description here

3. Requirement description: Create a PairRDD and count the number of each key (the code is implemented at the end)

2.4.6 save related operators

1. saveAsTextFile(path)
(1) Function: save the RDD as a Text file
(2) Function description: save the elements of the dataset to the HDFS file system or other supported file systems in the form of textfile. For each element, Spark The toString method will be called to convert it to the text in the file.
2. saveAsObjectFile(path)
(1) Function: Serialize into an object and save it to a file
(2) Function description: It is used to serialize the elements in the RDD into an object and store it in a file.
(The code implementation is at the end)

2.4.7 foreach() traverses each element in RDD

1. Function: Traverse each element in the RDD and apply the function in turn
insert image description here
2. Requirement description: Create an RDD and print each element

2.4.8 All code implementation
package com.zhm.spark.operator.action;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;

import java.util.Arrays;

/**
 * @ClassName TestAll
 * @Description TODO
 * @Author Zouhuiming
 * @Date 2023/6/28 13:47
 * @Version 1.0
 */
public class TestAll {
    
    
    public static void main(String[] args) {
    
    
        //设置往HDFS储存数据的用户名
        System.setProperty("HADOOP_USER_NAME","zhm");
        //1、创建配置对象
        SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("TestAll");

        //2、创建sparkContext
        JavaSparkContext sparkContext = new JavaSparkContext(conf);

        //3、获取RDD
//        JavaRDD<Integer> javaRDD = sparkContext.parallelize(Arrays.asList(1, 2, 3, 4, 5, 6, 7));
        JavaPairRDD<Integer, String> javaPairRDD = sparkContext.parallelizePairs(Arrays.asList(new Tuple2<>(1, "z"), new Tuple2<>(2, "h"), new Tuple2<>(3, "m"),
                new Tuple2<>(1, "zhm")
        ),2);

        //4、collect():以数组的形式返回数据集
        //注意:所有的数据都会被拉取到Driver端,慎用
        System.out.println("-------------collect测试-------------");
        javaPairRDD.collect().forEach(System.out::println);

        //5、count():返回RDD中元素个数
        System.out.println("-------------count测试-------------");
        System.out.println(javaPairRDD.collect());

        //6、first():返回RDD中第一个元素
        System.out.println("-------------first测试-------------");
        System.out.println(javaPairRDD.first());

        //7、take():返回RDD前n个元素组成的数组
        System.out.println("-------------take测试-------------");
        javaPairRDD.take(3).forEach(System.out::println);

        //8、countByKey统计每种key的个数
        System.out.println("-------------countByKey测试-------------");
        System.out.println(javaPairRDD.countByKey());


        //9、save相关的算子
        //以文本格式储存数据
        javaPairRDD.saveAsTextFile("outputText");
        //以对象储存数据
        javaPairRDD.saveAsObjectFile("outputObject");

        //10、foreach():遍历RDD中每一个元素
        javaPairRDD.foreach(new VoidFunction<Tuple2<Integer, String>>() {
    
    
            @Override
            public void call(Tuple2<Integer, String> integerStringTuple2) throws Exception {
    
    
                System.out.println(integerStringTuple2._1+":"+integerStringTuple2._2);
            }
        });
        //x 关闭 sparkContext
        sparkContext.stop();
    }
}


operation result
insert image description here

Guess you like

Origin blog.csdn.net/qq_44804713/article/details/131568017