If you know java 8 streaming, you will know big data!

If you know the flow of any language, it makes no sense not to develop big data.

As the saying goes, men chase women across the mountain, and women chase men through layers of gauze. If you learn big data with zero foundation, it feels like there is a mountain ahead, then as long as you know java or the flow of any language, big data is only separated by a layer of veil.

This article takes java stream computing as an example to explain some basic spark operations. The same goes for flink, another popular big data framework.

Preparation

For test data, the following columns represent name, age, department, and position respectively.

张三,20,研发部,普通员工
李四,31,研发部,普通员工
李丽,36,财务部,普通员工
张伟,38,研发部,经理
杜航,25,人事部,普通员工
周歌,28,研发部,普通员工

Create a Employeeclass.

@Getter
@Setter
@AllArgsConstructor
@NoArgsConstructor
@ToString
static
class Employee implements Serializable {
    private String name;
    private Integer age;
    private String department;
    private String level;
}

Versions: jdk:1.8 spark:3.2.0 scala:2.12.15.

The above scala version only needs to depend on scala for the spark framework itself.

Because scala is indeed a relatively small language, this article still uses java to demonstrate the spark code.

map class

java stream map

map represents a one-to-one operation. Perform any operation on a row of upstream data, and finally get a piece of data after operation. This idea is consistent in java, spark, and flink.

We first use java stream to demonstrate reading files, and then use the map operation to map each line of data into Employeeobjects.

List<String> list = FileUtils.readLines(new File("f:/test.txt"), "utf-8");
List<Employee> employeeList = list.stream().map(word -> {
    List<String> words = Arrays.stream(word.split(",")).collect(Collectors.toList());
    Employee employee = new Employee(words.get(0), Integer.parseInt(words.get(1)), words.get(2), words.get(3));
    return employee;
}).collect(Collectors.toList());

employeeList.forEach(System.out::println);

Transformed data:

JavaStreamDemo.Employee(name=张三, age=20, department=研发部, level=普通员工)
JavaStreamDemo.Employee(name=李四, age=31, department=研发部, level=普通员工)
JavaStreamDemo.Employee(name=李丽, age=36, department=财务部, level=普通员工)
JavaStreamDemo.Employee(name=张伟, age=38, department=研发部, level=经理)
JavaStreamDemo.Employee(name=杜航, age=25, department=人事部, level=普通员工)
JavaStreamDemo.Employee(name=周歌, age=28, department=研发部, level=普通员工)

spark map

First get a SparkSession object, read the file, and get a DataSet elastic data set object.

SparkSession session = SparkSession.builder().master("local[*]").getOrCreate();
Dataset<Row> reader = session.read().text("F:/test.txt");
reader.show();

The show() here is to print out the current data set, which is an operator of the action class. got the answer:

+-----------------------+
|                  value|
+-----------------------+
|张三,20,研发部,普通员工|
|李四,31,研发部,普通员工|
|李丽,36,财务部,普通员工|
|    张伟,38,研发部,经理|
|杜航,25,人事部,普通员工|
|周歌,28,研发部,普通员工|
+-----------------------+

Now that we have the basic data, we use map one-to-one operation to convert a row of data into Employeean object. We do not use lamda expressions here to make it clearer for everyone.

The call method in the MapFunction interface is implemented here. Every time we get a row of data, we split it here and convert it into an object.

  1. What needs to be pointed out is that, unlike back-end web applications, which have unified exception handling, big data applications, especially stream computing, need to capture exceptions for each operator in order to ensure 7*24 online. Because you don't know how the upstream data is cleaned, it is very likely that you will get a piece of dirty data and throw an exception during processing. If it is not caught and processed, the entire application will hang.

  2. Spark operators are divided into two types: Transformation and Action. Transformation will open a DAG graph with lazy delay. It will only convert from one dataset (rdd/df) to another dataset (rdd/df), and will only be executed when it encounters an operator of the action class. The operators we will demonstrate today are operators of the Transformation class.

Typical Action operators include show, collect, save and the like. For example, show and view the results locally, or save to the database or HDFS after the operation is completed.

  1. When spark is executed, it is divided into driver and executor. But it is not the focus of this article, and I will not expand on it. Just note that the driver side will distribute the code to the node executors of each distributed system, and it will not participate in the calculation itself. Generally speaking, the outside of the operator, such as part a of the following sample code, will be executed on the driver side, and the interior of the operator at part b will be executed on the executor side on different servers. Therefore, when variables defined outside the operator are used inside the operator, special attention must be paid! ! Don't take it for granted that all code written in a main method must be in the same JVM.

This involves the issue of serialization, and they are located in different JVMs, and there may be problems when using "=" to compare! !

This is when back-end WEB development turns to big data development, this idea must be changed.

简言之,后端WEB服务的分布式是我们自己实现的,大数据的分布式是框架天生帮我们实现的

MapFunction

// a 算子外部,driver端
Dataset<Employee> employeeDataset = reader.map(new MapFunction<Row, Employee>() {
            @Override
            public Employee call(Row row) throws Exception {
                // b 算子内部,executor端
                Employee employee = null;
                try {
                    // gson.fromJson(); 这里使用gson涉及到序列化问题
                    List<String> list = Arrays.stream(row.mkString().split(",")).collect(Collectors.toList());
                    employee = new Employee(list.get(0), Integer.parseInt(list.get(1)), list.get(2), list.get(3));
                } catch (Exception exception) {
                    // 日志记录
                    // 流式计算中要做到7*24小时不间断,任意一条上流脏数据都可能导致失败,从而导致任务退出,所以这里要做好异常的抓取
                    exception.printStackTrace();
                }
                return employee;
            }
        }, Encoders.bean(Employee.class));

        employeeDataset.show();

output

+---+----------+--------+----+
|age|department|   level|name|
+---+----------+--------+----+
| 20|    研发部|普通员工|张三|
| 31|    研发部|普通员工|李四|
| 36|    财务部|普通员工|李丽|
| 38|    研发部|    经理|张伟|
| 25|    人事部|普通员工|杜航|
| 28|    研发部|普通员工|周歌|

MapPartitionsFunction

What is the difference between map and mapPartitions in spark?

The map is data processed one by one. mapPartitions is a partition to process data.

Is the latter necessarily more efficient than the former?

Not necessarily, it depends on the specific situation.

Here, the same logical processing as the previous map is used. It can be seen that what is obtained in the call method is an Iterator iterator, which is a batch of data.

Get a batch of data, and then map it to an object one-to-one, and then return this batch of data in the form of Iterator.

Dataset<Employee> employeeDataset2 = reader.mapPartitions(new MapPartitionsFunction<Row, Employee>() {
    @Override
    public Iterator<Employee> call(Iterator<Row> iterator) throws Exception {
        List<Employee> employeeList = new ArrayList<>();
        while (iterator.hasNext()){
            Row row = iterator.next();
            try {
                List<String> list = Arrays.stream(row.mkString().split(",")).collect(Collectors.toList());
                Employee employee = new Employee(list.get(0), Integer.parseInt(list.get(1)), list.get(2), list.get(3));
                employeeList.add(employee);
            } catch (Exception exception) {
                // 日志记录
                // 流式计算中要做到7*24小时不间断,任意一条上流脏数据都可能导致失败,从而导致任务退出,所以这里要做好异常的抓取
                exception.printStackTrace();
            }
        }
        return employeeList.iterator();
    }
}, Encoders.bean(Employee.class));

employeeDataset2.show();

The output result is the same as the map, so it will not be posted here.

flatMap class

What is the difference between map and flatMap?

map is one-to-one, and flatMap is one-to-many. Of course, in java stream, flatMap is called flattening.

This kind of thinking is consistent in java, spark, and flink.

java stream flatMap

The following code maps 1 piece of raw data to 2 objects and returns them.

List<Employee> employeeList2 = list.stream().flatMap(word -> {
List<String> words = Arrays.stream(word.split(",")).collect(Collectors.toList());
List<Employee> lists = new ArrayList<>();
Employee employee = new Employee(words.get(0), Integer.parseInt(words.get(1)), words.get(2), words.get(3));
lists.add(employee);
Employee employee2 = new Employee(words.get(0)+"_2", Integer.parseInt(words.get(1)), words.get(2), words.get(3));
lists.add(employee2);
return lists.stream();
}).collect(Collectors.toList());
employeeList2.forEach(System.out::println);

output

JavaStreamDemo.Employee(name=张三, age=20, department=研发部, level=普通员工)
JavaStreamDemo.Employee(name=张三_2, age=20, department=研发部, level=普通员工)
JavaStreamDemo.Employee(name=李四, age=31, department=研发部, level=普通员工)
JavaStreamDemo.Employee(name=李四_2, age=31, department=研发部, level=普通员工)
JavaStreamDemo.Employee(name=李丽, age=36, department=财务部, level=普通员工)
JavaStreamDemo.Employee(name=李丽_2, age=36, department=财务部, level=普通员工)
JavaStreamDemo.Employee(name=张伟, age=38, department=研发部, level=经理)
JavaStreamDemo.Employee(name=张伟_2, age=38, department=研发部, level=经理)
JavaStreamDemo.Employee(name=杜航, age=25, department=人事部, level=普通员工)
JavaStreamDemo.Employee(name=杜航_2, age=25, department=人事部, level=普通员工)
JavaStreamDemo.Employee(name=周歌, age=28, department=研发部, level=普通员工)
JavaStreamDemo.Employee(name=周歌_2, age=28, department=研发部, level=普通员工)

spark flatMap

The call method of FlatMapFunction is implemented here, and one piece of data is obtained at a time, and the return value is an Iterator, so multiple pieces can be returned.

Dataset<Employee> employeeDatasetFlatmap = reader.flatMap(new FlatMapFunction<Row, Employee>() {
    @Override
    public Iterator<Employee> call(Row row) throws Exception {
        List<Employee> employeeList = new ArrayList<>();
        try {
            List<String> list = Arrays.stream(row.mkString().split(",")).collect(Collectors.toList());
            Employee employee = new Employee(list.get(0), Integer.parseInt(list.get(1)), list.get(2), list.get(3));
            employeeList.add(employee);

            Employee employee2 = new Employee(list.get(0)+"_2", Integer.parseInt(list.get(1)), list.get(2), list.get(3));
            employeeList.add(employee2);
        } catch (Exception exception) {
            exception.printStackTrace();
        }
        return employeeList.iterator();
    }
}, Encoders.bean(Employee.class));
employeeDatasetFlatmap.show();

output

+---+----------+--------+------+
|age|department|   level|  name|
+---+----------+--------+------+
| 20|    研发部|普通员工|  张三|
| 20|    研发部|普通员工|张三_2|
| 31|    研发部|普通员工|  李四|
| 31|    研发部|普通员工|李四_2|
| 36|    财务部|普通员工|  李丽|
| 36|    财务部|普通员工|李丽_2|
| 38|    研发部|    经理|  张伟|
| 38|    研发部|    经理|张伟_2|
| 25|    人事部|普通员工|  杜航|
| 25|    人事部|普通员工|杜航_2|
| 28|    研发部|普通员工|  周歌|
| 28|    研发部|普通员工|周歌_2|
+---+----------+--------+------+

groupby class

Similar to SQL, like java stream and spark, groupby groups data sets and can perform aggregate function operations on this basis. It is also possible to group directly to get a set of sub-datasets.

java stream groupBy

Group statistics by department:

Map<String, Long> map = employeeList.stream().collect(Collectors.groupingBy(Employee::getDepartment, Collectors.counting()));
        System.out.println(map);

output

{财务部=1, 人事部=1, 研发部=4}

spark groupBy

Group the data set mapped to objects by department, and then count the number of employees and the average age of the department based on this.

RelationalGroupedDataset datasetGroupBy = employeeDataset.groupBy("department");
// 统计每个部门有多少员工
datasetGroupBy.count().show(); 
/**
 * 每个部门的平均年龄
 */
datasetGroupBy.avg("age").withColumnRenamed("avg(age)","avgAge").show();

The outputs are

+----------+-----+
|department|count|
+----------+-----+
|    财务部|    1|
|    人事部|    1|
|    研发部|    4|
+----------+-----+
+----------+------+
|department|avgAge|
+----------+------+
|    财务部|  36.0|
|    人事部|  25.0|
|    研发部| 29.25|
+----------+------+

spark groupByKey

The difference between spark groupByand groupByKeyspark, the former uses the aggregate function to get an aggregate value on this basis, and the latter just performs grouping without any calculation.

Similar to java stream:

Map<String, List<Employee>> map2 = employeeList.stream().collect(Collectors.groupingBy(Employee::getDepartment));
System.out.println(map2);

output

{财务部=[JavaStreamDemo.Employee(name=李丽, age=36, department=财务部, level=普通员工)], 
人事部=[JavaStreamDemo.Employee(name=杜航, age=25, department=人事部, level=普通员工)], 
研发部=[JavaStreamDemo.Employee(name=张三, age=20, department=研发部, level=普通员工), JavaStreamDemo.Employee(name=李四, age=31, department=研发部, level=普通员工), JavaStreamDemo.Employee(name=张伟, age=38, department=研发部, level=经理), JavaStreamDemo.Employee(name=周歌, age=28, department=研发部, level=普通员工)]}

Use spark groupByKey.

First get a key-value one-to-many collection data set. The call() method here returns the key, which is the key of the group.

KeyValueGroupedDataset keyValueGroupedDataset = employeeDataset.groupByKey(new MapFunction<Employee, String>() {
    @Override
    public String call(Employee employee) throws Exception {
        // 返回分组的key,这里表示根据部门进行分组
        return employee.getDepartment();
    }
}, Encoders.STRING());

Then keyValueGroupedDataset mapGroups is performed on the basis of , and all the original data of each key can be obtained in the call() method.

keyValueGroupedDataset.mapGroups(new MapGroupsFunction() {
            @Override
            public Object call(Object key, Iterator iterator) throws Exception {
                System.out.println("key = " + key);
                while (iterator.hasNext()){
                    System.out.println(iterator.next());
                }
                return iterator; 
            }
        }, Encoders.bean(Iterator.class))
                .show(); // 这里的show()没有意义,只是触发计算而已

output

key = 人事部
SparkDemo.Employee(name=杜航, age=25, department=人事部, level=普通员工)
key = 研发部
SparkDemo.Employee(name=张三, age=20, department=研发部, level=普通员工)
SparkDemo.Employee(name=李四, age=31, department=研发部, level=普通员工)
SparkDemo.Employee(name=张伟, age=38, department=研发部, level=经理)
SparkDemo.Employee(name=周歌, age=28, department=研发部, level=普通员工)
key = 财务部
SparkDemo.Employee(name=李丽, age=36, department=财务部, level=普通员工)

reduce class

reduceLiterally means: to decrease; to reduce; to reduce; to shrink. Also called reduction.

It loops the data set, and 当前对象calculates 前一对象the sum two by two. The result of each calculation is used as 下一次the calculation 前一对象, and finally an object is obtained.

Assuming that there are 5 data [1, 2, 3, 4, 5], use reduce to perform the sum calculation, respectively

For example, in the test data set above, I want to calculate the total age of each department. The result obtained by using the aggregate function is a number of type int.

java stream reduce

int age = employeeList.stream().mapToInt(e -> e.age).sum();
System.out.println(age);//178

The above calculation can also be performed using reduce

int age1 = employeeList.stream().mapToInt(e -> e.getAge()).reduce(0,(a,b) -> a+b);
System.out.println(age1);// 178

But what if I sum the ages and get a full object at the same time?

JavaStreamDemo.Employee(name=周歌, age=178, department=研发部, level=普通员工)

You can use reduce to loop the data set in pairs, add the ages, and return the last traversed object.

The pre of the following code represents the previous object, and the current represents the current object.

 /**
 * pre 代表前一个对象
 * current 代表当前对象
 */
Employee reduceEmployee = employeeList.stream().reduce(new Employee(), (pre,current) -> {
     // 当第一次循环时前一个对象为null
    if (pre.getAge() == null) {
        current.setAge(current.getAge());
    } else {
        current.setAge(pre.getAge() + current.getAge());
    }
    return current;
});
System.out.println(reduceEmployee);

spark reduce

The basic idea of ​​spark reduce is the same as java stream.

Look directly at the code:

Employee datasetReduce = employeeDataset.reduce(new ReduceFunction<Employee>() {
    @Override
    public Employee call(Employee t1, Employee t2) throws Exception {
        // 不同的版本看是否需要判断t1 == null
        t2.setAge(t1.getAge() + t2.getAge());
        return t2;
    }
});

System.out.println(datasetReduce);

output

SparkDemo.Employee(name=周歌, age=178, department=研发部, level=普通员工)

Other common operation classes

Employee employee = employeeDataset.filter("age > 30").limit(3).sort("age").first();
System.out.println(employee);
// SparkDemo.Employee(name=李四, age=31, department=研发部, level=普通员工)

At the same time, the dataset can be registered as a table, and more powerful SQL can be used to perform various powerful operations. Now SQL is a first-class citizen of flink, and spark is no less. Here is a very simple example.

employeeDataset.registerTempTable("table");
session.sql("select * from table where age > 30 order by age desc limit 3").show();

output

+---+----------+--------+----+
|age|department|   level|name|
+---+----------+--------+----+
| 38|    研发部|    经理|张伟|
| 36|    财务部|普通员工|李丽|
| 31|    研发部|普通员工|李四|
+---+----------+--------+----+
employeeDataset.registerTempTable("table");
session.sql("select 
    concat_ws(',',collect_set(name)) as names, // group_concat
    avg(age) as age,
    department from table 
    where age > 30  
    group by department 
    order by age desc 
    limit 3").show();

output

+---------+----+----------+
|    names| age|department|
+---------+----+----------+
|     李丽|36.0|    财务部|
|张伟,李四|34.5|    研发部|
+---------+----+----------+

summary

This article introduces some common operator operations in spark based on the similarity of java stream.

This article is just a very simple introductory introduction.

If you are interested, the back-end students can try to operate it, it is very simple, there is no need to build an environment locally, just import the maven dependency of spark.

I pasted all the code of this article at the end.

Java stream source code:

Click to view the code

import lombok.*;
import org.apache.commons.io.FileUtils;

import java.io.File;
import java.io.IOException;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

public class JavaStreamDemo {
    public static void main(String[] args) throws IOException {
        /**
         * 张三,20,研发部,普通员工
         * 李四,31,研发部,普通员工
         * 李丽,36,财务部,普通员工
         * 张伟,38,研发部,经理
         * 杜航,25,人事部,普通员工
         * 周歌,28,研发部,普通员工
         */
        List<String> list = FileUtils.readLines(new File("f:/test.txt"), "utf-8");
        List<Employee> employeeList = list.stream().map(word -> {
            List<String> words = Arrays.stream(word.split(",")).collect(Collectors.toList());
            Employee employee = new Employee(words.get(0), Integer.parseInt(words.get(1)), words.get(2), words.get(3));
            return employee;
        }).collect(Collectors.toList());

        // employeeList.forEach(System.out::println);

        List<Employee> employeeList2 = list.stream().flatMap(word -> {
            List<String> words = Arrays.stream(word.split(",")).collect(Collectors.toList());
            List<Employee> lists = new ArrayList<>();
            Employee employee = new Employee(words.get(0), Integer.parseInt(words.get(1)), words.get(2), words.get(3));
            lists.add(employee);
            Employee employee2 = new Employee(words.get(0)+"_2", Integer.parseInt(words.get(1)), words.get(2), words.get(3));
            lists.add(employee2);
            return lists.stream();
        }).collect(Collectors.toList());
        // employeeList2.forEach(System.out::println);

        Map<String, Long> map = employeeList.stream().collect(Collectors.groupingBy(Employee::getDepartment, Collectors.counting()));
        System.out.println(map);
        Map<String, List<Employee>> map2 = employeeList.stream().collect(Collectors.groupingBy(Employee::getDepartment));
        System.out.println(map2);

        int age = employeeList.stream().mapToInt(e -> e.age).sum();
        System.out.println(age);// 178

        int age1 = employeeList.stream().mapToInt(e -> e.getAge()).reduce(0,(a,b) -> a+b);
        System.out.println(age1);// 178

        /**
         * pre 代表前一个对象
         * current 代表当前对象
         */
        Employee reduceEmployee = employeeList.stream().reduce(new Employee(), (pre,current) -> {
            if (pre.getAge() == null) {
                current.setAge(current.getAge());
            } else {
                current.setAge(pre.getAge() + current.getAge());
            }
            return current;
        });
        System.out.println(reduceEmployee);

    }

    @Getter
    @Setter
    @AllArgsConstructor
    @NoArgsConstructor
    @ToString
    static
    class Employee implements Serializable {
        private String name;
        private Integer age;
        private String department;
        private String level;
    }
}

The source code of spark:

 

Click to view the code

import com.google.gson.Gson;
import lombok.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.sql.*;

import java.io.Serializable;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;
import java.util.stream.Collectors;

public class SparkDemo {
    public static void main(String[] args) {
        SparkSession session = SparkSession.builder().master("local[*]").getOrCreate();
        Dataset<Row> reader = session.read().text("F:/test.txt");
        // reader.show();
        /**
         * +-----------------------+
         * |                  value|
         * +-----------------------+
         * |张三,20,研发部,普通员工|
         * |李四,31,研发部,普通员工|
         * |李丽,36,财务部,普通员工|
         * |张伟,38,研发部,经理|
         * |杜航,25,人事部,普通员工|
         * |周歌,28,研发部,普通员工|
         * +-----------------------+
         */

        // 本地演示而已,实际分布式环境,这里的gson涉及到序列化问题
        // 算子以外的代码都在driver端运行
        // 任何算子以内的代码都在executor端运行,即会在不同的服务器节点上执行
        Gson gson = new Gson();
        // a 算子外部,driver端
        Dataset<Employee> employeeDataset = reader.map(new MapFunction<Row, Employee>() {
            @Override
            public Employee call(Row row) throws Exception {
                // b 算子内部,executor端
                Employee employee = null;
                try {
                    // gson.fromJson(); 这里使用gson涉及到序列化问题
                    List<String> list = Arrays.stream(row.mkString().split(",")).collect(Collectors.toList());
                    employee = new Employee(list.get(0), Integer.parseInt(list.get(1)), list.get(2), list.get(3));
                } catch (Exception exception) {
                    // 日志记录
                    // 流式计算中要做到7*24小时不间断,任意一条上流脏数据都可能导致失败,从而导致任务退出,所以这里要做好异常的抓取
                    exception.printStackTrace();
                }
                return employee;
            }
        }, Encoders.bean(Employee.class));

        // employeeDataset.show();
        /**
         * +---+----------+--------+----+
         * |age|department|   level|name|
         * +---+----------+--------+----+
         * | 20|    研发部|普通员工|张三|
         * | 31|    研发部|普通员工|李四|
         * | 36|    财务部|普通员工|李丽|
         * | 38|    研发部|    经理|张伟|
         * | 25|    人事部|普通员工|杜航|
         * | 28|    研发部|普通员工|周歌|
         */

        Dataset<Employee> employeeDataset2 = reader.mapPartitions(new MapPartitionsFunction<Row, Employee>() {
            @Override
            public Iterator<Employee> call(Iterator<Row> iterator) throws Exception {
                List<Employee> employeeList = new ArrayList<>();
                while (iterator.hasNext()){
                    Row row = iterator.next();
                    try {
                        List<String> list = Arrays.stream(row.mkString().split(",")).collect(Collectors.toList());
                        Employee employee = new Employee(list.get(0), Integer.parseInt(list.get(1)), list.get(2), list.get(3));
                        employeeList.add(employee);
                    } catch (Exception exception) {
                        // 日志记录
                        // 流式计算中要做到7*24小时不间断,任意一条上流脏数据都可能导致失败,从而导致任务退出,所以这里要做好异常的抓取
                        exception.printStackTrace();
                    }
                }
                return employeeList.iterator();
            }
        }, Encoders.bean(Employee.class));

        // employeeDataset2.show();
        /**
         * +---+----------+--------+----+
         * |age|department|   level|name|
         * +---+----------+--------+----+
         * | 20|    研发部|普通员工|张三|
         * | 31|    研发部|普通员工|李四|
         * | 36|    财务部|普通员工|李丽|
         * | 38|    研发部|    经理|张伟|
         * | 25|    人事部|普通员工|杜航|
         * | 28|    研发部|普通员工|周歌|
         * +---+----------+--------+----+
         */

        Dataset<Employee> employeeDatasetFlatmap = reader.flatMap(new FlatMapFunction<Row, Employee>() {
            @Override
            public Iterator<Employee> call(Row row) throws Exception {
                List<Employee> employeeList = new ArrayList<>();
                try {
                    List<String> list = Arrays.stream(row.mkString().split(",")).collect(Collectors.toList());
                    Employee employee = new Employee(list.get(0), Integer.parseInt(list.get(1)), list.get(2), list.get(3));
                    employeeList.add(employee);

                    Employee employee2 = new Employee(list.get(0)+"_2", Integer.parseInt(list.get(1)), list.get(2), list.get(3));
                    employeeList.add(employee2);
                } catch (Exception exception) {
                    exception.printStackTrace();
                }
                return employeeList.iterator();
            }
        }, Encoders.bean(Employee.class));
//        employeeDatasetFlatmap.show();
        /**
         * +---+----------+--------+------+
         * |age|department|   level|  name|
         * +---+----------+--------+------+
         * | 20|    研发部|普通员工|  张三|
         * | 20|    研发部|普通员工|张三_2|
         * | 31|    研发部|普通员工|  李四|
         * | 31|    研发部|普通员工|李四_2|
         * | 36|    财务部|普通员工|  李丽|
         * | 36|    财务部|普通员工|李丽_2|
         * | 38|    研发部|    经理|  张伟|
         * | 38|    研发部|    经理|张伟_2|
         * | 25|    人事部|普通员工|  杜航|
         * | 25|    人事部|普通员工|杜航_2|
         * | 28|    研发部|普通员工|  周歌|
         * | 28|    研发部|普通员工|周歌_2|
         * +---+----------+--------+------+
         */

        RelationalGroupedDataset datasetGroupBy = employeeDataset.groupBy("department");
        // 统计每个部门有多少员工
        // datasetGroupBy.count().show();
        /**
         * +----------+-----+
         * |department|count|
         * +----------+-----+
         * |    财务部|    1|
         * |    人事部|    1|
         * |    研发部|    4|
         * +----------+-----+
         */
        /**
         * 每个部门的平均年龄
         */
        // datasetGroupBy.avg("age").withColumnRenamed("avg(age)","avgAge").show();
        /**
         * +----------+--------+
         * |department|avg(age)|
         * +----------+--------+
         * |    财务部|    36.0|
         * |    人事部|    25.0|
         * |    研发部|   29.25|
         * +----------+--------+
         */

        KeyValueGroupedDataset keyValueGroupedDataset = employeeDataset.groupByKey(new MapFunction<Employee, String>() {
            @Override
            public String call(Employee employee) throws Exception {
                // 返回分组的key,这里表示根据部门进行分组
                return employee.getDepartment();
            }
        }, Encoders.STRING());

        keyValueGroupedDataset.mapGroups(new MapGroupsFunction() {
            @Override
            public Object call(Object key, Iterator iterator) throws Exception {
                System.out.println("key = " + key);
                while (iterator.hasNext()){
                    System.out.println(iterator.next());
                }
                return iterator;
                /**
                 * key = 人事部
                 * SparkDemo.Employee(name=杜航, age=25, department=人事部, level=普通员工)
                 * key = 研发部
                 * SparkDemo.Employee(name=张三, age=20, department=研发部, level=普通员工)
                 * SparkDemo.Employee(name=李四, age=31, department=研发部, level=普通员工)
                 * SparkDemo.Employee(name=张伟, age=38, department=研发部, level=经理)
                 * SparkDemo.Employee(name=周歌, age=28, department=研发部, level=普通员工)
                 * key = 财务部
                 * SparkDemo.Employee(name=李丽, age=36, department=财务部, level=普通员工)
                 */
            }
        }, Encoders.bean(Iterator.class))
                .show(); // 这里的show()没有意义,只是触发计算而已


        Employee datasetReduce = employeeDataset.reduce(new ReduceFunction<Employee>() {
            @Override
            public Employee call(Employee t1, Employee t2) throws Exception {
                // 不同的版本看是否需要判断t1 == null
                t2.setAge(t1.getAge() + t2.getAge());
                return t2;
            }
        });

        System.out.println(datasetReduce);


        Employee employee = employeeDataset.filter("age > 30").limit(3).sort("age").first();
        System.out.println(employee);
        // SparkDemo.Employee(name=李四, age=31, department=研发部, level=普通员工)

        employeeDataset.registerTempTable("table");
        session.sql("select * from table where age > 30 order by age desc limit 3").show();

        /**
         * +---+----------+--------+----+
         * |age|department|   level|name|
         * +---+----------+--------+----+
         * | 38|    研发部|    经理|张伟|
         * | 36|    财务部|普通员工|李丽|
         * | 31|    研发部|普通员工|李四|
         * +---+----------+--------+----+
         */


    }

    @Getter
    @Setter
    @AllArgsConstructor
    @NoArgsConstructor
    @ToString
    public static class Employee implements Serializable {
        private String name;
        private Integer age;
        private String department;
        private String level;
    }
}

spark maven dependency, remove unnecessary spark-streaming and kafka dependencies.

Click to view the code

<properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <scala.version>2.12.15</scala.version>
        <spark.version>3.2.0</spark.version>
        <encoding>UTF-8</encoding>
    </properties>
    <dependencies>
        <!-- scala依赖-->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <!-- spark依赖-->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <version>1.18.2</version>
            <scope>provided</scope>
        </dependency>

        <!--<dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
            <version>${spark.version}</version>
        </dependency>-->

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql-kafka-0-10_2.12</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <dependency>
            <groupId>com.google.code.gson</groupId>
            <artifactId>gson</artifactId>
            <version>2.7</version>
        </dependency>

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.34</version>
        </dependency>

    </dependencies>

Guess you like

Origin blog.csdn.net/2301_77463738/article/details/131385499