Convert POJO and Dataset to each other in Spark

0x0 Dataset to POJO

method:

  1. Convert query results to RDD
  2. Create an RDD as a DataFrame and pass in the schema parameter
  3. Call the as method to convert the Dataset to the corresponding POJO Dataset
  4. Call the collectAsList() method

code show as below:

1. Table structure

+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
|      id|   string|   null|
|    name|   string|   null|
|   class|   string|   null|
+--------+---------+-------+

2.POJO type

public class Student {
    String id;
    String name;
    String major;
    ...
}

3. Convert the code

SparkSession spark = CloudUtils.getSparkSession();

        // 查询原始数据
        Dataset<Row> student = spark.sql("select * from `event`.`student`");
        // 生成schema
        List<StructField> fields = new ArrayList<>();
        fields.add(DataTypes.createStructField("id", DataTypes.StringType, true));
        fields.add(DataTypes.createStructField("name", DataTypes.StringType, true));
        fields.add(DataTypes.createStructField("major", DataTypes.StringType, true));
        StructType schema = DataTypes.createStructType(fields);

        // 转换查询结果为POJO List
        List<Student> students = spark.createDataFrame(student.toJavaRDD(), schema)
                .as(Encoders.bean(Student.class))
                .collectAsList();
        System.out.println(students);

Note:
The date type in the Dataset is timestamp and the Date type in java is not compatible, but is compatible with the Timestamp type.
In order to solve the above problems, we can first convert Dataset to JSON, and then convert JSON to POJO, the code is as follows:

        // 查出数据并转为json集合
        List<String> jsonList = spark.sql("select * from `event`.`user`")
                .toJSON()
                .collectAsList();
        // 将json转为pojo,这里使用的是FastJSON        
        List<User> users = jsonList.stream()
                .map(jsonString -> JSON.parseObject(jsonString, User.class))
                .collect(Collectors.toList());
        System.out.println(users);

0x1 POJO to Dataset

1. Table structure

+---------+---------+-------+
|col_name |data_type|comment|
+---------+---------+-------+
| user_id |   string|   null|
|user_name|   string|   null|
|user_age |   int   |   null|
+---------+---------+-------+

2.POJO type

public class User{
    String userId;
    String userName;
    Integer userAge;
    ...
}

Conversion code:

        // 获取users列表
        List<User> users = createUsers();
        // 使用createDataFrame转为dataset
        Dataset<Row> ds = spark.createDataFrame(users, User.class);
        // 将驼峰式列名改为下划线式列名,camelToUnderline方法网上搜索
        String[] columns = ds.columns();
        String[] newColumns = Arrays.stream(columns)
                .map(column -> camelToUnderline(column))
                .toArray(String[]::new);
        // 转为新的df(重命名后的)
        ds.toDF(newColumns);
        ds.show();

Also note:
For some types that cannot be converted, json transition is still used. The code is as follows:

        // 创建user list
        List<User> users = createUsers();
        // 将user list转为json list
        List<String> jsonList = users.stream()
                .map(JSON::toJSONString)
                .collect(Collectors.toList());
        // 将json list转为json dataset
        Dataset<String> jsonDataset = spark.createDataset(jsonList, Encoders.STRING());
        // 转换为row dataset
        Dataset<Row> ds = spark.read().json(jsonDataset.toJavaRDD());
        ds.show();

Output result:

+------------+---+----+
|    birthday| id|name|
+------------+---+----+
|689875200000|  1| AAA|
|689875200000|  2| BBB|
+------------+---+----+

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325941300&siteId=291194637