Spark reads MySQL data

1. What is ELT

Data engineers, in most cases, have been exposed to ETL, that is, Extract, Transform, and Load. With the rise of more and more computing platform capabilities, many times, data engineers perform data operations according to ELT. That is, in accordance with Extract, Load, and Transform, the advantage is that data conversion can rely on a powerful computing platform, and data synchronization tools only need to pay more attention to data extraction and addition, which is simpler and faster Of improving efficiency for developers.

2. Why choose Spark

a) In the process of growing business data synchronization, many traditional ETL tools run on a stand-alone machine. Building a data synchronization system with large-scale data processing capabilities has become an indispensable link in a big data analysis system. Since Spark can run on distributed platforms and has achieved good support for access to various databases, it is a good choice to use Spark to develop data synchronization tools;

b) Spark DataFrame provides a rich operation API and can directly perform SQL operations based on the DataFrame, and some simple data conversions can be performed in the EL process;

c) The Spark program is simple to deploy, just use the spark-submit command to submit the code.

2.1, Spark ETL without T

This practice does not carry out transformation (Transform) actual combat, only simple EL operation actual combat, in order to be able to use Spark proficiently to synchronize data between multiple heterogeneous data sources.

 

2.3、Java Spark Read on MySQL

 


public class MysqlReader {

    public Dataset<Row> reader(SparkSession sparkSession){

        Dataset<Row> rowDataset = sparkSession.read()

                                  .format("jdbc")

                                  .option("url", url)

                                  .option("dbtable", tableName)

                                  .option("user", userName)

                                  .option("password", passwd)

                                  .option("driver", "com.mysql.cj.jdbc.Driver").load();

        return rowDataset;

    }

}

The above code has a drawback. When the amount of table data is large, because it is a session that reads mysql data all at once, there is a risk of reading data oom. Therefore, there can be the following second reading:

Dataset<Row> rowDataset = sparkSession.read()
                .format("jdbc")
                .option("url", url)
                .option("dbtable", tableName)
                .option("user", userName)
                .option("password", passwd)
                .option("driver", "com.mysql.cj.jdbc.Driver")
                .option("partitionColumn", columnName)
                .option("lowerBound", 1)
                .option("upperBound", 1000)
                .option("fetchsize", 1000)
                .option("numPartitions", 3)
                .load();

Looking at the official document, you can find that the partitionColumn configuration item and numPartitions, lowerBound, and upperBound must appear at the same time. among them

partitionColumn: Indicates the field to be filtered according to conditions when reading data. Generally, the primary key or index field in the format of int, datatime, and timestamp is selected;

numPartitions: It means that when reading, it will be divided into several partitions to read, and finally the data you want to read will be read into several partitions of Spark;

lowerBound: means that when reading, all those less than 1 must be in the first partition;

upperBound: indicates that when reading, all the data exceeding 1000 must be in the last partition;

fetchsize: Indicates the maximum number of items returned for each read when reading, which can effectively control the rate of reading mysql data, not too fast, too fast will crash mysql;

The following figure shows the meaning of the numbers 1, 1000, 3 in the above code when actually reading, 1000/3=333, so in the log, 334 and 667 are used as the storage rules for the three data partitions

 

Guess you like

Origin blog.csdn.net/Aaron_ch/article/details/112056275