Flink sql source code walking --- explain why the flink jdbc where condition does not push down the database

This article uses a specific case to illustrate how flink sql implements connector loading, source/sink operation, database connection, etc. It can help everyone understand its principle, and find the logic of SQL generation in the code, and get the conclusion that the where condition has not been pushed down to the library for execution.

(For the solution, refer to the next article: How does flink sql (jdbc) support pushing down the database under the where condition )

The case is as follows:

create table mysql_test_12 (
ID STRING,
NAME STRING,
primary key(ID) not enforced
) with (
'connector' = 'jdbc',
'url' = 'jdbc:mysql://${
    
    mysql_hosts}:${
    
    mysql_port}/sitrdw001?useSSL=false&useUnicode=true&characterEncoding=UTF-8',
'username' = '${
    
    mysql_username}',
'password' = '${
    
    mysql_pass}',
'scan.fetch-size'='1000',
'table-name' = 'test_12'
);

create table es_test_12 (
ID STRING,
NAME STRING,
primary key(ID) not enforced
) with (
'connector' = '${
    
    es_connector}',
'hosts' = '${
    
    es_hosts}',
'username' = '${
    
    es_username}',
'password' = '${
    
    es_pass}',
'index' = 'test_12'
);

insert into es_test_12
select
   *
from mysql_test_12
where ID = '20200604'
;

This is a very simple case. The source end is connected to the mysql database, the sink end is connected to es, and the data with ID = '20200604' is obtained and written into es.

1. The overall concept

insert image description here
Before going into details, it is necessary to be familiar with the overall framework of flink sql.
catalog : The abstraction of the table catalog. The table creation statement create table mysql_test_12 ... as ... in the above case is a catalog, which mainly includes information such as library name, table name, column name, and column data type;

DynamicTableSourceFactory : Each connector will have a fixed factory factory, mainly processing some configuration items (with the options configured later, such as 'scan.fetch-size'='1000'), doing some configuration checking and encapsulation work, and finally generating DynamicTableSource .

DynamicTableSource : Data source, where executable sql statements will be created and ScanRuntimeProvider specific execution classes will be generated.

ScanRuntimeProvider : the specific class for sql execution, where sql queries are executed and data query/traversal interfaces are provided.

Sink is similar to source, providing the ability to write to the outside world, which will not be expanded here.

2. Create the source node
In the case, 'connector' is configured with 'jdbc', so how does flink create the source node of jdbc?
insert image description here
CatalogSourceTable is the entry class for creating source nodes. You can see that a JdbcDynamicTableSource data source has been created here. We click to view the specific implementation method and find that it mainly calls the createTableSource method of FactoryUtil: As can be seen from this code
insert image description here, Flink first obtains DynamicTableSourceFactory, and then calls the factory.createDynamicTableSource(context) method to obtain the specific implementation source.

Regarding the acquisition of the factory, if you are interested, you can continue to debug for an in-depth understanding. I will briefly summarize the main logic:
1. The system loads all classes that inherit the Factory under META-INF.services;
2. Traverse each factory and call the factoryIdentifier( ) method to get the identity and match it. For example, the IDENTIFIER (identifier) ​​of JdbcDynamicTableFactory is 'jdbc', which just matches the connector in SQL.
3. After finding the factory, call the createDynamicTableSource() method of the factory to return the source

At this point, the source node is created.
insert image description hereinsert image description here
3. Why the where condition does not support pushing down the database
To understand whether the where condition is pushed down, we need to see how SQL is created. DynamicTableSource (the implementation class of jdbc is JdbcDynamicTableSource) is responsible for constructing SQL. The core code is as follows:

    @Override
    public ScanRuntimeProvider getScanRuntimeProvider(ScanContext runtimeProviderContext) {
    
    
        // 用来执行 SQL的具体对象
        final JdbcRowDataInputFormat.Builder builder =
                JdbcRowDataInputFormat.builder()
                        .setDrivername(options.getDriverName())
                        .setDBUrl(options.getDbURL())
                        .setUsername(options.getUsername().orElse(null))
                        .setPassword(options.getPassword().orElse(null))
                        .setAutoCommit(readOptions.getAutoCommit());

        // 设置 fetch-size
        if (readOptions.getFetchSize() != 0) {
    
     
            builder.setFetchSize(readOptions.getFetchSize());
        }
        final JdbcDialect dialect = options.getDialect();
        
        // 通过schema 生成 select 语句。对照案例,query = "SELECT `ID`, `NAME` FROM `test_12`"
        String query =
                dialect.getSelectFromStatement(
                        options.getTableName(), physicalSchema.getFieldNames(), new String[0]);
        // 如果设置了分区扫描,在sql 后面拼接 where {scan.partition.column} BETWEEN ? AND ?
        if (readOptions.getPartitionColumnName().isPresent()) {
    
    
            long lowerBound = readOptions.getPartitionLowerBound().get();
            long upperBound = readOptions.getPartitionUpperBound().get();
            int numPartitions = readOptions.getNumPartitions().get();
            builder.setParametersProvider(
                    new JdbcNumericBetweenParametersProvider(lowerBound, upperBound)
                            .ofBatchNum(numPartitions));
            query +=
                    " WHERE "
                            + dialect.quoteIdentifier(readOptions.getPartitionColumnName().get())
                            + " BETWEEN ? AND ?";
        }
        // 设置limit
        if (limit >= 0) {
    
    
            query = String.format("%s %s", query, dialect.getLimitClause(limit));
        }
        builder.setQuery(query);
        final RowType rowType = (RowType) physicalSchema.toRowDataType().getLogicalType();
        builder.setRowConverter(dialect.getRowConverter(rowType));
        builder.setRowDataTypeInfo(
                runtimeProviderContext.createTypeInformation(physicalSchema.toRowDataType()));

        return InputFormatProvider.of(builder.build());
    }

Through the above code logic, it is not difficult to see that SQL is mainly generated based on schema and scan.partition, and where ID = '20200604' is not concatenated. The where operation should be filtered through the filter operator in memory.

The problem that may be caused by this is that even if only one piece of data needs to be processed, flink sql will load all the data of test_12 into the memory. If a large table is encountered, the processing performance will decrease.

Guess you like

Origin blog.csdn.net/samur2/article/details/129366310