Flink high-performance write relational database Oracle or MySql

I believe that more and more people engaged in big data development are engaged in real-time computing. Flink technology shows that it is very important. It is said that this technology is important not only because of its streaming computing, but also because of its integration with other technologies. It is more powerful. In the development process, in addition to writing message middleware and other scenarios, sometimes it also needs to be written to a traditional database, such as Oracle or MySql .

We are accustomed to using some connection pools such as c3p0 when connecting to relational databases. In traditional business development or when the amount of data is not very large, there is no problem, but in the case of large amounts of data, the write rate in this way is far from enough. Speaking of this, bloggers believe that many people will say to increase the degree of parallelism and improve efficiency by sacrificing resources. Although this method can achieve efficiency improvements, it will consume too much flink slot, and the size of the connection pool is also It is not well controlled, too many connections, and the database connection pressure is too high. Also note that the connection pool is closed well, and when the flink task restarts or the cluster restarts, whether the connection pool connection is released, these problem bloggers I have encountered it in the development process, so here is the solution of the blogger. I useOptimize sqlandMulti-threaded

I use the Oracle scenario: query Oracle based on current data, if Oracle has current data, update the current data, if there is no current data, insert the current data into Oracle.

sql optimization

I believe that most people who are not specialized in ETL will select first, and then insert or update. This is no problem, but the pressure on the data will be great. I will use the merge into method to transfer the pressure from the database to the sql In terms of its own calculations, the blogger will post his own s'q'l later.

Use multithreading instead of connection pool

The connection pool can avoid frequent creation of connections, but the open method of f'lin'k can also achieve this function, because the method is only loaded once

Not much to say, on the code

import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple5;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class ***OracleSinkMutlThread extends RichSinkFunction<Tuple5<String, Long, Long, Double, Double>> {
    private static final Logger LOGGER = LoggerFactory.getLogger(***OracleSinkMultThread.class);
    private List<Tuple2<Connection, PreparedStatement>> connectionList = new ArrayList<>();
    private int index = 0;
    private ExecutorService executorService;

    @Override
    public void open(Configuration parameters) throws Exception {
        super.open(parameters);
        Class.forName("oracle.jdbc.driver.OracleDriver");
        for (int i = 0; i < 10; i++) {
            Connection connection = DriverManager.getConnection("***", "***", "***");
            PreparedStatement statement = connection.prepareStatement("merge 表名 a using (select ? DAY_TIME,? HOUR_TIME,? PROV_NAME,? INTERFACE_NAME,? TOTAL_COUNT,? FAIL_COUNT,? TOTAL_TIME,? FAIL_TIME from dual) b on(a.DAY_TIME = b.DAY_TIME and a.HOUR_TIME=b.HOUR_TIME and a.PROV_NAME=b.PROV_NAME and a.INTERFACE_NAME=b.INTERFACE_NAME) when matched then update set a.TOTAL_COUNT=a.TOTAL_COUNT+b.TOTAL_COUNT, a.FAIL_COUNT=a.FAIL_COUNT+b.FAIL_COUNT,a.TOTAL_TIME=a.TOTAL_TIME+b.TOTAL_TIME,a.FAIL_TIME=a.FAIL_TIME+b.FAIL_TIME where DAY_TIME=? and HOUR_TIME=? and PROV_NAME=? and INTERFACE_NAME=? when not matched then insert values (b.DAY_TIME,b.HOUR_TIME,b.PROV_NAME,b.INTERFACE_NAME,b.TOTAL_COUNT,b.FAIL_COUNT,b.TOTAL_TIME,b.FAIL_TIME,?)");
            connectionList.add(new Tuple2(connection, statement));
        }
        executorService = Executors.newFixedThreadPool(10);

    }

    @Override
    public void invoke(Tuple5<String, Long, Long, Double, Double> value, Context context) throws Exception {
        String[] split = value.f0.split("_");
        String[] time = split[2].split("\t");
        String day_time = time[1];
        String hour_time = time[2];

        String provinceName = split[3];
        String INTERFACE_NAME = split[4];
        PreparedStatement statement;

        try {
            statement = connectionList.get(index).f1;
            statement.setString(1,day_time);
            statement.setString(2,hour_time);
            statement.setString(3,provinceName);
            statement.setString(4,INTERFACE_NAME);
            statement.setString(5,value.f1+"");

            statement.setString(6,value.f2+"");
            statement.setString(7,value.f3+"");
            statement.setString(8,value.f4+"");
            statement.setString(9,day_time);
            statement.setString(10,hour_time);

            statement.setString(11,provinceName);
            statement.setString(12,INTERFACE_NAME);

            long current_time = new Date().getTime();
            java.sql.Date dateSql = new java.sql.Date(current_time);
            statement.setDate(13,dateSql);

            executorService.execute(new Runnable() {
                @Override
                public void run() {
                    try {
                        statement.execute();
                    } catch (SQLException e) {
                        e.printStackTrace();
                    }
                }
            });
            index += 1;
            if (index == 10) {
                index = 0;
            }
        }catch (SQLException e){
            e.printStackTrace();
            LOGGER.error(String.format("%s -> *** !", "***"));
        }
    }

    @Override
    public void close() throws Exception {
        super.close();
        if (executorService != null) {
            executorService.shutdown();
        }
        for (Tuple2<Connection, PreparedStatement> tuple2 : connectionList) {
            if (tuple2.f0 != null) {
                tuple2.f0.close();
            }
            if (tuple2.f1 != null) {
                tuple2.f1.close();
            }
        }
    }
}

This method implements the richSinkFunction, and when the program is first executed, this implementation is quick to initialize, because the traditional connection pool takes a lot of time to initialize the connection pool, and the parallelism of the program can be controlled by controlling the size of the loop to reduce the cluster Slot consumption, at the same time, when the task is closed, the connection is released without any problem.

Guess you like

Origin blog.csdn.net/qq_44962429/article/details/105976314