Seven, flink-- asynchronous IO

A, flink asynchronous IO Overview

1.1 Asynchronous IO requirements

Async I / O is Alibaba contribute to a community of very high voice features introduced in version 1.2. The main purpose is to solve the communication delay (such as waiting for a response to the external system) becomes a problem when the system data flow bottleneck interact with external systems. For real-time processing, when the need to use external storage data, the need to be careful to treat, can not let the whole progress of work and the interaction between the external system processing delay convection decisive influence.
For example, the operator and other yard mapfunction access an external memory, in fact, the interaction is synchronous: send such a request to the database, it will wait for a response mapfunction. In many cases, this waiting process is very wasteful function of time. Asynchronous interaction with the database, the function means that a single instance can handle many concurrent requests, concurrent receive a response. Then, since the waiting time will send another request and receive other responses, reused saves time. At least, waiting time is amortized over a plurality of requests. This makes a lot of use cases with a higher throughput.
Seven, flink-- asynchronous IO
Figure 1.1 flink-- asynchronous IO

Note: By increasing MapFunction to a larger degree of parallelism is possible to improve throughput, but this means higher resource overhead: More MapFunction instance mean more task, thread, flink internal network connection, link to the database, cache, more internal state spending.

1.2 Prerequisites for using asynchronous IO

Flink when using asynchronous IO, the connected database needs to support asynchronous client. Fortunately, many popular database support such clients. If there is no asynchronous client, you can also create multiple synchronization client, into the thread pool, use the thread pool to complete asynchronous function. Of course, this manner asynchronous with respect to clients less efficient.

Two, flink the use of asynchronous IO

2.1 The use of asynchronous IO

flink asynchronous IO API support asynchronous users requesting client in the data stream. API itself integration processing data streams, message sequence, Time, and fault tolerance.
If there is asynchronous client target database, using asynchronous IO, the need to achieve what three steps:
1, to achieve AsyncFunction or RichAsyncFunction, the function implements the functionality requested asynchronous distributed.
2, a callback callback function to retrieve the results of the operation, and then transferred to the ResultFuture.
3, using asynchronous IO operations on DataStream.

This interface can look AsyncFunction source

public interface AsyncFunction<IN, OUT> extends Function, Serializable {
    void asyncInvoke(IN var1, ResultFuture<OUT> var2) throws Exception;

    default void timeout(IN input, ResultFuture<OUT> resultFuture) throws Exception {
        resultFuture.completeExceptionally(new TimeoutException("Async function call has timed out."));
    }
}

Mainly we need to implement two methods:

 void asyncInvoke(IN var1, ResultFuture<OUT> var2):
 这是真正实现外部操作逻辑的方法,var1是输入的参数,var2则是返回结果的集合

 default void timeout(IN input, ResultFuture<OUT> resultFuture)
 这是当异步请求超时的时候,会调用这个方法。参数的用途和上面一样

And RichAsyncFunction an inheritance RichAsyncFunction class, it also provides open and close these two methods, we generally use is to create a client connection (such as connection jdbc mysql connection) to connect an external storage open method, close to close client connection, as for the above two methods and the use of the same timeout asyncInvoke and will not be repeated here. Generally we used is RichAsyncFunction.

2.2 Asynchronous IO official website template instantiation

class AsyncDatabaseRequest extends RichAsyncFunction<String, Tuple2<String, String>> {

    /** The database specific client that can issue concurrent requests with callbacks */
    private transient DatabaseClient client;

    @Override
    public void open(Configuration parameters) throws Exception {
        client = new DatabaseClient(host, post, credentials);
    }

    @Override
    public void close() throws Exception {
        client.close();
    }

    @Override
    public void asyncInvoke(String key, final ResultFuture<Tuple2<String, String>> resultFuture) throws Exception {

        // issue the asynchronous request, receive a future for result
        final Future<String> result = client.query(key);

        // set the callback to be executed once the request by the client is complete
        // the callback simply forwards the result to the result future
        CompletableFuture.supplyAsync(new Supplier<String>() {

            @Override
            public String get() {
                try {
                    return result.get();
                } catch (InterruptedException | ExecutionException e) {
                    // Normally handled explicitly.
                    return null;
                }
            }
        }).thenAccept( (String dbResult) -> {
            resultFuture.complete(Collections.singleton(new Tuple2<>(key, dbResult)));
        });
    }
}

// create the original stream
DataStream<String> stream = ...;

// 将异步IO类应用于数据流
DataStream<Tuple2<String, String>> resultStream =
    AsyncDataStream.unorderedWait(stream, new AsyncDatabaseRequest(), 1000, TimeUnit.MILLISECONDS, 100);

It should be noted that the last required to query the data into resultFuture, that by resultFuture.completeto pass the result to the frame. The first call ResultFuture.complete when ResultFuture will be completed. All subsequent complete calls will be ignored.

Note the use of asynchronous IO 2.3 points

2.3.1 AsyncDataStream.unorderedWait () parameters

There are four parameters, in, asyncObject, timeout, timeUnit, capacity

in:输入的数据流

asyncObject:异步IO操作类对象

timeout:
异步IO请求被视为失败的超时时间,超过该时间异步请求就算失败。该参数主要是为了剔除死掉或者失败的请求。

timeUnit:时间的单位,例如TimeUnit.MICROSECONDS,表示毫秒

capacity:
该参数定义了同时最多有多少个异步请求在处理。即使异步IO的方式会导致更高的吞吐量,但是对于实时应用来说该操作也是一个瓶颈。限制并发请求数,算子不会积压过多的未处理请求,但是一旦超过容量的显示会触发背压。

2.3.2 Timeout Handling

When an asynchronous IO request several time-outs, by default will throw an exception, then restart the job. If you want to handle timeouts, you can cover AsyncFunction.timeout method.

2.3.3 The results of the order

Order AsyncFunction initiated concurrent requests completed is unpredictable. In order to control the order of transmission of the result, flink offers two modes:

1). Unordered
results recorded immediately after termination of the asynchronous transmission request. Data stream IO after the asynchronous operation sequence on not the same as the order of the sequence is requested and not requested to ensure consistent results. When used as a basis for the processing time of the time characteristic, in this way has extremely low latency and low load. Called AsyncDataStream.unorderedWait (...)

2). Ordered
sequential stream this manner will be retained. Results order of recording and the asynchronous request is triggered, as is the order in which the events in the original stream. To achieve this goal, the recording operation before the operator would record the result of the recording buffer prior to transmission. This tends to introduce additional delays and some of Checkpoint load, as compared to the record result of the disorder mode is saved in the internal state of Checkpoint long time. Called AsyncDataStream.orderedWait (...)

2.3.4 watermark sequence and time

When using the event time, asynchronous IO operations will be handled properly watermark mechanism. This means that two kinds of order modes do the following:

1). Unordered
watermark will not exceed the record, meaning a watermark established order boundary. Record will only be transmitted between two random watermark. The current record after the watermark will be sent only after the current watermark send. After transmitting the transmission will only be complete watermark before all records of the watermark. This means that in the presence of the watermark, disorderly mode introduces some of the same orderly pattern of delay and administrative overhead. The size of the cost depends on the frequency of the watermark. That is, between the watermark is ordered, but with a watermark internal request is disordered

2). Ordered
sequence as watermark would be stored as sequential recording. Compared with the processing time, the cost does not change significantly. Remember, injection time Ingestion Time is a special case automatically generated based on the source time event processing time watermark.

2.3.5 Fault Tolerance

Asynchronous IO operation provides a fault-tolerant guaranteed only once processed. It will be in the outgoing asynchronous IO request stored at Checkpoint, when recovery and restoration of these requests from the Checkpoint.

2.4 The use of asynchronous IO query data from mysql

1, maven is dependent pom

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>SparkDemo</groupId>
    <artifactId>SparkDemoTest</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <spark.version>2.1.0</spark.version>
        <scala.version>2.11.8</scala.version>
        <hadoop.version>2.7.3</hadoop.version>
        <scala.binary.version>2.11</scala.binary.version>
        <flink.version>1.6.1</flink.version>

    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-core</artifactId>
            <version>2.9.0</version>
        </dependency>

        <!--因为spark和es默认依赖的netty版本不一致,前者使用3.x版本,后者使用4.1.32版本
        所以导致es使用的是3.x版本,有些方法不兼容,这里直接使用使用新版本,否则报错-->
        <dependency>
            <groupId>io.netty</groupId>
            <artifactId>netty-all</artifactId>
            <version>4.1.32.Final</version>
        </dependency>

        <!--flink-->
        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-java -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>1.6.1</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java_2.11</artifactId>
            <version>1.6.1</version>
            <!--<scope>provided</scope>-->
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-scala -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-scala_2.11</artifactId>
            <version>1.6.1</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-scala -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-scala_2.11</artifactId>
            <version>1.6.1</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_2.11</artifactId>
            <version>1.6.1</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-table -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table_2.11</artifactId>
            <version>1.6.1</version>
            <scope>provided</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.22</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka-0.10_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <!--mysql异步客户端-->
        <!-- https://mvnrepository.com/artifact/io.vertx/vertx-core -->
        <dependency>
            <groupId>io.vertx</groupId>
            <artifactId>vertx-core</artifactId>
            <version>3.7.0</version>
        </dependency>

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>8.0.12</version>
        </dependency>

        <dependency>
            <groupId>io.vertx</groupId>
            <artifactId>vertx-jdbc-client</artifactId>
            <version>3.7.0</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/io.vertx/vertx-web -->
        <dependency>
            <groupId>io.vertx</groupId>
            <artifactId>vertx-web</artifactId>
            <version>3.7.0</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/com.github.ben-manes.caffeine/caffeine -->
        <dependency>
            <groupId>com.github.ben-manes.caffeine</groupId>
            <artifactId>caffeine</artifactId>
            <version>2.6.2</version>
        </dependency>

    </dependencies>

    <!--下面这是maven打包scala的插件,一定要,否则直接忽略scala代码-->
    <build>
        <plugins>

            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <version>2.15.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.6.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.19</version>
                <configuration>
                    <skip>true</skip>
                </configuration>
            </plugin>

        </plugins>
    </build>
</project>

2, the source
format to the target mysql table:

id     name
1       king
2        tao
3       ming

需要根据name查询到id

Code:

package flinktest;

import com.github.benmanes.caffeine.cache.Cache;
import com.github.benmanes.caffeine.cache.Caffeine;
import io.vertx.core.Vertx;
import io.vertx.core.VertxOptions;
import io.vertx.core.json.JsonObject;
import io.vertx.ext.jdbc.JDBCClient;
import io.vertx.ext.sql.ResultSet;
import io.vertx.ext.sql.SQLClient;
import io.vertx.ext.sql.SQLConnection;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.AsyncDataStream;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.async.ResultFuture;
import org.apache.flink.streaming.api.functions.async.RichAsyncFunction;

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.TimeUnit;

/**
 * flink 异步IO demo:使用异步IO和mysql交互
 * 因为普通的jdbc客户端不支持异步方式,所以这里引入vertx
 * 的异步jdbc client(异步IO要求客户端支持异步操作)
 *
 * 实现目标:根据数据源,使用异步IO从mysql查询对应的数据, 然后打印出来
 */
public class AsyncToMysql {
    public static void main(String[] args) {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        List<String> sourceList = new ArrayList<>();
        //构建数据源查询条件,后面用来作为sql查询中where的查询值
        sourceList.add("king");
        sourceList.add("tao");
        DataStreamSource<String> source = env.fromCollection(sourceList);

        //调用异步IO处理类
        DataStream<JsonObject> result = AsyncDataStream.unorderedWait(
                source,
                new MysqlAsyncFunc(),
                10, //这里超时时长如果在本地idea跑的话不要设置得太短,因为本地执行延迟比较大
                TimeUnit.SECONDS,
                20).setParallelism(1);
        result.print();
        try {
            env.execute("TEST async");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    /**
     * 继承 RichAsyncFunction类,编写自定义的异步IO处理类
     */
    private static class MysqlAsyncFunc extends RichAsyncFunction<String, JsonObject> {
        private transient SQLClient mysqlClient;
        private Cache<String, String> cache;

        @Override
        public void open(Configuration parameters) throws Exception {
            super.open(parameters);
            //构建mysql查询缓存,这里使用Caffeine这个高性能缓存库
            cache = Caffeine
                    .newBuilder()
                    .maximumSize(1025)
                    .expireAfterAccess(10, TimeUnit.MINUTES) //设置缓存过期时间
                    .build();

            //构建mysql jdbc连接
            JsonObject mysqlClientConfig = new JsonObject();
            //设置jdbc连接参数
            mysqlClientConfig.put("url", "jdbc:mysql://192.168.50.121:3306/test?useSSL=false&serverTimezone=UTC&useUnicode=true")
                    .put("driver_class", "com.mysql.cj.jdbc.Driver")
                    .put("max_pool_size", 20)
                    .put("user", "root")
                    .put("password", "xxxxx");

            //设置vertx的工作参数,比如线程池大小
            VertxOptions vo = new VertxOptions();
            vo.setEventLoopPoolSize(10);
            vo.setWorkerPoolSize(20);

            Vertx vertx = Vertx.vertx(vo);
            mysqlClient = JDBCClient.createNonShared(vertx, mysqlClientConfig);
            if (mysqlClient != null) {
                System.out.println("连接mysql成功!!!");
            }
        }

        //清理环境
        @Override
        public void close() throws Exception {
            super.close();
            //关闭mysql连接,清除缓存
            if (mysqlClient != null) {
                mysqlClient.close();
            }

            if (cache != null) {
                cache.cleanUp();
            }
        }

        @Override
        public void asyncInvoke(String input, ResultFuture<JsonObject> resultFuture) throws Exception {
            System.out.println("key is:" + input);
            String key = input;

            //先从缓存中查找,找到就直接返回
            String cacheIfPresent = cache.getIfPresent(key);
            JsonObject output = new JsonObject();
            if (cacheIfPresent != null) {
                output.put("name", key);
                output.put("id-name", cacheIfPresent);
                resultFuture.complete(Collections.singleton(output));
                //return;
            }

            System.out.println("开始查询");
            mysqlClient.getConnection(conn -> {
                if (conn.failed()) {
                    resultFuture.completeExceptionally(conn.cause());
                    //return;
                }

                final SQLConnection sqlConnection = conn.result();

                //拼接查询语句
                String querySql = "select id,name from customer where name='" + key + "'";
                System.out.println("执行的sql为:" + querySql);
                //执行查询,并获取结果
                sqlConnection.query(querySql, res -> {
                    if (res.failed()) {
                        resultFuture.completeExceptionally(null);
                        System.out.println("执行失败");
                        //return;
                    }

                    if (res.succeeded()) {
                        System.out.println("执行成功,获取结果");
                        ResultSet result = res.result();
                        List<JsonObject> rows = result.getRows();
                        System.out.println("结果个数:" + String.valueOf(rows.size()));
                        if (rows.size() <= 0) {
                            resultFuture.complete(null);
                            //return;
                        }

                        //结果返回,并更新到缓存中
                        for (JsonObject row : rows) {
                            String name = row.getString("name");
                            String id = row.getInteger("id").toString();
                            String desc = id + "-" + name;
                            System.out.println("结果:" + desc);
                            output.put("name", key);
                            output.put("id-name", desc);
                            cache.put(key, desc);
                            resultFuture.complete(Collections.singleton(output));

                        }
                    } else {
                        //执行失败,返回空
                        resultFuture.complete(null);
                    }
                });

                //连接关闭
                sqlConnection.close(done -> {
                    if (done.failed()) {
                        throw new RuntimeException(done.cause());
                    }
                });

            });
        }
    }
}

Guess you like

Origin blog.51cto.com/kinglab/2457541