Flink's DataSource Trilogy Part Two: Built-in connector

This article is the second in a series of "Flink's DataSource Trilogy". The previous article "One of Flink's DataSource Trilogy: Direct API" learned the API of StreamExecutionEnvironment to create a DataSource. What I want to practice today is Flink's built-in connector That is, the red box position in the following figure. These connectors can be used through the addSource method of StreamExecutionEnvironment:
Insert picture description here
Today's actual combat chooses Kafka as a data source to operate, first try to receive and process String messages, then receive JSON messages, and reverse the JSON Into a bean instance;

Flink's DataSource trilogy article link

Source code download

If you do n’t want to write code, the source code of the entire series can be downloaded from GitHub. The address and link information are shown in the following table (https://github.com/zq2599/blog_demos):

name	link	Remarks
Project Homepage	https://github.com/zq2599/blog_demos	The project's homepage on GitHub
git repository address (https)	https://github.com/zq2599/blog_demos.git	The warehouse address of the project source code, https protocol
git repository address (ssh)	[email protected]:zq2599/blog_demos.git	The warehouse address of the project source code, ssh protocol

There are multiple folders in this git project. The application of this chapter is under the flinkdatasourcedemo folder, as shown in the red box below:
Insert picture description here

Environment and version

The environment and version of this actual combat are as follows:

JDK：1.8.0_211
Strong ： 1.9.2
Maven：3.6.0
Operating system: macOS Catalina 10.15.3 (MacBook Pro 13-inch, 2018)
IDEA：2018.3.5 (Ultimate Edition)
Kafka：2.4.0
Zookeeper：3.5.5

Please make sure that the above content is ready before you can continue the actual combat;

Flink matches Kafka version

Flink official made a detailed description of matching Kafka version, the address is: https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html
The focus is on the officially mentioned universal Kafka connector, which was launched from Flink 1.7 and can be used for Kafka 1.0.0 or higher:
The red box in the picture below is the library that my project depends on, and the blue box is the class used to connect Kafka. Readers can find the suitable library and class in the table according to your Kafka version:

Actual string message processing

Create a topic named test001 on kafka, refer to the command:

./kafka-topics.sh \
--create \
--zookeeper 192.168.50.43:2181 \
--replication-factor 1 \
--partitions 2 \
--topic test001

Continue to use the flinkdatasourcedemo project created in the previous chapter, open the pom.xml file and add the following dependencies:

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-connector-kafka_2.11</artifactId>
  <version>1.10.0</version>
</dependency>

Added class Kafka240String.java, which is used to connect to the broker and do WordCount operation on the received string message:

package com.bolingcavalry.connector;

import com.bolingcavalry.Splitter;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import java.util.Properties;
import static com.sun.tools.doclint.Entity.para;

public class Kafka240String {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        //设置并行度
        env.setParallelism(2);

        Properties properties = new Properties();
        //broker地址
        properties.setProperty("bootstrap.servers", "192.168.50.43:9092");
        //zookeeper地址
        properties.setProperty("zookeeper.connect", "192.168.50.43:2181");
        //消费者的groupId
        properties.setProperty("group.id", "flink-connector");
        //实例化Consumer类
        FlinkKafkaConsumer<String> flinkKafkaConsumer = new FlinkKafkaConsumer<>(
                "test001",
                new SimpleStringSchema(),
                properties
        );
        //指定从最新位置开始消费，相当于放弃历史消息
        flinkKafkaConsumer.setStartFromLatest();

        //通过addSource方法得到DataSource
        DataStream<String> dataStream = env.addSource(flinkKafkaConsumer);

        //从kafka取得字符串消息后，分割成单词，统计数量，窗口是5秒
        dataStream
                .flatMap(new Splitter())
                .keyBy(0)
                .timeWindow(Time.seconds(5))
                .sum(1)
                .print();

        env.execute("Connector DataSource demo : kafka");
    }
}

Make sure that the topic of kafka has been created, run Kafka240, the function of consuming messages and counting words is normal:
The actual combat of receiving kafka string messages has been completed, then try JSON format messages;

Actual combat JSON message processing

The JSON format message to be accepted next can be deserialized into a bean instance, and the JSON library will be used. I chose gson;
Add gson dependency in pom.xml:

<dependency>
  <groupId>com.google.code.gson</groupId>
  <artifactId>gson</artifactId>
  <version>2.8.5</version>
</dependency>

Add class Student.java, this is an ordinary Bean, only two fields of id and name:

package com.bolingcavalry;

public class Student {

    private int id;

    private String name;

    public int getId() {
        return id;
    }

    public void setId(int id) {
        this.id = id;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }
}

Add the class StudentSchema.java, which is an implementation of the DeserializationSchema interface. It is used when deserializing JSON into a Student instance:

ackage com.bolingcavalry.connector;

import com.bolingcavalry.Student;
import com.google.gson.Gson;
import org.apache.flink.api.common.serialization.DeserializationSchema;
import org.apache.flink.api.common.serialization.SerializationSchema;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import java.io.IOException;

public class StudentSchema implements DeserializationSchema<Student>, SerializationSchema<Student> {

    private static final Gson gson = new Gson();

    /**
     * 反序列化，将byte数组转成Student实例
     * @param bytes
     * @return
     * @throws IOException
     */
    @Override
    public Student deserialize(byte[] bytes) throws IOException {
        return gson.fromJson(new String(bytes), Student.class);
    }

    @Override
    public boolean isEndOfStream(Student student) {
        return false;
    }

    /**
     * 序列化，将Student实例转成byte数组
     * @param student
     * @return
     */
    @Override
    public byte[] serialize(Student student) {
        return new byte[0];
    }

    @Override
    public TypeInformation<Student> getProducedType() {
        return TypeInformation.of(Student.class);
    }
}

The new class Kafka240Bean.java is added, which is used to connect the broker, convert the received JSON message into a Student instance, and count the number of occurrences of each name. The window is still 5 seconds:

package com.bolingcavalry.connector;

import com.bolingcavalry.Splitter;
import com.bolingcavalry.Student;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import java.util.Properties;

public class Kafka240Bean {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        //设置并行度
        env.setParallelism(2);

        Properties properties = new Properties();
        //broker地址
        properties.setProperty("bootstrap.servers", "192.168.50.43:9092");
        //zookeeper地址
        properties.setProperty("zookeeper.connect", "192.168.50.43:2181");
        //消费者的groupId
        properties.setProperty("group.id", "flink-connector");
        //实例化Consumer类
        FlinkKafkaConsumer<Student> flinkKafkaConsumer = new FlinkKafkaConsumer<>(
                "test001",
                new StudentSchema(),
                properties
        );
        //指定从最新位置开始消费，相当于放弃历史消息
        flinkKafkaConsumer.setStartFromLatest();

        //通过addSource方法得到DataSource
        DataStream<Student> dataStream = env.addSource(flinkKafkaConsumer);

        //从kafka取得的JSON被反序列化成Student实例，统计每个name的数量，窗口是5秒
        dataStream.map(new MapFunction<Student, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(Student student) throws Exception {
                return new Tuple2<>(student.getName(), 1);
            }
        })
                .keyBy(0)
                .timeWindow(Time.seconds(5))
                .sum(1)
                .print();

        env.execute("Connector DataSource demo : kafka bean");
    }
}

During the test, you need to send JSON format strings to kafka, and flink will count the number of each name:

At this point, the actual battle of the built-in connector is completed. In the next chapter, we will work together to customize the DataSource ;

Welcome to pay attention to my public number: programmer Xinchen

Insert picture description here

Programmer Xinchen Blog Expert

Published 376 original articles · praised 986 · 1.28 million views

His message board concerns