Flink 1.17 Tutorial: Source operator (Source) concept, read data from collection, read data from file, read data from Socket, read data from Kafka, read data from data generator, supported data types

Source operator (Source)

Flink can obtain data from various sources, and then build DataStream for transformation processing. Generally, the input source of data is called a data source, and the operator that reads data is called a source operator. So, source is the input to our entire handler.
insert image description here

Before Flink1.12, the old way of adding source was to call the addSource() method of the execution environment:

DataStream<String> stream = env.addSource(...);

The parameter passed in by the method is a "source function" (source function), which needs to implement the SourceFunction interface.
Starting from Flink1.12, the new Source architecture with unified flow and batch is mainly used:

DataStreamSource<String> stream = env.fromSource()

Flink directly provides many pre-implemented interfaces, and many external connection tools also help us realize the corresponding Source, which is usually enough to meet our actual needs.

Preparation

For the convenience of practice, WaterSensor is used as the data model here.
insert image description here
The specific code is as follows:

public class WaterSensor {
    
    
    public String id;
    public Long ts;
    public Integer vc;

    public WaterSensor() {
    
    
    }

    public WaterSensor(String id, Long ts, Integer vc) {
    
    
        this.id = id;
        this.ts = ts;
        this.vc = vc;
    }

    public String getId() {
    
    
        return id;
    }

    public void setId(String id) {
    
    
        this.id = id;
    }

    public Long getTs() {
    
    
        return ts;
    }

    public void setTs(Long ts) {
    
    
        this.ts = ts;
    }

    public Integer getVc() {
    
    
        return vc;
    }

    public void setVc(Integer vc) {
    
    
        this.vc = vc;
    }

    @Override
    public String toString() {
    
    
        return "WaterSensor{" +
                "id='" + id + '\'' +
                ", ts=" + ts +
                ", vc=" + vc +
                '}';
    }

    @Override
    public boolean equals(Object o) {
    
    
        if (this == o) {
    
    
            return true;
        }
        if (o == null || getClass() != o.getClass()) {
    
    
            return false;
        }
        WaterSensor that = (WaterSensor) o;
        return Objects.equals(id, that.id) &&
                Objects.equals(ts, that.ts) &&
                Objects.equals(vc, that.vc);
    }

    @Override
    public int hashCode() {
    
    

        return Objects.hash(id, ts, vc);
    }
}

It should be noted here that the WaterSensor we defined has the following characteristics:

  • Classes are public
  • has a no-argument constructor
  • All properties are public
  • All property types are serializable

Flink treats such a class as a special POJO (Plain Ordinary Java Object, which is actually an ordinary JavaBeans) data type to facilitate data parsing and serialization. In addition, we have rewritten the toString method in the class, mainly to make the test output display clearer.
Our custom POJO class here will be frequently used in the following code, so if you encounter it in the following code, just import the POJO class here.

read data from collection

The easiest way to read data is to directly create a Java collection in the code, and then call the fromCollectionmethod of the execution environment to read. This is equivalent to temporarily storing data in memory, forming a special data structure, and using it as a data source, generally for testing.

public static void main(String[] args) throws Exception {
    
    

    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    
	List<Integer> data = Arrays.asList(1, 22, 3);
    DataStreamSource<Integer> ds = env.fromCollection(data);

	stream.print();

    env.execute();
}

read data from file

In real practical applications, naturally the data will not be written directly in the code. Usually, we will get data from storage media, and a common way is to read log files. This is also the most common way to read in batches.
To read files, you need to add file connector dependencies:

 <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-files</artifactId>
            <version>${flink.version}</version>
</dependency>

Examples are as follows:

public static void main(String[] args) throws Exception {
    
    

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        FileSource<String> fileSource = FileSource.forRecordStreamFormat(new TextLineInputFormat(), new Path("input/word.txt")).build();

        env.fromSource(fileSource,WatermarkStrategy.noWatermarks(),"file").print();

        env.execute();
}

illustrate:

  • The parameter can be a directory or a file; it can also be read from the HDFS directory, using the path hdfs://…;
  • The path can be a relative path or an absolute path;
  • The relative path is obtained from the system properties user.dir: the root directory of the project is under the idea, and it is under the standalone mode 集群节点根目录;

Read data from Socket

Whether from a collection or a file, what we read is actually bounded data. In stream processing scenarios, data is often unbounded.
The reading socket text stream we used before is the stream processing scenario. However, due to its small throughput and poor stability, this method is generally used for testing.

DataStream<String> stream = env.socketTextStream("localhost", 7777);

Read data from Kafka

Flink officially provides a connection tool flink-connector-kafka, which directly helps us implement a consumer FlinkKafkaConsumer, which is the SourceFunction used to read Kafka data.
So if we want to use Kafka as the data source to get data, we only need to introduce the dependency of Kafka connector. Flink officially provides a generic Kafka connector that automatically tracks the latest version of the Kafka client. Currently, the latest version only supports Kafka version 0.10.0 and above. The dependencies we need to import here are as follows.

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka</artifactId>
    <version>${flink.version}</version>
</dependency>

code show as below:

public class SourceKafka {
    
    
    public static void main(String[] args) throws Exception {
    
    

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        KafkaSource<String> kafkaSource = KafkaSource.<String>builder()
            .setBootstrapServers("hadoop102:9092")
            .setTopics("topic_1")
            .setGroupId("atguigu")
            .setStartingOffsets(OffsetsInitializer.latest())
            .setValueOnlyDeserializer(new SimpleStringSchema()) 
            .build();

        DataStreamSource<String> stream = env.fromSource(kafkaSource, WatermarkStrategy.noWatermarks(), "kafka-source");

        stream.print("Kafka");

        env.execute();
    }
}

read data from data generator

Flink has provided a built-in DataGen connector since 1.11, which is mainly used to generate some random numbers for streaming task testing and performance testing when there is no data source. 1.17 provides a new way of writing Source, which needs to import dependencies:

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-datagen</artifactId>
    <version>${flink.version}</version>
</dependency>

code show as below:

public class DataGeneratorDemo {
    
    
    public static void main(String[] args) throws Exception {
    
    

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        DataGeneratorSource<String> dataGeneratorSource =
                new DataGeneratorSource<>(
                        new GeneratorFunction<Long, String>() {
    
    
                            @Override
                            public String map(Long value) throws Exception {
    
    
                                return "Number:"+value;
                            }
                        },
                        Long.MAX_VALUE,
                        RateLimiterStrategy.perSecond(10),
                        Types.STRING
                );


        env.fromSource(dataGeneratorSource, WatermarkStrategy.noWatermarks(), "datagenerator").print();


        env.execute();
    }
}

Data types supported by Flink

1) Flink's type system

Flink uses " 类型信息" (TypeInformation) to uniformly represent data types. The TypeInformation class is the base class for all type descriptors in Flink. It covers some basic properties of types and generates specific serializers, deserializers and comparators for each data type.

2) Data types supported by Flink

Flink supports both common Java and Scala data types. Internally, Flink divides support for different types, which can be found in the Types tool class:

(1) Basic types
All Java basic types and their wrapper classes, plus Void, String, Date, BigDecimal, and BigInteger.

(2) Array type
Including basic type array (PRIMITIVE_ARRAY) and object array (OBJECT_ARRAY).

(3) Composite data types

  • Java tuple type (TUPLE): This is Flink's built-in tuple type and is part of the Java API. Up to 25 fields, that is, from Tuple0~Tuple25, empty fields are not supported.
  • Scala sample classes and Scala tuples: empty fields are not supported.
  • Row type (ROW): It can be considered as a tuple with any number of fields, and supports empty fields.
  • POJO: A class similar to the Java bean pattern customized by Flink.

(4) Auxiliary types
Option, Either, List, Map, etc.

(5) Generic type (GENERIC)
Flink supports all Java classes and Scala classes. However, if it is not defined according to the requirements of the above POJO type, it will be treated as a generic class by Flink. Flink treats generic types as black boxes and cannot access their internal properties; they are not serialized by Flink itself, but by Kryo.
Among these types, tuple types and POJO types are the most flexible because they support the creation of complex types. In contrast, POJO also supports the direct use of field names in the definition of keys, which will greatly increase the readability of our code. Therefore, in project practice, the element type in the stream processing program is often defined as the POJO type of Flink.

Flink's requirements for POJO types are as follows:

  • Classes are public
  • has a no-argument constructor
  • All properties are public
  • All property types are serializable

3) Type hints (Type Hints)

Flink also has a type extraction system that can analyze the input and return types of functions, automatically obtain type information, and obtain corresponding serializers and deserializers. However, due to the existence of generic erasure in Java, in some special cases (such as in Lambda expressions), the automatically extracted information is not fine enough-only tell Flink that the current element consists of "bow, hull, "Stern" composition, it is impossible to reconstruct the appearance of a "big ship"; at this time, type information needs to be explicitly provided in order to make the application work properly or improve its performance.
In order to solve such problems, the Java API provides special "type hints" (type hints).
Recall the previous word count stream processing program, after we converted each word of String type into a (word, count) two-tuple, we explicitly specified the return type with returns. Because for the Lambda expression passed in the map, the system can only deduce that the returned type is Tuple2, but cannot get Tuple2<String, Long>. Only by explicitly telling the system the current return type can the complete data be parsed correctly.

.map(word -> Tuple2.of(word, 1L))
.returns(Types.TUPLE(Types.STRING, Types.LONG));

Flink also specifically provides the TypeHint class, which can capture generic type information and record it all the time to provide enough information for runtime. We can also explicitly specify the type of elements in the converted DataStream through the .returns() method.

returns(new TypeHint<Tuple2<Integer, SomeType>>(){
    
    })

Guess you like

Origin blog.csdn.net/a772304419/article/details/132644113
Recommended