Types and basic usage of Flink-dataSource


Foreword:
For a flink task, we use the Java API level to explain:
a task is basically divided into three types

name effect
dataSource Indicates the data source of flink
dataStream Represents the process of data processing. Once a dataSource undergoes a certain transformation, it becomes a part of the dataStream, such as fliter, flatMap and other operations (similar to spark RDD)
sink The action equivalent to spark is an action that actually triggers the flink task.

So this article is mainly about Flink's data source: dataSource related explanations.

Parallel dataSource

Parallel dataSource refers to a dataSource with a parallelism of 1.
The degree of parallelism can be checked by this method:

dataSource.getParallelism();

Unbounded set parallel dataSource

socketTextStream()

A way to get text data from the socket, in
order to be more intuitive, without using lambda expressions. Slightly cumbersome

public class StreamWordCount {
    
    
    public static void main(String[] args) throws Exception {
    
    
        // 1.创建一个flink stream  序执行的环境
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
        // 2.通过这个环境创建一个抽象的的数据集dataStream
       // DataStreamSource<String> dataStream = environment.socketTextStream("192.168.237.130", 8888);
        DataStreamSource<String> dataStream = environment.socketTextStream(args[0], Integer.parseInt(args[1]));
        // 3.调用dataStream上的方法 ,如:transformation(可以不调用) 和sink(必须调用,类似于spark的action,提交动作)。
        // 调用transformation会将一个dataStream转换为一个新的dataStream
        SingleOutputStreamOperator<String> dataStream2 = dataStream.flatMap(new FlatMapFunction<String, String>() {
    
    
            @Override
            public void flatMap(String line, Collector<String> out) throws Exception {
    
    
                // 将一行单词进行切分
                String[] words = line.split(" ");
                for (String word : words) {
    
    
                    // 切分后输出
                    out.collect(word);
                }
            }
        });
        // 4.将单词和数字1进行组合,返回一个dataStream
        SingleOutputStreamOperator<Tuple2<String, Integer>> dataStream3 =
                dataStream2.map(new MapFunction<String, Tuple2<String, Integer>>() {
    
    
                    @Override
                    public Tuple2<String, Integer> map(String word) throws Exception {
    
    
                        return Tuple2.of(word, 1);
                    }
                });
        // 5.进行分组聚合,根据单词进行keyBy,然后把对应的第一个数据进行累加。这里的数字是下标,对应的Tuple2<String,Integer>
        SingleOutputStreamOperator<Tuple2<String, Integer>> dataStream4 =
                dataStream3.keyBy(0).sum(1);
        // 到这里transformation结束

        // 6.调用sink并启动
        dataStream4.print();
        environment.execute("StreamWordCount");
    }
}

The data set obtained through socketTextStream can be unbounded.
Because as long as there is data flowing through the corresponding socket port, then flink can perform real-time calculations.

Bounded set parallel dataSource

Bounded dataSource means that once the task is created, its data set is limited and no new data will be generated. That is to say, flink does not have to execute or wait for new data, as long as the limited data is executed, the program It stopped.

fromCollection()

Used to store array data, here for simplicity, use List as the
name implies to get data from Collection, which is a collection, but please note that generally this method and the following fromElements are used for testing and verification.
Why do you say this, explain: For example, if we want to make a small demo of flink's WordCount, then we may be used to the first method, which is socketTextStream to obtain data, but note that the data here needs to be manually obtained from the client The input is more troublesome, so if we use fromCollection this way, we can define certain data at the beginning, which is more convenient.

public class SourceDemo1 {
    
    
    public static void main(String[] args) throws Exception {
    
    
        // 创建实时计算的一个执行环境
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
        // 将客户端的集合并行化成一个抽象的数据集,通常这个方法是用作测试的,即模拟flink的数据。
        // 就不需要通过socket去传递数据了
        // 但是 fromElements是一个有界的数据流,一旦数据处理完,就会退出。
        DataStream<Integer> nums = environment.fromCollection(Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9));
        // 这里结果是1,并行度为1,就代表只有一个subTask来产生数据。
        System.out.println("=========" + nums.getParallelism() + "=========");
        DataStream<Integer> sum = nums.filter(new FilterFunction<Integer>() {
    
    
            @Override
            public boolean filter(Integer value) throws Exception {
    
    
                return value % 2 == 0;
            }
        });
        sum.print();
        environment.execute("SourceDemo1");
    }
}

fromElements()

Specify to store the corresponding element as the data source.

public class SourceDemo1 {
    
    
    public static void main(String[] args) throws Exception {
    
    
        // 创建实时计算的一个执行环境
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
        // 创建抽象的数据集【创建原始的抽象数据集的方法,Source】   
        DataStream<Integer> nums = environment.fromElements(1, 2, 3, 4, 5, 6, 7, 8, 9);
        // 这里结果是1,并行度为1,就代表只有一个subTask来产生数据。
        System.out.println("=========" + nums.getParallelism() + "=========");
         
        DataStream<Integer> sum = nums.filter(new FilterFunction<Integer>() {
    
    
            @Override
            public boolean filter(Integer value) throws Exception {
    
    
                return value % 2 == 0;
            }
        });
        sum.print();
        environment.execute("SourceDemo1");
    }
}

So far, the parallelism of the above three dataSources is 1. Next, we will introduce the dataSource whose parallelism is not 1.

Non-parallel dataSource

Non-parallel dataSource refers to dataSource with a degree of parallelism> 1

fromParallelCollection()

public class SourceDemo2 {
    
    
    public static void main(String[] args) throws Exception {
    
    
        // 创建实时计算的一个执行环境
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
        // 创建抽象的数据集【创建原始的抽象数据集的方法,Source】
        DataStreamSource<Long> nums = environment.fromParallelCollection(new NumberSequenceIterator(1, 10), Long.class);
        System.out.println("=========" + nums.getParallelism() + "=========");
        // DataStream<Integer> nums = environment.fromElements(1, 2, 3, 4, 5, 6, 7, 8, 9);
        SingleOutputStreamOperator<Long> sum = nums.filter(new FilterFunction<Long>() {
    
    
            @Override
            public boolean filter(Long value) throws Exception {
    
    
                return value % 2 == 0;
            }
        });
        sum.print();
        environment.execute("SourceDemo2");
    }
}

Note: fromParallelCollection needs to specify the type as Long and the
final printed part of the result is as follows: the
output parallelism is 8
Insert picture description here
output data.
By the way, the output format problem here:

  1. The first column represents the number of the subTask+1
  2. The second column represents the output result (the output in our demo is an even number)

Insert picture description here

generateSequence()

Just change this part to:

DataStreamSource<Long> nums = environment.generateSequence(1100);

That is, the representative data source is an integer from 1 to 100.

readTextFile()

Obtaining data from the file is also the dataSource
code part of the bounded set :

public class TextFileSource {
    
    
    public static void main(String[] args) throws Exception {
    
    
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> dataStreamSource = environment.readTextFile(args[0]);
        // 获取并行度
        int parallelism = dataStreamSource.getParallelism();
        System.out.println("-------------->:" + parallelism);
        SingleOutputStreamOperator<Tuple2<String, Integer>> words = dataStreamSource.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
    
    
            @Override
            public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {
    
    
                String[] words = line.split(" ");
                for (String word : words) {
    
    
                    out.collect(Tuple2.of(word, 1));
                }
            }
        });
        System.out.println("===============>:" + words.getParallelism());
        SingleOutputStreamOperator<Tuple2<String, Integer>> sum = words.keyBy(0).sum(1);
        sum.print();
        environment.execute("TextFileSource");
    }
}

Setting part:
click on setting:
Insert picture description here
add the address of your file:
Insert picture description here
my address here is a file called words.txt on the desktop

Insert picture description here

Note: Be sure to execute the code after configuration, otherwise an error will be reported.
Output result:
Insert picture description here

summary

DataSource is divided into bounded set and unbounded set:
so flink can also do offline calculations, or real-time calculations

  1. If the data is a bounded set, the execution environment generally used is ExecutionEnvironment, the amount of data is fixed, and what flink does is offline calculation. (Even if the real-time StreamExecutionEnvironment is used, once the data is calculated, the program will stop, so it is equivalent to doing offline calculation, which is batch calculation)

  2. If the data is an infinite set, the generally used execution environment is StreamExecutionEnvironment, and Flink does real-time stream computing.

  3. DataSource can be divided into parallel dataSource and non-parallel dataSource according to different methods of the environment.

  4. Parallel dataSource means that the degree of parallelism is> 1. Generally speaking, it allows multiple threads ( multiple subTasks, subTask is a unit of real execution ) to execute this task at the same time. The non-parallel dataSource has a parallelism of 1, and has only 1 thread to execute.

- parallel Non-parallel Can it be used as an unbounded set (for real-time stream processing)
socketTextStream() AND ----- AND
fromCollection() AND ----- N (usually used for testing)
fromElements AND ----- N
fromParallelCollection ----- AND N
generateSequence ----- AND N
readTextFile ----- AND AND

The data in this table is not necessarily accurate (whether it can be used as an unbounded set), because strictly speaking, as long as the data continuously enters, for example, new data is added before your program ends, then it is a stream processing. The stream processing time is also long and short, so you can't draw conclusions at will.

Here is a reference.
For example, readTextFile, if you read a file from Hdfs, and on the other hand, there are places like spark, hadoop, ES, etc., which continuously store data in Hdfs, then Flink also does stream processing. If it is like the demo in this article, then it is an offline process. So everyone should think about this matter reasonably.

Guess you like

Origin blog.csdn.net/Zong_0915/article/details/107734403