Flink series java achieve incremental file WordCount, mission deployed to the yarn

Flink series java achieve incremental file WordCount, mission deployed to the yarn

Here Insert Picture Description

There the data stream state calculation - Apache Flink®
Here Insert Picture Description

Apache Flink is distributed and a frame processing engine for performing calculations in a state where no boundary and has boundary data stream. Flink
can run on all common cluster environment, and can be calculated in any size and memory speed.

Next, we introduce the important aspects Flink architecture.

Processing unbounded and bounded data

Any type of data may form a stream of events. Credit card transactions, sensor measurements, the machine logs record user interaction on websites or mobile applications, all of which form a data stream.

Data may be used as a bounded or unbounded process stream.

  • Unbounded streams have to start defining the stream, but does not define the end of the stream. They will endlessly generate data. Unbounded streams of data must continue to deal with, that the data has been ingested requires immediate attention. We can not wait until all the data arrives reprocessing, because the input is infinite, the input will not be done at any time. Unbounded processing generally requires an ingestible event data in a specific order, for example, sequence of events, so that the integrity of the results can be estimated.

  • Bounded flow there begin to define the flow, but also defined the end of the stream. Bounded stream can be calculated after the intake of all the data. Bounded stream all the data can be sorted, so you do not need to order intake. Bounded commonly referred to as a batch process stream
    Here Insert Picture Description
    the Apache Flink good processing bounded and unbounded data set application (Runtime) when the precise time that the control and status of the operation can run any Flink unbounded flow process. Bounded by the number of streams is designed for internal processing algorithms and data structures of fixed size data set of specially designed, produced excellent performance.

Preparing the Environment

Package version
Apache Hadoop 3.2.1
Idea 2018.3
CentOS 7
JDK 1.8
Apache Flink 1.10.0

Our goal

Implement a program using java and Flink can count on a word file, and when we write new content to the file and save it, the program will automatically be added to the contents of the incremental count.

We will then submit the task to run on top of yarn.

FileWindowWordCount

The introduction of dependence

Gradle

//    下面四个依赖是java开发Flink应用必要的依赖
    // https://mvnrepository.com/artifact/org.apache.flink/flink-core
    compile group: 'org.apache.flink', name: 'flink-core', version: '1.10.0'
    // https://mvnrepository.com/artifact/org.apache.flink/flink-java
    compile group: 'org.apache.flink', name: 'flink-java', version: '1.10.0'
    // https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java
    compile group: 'org.apache.flink', name: 'flink-streaming-java_2.12', version: '1.10.0'
    // https://mvnrepository.com/artifact/org.apache.flink/flink-clients
    compile group: 'org.apache.flink', name: 'flink-clients_2.12', version: '1.10.0'

Code Code

package cn.flink;


import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.io.TextInputFormat;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.core.fs.Path;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.FileProcessingMode;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;

/**
 * 增量读取文件中的内容并对单词计数
 * 改编自Apache Flink官网上的SocketStreamWordCount
 */
public class LocalFileWindowWordCount {

    public static void main(String[] args) throws Exception {

        // 待处理的文件由 --input 参数指定
        String input;
        try {
            final ParameterTool params = ParameterTool.fromArgs(args);
            input = params.get("input");
        } catch (Exception e) {
            System.err.println("No port specified. Please run 'LocalFileWindowWordCount --input <input>'");
            return;
        }

        // get the execution environment
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        Path path = new org.apache.flink.core.fs.Path(input);
        TextInputFormat textInputFormat = new TextInputFormat(path);

        // get input data by connecting to the socket
        DataStream<String> text = env.readFile(textInputFormat, input, FileProcessingMode.PROCESS_CONTINUOUSLY,2000);

        // parse the data, group it, window it, and aggregate the counts
        DataStream<WordWithCount> windowCounts = text
                .flatMap(new FlatMapFunction<String, WordWithCount>() {
                    @Override
                    public void flatMap(String value, Collector<WordWithCount> out) {
                        for (String word : value.split("\\s")) {
                            out.collect(new WordWithCount(word, 1L));
                        }
                    }
                })
                .keyBy("word")
                .timeWindow(Time.seconds(1), Time.seconds(1))
                .reduce(new ReduceFunction<WordWithCount>() {
                    @Override
                    public WordWithCount reduce(WordWithCount a, WordWithCount b) {
                        return new WordWithCount(a.word, a.count + b.count);
                    }
                });

        // print the results with a single thread, rather than in parallel
        windowCounts.print().setParallelism(2);

        env.execute("Local File Window WordCount");
    }

    // Data type for words with count
    public static class WordWithCount {

        public String word;
        public long count;

        public WordWithCount() {}

        public WordWithCount(String word, long count) {
            this.word = word;
            this.count = count;
        }

        @Override
        public String toString() {
            return word + " : " + count;
        }
    }
}

In the IDE runs facie effect

Here Insert Picture Description
Must first create a new empty text file that
I have here is D: \ tmp \ input.txt

Running the Project

Here Insert Picture Description
Input.txt Notepad to open the file
and then write a single line
Here Insert Picture Description
Here Insert Picture Description
OK, the result is normal.
You can be uploaded to the server running.

bin/flink run -m yarn-cluster -p 2 -yjm 700m -ytm 1024m -c cn.flink.LocalFileWindowWordCount ~/ApacheFlink-0.0.3-alpha.jar --input /root/wordcount/input.txt

Published 54 original articles · won praise 9 · views 20000 +

Guess you like

Origin blog.csdn.net/wangxudongx/article/details/104627763