Apache Flink start example demo

In this article, we will start from scratch and teach you how to build your first Apache Flink (hereinafter referred to as Flink) applications.

Development environment ready

Flink can run on Linux, Max OS X, or Windows. In order to develop Flink application on a local machine needs to have Java 8.x and maven environment.
If you have Java 8 environment, run the following command outputs the following version information:

$ java -versionjava version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)

If you have maven environment, run the following command outputs the following version information:

$ mvn -version
Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-18T02:33:14+08:00)
Maven home: /Users/wuchong/dev/maven
Java version: 1.8.0_65, vendor: Oracle Corporation, runtime: /Library/Java/JavaVirtualMachines/jdk1.8.0_65.jdk/Contents/Home/jre
Default locale: zh_CN, platform encoding: UTF-8
OS name: "mac os x", version: "10.13.6", arch: "x86_64", family: "mac"

In addition, we recommend the use of ItelliJ IDEA (community free version has enough) as a development IDE Flink applications. Eclipse although it could be, but in Eclipse Scala and Java mixed project would have some known problems, so he did not recommend Eclipse. The next section, we'll show you how to create a project and import it Flink ItelliJ IDEA.
Maven project
we will use Flink Maven Archetype to create our project structure and some initial default dependence. In your working directory, run the following command to create the project:

mvn archetype:generate \    
-DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-quickstart-java \    
-DarchetypeVersion=1.6.1 \    
-DgroupId=my-flink-project \    
-DartifactId=my-flink-project \    
-Dversion=0.1 \    
-Dpackage=myflink \    
-DinteractiveMode=false

You can edit the above groupId, artifactId, package into your favorite path. Using the above parameters, Maven will automatically create a project structure is shown below for you:

$ tree my-flink-project
my-flink-project
├── pom.xml
└── src
    └── main
        ├── java
        │   └── myflink
        │       ├── BatchJob.java
        │       └── StreamingJob.java
        └── resources
            └── log4j.properties

Our pom.xml file already contains the required Flink dependence, and there are several examples in the application framework src / main / java. Next, we will start writing the first Flink program.

Start IntelliJ IDEA, select "Import Project" (import project), select pom.xml under my-flink-project root directory. According to the guide, to complete the project import.
Creating SocketWindowWordCount.java file in src / main / java / myflink:

package myflink;

public class SocketWindowWordCount {

    public static void main(String[] args) throws Exception {

    }
}

Now this program is still based on a step by step we will drive a fill code. We do not pay attention to the following import statement also write it, because the IDE automatically adds them up. At the end of this section, I will complete code displayed, if you want to skip the next step, you can directly stick to the last complete code editor.
The first step is to create a program Flink StreamExecutionEnvironment. This entry is a class, it can be used to set parameters and create a data source, and submit the task. So let's add it to the main function:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

Next we will create a socket to read data from the local port number 9000 in the data source:

DataStream text = env.socketTextStream("localhost", 9000, "\n");

This creates a string type DataStream. Flink DataStream is in the API do stream processing core, defined above very many common operations (e.g., filtration, transformation, polymerization, window, association and the like). In this example, we are interested in is the number of times each word appears in a specific window of time, say five seconds window. To do this, we first need to parse the string data into words and the number of times (using Tuple2 representation), the first field is the word, the second field is the number, the number of initial values ​​are set to become 1. We implemented a flatmap do analytical work, because his data may have more than one word.

DataStream> wordCounts = text
                .flatMap(new FlatMapFunction>() {
                    @Override
                    public void flatMap(String value, Collector> out) {
                        for (String word : value.split("\\s")) {
                            out.collect(Tuple2.of(word, 1));
                        }
                    }
                });

Then we do the packet data stream in accordance with a word field (i.e. field index 0), where you can simply use keyBy (int index) method, to obtain a key word for the data stream of Tuple2. Then we can specify the window you want on the stream, and the calculation results based on the data window. In our example, we want to once every 5 seconds polymerization number of words, each window is scratch statistics:

DataStream> windowCounts = wordCounts
                .keyBy(0)
                .timeWindow(Time.seconds(5))
                .sum(1);

The second call .timeWindow () to specify that we want five seconds of rolling window (Tumble). Every third call key to each window specifies sum aggregate function in our example is the addition in accordance with the number of fields (i.e., index field No. 1) was. The results obtained data stream, a number of outputs each word that appears within 5 seconds every 5 seconds.
The last thing the data stream to print to the console, and begin to implement:

windowCounts.print().setParallelism(1);
env.execute("Socket Window WordCount");

The final call is to start the actual env.execute Flink necessary for the job. All arithmetic operators (e.g., create source, polymerization, printing) are building a sub-operations internal graphics operator. Will only be performed on a cluster to submit to the local computer or when it is called in the execute ().
Here is the complete code part of the code simplified:

package myflink;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;

public class SocketWindowWordCount {

    public static void main(String[] args) throws Exception {

        // 创建 execution environment
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 通过连接 socket 获取输入数据,这里连接到本地9000端口,如果9000端口已被占用,请换一个端口
        DataStream text = env.socketTextStream("localhost", 9000, "\n");

        // 解析数据,按 word 分组,开窗,聚合
        DataStream> windowCounts = text
                .flatMap(new FlatMapFunction>() {
                    @Override
                    public void flatMap(String value, Collector> out) {
                        for (String word : value.split("\\s")) {
                            out.collect(Tuple2.of(word, 1));
                        }
                    }
                })
                .keyBy(0)
                .timeWindow(Time.seconds(5))
                .sum(1);

        // 将结果打印到控制台,注意这里使用的是单线程打印,而非多线程
        windowCounts.print().setParallelism(1);

        env.execute("Socket Window WordCount");
    }
}

Run the program
you want to run the sample program, first we get the input stream start netcat in a terminal:

nc -lk 9000

If a Windows platform, you can install and run ncat by https://nmap.org/ncat/:

ncat -lk 9000

Then run directly SocketWindowWordCount main method.
Just type in the word netcat console, you can see the frequency statistics of each word in SocketWindowWordCount output console. To see the count is greater than 1, typing the same word repeatedly within 5 seconds.

No micro-channel public concern "big data technology advanced", from the point to the surface, take you through the big data technology architecture and applications!

Guess you like

Origin www.cnblogs.com/xiaodf/p/11757462.html