Apache Flink stream processing example

Wikipedia records Wiki modification logs on the IRC channel. We can monitor the modification events in a given time window in real time by listening to this IRC channel. As a stream computing engine, Apache Flink is very suitable for processing stream data, and, similar to frameworks such as Hadoop MapReduce, Flink provides a very good abstraction, making business logic code writing very simple. Let's feel the programming of Flink through this simple example.

Building Maven projects with Flink Quickstart

Flink provides flink-quickstart-javaand flink-quickstart-scalaplugins that allow developers using Maven to create a unified project template. Applying the project template can avoid many deployment pits.

The command to build this project is as follows

$ mvn archetype:generate \
    -DarchetypeGroupId=org.apache.flink \
    -DarchetypeArtifactId=flink-quickstart-java \
    -DarchetypeCatalog=https://repository.apache.org/content/repositories/snapshots/ \
    -DarchetypeVersion=1.6-SNAPSHOT \
    -DgroupId=wiki-edits \
    -DartifactId=wiki-edits \
    -Dversion=0.1 \
    -Dpackage=wikiedits \
    -DinteractiveMode=false

Note that the high version of Maven does not support -DarchetypeCatalogparameters , you can change the first line to  mvn org.apache.maven.plugins:maven-archetype-plugin:2.4::generate \or remove -DarchetypeCatalogthe line , and .m2/settings.xmlmodify it as follows, which is mainly //profiles/profile/repositoriesto set the search archetypewarehouse address under

<settings xmlns="http://maven.apache.org/POM/4.0.0"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
                      http://maven.apache.org/xsd/settings-1.0.0.xsd">

  <profiles>
    <profile>
      <id>acme</id>
      <repositories>
        <repository>
            <id>archetype</id>
            <name>Apache Development Snapshot Repository</name>
            <url>https://repository.apache.org/content/repositories/snapshots/</url>
            <releases>
                <enabled>false</enabled>
            </releases>
            <snapshots>
                <enabled>true</enabled>
            </snapshots>
        </repository>
      </repositories>
    </profile>
  </profiles>

  <activeProfiles>
    <activeProfile>acme</activeProfile>
  </activeProfiles>

</settings>

After successfully downloading the project template, you should see the directory in the current wiki-editdirectory . Execute the command to rm wiki-edits/src/main/java/wikiedits/*.javaclear the Java files that come with the template.

In order to monitor the IRC channel of Wikipedia, add the following dependencies under the pom.xmlfile , which are the Flink client and the WikiEdit connector

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-wikiedits_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>

Write Flink programs

The rest of the code work assumes you are writing in an IDE, mainly to avoid verbose importstatements . importThe full code , including template code such as , is given at the end.

First we create the main program code for runningsrc/main/java/wikiedits/WikipediaAnalysis.java

package wikiedits;

public class WikipediaAnalysis {
    public static void main(String[] args) throws Exception {

    }
}

The first step in a Flink program for stream processing is to create a stream processing execution context StreamExecutionEnvironment, which is similar to the Configuration class in other frameworks and is used to configure various parameters of the Flink program and runtime. The corresponding statements are as follows

StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();

Next we create a connection using the logs of the Wikipedia IRC channel as the data source

DataStream<WikipediaEditEvent> edits = see.addSource(new WikipediaEditsSource());

This statement creates the fill WikipediaEditEvent, DataStreamand after we get the data stream, we can do further operations on it.

Our goal is to count the number of bytes modified by users on Wikipedia within a given time window, say five seconds. So we keyed each WikipediaEditEventwith the username as the key. Flink is compatible with Java version 1.6, so in old versions Flink provided a KeySelectorfunctional interface to mark

KeyedStream<WikipediaEditEvent, String> keyedEdits = edits
    .keyBy(new KeySelector<WikipediaEditEvent, String>() {
        @Override
        public String getKey(WikipediaEditEvent event) {
            return event.getUser();
        }
    });

The current version of Flink mainly supports the Java 8 version, so we can also use Lambda expressions to rewrite this more tedious code

KeyedStream<WikipediaEditEvent, String> keyedEdits = edits
        .keyBy(WikipediaEditEvent::getUser);

This statement defines a keyedEditsvariable , which is conceptually (String, WikipediaEditEvent)a stream of data, that is, a stream of data with strings (usernames) as keys and WikipediaEditEventvalues. This step is similar to the Shuffle process of MapReduce, keyedEditsthe processing for will be automatically grouped by key, so we can directly foldoperate on the data to fold and aggregate the modified bytes of the same username

DataStream<Tuple2<String, Long>> result = keyedEdits
    .timeWindow(Time.seconds(5))
    .fold(new Tuple2<>("", 0L), new FoldFunction<WikipediaEditEvent, Tuple2<String, Long>>() {
        @Override
        public Tuple2<String, Long> fold(Tuple2<String, Long> acc, WikipediaEditEvent event) {
            acc.f0 = event.getUser();
            acc.f1 += event.getByteDiff();
            return acc;
        }
    });

In the new version of Flink, FoldFunctionit is abandoned because it cannot support partial aggregation. If there is obsessive-compulsive disorder in the program, we can use a method similar to MapReduce to rewrite the above code. The function of each method call is consistent with their name. Among them, in order to To bypass the problem caused by type erasure using the returnsfunction

DataStream<Tuple2<String, Long>> result = keyedEdits
        .map((event) -> new Tuple2<>(event.getUser(), Long.valueOf(event.getByteDiff())))
        .returns(new TypeHint<Tuple2<String, Long>>(){})
        .timeWindowAll(Time.seconds(5))
        .reduce((acc, a) -> new Tuple2<>(a.f0, acc.f1+a.f1));

The processed data resultstream contains the information we need, specifically Tuple2<String, Long>the stream filled with (username, modified bytes) tuples, which we can use result.print()to print.

The main processing logic of the program has been written so far, but Flink also StreamExecutionEnvironmentneeds to call executethe method on the variable of type to actually execute the entire Flink program. When this method is executed, the entire Flink program is converted into a task graph and submitted to the Flink cluster.

The code for the entire program, including the template code, looks like this

package wikiedits;

import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.wikiedits.WikipediaEditEvent;
import org.apache.flink.streaming.connectors.wikiedits.WikipediaEditsSource;
import org.apache.flink.api.java.tuple.Tuple2;

public class WikipediaAnalysis {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStream<WikipediaEditEvent> edits = see.addSource(new WikipediaEditsSource());
        KeyedStream<WikipediaEditEvent, String> keyedEdits = edits
                .keyBy(WikipediaEditEvent::getUser);
        DataStream<Tuple2<String, Long>> result = keyedEdits
                .map((event) -> new Tuple2<>(event.getUser(), Long.valueOf(event.getByteDiff())))
                .returns(new TypeHint<Tuple2<String, Long>>(){})
                .timeWindowAll(Time.seconds(5))
                .reduce((acc, a) -> new Tuple2<>(a.f0, acc.f1+a.f1));
        result.print();
        see.execute();
    }
}

You can run the program through the IDE and see the output in the console format similar to the following. The number in front of each line represents the result of running the instance numbered by printthe parallel instance of .

1> (LilHelpa,1966)
2> (1.70.80.5,2066)
3> (Beyond My Ken,-6550)
4> (Aleksandr Grigoryev,725)
1> (6.77.155.31,1943)
2> (Serols,1639)
3> (ClueBot NG,1907)
4> (GSS,3155)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325894032&siteId=291194637