Wikipedia records Wiki modification logs on the IRC channel. We can monitor the modification events in a given time window in real time by listening to this IRC channel. As a stream computing engine, Apache Flink is very suitable for processing stream data, and, similar to frameworks such as Hadoop MapReduce, Flink provides a very good abstraction, making business logic code writing very simple. Let's feel the programming of Flink through this simple example.
Building Maven projects with Flink Quickstart
Flink provides flink-quickstart-java
and flink-quickstart-scala
plugins that allow developers using Maven to create a unified project template. Applying the project template can avoid many deployment pits.
The command to build this project is as follows
$ mvn archetype:generate \
-DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-quickstart-java \
-DarchetypeCatalog=https://repository.apache.org/content/repositories/snapshots/ \
-DarchetypeVersion=1.6-SNAPSHOT \
-DgroupId=wiki-edits \
-DartifactId=wiki-edits \
-Dversion=0.1 \
-Dpackage=wikiedits \
-DinteractiveMode=false
Note that the high version of Maven does not support -DarchetypeCatalog
parameters , you can change the first line to mvn org.apache.maven.plugins:maven-archetype-plugin:2.4::generate \
or remove -DarchetypeCatalog
the line , and .m2/settings.xml
modify it as follows, which is mainly //profiles/profile/repositories
to set the search archetype
warehouse address under
<settings xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/settings-1.0.0.xsd">
<profiles>
<profile>
<id>acme</id>
<repositories>
<repository>
<id>archetype</id>
<name>Apache Development Snapshot Repository</name>
<url>https://repository.apache.org/content/repositories/snapshots/</url>
<releases>
<enabled>false</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>
</profile>
</profiles>
<activeProfiles>
<activeProfile>acme</activeProfile>
</activeProfiles>
</settings>
After successfully downloading the project template, you should see the directory in the current wiki-edit
directory . Execute the command to rm wiki-edits/src/main/java/wikiedits/*.java
clear the Java files that come with the template.
In order to monitor the IRC channel of Wikipedia, add the following dependencies under the pom.xml
file , which are the Flink client and the WikiEdit connector
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-wikiedits_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
Write Flink programs
The rest of the code work assumes you are writing in an IDE, mainly to avoid verbose import
statements . import
The full code , including template code such as , is given at the end.
First we create the main program code for runningsrc/main/java/wikiedits/WikipediaAnalysis.java
package wikiedits;
public class WikipediaAnalysis {
public static void main(String[] args) throws Exception {
}
}
The first step in a Flink program for stream processing is to create a stream processing execution context StreamExecutionEnvironment
, which is similar to the Configuration class in other frameworks and is used to configure various parameters of the Flink program and runtime. The corresponding statements are as follows
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
Next we create a connection using the logs of the Wikipedia IRC channel as the data source
DataStream<WikipediaEditEvent> edits = see.addSource(new WikipediaEditsSource());
This statement creates the fill WikipediaEditEvent
, DataStream
and after we get the data stream, we can do further operations on it.
Our goal is to count the number of bytes modified by users on Wikipedia within a given time window, say five seconds. So we keyed each WikipediaEditEvent
with the username as the key. Flink is compatible with Java version 1.6, so in old versions Flink provided a KeySelector
functional interface to mark
KeyedStream<WikipediaEditEvent, String> keyedEdits = edits
.keyBy(new KeySelector<WikipediaEditEvent, String>() {
@Override
public String getKey(WikipediaEditEvent event) {
return event.getUser();
}
});
The current version of Flink mainly supports the Java 8 version, so we can also use Lambda expressions to rewrite this more tedious code
KeyedStream<WikipediaEditEvent, String> keyedEdits = edits
.keyBy(WikipediaEditEvent::getUser);
This statement defines a keyedEdits
variable , which is conceptually (String, WikipediaEditEvent)
a stream of data, that is, a stream of data with strings (usernames) as keys and WikipediaEditEvent
values. This step is similar to the Shuffle process of MapReduce, keyedEdits
the processing for will be automatically grouped by key, so we can directly fold
operate on the data to fold and aggregate the modified bytes of the same username
DataStream<Tuple2<String, Long>> result = keyedEdits
.timeWindow(Time.seconds(5))
.fold(new Tuple2<>("", 0L), new FoldFunction<WikipediaEditEvent, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> fold(Tuple2<String, Long> acc, WikipediaEditEvent event) {
acc.f0 = event.getUser();
acc.f1 += event.getByteDiff();
return acc;
}
});
In the new version of Flink, FoldFunction
it is abandoned because it cannot support partial aggregation. If there is obsessive-compulsive disorder in the program, we can use a method similar to MapReduce to rewrite the above code. The function of each method call is consistent with their name. Among them, in order to To bypass the problem caused by type erasure using the returns
function
DataStream<Tuple2<String, Long>> result = keyedEdits
.map((event) -> new Tuple2<>(event.getUser(), Long.valueOf(event.getByteDiff())))
.returns(new TypeHint<Tuple2<String, Long>>(){})
.timeWindowAll(Time.seconds(5))
.reduce((acc, a) -> new Tuple2<>(a.f0, acc.f1+a.f1));
The processed data result
stream contains the information we need, specifically Tuple2<String, Long>
the stream filled with (username, modified bytes) tuples, which we can use result.print()
to print.
The main processing logic of the program has been written so far, but Flink also StreamExecutionEnvironment
needs to call execute
the method on the variable of type to actually execute the entire Flink program. When this method is executed, the entire Flink program is converted into a task graph and submitted to the Flink cluster.
The code for the entire program, including the template code, looks like this
package wikiedits;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.wikiedits.WikipediaEditEvent;
import org.apache.flink.streaming.connectors.wikiedits.WikipediaEditsSource;
import org.apache.flink.api.java.tuple.Tuple2;
public class WikipediaAnalysis {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<WikipediaEditEvent> edits = see.addSource(new WikipediaEditsSource());
KeyedStream<WikipediaEditEvent, String> keyedEdits = edits
.keyBy(WikipediaEditEvent::getUser);
DataStream<Tuple2<String, Long>> result = keyedEdits
.map((event) -> new Tuple2<>(event.getUser(), Long.valueOf(event.getByteDiff())))
.returns(new TypeHint<Tuple2<String, Long>>(){})
.timeWindowAll(Time.seconds(5))
.reduce((acc, a) -> new Tuple2<>(a.f0, acc.f1+a.f1));
result.print();
see.execute();
}
}
You can run the program through the IDE and see the output in the console format similar to the following. The number in front of each line represents the result of running the instance numbered by print
the parallel instance of .
1> (LilHelpa,1966)
2> (1.70.80.5,2066)
3> (Beyond My Ken,-6550)
4> (Aleksandr Grigoryev,725)
1> (6.77.155.31,1943)
2> (Serols,1639)
3> (ClueBot NG,1907)
4> (GSS,3155)