Setting a maven project
Flink use maven to create a project, use the following command:
$ mvn archetype:generate \
-DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-quickstart-java \
-DarchetypeVersion=1.9.0 \
-DgroupId=wiki-edits \
-DartifactId=wiki-edits \
-Dversion=0.1 \
-Dpackage=wikiedits \
-DinteractiveMode=false
You can edit the groupid artifactId and package, the project directory structure is as follows:
$ tree wiki-edits
wiki-edits/
├── pom.xml
└── src
└── main
├── java
│ └── wikiedits
│ ├── BatchJob.java
│ └── StreamingJob.java
└── resources
└── log4j.propertie
This project has already created some sample code, we can delete the sample code structure in src / main / java
$ rm wiki-edits/src/main/java/wikiedits/*.java
Finally, we need to add some rely on our program need to add in pom.xml:
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-wikiedits_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
</dependencies>
Write a program flink
Open the IDE, add a file src / main / java / wikiedits / WikipediaAnalysis.java:
package wikiedits;
public class WikipediaAnalysis {
public static void main(String[] args) throws Exception {
}
}
This is just a basic main function, then our first step is to create an environment variable StreamExecutionEnvironment (if it is, then you create a batch ExecutionEnvironment), this can be used to read external files and resources for the implementation of the program, so we are now main function add this method.
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
Next we create yet another source, to accept Wikipedia's IRC log
DataStream<WikipediaEditEvent> edits = see.addSource(new WikipediaEditsSource());
Here datastream created a WikipediaEditEvent to help us further treatment program. The first step we need to specify userName packet key.
KeyedStream<WikipediaEditEvent, String> keyedEdits = edits
.keyBy(new KeySelector<WikipediaEditEvent, String>() {
@Override
public String getKey(WikipediaEditEvent event) {
return event.getUser();
}
});
Then we need to specify the results we want time the output size of a window, and do some aggregation operations. This example demonstrates that increasing the amount within the time window for each user or delete bytes, performs a calculation window in which a stream, the data stream in an infinite stream, we need to set the window to the window in the example 5s.
DataStream<Tuple2<String, Long>> result = keyedEdits
.timeWindow(Time.seconds(5))
.aggregate(new AggregateFunction<WikipediaEditEvent, Tuple2<String, Long>, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> createAccumulator() {
return new Tuple2<>("", 0L);
}
@Override
public Tuple2<String, Long> add(WikipediaEditEvent value, Tuple2<String, Long> accumulator) {
accumulator.f0 = value.getUser();
accumulator.f1 += value.getByteDiff();
return accumulator;
}
@Override
public Tuple2<String, Long> getResult(Tuple2<String, Long> accumulator) {
return accumulator;
}
@Override
public Tuple2<String, Long> merge(Tuple2<String, Long> a, Tuple2<String, Long> b) {
return new Tuple2<>(a.f0, a.f1 + b.f1);
}
});
Finally, print the results and presented to the Executive
result.print();
see.execute();
All operations, such as: the establishment of a source, transformations, sink, are built on a directed graph, within
only we execute execute () when these operations will be performed on our map the machine or cluster.
The complete code is as follows:
package wikiedits;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.wikiedits.WikipediaEditEvent;
import org.apache.flink.streaming.connectors.wikiedits.WikipediaEditsSource;
public class WikipediaAnalysis {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<WikipediaEditEvent> edits = see.addSource(new WikipediaEditsSource());
KeyedStream<WikipediaEditEvent, String> keyedEdits = edits
.keyBy(new KeySelector<WikipediaEditEvent, String>() {
@Override
public String getKey(WikipediaEditEvent event) {
return event.getUser();
}
});
DataStream<Tuple2<String, Long>> result = keyedEdits
.timeWindow(Time.seconds(5))
.aggregate(new AggregateFunction<WikipediaEditEvent, Tuple2<String, Long>, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> createAccumulator() {
return new Tuple2<>("", 0L);
}
@Override
public Tuple2<String, Long> add(WikipediaEditEvent value, Tuple2<String, Long> accumulator) {
accumulator.f0 = value.getUser();
accumulator.f1 += value.getByteDiff();
return accumulator;
}
@Override
public Tuple2<String, Long> getResult(Tuple2<String, Long> accumulator) {
return accumulator;
}
@Override
public Tuple2<String, Long> merge(Tuple2<String, Long> a, Tuple2<String, Long> b) {
return new Tuple2<>(a.f0, a.f1 + b.f1);
}
});
result.print();
see.execute();
}
}
Execution of the program can be done with maven, at the command line:
$ mvn clean package
$ mvn exec:java -Dexec.mainClass=wikiedits.WikipediaAnalysis
Output follows
1> (Fenix down,114)
6> (AnomieBOT,155)
8> (BD2412bot,-3690)
7> (IgnorantArmies,49)
3> (Ckh3111,69)
5> (Slade360,0)
7> (Narutolovehinata5,2195)
6> (Vuyisa2001,79)
4> (Ms Sarah Welch,269)
4> (KasparBot,-245)
Each number represents the front line and actuators which accept and execute the task.
Exercise: write and run on a cluster kafka
Please set up a local cluster environment in the machine and install kafka
We need to add kafka-connector and sink to kafka, the first step, we need to add its dependencies in pom.xml
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.11_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
Next, we need to modify our procedures. The print sink replaced kafka sink, programmed as follows:
result
.map(new MapFunction<Tuple2<String,Long>, String>() {
@Override
public String map(Tuple2<String, Long> tuple) {
return tuple.toString();
}
})
.addSink(new FlinkKafkaProducer011<>("localhost:9092", "wiki-result", new SimpleStringSchema()));
Add an import-dependent follows:
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.common.functions.MapFunction;
Use maven compile jar package:
$ mvn clean package
Jar file address generation target / wiki-edits-0.1.jar
Now we need to start flink cluster
$ cd my/flink/directory
$ bin/start-cluster.sh
Also need to create a kafka topic, to ensure that our program can be written to gold:
$ cd my/kafka/directory
$ bin/kafka-topics.sh --create --zookeeper localhost:2181 \
--replication-factor 1 --partitions 1 --topic wiki-results
Now we are ready to run jar file in our local cluster flink
$ cd my/flink/directory
$ bin/flink run -c wikiedits.WikipediaAnalysis path/to/wikiedits-0.1.jar
Log output as follows:
03/08/2016 15:09:27 Job execution switched to status RUNNING.
03/08/2016 15:09:27 Source: Custom Source(1/1) switched to SCHEDULED
03/08/2016 15:09:27 Source: Custom Source(1/1) switched to DEPLOYING
03/08/2016 15:09:27 Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, AggregateFunction$3, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) switched from CREATED to SCHEDULED
03/08/2016 15:09:27 Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, AggregateFunction$3, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) switched from SCHEDULED to DEPLOYING
03/08/2016 15:09:27 Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, AggregateFunction$3, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) switched from DEPLOYING to RUNNING
03/08/2016 15:09:27 Source: Custom Source(1/1) switched to RUNNING
Can log on http: // localhost: 8081 to see the situation the task to run, we can see there are two operation, then for performance considerations, window operation will be folded into one, known as chaining
Custom command may be used to observe kafka kafka data
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic wiki-result