Flink series java achieve incremental file WordCount, mission deployed to the yarn
There the data stream state calculation - Apache Flink®
Apache Flink is distributed and a frame processing engine for performing calculations in a state where no boundary and has boundary data stream. Flink
can run on all common cluster environment, and can be calculated in any size and memory speed.Next, we introduce the important aspects Flink architecture.
Processing unbounded and bounded data
Any type of data may form a stream of events. Credit card transactions, sensor measurements, the machine logs record user interaction on websites or mobile applications, all of which form a data stream.
Data may be used as a bounded or unbounded process stream.
Unbounded streams have to start defining the stream, but does not define the end of the stream. They will endlessly generate data. Unbounded streams of data must continue to deal with, that the data has been ingested requires immediate attention. We can not wait until all the data arrives reprocessing, because the input is infinite, the input will not be done at any time. Unbounded processing generally requires an ingestible event data in a specific order, for example, sequence of events, so that the integrity of the results can be estimated.
Bounded flow there begin to define the flow, but also defined the end of the stream. Bounded stream can be calculated after the intake of all the data. Bounded stream all the data can be sorted, so you do not need to order intake. Bounded commonly referred to as a batch process stream
the Apache Flink good processing bounded and unbounded data set application (Runtime) when the precise time that the control and status of the operation can run any Flink unbounded flow process. Bounded by the number of streams is designed for internal processing algorithms and data structures of fixed size data set of specially designed, produced excellent performance.
Preparing the Environment
Package | version |
---|---|
Apache Hadoop | 3.2.1 |
Idea | 2018.3 |
CentOS | 7 |
JDK | 1.8 |
Apache Flink | 1.10.0 |
Our goal
Implement a program using java and Flink can count on a word file, and when we write new content to the file and save it, the program will automatically be added to the contents of the incremental count.
We will then submit the task to run on top of yarn.
FileWindowWordCount
The introduction of dependence
Gradle
// 下面四个依赖是java开发Flink应用必要的依赖
// https://mvnrepository.com/artifact/org.apache.flink/flink-core
compile group: 'org.apache.flink', name: 'flink-core', version: '1.10.0'
// https://mvnrepository.com/artifact/org.apache.flink/flink-java
compile group: 'org.apache.flink', name: 'flink-java', version: '1.10.0'
// https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java
compile group: 'org.apache.flink', name: 'flink-streaming-java_2.12', version: '1.10.0'
// https://mvnrepository.com/artifact/org.apache.flink/flink-clients
compile group: 'org.apache.flink', name: 'flink-clients_2.12', version: '1.10.0'
Code Code
package cn.flink;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.io.TextInputFormat;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.core.fs.Path;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.FileProcessingMode;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
/**
* 增量读取文件中的内容并对单词计数
* 改编自Apache Flink官网上的SocketStreamWordCount
*/
public class LocalFileWindowWordCount {
public static void main(String[] args) throws Exception {
// 待处理的文件由 --input 参数指定
String input;
try {
final ParameterTool params = ParameterTool.fromArgs(args);
input = params.get("input");
} catch (Exception e) {
System.err.println("No port specified. Please run 'LocalFileWindowWordCount --input <input>'");
return;
}
// get the execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Path path = new org.apache.flink.core.fs.Path(input);
TextInputFormat textInputFormat = new TextInputFormat(path);
// get input data by connecting to the socket
DataStream<String> text = env.readFile(textInputFormat, input, FileProcessingMode.PROCESS_CONTINUOUSLY,2000);
// parse the data, group it, window it, and aggregate the counts
DataStream<WordWithCount> windowCounts = text
.flatMap(new FlatMapFunction<String, WordWithCount>() {
@Override
public void flatMap(String value, Collector<WordWithCount> out) {
for (String word : value.split("\\s")) {
out.collect(new WordWithCount(word, 1L));
}
}
})
.keyBy("word")
.timeWindow(Time.seconds(1), Time.seconds(1))
.reduce(new ReduceFunction<WordWithCount>() {
@Override
public WordWithCount reduce(WordWithCount a, WordWithCount b) {
return new WordWithCount(a.word, a.count + b.count);
}
});
// print the results with a single thread, rather than in parallel
windowCounts.print().setParallelism(2);
env.execute("Local File Window WordCount");
}
// Data type for words with count
public static class WordWithCount {
public String word;
public long count;
public WordWithCount() {}
public WordWithCount(String word, long count) {
this.word = word;
this.count = count;
}
@Override
public String toString() {
return word + " : " + count;
}
}
}
In the IDE runs facie effect
Must first create a new empty text file that
I have here is D: \ tmp \ input.txt
Running the Project
Input.txt Notepad to open the file
and then write a single line
OK, the result is normal.
You can be uploaded to the server running.
bin/flink run -m yarn-cluster -p 2 -yjm 700m -ytm 1024m -c cn.flink.LocalFileWindowWordCount ~/ApacheFlink-0.0.3-alpha.jar --input /root/wordcount/input.txt