I. Introduction
The DataStream API gets its name from the DataStream
special class used to represent collections of data in Flink programs. You can think of them as immutable collections of data that can contain duplicates. These data can be bounded (finite) or unbounded (unlimited), but the API used to process them is the same.
DataStream
Similar to regular Java in usage 集合
, but quite different in some key ways. They are immutable, which means you cannot add or remove elements once they are created. You also can't simply look at internal elements, you can only DataStream
manipulate them using API operations, DataStream
also known as transformations.
You can create an initial one by adding source to your Flink program DataStream
. Then, you can DataStream
derive , and use API methods like map, filter, etc. to DataStream
join with the derived streams. As before, the processing of a DataStrea mainly includes the combination of Source + Transformation + Sink:
Tips:
Unlike before, the execution environment of the DataSet is:
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
The execution environment of DataStreaming is:
val env = StreamExecutionEnvironment.getExecutionEnvironment
2. FileBased file-based
Most of the interfaces here are similar to DataSet. Due to the difference in env, the final type obtained is also different, from DataSet to DataStreaming
1.readTextFile(path)
Read text files, such as those conforming to the TextInputFormat specification, line by line and return them as strings.
val textLines = env.readTextFile("path")
2.readFile(fileInputFormat, path)
Read (once) the file according to the specified file input format.
class selfFileInputFormat() extends FileInputFormat[String] {
override def reachedEnd(): Boolean = ???
override def nextRecord(ot: String): String = ???
}
val dataStream = env.readFile(new selfFileInputFormat(), "")
3.readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo)
The above two methods are called directly based on the API, and the underlying calling function is this function, which is based on the file on the given fileInputFormat
read path . path
Depending on what is provided watchType
, the source may periodically (every interval
millisecond) monitor the path for new data.
Tips:
Under the hood, Flink splits the file reading process into two subtasks, directory monitoring and data reading . Each subtask is implemented by a separate entity. Monitoring is implemented by a single non-parallel (parallelism = 1) task, while reads are performed by multiple tasks running in parallel. The parallelism of the latter is equal to the parallelism of the job. The role of a single monitoring task is to scan a directory (either periodically or only once, depending watchType
), find files to process, divide them into shards , and assign those shards to downstream readers. Reader is the role that will actually get the data. Each shard can only be read by one reader, and a reader can read multiple shards one by one.
FileProcessingMode.PROCESS_CONTINUOUSLY
When a file is modified, its contents are completely reprocessed. This may break the "exactly once" semantics, since appending data at the end of the file will cause the entire contents of the file to be reprocessed.
FileProcessingMode.PROCESS_ONCE
source scans the path once and exits without waiting for the reader to finish reading the file. Of course, the reader will continue to read data until all file contents have been read. Closing the source will result in no more checkpoints after that. This can result in slower recovery after a node failure, as the job will resume reading from the last checkpoint.
val env = StreamExecutionEnvironment.getExecutionEnvironment
// 自定义 TextFormat
class selfFileInputFormat() extends FileInputFormat[String] {
override def reachedEnd(): Boolean = ???
override def nextRecord(ot: String): String = ???
}
// 隐函数 typeInfo
implicit val typeInfo = TypeInformation.of(classOf[String])
// 读取模式 watchType
val watchType = FileProcessingMode.PROCESS_CONTINUOUSLY
// 文件过滤 fileFilter 也可以加入过滤正在处理的文件逻辑
val filePathFilter = new FilePathFilter {
override def filterPath(filePath: Path): Boolean = {
filePath.getPath.endsWith(".txt")
}
}
val dataStream = env.readFile(new selfFileInputFormat(), "", watchType, 60L, filePathFilter)
三.Collection-Based
1.fromCollection(Collection)
Create a data stream from Java Java.util.Collection. All elements in the collection must be of the same type. Of course, after using scala conversion, the collection corresponding to scala can also be used, and the usage method here is similar to that of DataSet.
val dataStream: DataStream[String] = env.fromCollection(Array("spark", "flink"))
2.fromCollection(Iterator, Class)
Obtained from the iterator, the class parameter specifies the data type of the returned element of the return value
val dataStream: DataStream[String] = env.fromCollection(Iterator("spark", "flink"))
3.fromElements(T ...)
Creates a data stream from the given sequence of objects. All objects must be of the same type.
val dataStream: DataStream[String] = env.fromElements("spark", "flink")
4.fromParellelCollection(SplittableIterator, Class)
Create streams of data in parallel from iterators. The class parameter specifies the data type of the elements returned by the iterator.
val itSequence = new NumberSequenceIterator(0, 100)
val dataStream = env.fromParallelCollection(itSequence)
5.generateSequence(from, to)
val numbers = env.generateSequence(1, 10000000)
4. Socket-Based
Read from Socket. Elements can be separated by delimiters.
1. Start Socket
Execute the following statement in the local terminal and enter the following characters:
nc -lk 9999
2. Read Socket
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("localhost", 9999)
val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
.map { (_, 1) }
.keyBy(_._1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.sum(1)
counts.print()
env.execute("Window Stream WordCount")
The above is an example of using keyBy to perform wordCount on the words in the 5S scrolling window, and the following is the output result:
3> (hello,1)
5> (world,1)
If you continue to input some words, the program will count the wordCount in a window every 5s.
Five.AddSource
1. Official API
The previous article mentioned the external APIs supported by Flink and the corresponding supported operation modes. The following Connector categories that support source can call the official API and Maven dependencies to read and load data to generate DataStream.
Connector category | way of support |
Apache Kafka | source/sink |
Apache Cassandra | sink |
Amazon Kinesis Streams | source/sink |
Elasticsearch | sink |
FileSystem | sink |
RabbitMQ | source/sink |
Google PubSub | source/sink |
Hybrid Source | source |
Apache NiFi | source/sink |
Apache Pulsar | source |
Twitter Streaming API | source |
JDBC | sink |
2. Self-Defined
The custom data source needs to inherit RichSourceFunction[T] and define the data type T. It mainly implements the run method - get data and cancel method - stop getting data. This is similar to the Spark-Streaming custom receiver and Storm custom implementation spout. The following example will Continue to read new content from the text at 1s intervals and output:
class SourceFromFile extends RichSourceFunction[String] {
private var isRunning = true
override def run(ctx: SourceFunction.SourceContext[String]): Unit = {
val bufferedReader = new BufferedReader(new FileReader("data.txt"))
while ( {
isRunning
}) {
val line = bufferedReader.readLine
if (!StringUtils.isBlank(line)) {
ctx.collect(line)
}
TimeUnit.SECONDS.sleep(1)
}
}
override def cancel(): Unit = {
isRunning = false
}
}
val dataStream = env.addSource(new SourceFromFile()).print()
6. Summary
Combined with the previous Flink / Scala - DataSource's DataSet data acquisition summary , the acquisition of Flink's two data structures - DataSet / DataStream has been introduced. As a stream processing engine, Flink is better at processing DataStream streaming data, and will follow in the future. Introduce more streaming data processing methods.