Flink / Scala - DataStream of DataSource to get data summary

I. Introduction

The DataStream API gets its name from the DataStreamspecial class used to represent collections of data in Flink programs. You can think of them as immutable collections of data that can contain duplicates. These data can be bounded (finite) or unbounded (unlimited), but the API used to process them is the same.

DataStreamSimilar to regular Java in usage 集合, but quite different in some key ways. They are immutable, which means you cannot add or remove elements once they are created. You also can't simply look at internal elements, you can only DataStreammanipulate them using API operations, DataStreamalso known as transformations.

You can create an initial one by adding source to your Flink program DataStream. Then, you can DataStreamderive , and use API methods like map, filter, etc. to DataStreamjoin with the derived streams. As before, the processing of a DataStrea mainly includes the combination of Source + Transformation + Sink:

Tips:

Unlike before, the execution environment of the DataSet is:

    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment

The execution environment of DataStreaming is:

    val env = StreamExecutionEnvironment.getExecutionEnvironment

2. FileBased file-based

Most of the interfaces here are similar to DataSet. Due to the difference in env, the final type obtained is also different, from DataSet to DataStreaming

1.readTextFile(path)

Read text files, such as those conforming to the TextInputFormat specification, line by line and return them as strings.

    val textLines = env.readTextFile("path")

2.readFile(fileInputFormat, path)

Read (once) the file according to the specified file input format.

    class selfFileInputFormat() extends FileInputFormat[String] {
      override def reachedEnd(): Boolean = ???

      override def nextRecord(ot: String): String = ???
    }

    val dataStream = env.readFile(new selfFileInputFormat(), "")

3.readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo)

The above two methods are called directly based on the API, and the underlying calling function is this function, which is based on the file on the given fileInputFormat read path . pathDepending on what is provided watchType , the source may periodically (every interval millisecond) monitor the path for new data.

Tips:

Under the hood, Flink splits the file reading process into two subtasks, directory monitoring and data reading . Each subtask is implemented by a separate entity. Monitoring is implemented by a single non-parallel (parallelism = 1) task, while reads are performed by multiple tasks running in parallel. The parallelism of the latter is equal to the parallelism of the job. The role of a single monitoring task is to scan a directory (either periodically or only once, depending watchType), find files to process, divide them into shards , and assign those shards to downstream readers. Reader is the role that will actually get the data. Each shard can only be read by one reader, and a reader can read multiple shards one by one.

FileProcessingMode.PROCESS_CONTINUOUSLY

When a file is modified, its contents are completely reprocessed. This may break the "exactly once" semantics, since appending data at the end of the file will cause the entire contents of the file to be reprocessed.

FileProcessingMode.PROCESS_ONCE

source scans the path once and exits without waiting for the reader to finish reading the file. Of course, the reader will continue to read data until all file contents have been read. Closing the source will result in no more checkpoints after that. This can result in slower recovery after a node failure, as the job will resume reading from the last checkpoint.

    val env = StreamExecutionEnvironment.getExecutionEnvironment

    // 自定义 TextFormat
    class selfFileInputFormat() extends FileInputFormat[String] {
      override def reachedEnd(): Boolean = ???

      override def nextRecord(ot: String): String = ???
    }
    // 隐函数 typeInfo
    implicit val typeInfo = TypeInformation.of(classOf[String])
    // 读取模式 watchType
    val watchType = FileProcessingMode.PROCESS_CONTINUOUSLY
    // 文件过滤 fileFilter 也可以加入过滤正在处理的文件逻辑
    val filePathFilter = new FilePathFilter {
      override def filterPath(filePath: Path): Boolean = {
        filePath.getPath.endsWith(".txt")
      }
    }

    val dataStream = env.readFile(new selfFileInputFormat(), "", watchType, 60L, filePathFilter)

三.Collection-Based

1.fromCollection(Collection)

Create a data stream from Java Java.util.Collection. All elements in the collection must be of the same type. Of course, after using scala conversion, the collection corresponding to scala can also be used, and the usage method here is similar to that of DataSet.

    val dataStream: DataStream[String] = env.fromCollection(Array("spark", "flink"))

2.fromCollection(Iterator, Class)

Obtained from the iterator, the class parameter specifies the data type of the returned element of the return value

    val dataStream: DataStream[String] = env.fromCollection(Iterator("spark", "flink"))

3.fromElements(T ...)

Creates a data stream from the given sequence of objects. All objects must be of the same type.

    val dataStream: DataStream[String] = env.fromElements("spark", "flink")

4.fromParellelCollection(SplittableIterator, Class)

Create streams of data in parallel from iterators. The class parameter specifies the data type of the elements returned by the iterator.

    val itSequence = new NumberSequenceIterator(0, 100)
    val dataStream = env.fromParallelCollection(itSequence)

5.generateSequence(from, to)

val numbers = env.generateSequence(1, 10000000)

4. Socket-Based

Read from Socket. Elements can be separated by delimiters.

1. Start Socket

Execute the following statement in the local terminal and enter the following characters:

nc -lk 9999

2. Read Socket

    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val text = env.socketTextStream("localhost", 9999)

      val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
      .map { (_, 1) }
      .keyBy(_._1)
      .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
      .sum(1)

    counts.print()

    env.execute("Window Stream WordCount")

The above is an example of using keyBy to perform wordCount on the words in the 5S scrolling window, and the following is the output result:

3> (hello,1)
5> (world,1)

If you continue to input some words, the program will count the wordCount in a window every 5s.

Five.AddSource

1. Official API

The previous article mentioned the external APIs supported by Flink and the corresponding supported operation modes. The following Connector categories that support source can call the official API and Maven dependencies to read and load data to generate DataStream.

Connector category	way of support
Apache Kafka	source/sink
Apache Cassandra	sink
Amazon Kinesis Streams	source/sink
Elasticsearch	sink
FileSystem	sink
RabbitMQ	source/sink
Google PubSub	source/sink
Hybrid Source	source
Apache NiFi	source/sink
Apache Pulsar	source
Twitter Streaming API	source
JDBC	sink

2. Self-Defined

The custom data source needs to inherit RichSourceFunction[T] and define the data type T. It mainly implements the run method - get data and cancel method - stop getting data. This is similar to the Spark-Streaming custom receiver and Storm custom implementation spout. The following example will Continue to read new content from the text at 1s intervals and output:

    class SourceFromFile extends RichSourceFunction[String] {
      private var isRunning = true

      override def run(ctx: SourceFunction.SourceContext[String]): Unit = {
        val bufferedReader = new BufferedReader(new FileReader("data.txt"))
        while ( {
          isRunning
        }) {
          val line = bufferedReader.readLine
          if (!StringUtils.isBlank(line)) {
            ctx.collect(line)
          }
          TimeUnit.SECONDS.sleep(1)
        }
      }

      override def cancel(): Unit = {
        isRunning = false
      }
    }

    val dataStream = env.addSource(new SourceFromFile()).print()

6. Summary

Combined with the previous Flink / Scala - DataSource's DataSet data acquisition summary , the acquisition of Flink's two data structures - DataSet / DataStream has been introduced. As a stream processing engine, Flink is better at processing DataStream streaming data, and will follow in the future. Introduce more streaming data processing methods.