Flink Series (two) - Flink development environment to build

First, install the Scala plugin

Flink, respectively, provides an API based on the Java language and Scala language, if you want to use the Scala language to develop Flink program, you can install the Scala plugin IDEA to provide prompt grammar, syntax highlighting and other functions. Open IDEA, Click File => settings => pluginsto open the plug-in installation page, search Scala plug-in and install it. After installation is complete, restart IDEA to take effect.

https://github.com/heibaiying

Two, Flink project initialization

Official Script Builder 2.1

Flink official support using Maven and Gradle both constructs tools for building Java-based project Flink; supports the use of SBT and Maven tools to build two constructs Flink project based on Scala language. Here to Maven as an example, because it can support building Java language and Scala language programs at the same time. Note that the above Flink 1.9 only supports version 3.0.4 of Maven, Maven after the installation is complete, you can build the project in two ways:

1. Construction directly on Maven Archetype

Mvn directly using the following statement to build, and then follow the prompts to interact with information, in turn enter the information groupId, artifactId and package names such as waiting for the completion of initialization:

$ mvn archetype:generate                               \
      -DarchetypeGroupId=org.apache.flink              \
      -DarchetypeArtifactId=flink-quickstart-java      \
      -DarchetypeVersion=1.9.0
复制代码

Note: If you want to create a project based on Scala language, simply flink-quickstart-java replaced flink-quickstart-scala can, later likewise.

2. Use the official script to quickly build

To make it easier to initialize the project, the official quickly build script that can be invoked directly through the following command:

$ curl https://flink.apache.org/q/quickstart.sh | bash -s 1.9.0
复制代码

The way is actually to be initialized by executing the command maven archetype, whose script reads as follows:

PACKAGE=quickstart

mvn archetype:generate \
  -DarchetypeGroupId=org.apache.flink \
  -DarchetypeArtifactId=flink-quickstart-java \
  -DarchetypeVersion=${1:-1.8.0} \
  -DgroupId=org.myorg.quickstart \
  -DartifactId=$PACKAGE	\
  -Dversion=0.1 \
  -Dpackage=org.myorg.quickstart \
  -DinteractiveMode=false
复制代码

It can be seen compared to the first way, but the ways specified directly Well groupId, artifactId, version information only.

2.2 IDEA building

If you are using a development tool is the IDEA, you can create a page select Maven Flink Archetype project initialization directly in the project:

https://github.com/heibaiying

If you are not IDEA above Archetype, by clicking on the upper right corner ADD ARCHETYPEto add, in turn fill in the required information, which are available from the above-mentioned archetype:generateacquisition statement. Click OKOnce saved, the Archetype will always exist in your IDEA, the time after each item is created, just select the Archetype directly to:

https://github.com/heibaiying

Select Flink Archetype, then click on NEXTthe button, after all the steps are the same and normal Maven project.

Third, the project structure

3.1 Project Structure

Automatically generated after completion of the project to create structured as follows:

https://github.com/heibaiying

BatchJob wherein a batch of sample code, source code is as follows:

import org.apache.flink.api.scala._

object BatchJob {
  def main(args: Array[String]) {
    val env = ExecutionEnvironment.getExecutionEnvironment
      ....
    env.execute("Flink Batch Scala API Skeleton")
  }
}
复制代码

getExecutionEnvironment representative for batch execution environment, if it is running locally is to get local execution environment; If you are running on a cluster, is to get the execution environment cluster. If you want to get the execution environment of the process stream, only it needs to be ExecutionEnvironmentreplaced StreamExecutionEnvironment, the corresponding code in the sample in StreamingJob:

import org.apache.flink.streaming.api.scala._

object StreamingJob {
  def main(args: Array[String]) {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
      ...
    env.execute("Flink Streaming Scala API Skeleton")
  }
}

复制代码

Note that for stream processing project env.execute()of this code is a must, otherwise the stream handler will not be executed, but for the batch items are optional.

3.2 depends

Maven-based project to create a skeleton mainly provides the following core dependence: which flink-scalais used to support the development of batch programs; flink-streaming-scalato support the development of stream processing program; scala-librarya class library provides the Scala language needs. If you select when you use Maven to create a skeleton Java language, the default is provided flink-javaand flink-streaming-javadependence.

<!-- Apache Flink dependencies -->
<!-- These dependencies are provided, because they should not be packaged into the JAR file. -->
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-scala_${scala.binary.version}</artifactId>
    <version>${flink.version}</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
    <version>${flink.version}</version>
    <scope>provided</scope>
</dependency>

<!-- Scala Library, provided by Flink as well. -->
<dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-library</artifactId>
    <version>${scala.version}</version>
    <scope>provided</scope>
</dependency>
复制代码

Of particular note is dependent on the above scopelabels have all been identified as being provided, which means that they will not be dependent JAR into the final package. Because Flink installation package these dependencies have been provided, which is located in the lib directory named flink-dist_*.jar, which contains all the core classes and Flink dependency:

https://github.com/heibaiying

scopeTag is identified as provided will result will throw a ClassNotFoundException when you start the project in IDEA. For this reason, in creating a project using IDEA Shihai automatically generates the following profile configuration:

<!-- This profile helps to make things run out of the box in IntelliJ -->
<!-- Its adds Flink's core classes to the runtime class path. -->
<!-- Otherwise they are missing in IntelliJ, because the dependency is 'provided' -->
<profiles>
    <profile>
        <id>add-dependencies-for-IDEA</id>

        <activation>
            <property>
                <name>idea.version</name>
            </property>
        </activation>

        <dependencies>
            <dependency>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-scala_${scala.binary.version}</artifactId>
                <version>${flink.version}</version>
                <scope>compile</scope>
            </dependency>
            <dependency>
                <groupId>org.apache.flink</groupId>
                <artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
                <version>${flink.version}</version>
                <scope>compile</scope>
            </dependency>
            <dependency>
                <groupId>org.scala-lang</groupId>
                <artifactId>scala-library</artifactId>
                <version>${scala.version}</version>
                <scope>compile</scope>
            </dependency>
        </dependencies>
    </profile>
</profiles>
复制代码

In the id of add-dependencies-for-IDEAthe profile, all of the core have been identified as dependent compile, then you may not need to change any code, only need to check the profile in Maven panel IDEA, you can run directly in the IDEA project in Flink:

https://github.com/heibaiying

Fourth, word frequency statistics Cases

After the project is created, you can write a simple case of word frequency statistics to try to run Flink project, with the following Scala language, for example, stream processing program and batch programs programming examples were introduced:

4.1 Batch Example

import org.apache.flink.api.scala._

object WordCountBatch {

  def main(args: Array[String]): Unit = {
    val benv = ExecutionEnvironment.getExecutionEnvironment
    val dataSet = benv.readTextFile("D:\\wordcount.txt")
    dataSet.flatMap { _.toLowerCase.split(",")}
            .filter (_.nonEmpty)
            .map { (_, 1) }
            .groupBy(0)
            .sum(1)
            .print()
  }
}
复制代码

Wherein the wordcount.txtcontents are as follows:

a,a,a,a,a
b,b,b
c,c
d,d
复制代码

The machine does not require any configuration other Flink environment, the method can be run directly Main results are as follows:

https://github.com/heibaiying

4.2 Processing flow example

import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time

object WordCountStreaming {

  def main(args: Array[String]): Unit = {

    val senv = StreamExecutionEnvironment.getExecutionEnvironment

    val dataStream: DataStream[String] = senv.socketTextStream("192.168.0.229", 9999, '\n')
    dataStream.flatMap { line => line.toLowerCase.split(",") }
              .filter(_.nonEmpty)
              .map { word => (word, 1) }
              .keyBy(0)
              .timeWindow(Time.seconds(3))
              .sum(1)
              .print()
    senv.execute("Streaming WordCount")
  }
}
复制代码

Here to listen to the content on the specified port number, for example, use the following command to open port services:

nc -lk 9999
复制代码

After the input test data can be observed handling of stream processing program.

Fifth, use Scala Shell

Demo for everyday items, if you do not want to start IDEA frequently to observe the test results can be the same as Spark, directly Scala Shell to run the program, which for the daily study, the effect is more intuitive and less time. Flink installation packages Download follows:

https://flink.apache.org/downloads.html
复制代码

Flink Most versions are provided with Scala 2.11 and Scala 2.12 two versions of the installation package is available for download:

https://github.com/heibaiying

To decompress after the download is complete, the next Scala Shell installation directory's bin directory, which can directly use the following command Local mode is activated:

./start-scala-shell.sh local
复制代码

After the completion of start command, which has been provided a batch (benv and btenv) and stream processing (SENV and stenv) operating environment, the program can be run directly Scala Flink, examples are as follows:

https://github.com/heibaiying

Finally, a common interpretation of the exception: here Flink version I used is 1.9.1, the following will throw an exception at startup. Here because, according to the official explanation, all current Scala 2.12 version of the installation package is not supported temporarily Scala Shell, so if you want to use Scala Shell, you can only choose Scala 2.11 version of the installation package.

[root@hadoop001 bin]# ./start-scala-shell.sh local
错误: 找不到或无法加载主类 org.apache.flink.api.scala.FlinkShell
复制代码

More big data series can be found GitHub open source project : Big Data Getting Started

Guess you like

Origin juejin.im/post/5dd2661cf265da0bd20af2b3