flink is an open source big data stream processing framework, he can batch and stream processing at the same time, with fault tolerance, high throughput, low latency and other advantages, this article briefly describes the installation steps of flink in windows and linux Operation, including local debugging environment and cluster environment. Also introduces the construction of Flink's development project.
First of all, to run Flink, we need to download and decompress Flink's binary package. The download address is as follows: https://flink.apache.org/downloads.html
We can choose the combined version of Flink and Scala, here we choose the latest 1.9 version of Apache Flink 1.9.0 for Scala 2.12 to download.
After the download is successful, Flink can be run through the Windows bat file or Cygwin in the Windows system.
In the Linux system, it is divided into multiple situations such as stand-alone, cluster and Hadoop.
Run via Windows bat file
First start the cmd command line window, enter the flink folder, and run the bin directorystart-cluster.bat
Note: The java environment is required to run flink. Please make sure that the system has been configured with java environment variables.
$ cd flink
$ cd bin
$ start-cluster.bat
Starting a local cluster with one JobManager process and one TaskManager process.
You can terminate the processes via CTRL-C in the spawned shell windows.
Web interface by default on http://localhost:8081/.
After the successful startup, we can visit http: // localhost: 8081 / in the browser to see the flink management page.
Run via Cygwin
Cygwin is a UNIX-like simulation environment that runs on the Windows platform. The official website download: http://cygwin.com/install.html
After the installation is successful, start the Cygwin terminal and run the start-cluster.sh
script.
$ cd flink
$ bin/start-cluster.sh
Starting cluster.
After the successful startup, we can visit http: // localhost: 8081 / in the browser to see the flink management page.
Install flink on Linux system
Single node installation
Single-node installation on Linux is the same as cygwin, download Apache Flink 1.9.0 for Scala 2.12, and then unzip it, you only need to start start-cluster.sh.
Cluster installation
The cluster installation is divided into the following steps:
1. Copy the decompressed flink directory on each machine.
2. Select one as the master node, and then modify all machines conf / flink-conf.yaml
jobmanager.rpc.address = master主机名
3. Modify conf / slaves and write all work nodes
work01
work02
4. Start the cluster on the master
bin/start-cluster.sh
Install on Hadoop
We can choose to let Flink run on the Yarn cluster.
Download the Flink for Hadoop package
Make sure HADOOP_HOME has been set correctly
Start bin / yarn-session.sh
Run the flink sample program
Batch processing example:
Submit flink's batch examples program:
bin/flink run examples/batch/WordCount.jar
This is a batch example program under the examples provided by flink to count the number of words.
$ bin/flink run examples/batch/WordCount.jar
Starting execution of program
Executing WordCount example with default input data set.
Use --input to specify file input.
Printing result to stdout. Use --output to specify output path.
(a,5)
(action,1)
(after,1)
(against,1)
(all,2)
(and,12)
(arms,1)
(arrows,1)
(awry,1)
(ay,1)
Get the result, here is the default data set, you can specify the input and output through --input --output.
We can check the running status in the page:
Stream processing example:
Start the nc server:
nc -l 9000
Submit flink's batch examples program:
bin/flink run examples/streaming/SocketWindowWordCount.jar --port 9000
This is a stream processing example program under the examples provided by flink. It receives socket data and counts the number of words.
Write a word on the nc side
$ nc -l 9000
lorem ipsum
ipsum ipsum ipsum
bye
Output in log
$ tail -f log/flink-*-taskexecutor-*.out
lorem : 1
bye : 1
ipsum : 4
Stop flink
$ ./bin/stop-cluster.sh
After Flink is installed, as long as the Flink project is quickly built and related code development is completed, you can easily start Flink.
Build tools
Flink projects can be built using different build tools. To get started quickly, Flink provides project templates for the following build tools:
-
Maven
-
Gradle
These templates can help you build the project structure and create the initial build file.
Maven
Environmental requirements
The only requirement is to use Maven 3.0.4 (or higher) and install Java 8.x.
Create project
Use one of the following commands to create the project:
Use Maven archetypes
$ mvn archetype:generate \
-DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-quickstart-java \
-DarchetypeVersion=1.9.0
Run quickstart script
curl https://flink.apache.org/q/quickstart.sh | bash -s 1.9.0
After downloading, check the project directory structure:
tree quickstart/
quickstart/
├── pom.xml
└── src
└── main
├── java
│ └── org
│ └── myorg
│ └── quickstart
│ ├── BatchJob.java
│ └── StreamingJob.java
└── resources
└── log4j.properties
The sample project is a Maven project, which contains two classes: StreamingJob and BatchJob are the basic skeleton programs of the DataStream and DataSet programs, respectively . The main method is the entry point of the program, which can be used for IDE testing / execution and deployment.
We recommend that you import this project into the IDE to develop and test it. IntelliJ IDEA supports Maven projects out of the box. If you are using Eclipse, you can use the m2e plugin to import Maven projects. Some Eclipse bundles include this plug-in by default. Other situations require you to install it manually.
Please note : For Flink, the default JVM heap memory may be too small, you should manually increase the heap memory. In Eclipse, selected Run Configurations -> Arguments
and VM Arguments
written into the corresponding input box: -Xmx800m
. In IntelliJ IDEA, it is recommended Help | Edit Custom VM Options
to modify the JVM options from the menu .
Build project
If you want to build / package your project, please run the ' mvn clean package
' command in the project directory . After the command is executed, you will find a JAR file, which contains your applications, and has been added as a dependency to the connector and library applications: target/-.jar
.
Note: If you use other classes instead of StreamingJob as the main class / entrance of the application, we recommend that you modify pom.xml
the mainClass
configuration in the file accordingly . In this way, Flink can run the application from the JAR file without specifying the main class.
Gradle
Environmental requirements
The only requirements are to use Gradle 3.x (or higher) and install Java 8.x.
Create project
Use one of the following commands to create the project:
Gradle example:
build.gradle
buildscript {
repositories {
jcenter() // this applies only to the Gradle 'Shadow' plugin
}
dependencies {
classpath 'com.github.jengelman.gradle.plugins:shadow:2.0.4'
}
}
plugins {
id 'java'
id 'application'
// shadow plugin to produce fat JARs
id 'com.github.johnrengelman.shadow' version '2.0.4'
}
// artifact properties
group = 'org.myorg.quickstart'
version = '0.1-SNAPSHOT'
mainClassName = 'org.myorg.quickstart.StreamingJob'
description = """Flink Quickstart Job"""
ext {
javaVersion = '1.8'
flinkVersion = '1.9.0'
scalaBinaryVersion = '2.11'
slf4jVersion = '1.7.7'
log4jVersion = '1.2.17'
}
sourceCompatibility = javaVersion
targetCompatibility = javaVersion
tasks.withType(JavaCompile) {
options.encoding = 'UTF-8'
}
applicationDefaultJvmArgs = ["-Dlog4j.configuration=log4j.properties"]
task wrapper(type: Wrapper) {
gradleVersion = '3.1'
}
// declare where to find the dependencies of your project
repositories {
mavenCentral()
maven { url "https://repository.apache.org/content/repositories/snapshots/" }
}
// 注意:我们不能使用 "compileOnly" 或者 "shadow" 配置,这会使我们无法在 IDE 中或通过使用 "gradle run" 命令运行代码。
// 我们也不能从 shadowJar 中排除传递依赖(请查看 https://github.com/johnrengelman/shadow/issues/159)。
// -> 显式定义我们想要包含在 "flinkShadowJar" 配置中的类库!
configurations {
flinkShadowJar // dependencies which go into the shadowJar
// 总是排除这些依赖(也来自传递依赖),因为 Flink 会提供这些依赖。
flinkShadowJar.exclude group: 'org.apache.flink', module: 'force-shading'
flinkShadowJar.exclude group: 'com.google.code.findbugs', module: 'jsr305'
flinkShadowJar.exclude group: 'org.slf4j'
flinkShadowJar.exclude group: 'log4j'
}
// declare the dependencies for your production and test code
dependencies {
// --------------------------------------------------------------
// 编译时依赖不应该包含在 shadow jar 中,
// 这些依赖会在 Flink 的 lib 目录中提供。
// --------------------------------------------------------------
compile "org.apache.flink:flink-java:${flinkVersion}"
compile "org.apache.flink:flink-streaming-java_${scalaBinaryVersion}:${flinkVersion}"
// --------------------------------------------------------------
// 应该包含在 shadow jar 中的依赖,例如:连接器。
// 它们必须在 flinkShadowJar 的配置中!
// --------------------------------------------------------------
//flinkShadowJar "org.apache.flink:flink-connector-kafka-0.11_${scalaBinaryVersion}:${flinkVersion}"
compile "log4j:log4j:${log4jVersion}"
compile "org.slf4j:slf4j-log4j12:${slf4jVersion}"
// Add test dependencies here.
// testCompile "junit:junit:4.12"
}
// make compileOnly dependencies available for tests:
sourceSets {
main.compileClasspath += configurations.flinkShadowJar
main.runtimeClasspath += configurations.flinkShadowJar
test.compileClasspath += configurations.flinkShadowJar
test.runtimeClasspath += configurations.flinkShadowJar
javadoc.classpath += configurations.flinkShadowJar
}
run.classpath = sourceSets.main.runtimeClasspath
jar {
manifest {
attributes 'Built-By': System.getProperty('user.name'),
'Build-Jdk': System.getProperty('java.version')
}
}
shadowJar {
configurations = [project.configurations.flinkShadowJar]
}
setting.gradle
rootProject.name = 'quickstart'
Or run quickstart script
bash -c "$(curl https://flink.apache.org/q/gradle-quickstart.sh)" -- 1.9.0 2.11
View the directory structure:
tree quickstart/
quickstart/
├── README
├── build.gradle
├── settings.gradle
└── src
└── main
├── java
│ └── org
│ └── myorg
│ └── quickstart
│ ├── BatchJob.java
│ └── StreamingJob.java
└── resources
└── log4j.properties
The sample project is a Gradle project, which contains two classes: StreamingJob and BatchJob are the basic skeleton programs of DataStream and DataSet programs. The main method is the entry point of the program, which can be used for IDE testing / execution and deployment. If you want to learn big data systematically, you can join big data technology to learn the deduction : 522189307
We recommend that you import this project into your IDE to develop and test it. IntelliJ IDEA Gradle
supports Gradle projects after installing plugins. Eclipse supports Gradle projects through the Eclipse Buildship plugin (Given that the shadow
plugin requires Gradle version, please make sure to specify Gradle version> = 3.0 in the last step of the import wizard). You can also use Gradle's IDE integration to create project files from Gradle.
Build project
If you want to build / package the project, please run the ' gradle clean shadowJar
' command in the project directory . After the command is executed, you will find a JAR file, which contains your applications, and has been added as a dependency to the connector and library applications: build/libs/--all.jar
.
Note: If you use other classes instead of StreamingJob as the main class / entrance of the application, we recommend that you modify build.gradle
the mainClassName
configuration in the file accordingly . In this way, Flink can run the application from the JAR file without specifying the main class.