Shang Silicon Valley Big Data Flink1.17 Practical Tutorial-Notes 01 [Flink Overview, Flink Quick Start]

  1. Shang Silicon Valley Big Data Flink1.17 Practical Tutorial-Notes 01 [Flink Overview, Flink Quick Start]
  2. Shang Silicon Valley Big Data Flink1.17 Practical Tutorial - Note 02 [Flink Deployment]
  3. Shang Silicon Valley Big Data Flink1.17 Practical Tutorial-Notes 03【】
  4. Shang Silicon Valley Big Data Flink1.17 Practical Tutorial-Notes 04【】
  5. Shang Silicon Valley Big Data Flink1.17 Practical Tutorial-Notes 05【】
  6. Shang Silicon Valley Big Data Flink1.17 Practical Tutorial-Notes 06【】
  7. Shang Silicon Valley Big Data Flink1.17 Practical Tutorial-Notes 07【】
  8. Shang Silicon Valley Big Data Flink1.17 Practical Tutorial-Notes 08【】
  9. Shang Silicon Valley Big Data Flink1.17 Practical Tutorial - Note 09【】
  10. Shang Silicon Valley Big Data Flink1.17 Practical Tutorial - Note 10【】
  11. Shang Silicon Valley Big Data Flink1.17 Practical Tutorial - Note 11【】

Table of contents

Basic

Chapter 01 - Overview of Flink

P001【001_Course Introduction】09:18

Chapter 1 Flink Overview

P002【002_Flink Overview_What is Flink】14:13

P003 [003_Flink Overview_Flink Development History & Features] 08:14

P004 [004_Flink Overview_Differences from SparkStreaming & Application Scenarios & Hierarchical API] 12:50

Chapter 2 Flink Quick Start

P005【005_Flink Quick Start_Create Maven Project & Import Dependencies】03:49

P006 [006_Flink Quick Start_Batch Processing to Realize WordCount] 18:00

P007[007_Flink Quick Start_Stream Processing Realizes WordCount_Coding] 13:13

P008[008_Flink Quick Start_WordCount Realized by Stream Processing_Demo & Comparison] 06:03

P009[009_Flink Quick Start_Stream Processing Realizes WordCount_Unbounded Stream_Encoding] 13:57

P010[010_Flink Quick Start_Stream Processing Realizes WordCount_Unbounded Stream_Demo & Comparison] 05:14


This set of tutorials is based on the new version 1.17 of Flink. It is divided into four chapters: basic chapters, core chapters, advanced chapters, and SQL chapters. Through animation explanations and practical case demonstrations, the tutorial takes you to master Flink to build reliable and efficient data processing applications.

Basic

P001【001_Course Introduction】09:18

Who is using Flink? A: Big companies such as Tencent, Huawei, Didi, Alibaba, Kuaishou, and Amazon are all using it.

Flink features

  1. Batch unification
    1. The same set of codes can run stream or batch
    2. The same SQL can run stream or batch
  2. excellent performance
    1. high throughput
    2. low latency
  3. scale calculation
    1. Support horizontal scaling architecture
    2. Support for super large state and incremental checkpoint mechanism
    3. Large company usage:
      1. Handles trillions of events per day
      2. Application maintains several terabytes of state
      3. Applications run on thousands of CPU cores
  4. Ecological compatibility
    1. Support integration with Yarn
    2. Support for integration with Kubernetes
    3. Support stand-alone mode operation
  5. High fault tolerance
    1. Automatic retry on failure
    2. consistency checkpoint
    3. Guarantee exactly once state consistency in failure scenarios

Course Features

  1. shallow to deep
  2. Animation to explain the important and difficult points
  3. new version
  4. Detailed content

Chapter 01 - Overview of Flink

P002【002_Flink Overview_What is Flink】14:13

Flink's official website home page address: https://flink.apache.org/

The core goal of Flink is "Stateful Computations over Data Streams".

To be specific: Apache Flink is a framework and distributed processing engine for stateful computation on unbounded and bounded data streams.

P003 [003_Flink Overview_Flink Development History & Features] 08:14

Our goals for processing data are: low latency, high throughput, accuracy of results, and good fault tolerance.

The main features of Flink are as follows:

  1. High throughput and low latency. Process millions of events per second with millisecond latency.
  2. the accuracy of the results. Flink provides event-time and processing-time semantics. For out-of-order event streams, event-time semantics still provide consistent and accurate results.
  3. Exactly-once state consistency guarantees.
  4. Can connect to the most commonly used external systems, such as Kafka, Hive, JDBC, HDFS, Redis, etc.
  5. High availability. With its high-availability setting, coupled with tight integration with K8s, YARN and Mesos, coupled with the ability to quickly recover from failures and dynamically expand tasks, Flink can run 7×24 with very little downtime.

P004 [004_Flink Overview_Differences from SparkStreaming & Application Scenarios & Hierarchical API] 12:50

Table Flink and Streaming comparison

Considerable

Streaming

Computational model

stream computing

micro-batch

temporal semantics

event time, processing time

processing time

window

many, flexible

Fewer, inflexible (the window must be an integer multiple of the batch)

state

have

No

Streaming SQL

have

No

Chapter 02 - Flink Quick Start

P005【005_Flink Quick Start_Create Maven Project & Import Dependencies】03:49

    <properties>
        <flink.version>1.17.0</flink.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients</artifactId>
            <version>${flink.version}</version>
        </dependency>
    </dependencies>

P006 [006_Flink Quick Start_Batch Processing to Realize WordCount] 18:00

ctrl + p: View the parameter passing method.

hello flink
hello world
hello java

package com.atguigu.wc;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.AggregateOperator;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.operators.FlatMapOperator;
import org.apache.flink.api.java.operators.UnsortedGrouping;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;

/**
 * TODO DataSet API 实现 wordCount
 */
public class WordCountBatchDemo {
    public static void main(String[] args) throws Exception {
        // TODO 1.创建执行环境
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        // TODO 2.读取文件:从文件中读取
        DataSource<String> lineDS = env.readTextFile("input/word.txt");

        // TODO 3.切分、转换(word, 1)
        FlatMapOperator<String, Tuple2<String, Integer>> wordAndOne = lineDS.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
                // TODO 3.1 按照空格切分单词
                String[] words = value.split(" ");
                // TODO 3.2 将单词转换为(word, 1)格式
                for (String word : words) {
                    Tuple2<String, Integer> wordTuple2 = Tuple2.of(word, 1);
                    // TODO 3.3 使用Collector向下游发送数据
                    out.collect(wordTuple2);
                }
            }
        });

        // TODO 4.按照word分组
        UnsortedGrouping<Tuple2<String, Integer>> wordAndOneGroupBy = wordAndOne.groupBy(0);

        // TODO 5.各分组内聚合
        AggregateOperator<Tuple2<String, Integer>> sum = wordAndOneGroupBy.sum(1); //1是位置,表示第二个元素

        // TODO 6.输出
        sum.print();
    }
}

P007[007_Flink Quick Start_Stream Processing Realizes WordCount_Coding] 13:13

2.2.2 Stream processing

package com.atguigu.wc;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

/**
 * TODO DataStream实现Wordcount:读文件(有界流)
 *
 */
public class WordCountStreamDemo {
    public static void main(String[] args) throws Exception {
        // TODO 1.创建执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // TODO 2.读取数据:从文件读
        DataStreamSource<String> lineDS = env.readTextFile("input/word.txt");

        // TODO 3.处理数据: 切分、转换、分组、聚合
        // TODO 3.1 切分、转换
        SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndOneDS = lineDS //<输入类型, 输出类型>
                .flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
                    @Override
                    public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
                        // 按照 空格 切分
                        String[] words = value.split(" ");
                        for (String word : words) {
                            // 转换成 二元组 (word,1)
                            Tuple2<String, Integer> wordsAndOne = Tuple2.of(word, 1);
                            // 通过 采集器 向下游发送数据
                            out.collect(wordsAndOne);
                        }
                    }
                });
        // TODO 3.2 分组
        KeyedStream<Tuple2<String, Integer>, String> wordAndOneKS = wordAndOneDS.keyBy(
                new KeySelector<Tuple2<String, Integer>, String>() {
                    @Override
                    public String getKey(Tuple2<String, Integer> value) throws Exception {
                        return value.f0;
                    }
                }
        );
        // TODO 3.3 聚合
        SingleOutputStreamOperator<Tuple2<String, Integer>> sumDS = wordAndOneKS.sum(1);

        // TODO 4.输出数据
        sumDS.print();

        // TODO 5.执行:类似 sparkstreaming最后 ssc.start()
        env.execute();
    }
}

/**
 * 接口 A,里面有一个方法a()
 * 1、正常实现接口步骤:
 * <p>
 * 1.1 定义一个class B  实现 接口A、方法a()
 * 1.2 创建B的对象:   B b = new B()
 * <p>
 * <p>
 * 2、接口的匿名实现类:
 * new A(){
 * a(){
 * <p>
 * }
 * }
 */

P008[008_Flink Quick Start_WordCount Realized by Stream Processing_Demo & Comparison] 06:03

The main observations differ from the batch program BatchWordCount:

  1. Instead of creating an execution environment, stream handlers use a StreamExecutionEnvironment.
  2. After conversion processing, the resulting data object types are different.
  3. The grouping operation calls the keyBy method, and an anonymous function can be passed in as a key selector (KeySelector) to specify what the current grouping key is.
  4. At the end of the code, the execute method of env needs to be called to start executing the task.

P009[009_Flink Quick Start_Stream Processing Realizes WordCount_Unbounded Stream_Encoding] 13:57

2) Read the socket text stream

In the actual production environment, the real data flow is actually unbounded, with a beginning but no end, which requires us to continuously process the captured data. In order to simulate this scenario, you can listen to the socket port, and then continuously send data to the port.

[atguigu@node001 ~]$ sudo yum install -y netcat

[atguigu@node001 ~]$ nc -lk 7777

package com.atguigu.wc;

import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

/**
 * TODO DataStream实现Wordcount:读socket(无界流)
 *
 */
public class WordCountStreamUnboundedDemo {
    public static void main(String[] args) throws Exception {
        // TODO 1. 创建执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // IDEA运行时,也可以看到webui,一般用于本地测试
        // 需要引入一个依赖 flink-runtime-web
        // 在idea运行,不指定并行度,默认就是 电脑的 线程数
        // StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());

        env.setParallelism(3);

        // TODO 2. 读取数据: socket
        DataStreamSource<String> socketDS = env.socketTextStream("node001", 7777);

        // TODO 3. 处理数据: 切换、转换、分组、聚合
        SingleOutputStreamOperator<Tuple2<String, Integer>> sum = socketDS
                .flatMap(
                        (String value, Collector<Tuple2<String, Integer>> out) -> {
                            String[] words = value.split(" ");
                            for (String word : words) {
                                out.collect(Tuple2.of(word, 1));
                            }
                        }
                )
                .setParallelism(2)
                .returns(Types.TUPLE(Types.STRING, Types.INT))
                // .returns(new TypeHint<Tuple2<String, Integer>>() {})
                .keyBy(value -> value.f0)
                .sum(1);

        // TODO 4. 输出
        sum.print();

        // TODO 5. 执行
        env.execute();
    }
}

/**
 * 并行度的优先级:
 * 代码:算子 > 代码:env > 提交时指定 > 配置文件
 */

P010[010_Flink Quick Start_Stream Processing Realizes WordCount_Unbounded Stream_Demo & Comparison] 05:14

Guess you like

Origin blog.csdn.net/weixin_44949135/article/details/130895033