Apache Flink brief and entry

What Apache Flink

Apache Flink is distributed a large data processing engine == == == may be of finite and infinite data stream with a data stream == == == state calculation. Can be deployed in a clustered environment == == variety of various sizes quickly calculate the size of data.

Large distributed data processing engine
  • Is a distributed, highly available for a large data processing calculation engine

    Finite and infinite stream flow
  • Limited flow: beginnings and ends of the data stream. That batch data in the traditional sense, batch processing
  • Unlimited stream: peter data stream. That stream data in real life, streaming

    Stateful computing
  • Good state mechanism, be better fault tolerance and recovery tasks. While achieving Exactly-Once semantics.

    Various cluster environment
  • Can be deployed standalone, Flink on yarn, Flink on Mesos, Flink on k8s etc.

Flink Application

Streams

Data in the real world is constantly produced continue to be issued, the data processing should restore true, so true streaming. The batch is a special case of stream processing

  • I.e., above said finite and infinite stream flow diagram illustrating paste official website.
    Apache Flink brief and entry

State

In stream computing scenarios, in fact, all streams are essentially incremental computing (Incremental Processing) calculations.
For example, several hours, or before calculating a metric (PV, UV, etc.) has been the need to save the calculation result i.e. the state after completion of a calculation data, and to merge the next calculation.
Further, calculation retention state, CheckPoint flow calculation can realize fault tolerance and recovery task can be realized Exactly Once Processing semantics

Time

Three types of time:

  • Real time event generated: Event Time
  • Processing Time: Time events are processed Flink program
  • Ingestion Time: Event time into the program Flink

API

API divided into three layers, the closer the SQL layer, the more abstract, the lower the flexibility, but easier to use.

  • SQL / Table layers: a direct SQL data processing
  • DataStream / DataSet API: the core of the API, the stream data processing can implement custom WaterMark, Windows, State and other operations thereon
  • ProcessFunction: also called RunTime layer, the lowest level API, an event with a state of the drive.
    Apache Flink brief and entry

Flink Architecture

Data Pipeline Applications

I.e., real-time Stream ETL: ETL flow split.
Usually, ETL is through regular task scheduling SQL file or MR task to perform. ETL real-time scenario, the batch ETL flow processing logic writes, the dispersion calculation result calculated pressure and enhance real-time.
More than the number of positions for real-time, real-time search engines
Apache Flink brief and entry

Data Analytics Applications

That data analysis, data flow analysis including data analysis and bulk. Such as real-time reporting, real-time big screen.
Apache Flink brief and entry

Event-driven Applications

I.e., event-driven applications, a calculation process in the state, are normally stored in the third-party system state (e.g. Hbase Redis etc.).
In Flink, the state is stored in the internal program, reducing unnecessary I / O cost, higher throughput and lower latency access state.
Apache Flink brief and entry

The first program Flink

Development environment

Code Example:

package source.streamDataSource;


import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;


public class SocketWindowWordCount {



    public static void main(String[] args) throws Exception{

        if(args.length!=2){
            System.err.println("Usage:\nSocketWindowWordCount hostname port");
        }

        // 获取程序参数
        String hostname = args[0];
        int port = Integer.parseInt(args[1]);

        // 入口类,用于设置环境和参数等
        StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
        
        // 设置 Time 类型
        see.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);

        // 从指定 IP 端口 读取流数据,返回一个 DataStreamSource
        DataStreamSource<String> text = see.socketTextStream(hostname, port, "\n", 5);

        // 在 DataStreamSource 上做操作即 transformation 
        DataStream<Tuple2<String, Integer>> windowCount = text
                // flatMap , FlatMap接口的实现:将获取到的数据分割,并每个元素组合成 (word, count)形式
                .flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public void flatMap(String value, Collector<Tuple2<String, Integer>> collector) throws Exception {
                for (String word : value.split("\\s")) {
                    collector.collect(Tuple2.of(word, 1));
                }
            }
        })
                // 按位置指定key,进行聚合操作
                .keyBy(0)
                // 指定窗口大小
                .timeWindow(Time.seconds(5))
                // 在每个key上做sum
                // reduce 和 sum 的实现
//                .reduce(new ReduceFunction<Tuple2<String, Integer>>() {
//                    @Override
//                    public Tuple2<String, Integer> reduce(Tuple2<String, Integer> stringIntegerTuple2, Tuple2<String, Integer> t1) throws Exception {
//                        return Tuple2.of(stringIntegerTuple2.f0, stringIntegerTuple2.f1+t1.f1);
//                    }
//                });
                .sum(1);

        // 一个线程执行
        windowCount.print().setParallelism(1);
        see.execute("Socket Window WordCount");

        // 其他 transformation 操作示例
//        windowCount
//                .map(new MapFunction<Tuple2<String,Integer>, String>() {
//                    @Override
//                    public String map(Tuple2<String, Integer> stringIntegerTuple2) throws Exception {
//                        return stringIntegerTuple2.f0;
//                    }
//                })
//                .print();
//
//        text.filter(new FilterFunction<String>() {
//            @Override
//            public boolean filter(String s) throws Exception {
//                return s.contains("h");
//            }
//        })
//                .print();
//
//        SplitStream<String> split = text.split(new OutputSelector<String>() {
//            @Override
//            public Iterable<String> select(String value) {
//                ArrayList<String> strings = new ArrayList<>();
//                if (value.contains("h"))
//                    strings.add("");
//                else
//                    strings.add("noHadoop");
//                return strings;
//
//            }
//        });
//
//        split.select("hadoop").print();
//        split.select("noHadoop").map(new MapFunction<String, String>() {
//            @Override
//            public String map(String s) throws Exception {
//
//                return s.toUpperCase();
//            }
//        }).print();

    }
}

Guess you like

Origin www.linuxidc.com/Linux/2019-07/159585.htm