Flink+kafka realizes wordcount real-time calculation + error solution

Flink introduction:

Flink is a distributed processing engine for streaming data and batch data. It is mainly implemented by Java code. At present, it mainly depends on the contribution of the open source community to develop. For Flink, the main scenario it needs to handle is streaming data, and batch data is just a special case of streaming data. In other words, Flink treats all tasks as streams, which is its biggest feature. Flink can support local rapid iteration and some circular iteration tasks.

Features of Flink:

Flink is an open source framework for distributed stream processing:
1>. Even if the data source is out of order or late arrival data, it can maintain the accuracy of the results
2>. Stateful and fault-tolerant, can seamlessly recover from failure, and can Keep exactly-once
3>. Large-scale distributed
4>. Wide application of real-time computing scenarios (The Blink used by Alibaba's double eleven real-time transaction volume is based on Flink)

Flink can ensure only one semantic state calculation; Flink's state means that the program can maintain the data that has been processed;
Flink supports stream processing and window event time semantics, Flink supports flexible windows based on time windows, counts, or session data ;
Flink fault tolerance is lightweight and allows the system to maintain a high throughput rate and provide only one-time consistency guarantee at the same time. Flink recovers from failure with zero data loss;
Flink can provide high throughput and low latency;
Flink savepoint provides versions The control mechanism is thus able to update applications or reprocess historical data without loss and with minimal downtime.

2. Kafka

Introduction to Kafka

Kafka is an open source stream processing platform developed by the Apache Software Foundation, written in Scala and Java. Kafka is a high-throughput distributed publish-subscribe messaging system that can process all action flow data in consumer-scale websites. Such actions (web browsing, search and other user actions) are a key factor in many social functions on the modern web. These data are usually resolved by processing logs and log aggregation due to throughput requirements. For log data and offline analysis systems like Hadoop, but with the limitations of real-time processing, this is a feasible solution. The purpose of Kafka is to unify online and offline message processing through Hadoop's parallel loading mechanism, and also to provide real-time messages through clusters.

Kafka characteristics

Kafka is a high-throughput distributed publish-and-subscribe messaging system with the following characteristics:
1>. Provides message persistence through a disk data structure, which can maintain long-term stability even for terabytes of message storage performance.
2>. High throughput Even a very common hardware Kafka can support millions of messages per second.
3>. Support partitioning messages through Kafka server and consumer cluster.
4>. Support Hadoop parallel data loading.

Kafka installation configuration and basic use

Because this blog is the local Flink consumption of Kafka data to implement WordCount, Kafka does not need to be configured too much. Download the installation package from the Apache official website and unzip it directly.
Here we create a topic named test to
input the data stream in the producer:

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test

Technology to share pictures

Monitor the data flow input from the producer in the consumer:

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning

Technology to share pictures

1>. Create a maven project

Technology to share pictures

2>. Configure the dependent pom files required by flink and flink-kafka

<dependencies>
        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_2.11</artifactId>
            <version>1.0.0</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java_2.11</artifactId>
            <version>1.0.0</version>
            <scope>provided</scope>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-java -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>1.0.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka-0.8_2.11</artifactId>
            <version>1.0.0</version>
        </dependency>
    </dependencies>

3>. Introduce Flink StreamExecutionEnvironment

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

4>. Set the time interval of monitoring data stream (officially called status and checkpoint)

env.enableCheckpointing(1000);

5>. Configure the ip and port of kafka and zookeeper

Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "192.168.1.20:9092");
properties.setProperty("zookeeper.connect", "192.168.1.20:2181");
properties.setProperty("group.id", "test");

6>. Load kafka and zookeeper configuration information to Flink StreamExecutionEnvironment

FlinkKafkaConsumer08<String> myConsumer = new FlinkKafkaConsumer08<String>("test", new SimpleStringSchema(),properties);

7>. Convert Kafka data to flink DataStream type

DataStream<String> stream = env.addSource(myConsumer);

8>. Implement the calculation model and output the result

DataStream<Tuple2<String, Integer>> counts = stream.flatMap(new LineSplitter()).keyBy(0).sum(1);

counts.print();

Calculation model specific logic code

public static final class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
        private static final long serialVersionUID = 1L;

        public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
            String[] tokens = value.toLowerCase().split("\\W+");
            for (String token : tokens) {
                if (token.length() > 0) {
                    out.collect(new Tuple2<String, Integer>(token, 1));
                }
            }
        }
    }

4. Verification

1>. Kafka producer input

Technology to share pictures

2>. Flink client gets the result immediately

Technology to share pictures

Complete code

package com.scn;

import java.util.Properties;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer08;
import org.apache.flink.streaming.util.serialization.SimpleStringSchema;
import org.apache.flink.util.Collector;

public class FilnkCostKafka {
    public static void main(String[] args) throws Exception {
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(1000);

        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "192.168.1.20:9092");
        properties.setProperty("zookeeper.connect", "192.168.1.20:2181");
        properties.setProperty("group.id", "test");

        FlinkKafkaConsumer08<String> myConsumer = new FlinkKafkaConsumer08<String>("test", new SimpleStringSchema(),
                properties);

        DataStream<String> stream = env.addSource(myConsumer);

        DataStream<Tuple2<String, Integer>> counts = stream.flatMap(new LineSplitter()).keyBy(0).sum(1);

        counts.print();

        env.execute("WordCount from Kafka data");
    }

    public static final class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
        private static final long serialVersionUID = 1L;

        public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
            String[] tokens = value.toLowerCase().split("\\W+");
            for (String token : tokens) {
                if (token.length() > 0) {
                    out.collect(new Tuple2<String, Integer>(token, 1));
                }
            }
        }
    }

}

 

 

Problem: When testing the link, the client is always connected and disconnected, and continues to loop:

java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:110)
at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:98)
at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer
sendRequest(SimpleConsumer.scala:83)atkafka.consumer.SimpleConsumer.getOffsetsBefore(SimpleConsumer.scala:149)atkafka.consumer.SimpleConsumer.earliestOrLatestOffset(SimpleConsumer.scala:188)atkafka.consumer.ConsumerFetcherThread.handleOffsetOutOfRange(ConsumerFetcherThread.scala:84)atkafka.server.AbstractFetcherThread
anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:187)
at kafka.server.AbstractFetcherThread
anonfun$addPartitions$2.apply(AbstractFetcherThread.scala:182)atscala.collection.TraversableLike$WithFilter
anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.immutable.Map$Map2.foreach(Map.scala:130)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at kafka.server.AbstractFetcherThread.addPartitions(AbstractFetcherThread.scala:182)
at kafka.server.AbstractFetcherManager
anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:88)atkafka.server.AbstractFetcherManager
anonfun$addFetcherForPartitions$2.apply(AbstractFetcherManager.scala:78)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at kafka.server.AbstractFetcherManager.addFetcherForPartitions(AbstractFetcherManager.scala:78)
at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:95)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)

 

The above exception occurs because the server does not do the mapping between the host name of kafka and the ip.

The linux directory is /etc/hosts

The directory of windows is C:\Windows\System32\drivers\etc
 

 

Guess you like

Origin blog.csdn.net/xiaoyutongxue6/article/details/88861087