Kafka Best Practices

foreword

Kafka best practices, involving

  1. Typical usage scenarios
  2. Best Practices for Kafka Usage

Kafka typical usage scenarios

Data Streaming

Kafka can be connected to multiple mainstream streaming data processing technologies such as Spark, Flink, and Flume. Taking advantage of the high throughput of Kafka, customers can establish a transmission channel through Kafka to transmit massive data from the application side to the stream data processing engine. After the data is processed and analyzed, it can support back-end big data analysis, AI model training, etc. business.

Kafka Data Stream

log platform

The most commonly used scenario of Kafka and the scenario I am most familiar with is the log analysis system. A typical implementation is to deploy a log collector (such as Fluentd, Filebeat, or Logstash, etc.) on the client side for log collection, and send the data to Kafka, and then perform data calculations through the back-end ES, etc., and then build a display layer such as Kibana. Presentation of statistical analysis data.

fluentd-kafka-es

fluentd-kafka-logstash

internet of things

The Internet of Things (IoT) is gaining traction as valuable use cases emerge. However, a key challenge is integrating devices and machines to process data in real time and at scale. Apache Kafka® and its surrounding ecosystem, including Kafka Connect, Kafka Streams, have become the technology of choice for integrating and processing such datasets.

Kafka is already used in many IoT deployments, including consumer IoT and industrial Internet of Things (IIoT). Most scenarios require reliable, scalable, and secure end-to-end integration that supports real-time bidirectional communication and data processing. Some specific use cases are:

  • Connected Vehicle Infrastructure
  • Smart Cities and Smart Homes
  • Smart Retail and Customer 360
  • smart manufacturing

The specific implementation architecture is shown in the figure below:

edge-datacenter-cloud

kafka-gateway

best practice to use

Reliability Best Practices

Satisfy different reliability based on producer and consumer configuration

ProducerAt Least Once

The producer needs to set request.required.acks = ALL, the master node on the server side writes successfully and the backup node synchronizes successfully to return the Response.

Consumer At Least Once

After receiving the message, the consumer should first perform the corresponding business operation, and then commit to indicate that the message has been processed . This processing method can ensure that a message can be consumed again when the business processing fails. Note that the consumer enable.auto.commitparameter needs to be set Falseto ensure that the commit action is manually controlled.

ProducerAt Most Once

To ensure that a message can be delivered at most once, it needs to be set request.required.acks = 0and set at the same time retries = 0. The principle here is that the producer does not retry when encountering any exceptions, and does not consider whether the broker responds to the successful writing.

Consumer At Most Once

To ensure that a message can be consumed at most once, consumers need to commit to indicate that the message has been processed after receiving the message, and then perform corresponding business operations . The principle here is that the consumer does not need to care about the processing result of the actual business, and immediately commits after receiving the message to tell the broker that the message has been successfully processed. Note that the consumer enable.auto.commitparameterFalse needs to be set to ensure that the commit action is manually controlled.

Producer Exactly-once

Since version 0.11 of Kafka , the semantics of idempotent messages has been added . By setting parameters, the message idempotence of a single partition can be realized.enable.idempotence=true

If the topic involves multiple partitions or multiple messages need to be encapsulated into one transaction to ensure idempotence, you need to increase Transaction control, as follows:

// 开启幂等控制参数
producerProps.put("enbale.idempotence", "true");
// 初始化事务
producer.initTransactions();
// 设置事务 ID
producerProps.put("transactional.id", "id-001");

try{
  // 开始事务,并在事务中发送 2 条消息
  producer.beginTranscation();
  producer.send(record0);
  producer.send(record1);
  // 提交事务
  producer.commitTranscation();
} catch( Exception e ) {
  producer.abortTransaction();
  producer.close();
}

Consumer Exactly-once

It needs to be set isolation.level=read_committedand set enable.auto.commit = falseto ensure that the consumer only consumes the messages that the producer has committed. The consumer business needs to ensure transactionality to avoid repeated processing of messages, such as persisting the message to the database, and then submitting the commit to the server.

Choose the appropriate semantics according to the business scenario

Use At Least Once semantics to support businesses that can accept a small number of repeated messages

At Least Once is the most commonly used semantics, which can ensure that messages are sent and consumed as much as possible, with a good balance between performance and reliability, and can be used as the default mode . The business side can also ensure idempotency by adding a unique business primary key to the message body , and ensure that messages with the same business primary key are only processed once on the consumer side.

Using Exactly Once semantic support requires strong idempotent business

Exactly Once semantics are generally used for key businesses that absolutely do not allow repetition. Typical cases are order and payment related scenarios .

Use At Most Once semantics to support non-critical business

At Most Once semantics are generally used in non-critical business , business is not sensitive to message loss , just need to try to ensure the successful production and consumption of messages. A typical scenario where At Most Once semantics is used is message notification , where a small number of missing messages has little impact. In contrast, sending notifications repeatedly will cause a poor user experience.

Performance Tuning Best Practices

Reasonably set the number of partitions for Topic

The following summarizes the dimensions that are recommended for tuning performance through partitions. It is recommended that you tune the overall performance of the system based on theoretical analysis and stress testing.

Consider dimensions illustrate
throughput Increasing the number of partitions can increase the concurrency of message consumption. When the bottleneck of the system lies in the consumption end, and the consumption end can be expanded horizontally, increasing the partition can increase the system throughput. Each partition under each topic in Kafka is an independent message processing channel. The messages in a partition can only be consumed by one consumer group at the same time. When the number of consumer groups exceeds the number of partitions, the redundant consumer group Idle will appear.
message sequence Kafka can guarantee the order of messages within a partition, but the order of messages between partitions cannot be guaranteed. When adding partitions, you need to consider the impact of message order on business.
Instance Partition upper limit The increase of Partition will consume more underlying resources such as memory, IO and file handles. When planning the number of partitions for a topic, you need to consider the upper limit of partitions that the Kafka cluster can support.

Description of the relationship between producers, consumers and partitions.

kafka_groups

Reasonably set the batch size

If the topic has multiple partitions, the producer needs to confirm which partition to send the message to first. When sending multiple messages to the same partition, the Producer client will package the related messages into a Batch and send them to the server in batches. Generally, a small batch will cause the Producer client to generate a large number of requests, causing the request queue to be queued at the client and the server, thereby pushing up message sending and consumption delays as a whole.

An appropriate batch size can reduce the number of requests initiated by the client to the server when sending messages, and improve the throughput and delay of message sending as a whole.

Batch parameters are described as follows:

parameter illustrate
batch.size The amount of cached messages sent to each partition (the sum of the bytes of the message content, not the number of messages). When the set value is reached, a network request will be triggered, and then the Producer client will send messages to the server in batches.
linger.ms The maximum time each message will be in the cache. If this time is exceeded, the Producer client will batch.sizeignore the limit and send the message to the server immediately.
buffer.memory When the total size of all cached messages exceeds this value, the message will be sent to the server, batch.sizeand the linger.msrestrictions of and will be ignored. buffer.memoryThe default value of is 32MB, which can guarantee sufficient performance for a single Producer.

There is no general method for selecting Batch-related parameter values. It is recommended to perform pressure testing and tuning for performance-sensitive business scenarios.

Use sticky partitions to handle bulk sends

Kafka producers and servers have a batch sending mechanism when sending messages, and only messages sent to the same Partition will be put into the same Batch. In a large batch sending scenario, if messages are scattered into multiple Partitions, multiple small batches may be formed, causing the batch sending mechanism to fail and reduce performance.

Kafka's default strategy for selecting partitions is as follows

Scenes Strategy
Message specified Key Hash the Key of the message, and then select a partition based on the hash result to ensure that messages with the same Key will be sent to the same partition.
The message does not specify a Key The default strategy is to cycle through all partitions of the topic and send messages to each partition in a round-robin fashion.

It can be seen from the default mechanism that the selection of partitions is very random. Therefore, in the scenario of mass transfer, it is recommended to set partitioner.classthe parameter and specify a custom partition selection algorithm to implement sticky partitions .

One of the implementation methods is to use the same partition for a fixed period of time, and switch to the next partition after a period of time to avoid data from being scattered into multiple different partitions.

General Best Practices

Kafka's Guarantee of Message Order

Kafka will guarantee the order of messages in the same partition. If there are multiple partitions in the topic, the global order cannot be guaranteed. If you need to guarantee the global order, you need to control the number of partitions to 1.

Set a unique Key for the message

The message of the message queue Kafka has two fields: Key (message identifier) ​​and Value (message content). For easy tracking, it is recommended to set a unique Key for the message. After that, you can track a message through the Key, print the sending log and consumption log, and understand the production and consumption of the message.

Reasonably set the retry strategy of the queue

In a distributed environment, due to reasons such as the network, messages may occasionally fail to be sent. The reason may be that the message has been sent successfully but the ACK mechanism failed or the message was not sent successfully. The default parameters can meet most scenarios, but the following retry parameters can be set as needed according to business needs:

parameter illustrate
retries The number of retries, the default value is 3, but for applications with zero tolerance for data loss, please consider setting it to Integer.MAX_VALUE(effective and maximum).
retry.backoff.ms The retry interval is recommended to be set to 1000.

:exclamation: Note:

If you want to implement At Most Once semantics, retries need to be turned off.

Access best practices

Connecting Spark Streaming to Kafka

Spark Streaming is an extension of Spark Core for high-throughput and fault-tolerant processing of persistent data. Currently, external inputs supported include Kafka, Flume, HDFS/S3, Kinesis, Twitter, and TCP socket.Alt text

Spark Streaming abstracts continuous data into DStream (Discretized Stream), and DStream consists of a series of continuous RDDs (elastic distributed data sets), and each RDD is data generated within a certain time interval. Using functions to process DStream is actually to process these RDDs.Alt text

When using Spark Streaming as the data input of Kafka, the stable version and experimental version of Kafka can be supported:

Kafka Version spark-streaming-kafka-0.8 spark-streaming-kafka-0.10
Broker Version 0.8.2.1 or higher 0.10.0 or higher
Api Maturity Deprecated Stable
Language Support Scala、Java、Python Scala、Java
Receiver DStream Yes No
Direct DStream Yes Yes
SSL / TLS Support No Yes
Offset Commit Api No Yes
Dynamic Topic Subscription No Yes

This practice uses the Kafka dependency of version 0.10.2.1.

Steps

Step 1: Create Kafka cluster and Topic

The steps to create a Kafka cluster are omitted, and then create a Topic testnamed .

Step 2: Prepare the server environment

Centos6.8 system

package version
sbt 0.13.16
hadoop 2.7.3
spark 2.1.0
protobuf 2.5.0
ssh CentOS default installation
Java 1.8

The specific installation steps are omitted, including the following steps:

  1. install sbt
  2. Install protobuf
  3. Install Hadoop
  4. Install Spark

Step 3: Connect to Kafka

Produce messages to Kafka

The Kafka dependency of version 0.10.2.1 is used here.

  1. build.sbtAdd dependencies in :
name := "Producer Example"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.kafka" % "kafka-clients" % "0.10.2.1"
  1. Configuration producer_example.scala:

    import java.util.Properties
    import org.apache.kafka.clients.producer._
    object ProducerExample extends App {
        val  props = new Properties()
        props.put("bootstrap.servers", "172.0.0.1:9092") //实例信息中的内网 IP 与端口
    
        props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
        props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
    
        val producer = new KafkaProducer[String, String](props)
        val TOPIC="test"  //指定要生产的 Topic
        for(i<- 1 to 50){
            val record = new ProducerRecord(TOPIC, "key", s"hello $i") //生产 key 是"key",value 是 hello i 的消息
            producer.send(record)
        }
        val record = new ProducerRecord(TOPIC, "key", "the end "+new java.util.Date)
        producer.send(record)
        producer.close() //最后要断开
    }

For more usage of ProducerRecord, please refer to ProducerRecord documentation.

Consume messages from Kafka
DirectStream
  1. build.sbtAdd dependencies in :
name := "Consumer Example"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.0"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.1.0"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.1.0"
  1. Configuration DirectStream_example.scala:
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.kafka.common.TopicPartition
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.OffsetRange
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import collection.JavaConversions._
import Array._
object Kafka {
    def main(args: Array[String]) {
        val kafkaParams = Map[String, Object](
            "bootstrap.servers" -> "172.0.0.1:9092",
            "key.deserializer" -> classOf[StringDeserializer],
            "value.deserializer" -> classOf[StringDeserializer],
            "group.id" -> "spark_stream_test1",
            "auto.offset.reset" -> "earliest",
            "enable.auto.commit" -> "false"
        )

        val sparkConf = new SparkConf()
        sparkConf.setMaster("local")
        sparkConf.setAppName("Kafka")
        val ssc = new StreamingContext(sparkConf, Seconds(5))
        val topics = Array("spark_test")

        val offsets : Map[TopicPartition, Long] = Map()

        for (i <- 0 until 3){
            val tp = new TopicPartition("spark_test", i)
            offsets.updated(tp , 0L)
        }
        val stream = KafkaUtils.createDirectStream[String, String](
            ssc,
            PreferConsistent,
            Subscribe[String, String](topics, kafkaParams)
        )
        println("directStream")
        stream.foreachRDD{ rdd=>
            //输出获得的消息
            rdd.foreach{iter =>
                val i = iter.value
                println(s"${i}")
            }
            //获得offset
            val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
            rdd.foreachPartition { iter =>
                val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
                println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
            }
        }

        // Start the computation
        ssc.start()
        ssc.awaitTermination()
    }
}
RDD
  1. Configuration build.sbt(the configuration is the same as above, click to view ).
  2. Configuration RDD_example:
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.OffsetRange
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import collection.JavaConversions._
import Array._
object Kafka {
    def main(args: Array[String]) {
        val kafkaParams = Map[String, Object](
            "bootstrap.servers" -> "172.0.0.1:9092",
            "key.deserializer" -> classOf[StringDeserializer],
            "value.deserializer" -> classOf[StringDeserializer],
            "group.id" -> "spark_stream",
            "auto.offset.reset" -> "earliest",
            "enable.auto.commit" -> (false: java.lang.Boolean)
        )
        val sc = new SparkContext("local", "Kafka", new SparkConf())
        val java_kafkaParams : java.util.Map[String, Object] = kafkaParams
        //按顺序向 parition 拉取相应 offset 范围的消息,如果拉取不到则阻塞直到超过等待时间或者新生产消息达到拉取的数量
        val offsetRanges = Array[OffsetRange](
            OffsetRange("spark_test", 0, 0, 5),
            OffsetRange("spark_test", 1, 0, 5),
            OffsetRange("spark_test", 2, 0, 5)
        )
        val range = KafkaUtils.createRDD[String, String](
            sc,
            java_kafkaParams,
            offsetRanges,
            PreferConsistent
        )
        range.foreach(rdd=>println(rdd.value))
        sc.stop()
    }
}

For more kafkaParamsusage , refer to the kafkaParams documentation.

Connecting Flume to Kafka

Apache Flume 是一个分布式、可靠、高可用的日志收集系统,支持各种各样的数据来源(如 HTTP、Log 文件、JMS、监听端口数据等),能将这些数据源的海量日志数据进行高效收集、聚合、移动,最后存储到指定存储系统中(如 Kafka、分布式文件系统、Solr 搜索服务器等)。

Flume 基本结构如下:

Flume 以 agent 为最小的独立运行单位。一个 agent 就是一个 JVM,单个 agent 由 Source、Sink 和 Channel 三大组件构成。

Flume 与 Kafka

把数据存储到 HDFS 或者 HBase 等下游存储模块或者计算模块时需要考虑各种复杂的场景,例如并发写入的量以及系统承载压力、网络延迟等问题。Flume 作为灵活的分布式系统具有多种接口,同时提供可定制化的管道。 在生产处理环节中,当生产与处理速度不一致时,Kafka 可以充当缓存角色。Kafka 拥有 partition 结构以及采用 append 追加数据,使 Kafka 具有优秀的吞吐能力;同时其拥有 replication 结构,使 Kafka 具有很高的容错性。 所以将 Flume 和 Kafka 结合起来,可以满足生产环境中绝大多数要求。

准备工作

  • 下载 Apache Flume (1.6.0以上版本兼容 Kafka)
  • 下载 Kafka工具包 (0.9.x以上版本,0.8已经不支持)
  • 确认 Kafka 的 Source、 Sink 组件已经在 Flume 中。

接入方式

Kafka 可作为 Source 或者 Sink 端对消息进行导入或者导出。

Kafka Source

配置 kafka 作为消息来源,即将自己作为消费者,从 Kafka 中拉取数据传入到指定 Sink 中。主要配置选项如下:

配置项 说明
channels 自己配置的 Channel
type 必须为:org.apache.flume.source.kafka.KafkaSource
kafka.bootstrap.servers Kafka Broker 的服务器地址
kafka.consumer.group.id 作为 Kafka 消费端的 Group ID
kafka.topics Kafka 中数据来源 Topic
batchSize 每次写入 Channel 的大小
batchDurationMillis 每次写入最大间隔时间

示例:

tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource 
tier1.sources.source1.channels = channel1
tier1.sources.source1.batchSize = 5000
tier1.sources.source1.batchDurationMillis = 2000
tier1.sources.source1.kafka.bootstrap.servers = localhost:9092
tier1.sources.source1.kafka.topics = test1, test2
tier1.sources.source1.kafka.consumer.group.id = custom.g.id

更多内容请参考 Apache Flume 官网

Kafka Sink

配置 Kafka 作为内容接收方,即将自己作为生产者,推到 Kafka Server 中等待后续操作。主要配置选项如下:

配置项 说明
channel 自己配置的 Channel
type 必须为:org.apache.flume.sink.kafka.KafkaSink
kafka.bootstrap.servers Kafka Broker 的服务器
kafka.topics Kafka 中数据来源 Topic
kafka.flumeBatchSize 每次写入的 Bacth 大小
kafka.producer.acks Kafka 生产者的生产策略

示例:

a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = mytopic
a1.sinks.k1.kafka.bootstrap.servers = localhost:9092
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = 1

更多内容请参考 Apache Flume 官网

Storm 接入 Kafka

Storm 是一个分布式实时计算框架,能够对数据进行流式处理和提供通用性分布式 RPC 调用,可以实现处理事件亚秒级的延迟,适用于对延迟要求比较高的实时数据处理场景。

Storm 工作原理

在 Storm 的集群中有两种节点,控制节点Master Node和工作节点Worker NodeMaster Node上运行Nimbus进程,用于资源分配与状态监控。Worker Node上运行Supervisor进程,监听工作任务,启动executor执行。整个 Storm 集群依赖zookeeper负责公共数据存放、集群状态监听、任务分配等功能。

用户提交给 Storm 的数据处理程序称为topology,它处理的最小消息单位是tuple,一个任意对象的数组。topologyspoutbolt构成,spout是产生tuple的源头,bolt可以订阅任意spoutbolt发出的tuple进行处理。

Storm with Kafka

Storm 可以把 Kafka 作为spout,消费数据进行处理;也可以作为bolt,存放经过处理后的数据提供给其它组件消费。

Centos6.8系统

package version
maven 3.5.0
storm 2.1.0
ssh 5.3
Java 1.8

前提条件

操作步骤

步骤1:创建 Topic

步骤2:添加 Maven 依赖

pom.xml 配置如下:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>storm</groupId>
  <artifactId>storm</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <name>storm</name> 
     <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.apache.storm</groupId>
            <artifactId>storm-core</artifactId>
            <version>2.1.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.storm</groupId>
            <artifactId>storm-kafka-client</artifactId>
            <version>2.1.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka_2.11</artifactId>
            <version>0.10.2.1</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>test</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                    <archive>
                        <manifest>
                            <mainClass>ExclamationTopology</mainClass>
                        </manifest>
                    </archive>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

步骤3:生产消息

使用 spout/bolt

topology 代码:

//TopologyKafkaProducerSpout.java
import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.StormSubmitter;
import org.apache.storm.kafka.bolt.KafkaBolt;
import org.apache.storm.kafka.bolt.mapper.FieldNameBasedTupleToKafkaMapper;
import org.apache.storm.kafka.bolt.selector.DefaultTopicSelector;
import org.apache.storm.topology.TopologyBuilder;
import org.apache.storm.utils.Utils;

import java.util.Properties;

public class TopologyKafkaProducerSpout {
    //申请的kafka实例ip:port
    private final static String BOOTSTRAP_SERVERS = "xx.xx.xx.xx:xxxx";
    //指定要将消息写入的topic
    private final static String TOPIC = "storm_test";
    public static void main(String[] args) throws Exception {
        //设置producer属性
        //函数参考:https://kafka.apache.org/0100/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html
        //属性参考:http://kafka.apache.org/0102/documentation.html
        Properties properties = new Properties();
        properties.put("bootstrap.servers", BOOTSTRAP_SERVERS);
        properties.put("acks", "1");
        properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

        //创建写入kafka的bolt,默认使用fields("key" "message")作为生产消息的key和message,也可以在FieldNameBasedTupleToKafkaMapper()中指定
        KafkaBolt kafkaBolt = new KafkaBolt()
                .withProducerProperties(properties)
                .withTopicSelector(new DefaultTopicSelector(TOPIC))
                .withTupleToKafkaMapper(new FieldNameBasedTupleToKafkaMapper());
        TopologyBuilder builder = new TopologyBuilder();
        //一个顺序生成消息的spout类,输出field是sentence
        SerialSentenceSpout spout = new SerialSentenceSpout();
        AddMessageKeyBolt bolt = new AddMessageKeyBolt();
        builder.setSpout("kafka-spout", spout, 1);
        //为tuple加上生产到kafka所需要的fields
        builder.setBolt("add-key", bolt, 1).shuffleGrouping("kafka-spout");
        //写入kafka
        builder.setBolt("sendToKafka", kafkaBolt, 8).shuffleGrouping("add-key");

        Config config = new Config();
        if (args != null && args.length > 0) {
            //集群模式,用于打包jar,并放到storm运行
            config.setNumWorkers(1);
            StormSubmitter.submitTopologyWithProgressBar(args[0], config, builder.createTopology());
        } else {
            //本地模式
            LocalCluster cluster = new LocalCluster();
            cluster.submitTopology("test", config, builder.createTopology());
            Utils.sleep(10000);
            cluster.killTopology("test");
            cluster.shutdown();
        }

    }
}

创建一个顺序生成消息的 spout 类:

import org.apache.storm.spout.SpoutOutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseRichSpout;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Values;
import org.apache.storm.utils.Utils;

import java.util.Map;
import java.util.UUID;

public class SerialSentenceSpout extends BaseRichSpout {

    private SpoutOutputCollector spoutOutputCollector;

    @Override
    public void open(Map map, TopologyContext topologyContext, SpoutOutputCollector spoutOutputCollector) {
        this.spoutOutputCollector = spoutOutputCollector;
    }

    @Override
    public void nextTuple() {
        Utils.sleep(1000);
        //生产一个UUID字符串发送给下一个组件
        spoutOutputCollector.emit(new Values(UUID.randomUUID().toString()));
    }

    @Override
    public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
        outputFieldsDeclarer.declare(new Fields("sentence"));
    }
}

tuple 加上 key、message 两个字段,当 key 为 null 时,生产的消息均匀分配到各个 partition,指定了 key 后将按照 key 值 hash 到特定 partition 上:

//AddMessageKeyBolt.java
import org.apache.storm.topology.BasicOutputCollector;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseBasicBolt;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;

public class AddMessageKeyBolt extends BaseBasicBolt {

    @Override
    public void execute(Tuple tuple, BasicOutputCollector basicOutputCollector) {
        //取出第一个filed值
        String messae = tuple.getString(0);
        //System.out.println(messae);
        //发送给下一个组件
        basicOutputCollector.emit(new Values(null, messae));
    }

    @Override
    public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
        //创建发送给下一个组件的schema
        outputFieldsDeclarer.declare(new Fields("key", "message"));
    }
}
使用 trident

使用 trident 类生成 topology:

//TopologyKafkaProducerTrident.java
import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.StormSubmitter;
import org.apache.storm.kafka.trident.TridentKafkaStateFactory;
import org.apache.storm.kafka.trident.TridentKafkaStateUpdater;
import org.apache.storm.kafka.trident.mapper.FieldNameBasedTupleToKafkaMapper;
import org.apache.storm.kafka.trident.selector.DefaultTopicSelector;
import org.apache.storm.trident.TridentTopology;
import org.apache.storm.trident.operation.BaseFunction;
import org.apache.storm.trident.operation.TridentCollector;
import org.apache.storm.trident.tuple.TridentTuple;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Values;
import org.apache.storm.utils.Utils;

import java.util.Properties;

public class TopologyKafkaProducerTrident {
    //申请的kafka实例ip:port
    private final static String BOOTSTRAP_SERVERS = "xx.xx.xx.xx:xxxx";
    //指定要将消息写入的topic
    private final static String TOPIC = "storm_test";
    public static void main(String[] args) throws Exception {
        //设置producer属性
        //函数参考:https://kafka.apache.org/0100/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html
        //属性参考:http://kafka.apache.org/0102/documentation.html
        Properties properties = new Properties();
        properties.put("bootstrap.servers", BOOTSTRAP_SERVERS);
        properties.put("acks", "1");
        properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        //设置Trident
        TridentKafkaStateFactory stateFactory = new TridentKafkaStateFactory()
                .withProducerProperties(properties)
                .withKafkaTopicSelector(new DefaultTopicSelector(TOPIC))
                //设置使用fields("key", "value")作为消息写入  不像FieldNameBasedTupleToKafkaMapper有默认值
                .withTridentTupleToKafkaMapper(new FieldNameBasedTupleToKafkaMapper("key", "value"));
        TridentTopology builder = new TridentTopology();
        //一个批量产生句子的spout,输出field为sentence
        builder.newStream("kafka-spout", new TridentSerialSentenceSpout(5))
                .each(new Fields("sentence"), new AddMessageKey(), new Fields("key", "value"))
                .partitionPersist(stateFactory, new Fields("key", "value"), new TridentKafkaStateUpdater(), new Fields());

        Config config = new Config();
        if (args != null && args.length > 0) {
            //集群模式,用于打包jar,并放到storm运行
            config.setNumWorkers(1);
            StormSubmitter.submitTopologyWithProgressBar(args[0], config, builder.build());
        } else {
            //本地模式
            LocalCluster cluster = new LocalCluster();
            cluster.submitTopology("test", config, builder.build());
            Utils.sleep(10000);
            cluster.killTopology("test");
            cluster.shutdown();
        }

    }

    private static class AddMessageKey extends BaseFunction {

        @Override
        public void execute(TridentTuple tridentTuple, TridentCollector tridentCollector) {
            //取出第一个filed值
            String messae = tridentTuple.getString(0);
            //System.out.println(messae);
            //发送给下一个组件
            //tridentCollector.emit(new Values(Integer.toString(messae.hashCode()), messae));
            tridentCollector.emit(new Values(null, messae));
        }
    }
}

创建一个批量生成消息的 spout 类:

//TridentSerialSentenceSpout.java
import org.apache.storm.Config;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.trident.operation.TridentCollector;
import org.apache.storm.trident.spout.IBatchSpout;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Values;
import org.apache.storm.utils.Utils;

import java.util.Map;
import java.util.UUID;

public class TridentSerialSentenceSpout implements IBatchSpout {

    private final int batchCount;

    public TridentSerialSentenceSpout(int batchCount) {
        this.batchCount = batchCount;
    }

    @Override
    public void open(Map map, TopologyContext topologyContext) {

    }

    @Override
    public void emitBatch(long l, TridentCollector tridentCollector) {
        Utils.sleep(1000);
        for(int i = 0; i < batchCount; i++){
            tridentCollector.emit(new Values(UUID.randomUUID().toString()));
        }
    }

    @Override
    public void ack(long l) {

    }

    @Override
    public void close() {

    }

    @Override
    public Map<String, Object> getComponentConfiguration() {
        Config conf = new Config();
        conf.setMaxTaskParallelism(1);
        return conf;
    }

    @Override
    public Fields getOutputFields() {
        return new Fields("sentence");
    }
}

步骤4:消费消息

使用 spout/bolt
//TopologyKafkaConsumerSpout.java
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.StormSubmitter;
import org.apache.storm.kafka.spout.*;
import org.apache.storm.task.OutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.TopologyBuilder;
import org.apache.storm.topology.base.BaseRichBolt;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import org.apache.storm.utils.Utils;

import java.util.HashMap;
import java.util.Map;

import static org.apache.storm.kafka.spout.FirstPollOffsetStrategy.LATEST;

public class TopologyKafkaConsumerSpout {
    //申请的kafka实例ip:port
    private final static String BOOTSTRAP_SERVERS = "xx.xx.xx.xx:xxxx";
    //指定要将消息写入的topic
    private final static String TOPIC = "storm_test";

    public static void main(String[] args) throws Exception {
        //设置重试策略
        KafkaSpoutRetryService kafkaSpoutRetryService = new KafkaSpoutRetryExponentialBackoff(
                KafkaSpoutRetryExponentialBackoff.TimeInterval.microSeconds(500),
                KafkaSpoutRetryExponentialBackoff.TimeInterval.milliSeconds(2),
                Integer.MAX_VALUE,
                KafkaSpoutRetryExponentialBackoff.TimeInterval.seconds(10)
        );
        ByTopicRecordTranslator<String, String> trans = new ByTopicRecordTranslator<>(
                (r) -> new Values(r.topic(), r.partition(), r.offset(), r.key(), r.value()),
                new Fields("topic", "partition", "offset", "key", "value"));
        //设置consumer参数
        //函数参考http://storm.apache.org/releases/1.1.0/javadocs/org/apache/storm/kafka/spout/KafkaSpoutConfig.Builder.html
        //参数参考http://kafka.apache.org/0102/documentation.html
        KafkaSpoutConfig spoutConfig = KafkaSpoutConfig.builder(BOOTSTRAP_SERVERS, TOPIC)
                .setProp(new HashMap<String, Object>(){
   
   {
                    put(ConsumerConfig.GROUP_ID_CONFIG, "test-group1"); //设置group
                    put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "50000"); //设置session超时
                    put(ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG, "60000"); //设置请求超时
                }})
                .setOffsetCommitPeriodMs(10_000) //设置自动确认时间
                .setFirstPollOffsetStrategy(LATEST) //设置拉取最新消息
                .setRetry(kafkaSpoutRetryService)
                .setRecordTranslator(trans)
                .build();

        TopologyBuilder builder = new TopologyBuilder();
        builder.setSpout("kafka-spout", new KafkaSpout(spoutConfig), 1);
        builder.setBolt("bolt", new BaseRichBolt(){
            private OutputCollector outputCollector;
            @Override
            public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {

            }

            @Override
            public void prepare(Map map, TopologyContext topologyContext, OutputCollector outputCollector) {
                this.outputCollector = outputCollector;
            }

            @Override
            public void execute(Tuple tuple) {
                System.out.println(tuple.getStringByField("value"));
                outputCollector.ack(tuple);
            }
        }, 1).shuffleGrouping("kafka-spout");

        Config config = new Config();
        config.setMaxSpoutPending(20);
        if (args != null && args.length > 0) {
            config.setNumWorkers(3);
            StormSubmitter.submitTopologyWithProgressBar(args[0], config, builder.createTopology());
        }
        else {
            LocalCluster cluster = new LocalCluster();
            cluster.submitTopology("test", config, builder.createTopology());
            Utils.sleep(20000);
            cluster.killTopology("test");
            cluster.shutdown();
        }
    }
}
使用 trident
//TopologyKafkaConsumerTrident.java
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.StormSubmitter;
import org.apache.storm.generated.StormTopology;
import org.apache.storm.kafka.spout.ByTopicRecordTranslator;
import org.apache.storm.kafka.spout.trident.KafkaTridentSpoutConfig;
import org.apache.storm.kafka.spout.trident.KafkaTridentSpoutOpaque;
import org.apache.storm.trident.Stream;
import org.apache.storm.trident.TridentTopology;
import org.apache.storm.trident.operation.BaseFunction;
import org.apache.storm.trident.operation.TridentCollector;
import org.apache.storm.trident.tuple.TridentTuple;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Values;
import org.apache.storm.utils.Utils;

import java.util.HashMap;

import static org.apache.storm.kafka.spout.FirstPollOffsetStrategy.LATEST;


public class TopologyKafkaConsumerTrident {
    //申请的kafka实例ip:port
    private final static String BOOTSTRAP_SERVERS = "xx.xx.xx.xx:xxxx";
    //指定要将消息写入的topic
    private final static String TOPIC = "storm_test";

    public static void main(String[] args) throws Exception {
        ByTopicRecordTranslator<String, String> trans = new ByTopicRecordTranslator<>(
                (r) -> new Values(r.topic(), r.partition(), r.offset(), r.key(), r.value()),
                new Fields("topic", "partition", "offset", "key", "value"));
        //设置consumer参数
        //函数参考http://storm.apache.org/releases/1.1.0/javadocs/org/apache/storm/kafka/spout/KafkaSpoutConfig.Builder.html
        //参数参考http://kafka.apache.org/0102/documentation.html
        KafkaTridentSpoutConfig spoutConfig = KafkaTridentSpoutConfig.builder(BOOTSTRAP_SERVERS, TOPIC)
                .setProp(new HashMap<String, Object>(){
   
   {
                    put(ConsumerConfig.GROUP_ID_CONFIG, "test-group1"); //设置group
                    put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true"); //设置自动确认
                    put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "50000"); //设置session超时
                    put(ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG, "60000"); //设置请求超时
                }})
                .setFirstPollOffsetStrategy(LATEST) //设置拉取最新消息
                .setRecordTranslator(trans)
                .build();

        TridentTopology builder = new TridentTopology();
//      Stream spoutStream = builder.newStream("spout", new KafkaTridentSpoutTransactional(spoutConfig)); //事务型
        Stream spoutStream = builder.newStream("spout", new KafkaTridentSpoutOpaque(spoutConfig));
        spoutStream.each(spoutStream.getOutputFields(), new BaseFunction(){
            @Override
            public void execute(TridentTuple tridentTuple, TridentCollector tridentCollector) {
                System.out.println(tridentTuple.getStringByField("value"));
                tridentCollector.emit(new Values(tridentTuple.getStringByField("value")));
            }
        }, new Fields("message"));

        Config conf = new Config();
        conf.setMaxSpoutPending(20);conf.setNumWorkers(1);
        if (args != null && args.length > 0) {
            conf.setNumWorkers(3);
            StormSubmitter.submitTopologyWithProgressBar(args[0], conf, builder.build());
        }
        else {
            StormTopology stormTopology = builder.build();
            LocalCluster cluster = new LocalCluster();
            cluster.submitTopology("test", conf, stormTopology);
            Utils.sleep(10000);
            cluster.killTopology("test");
            cluster.shutdown();stormTopology.clear();
        }
    }
}

步骤5:提交 Storm

使用 mvn package 编译后,可以提交到本地集群进行 debug 测试,也可以提交到正式集群进行运行。

storm jar your_jar_name.jar topology_name
storm jar your_jar_name.jar topology_name tast_name

Logstash 接入 Kafka

Logstash 是一个开源的日志处理工具,可以从多个源头收集数据、过滤收集的数据并对数据进行存储作为其他用途。

Logstash 灵活性强,拥有强大的语法分析功能,插件丰富,支持多种输入和输出源。Logstash 作为水平可伸缩的数据管道,与 Elasticsearch 和 Kibana 配合,在日志收集检索方面功能强大。

Logstash 工作原理

Logstash 数据处理可以分为三个阶段:inputs → filters → outputs。

  1. inputs:产生数据来源,例如文件、syslog、redis 和 beats 此类来源。
  2. filters:修改过滤数据, 在 Logstash 数据管道中属于中间环节,可以根据条件去对事件进行更改。一些常见的过滤器包括:grok、mutate、drop 和 clone 等。
  3. outputs:将数据传输到其他地方,一个事件可以传输到多个 outputs,当传输完成后这个事件就结束。Elasticsearch 就是最常见的 outputs。

同时 Logstash 支持编码解码,可以在 inputs 和 outputs 端指定格式。

Logstash 接入 Kafka 的优势

  • 可以异步处理数据:防止突发流量。
  • 解耦:当 Elasticsearch 异常的时候不会影响上游工作。

:exclamation: 注意:​ Logstash 过滤消耗资源,如果部署在生产 server 上会影响其性能。

操作步骤

准备工作

步骤1:创建 Topic

创建一个名为 logstash_test的 Topic。

步骤2:接入 Kafka

作为 inputs 接入
  1. 执行 bin/logstash-plugin list,查看已经支持的插件是否含有 logstash-input-kafka

  2. .bin/ 目录下编写配置文件 input.conf。 此处将标准输出作为数据终点,将 Kafka 作为数据来源。

    input {
        kafka {
            bootstrap_servers => "xx.xx.xx.xx:xxxx" // kafka 实例接入地址
            group_id => "logstash_group"  // kafka groupid 名称
            topics => ["logstash_test"] // kafka topic 名称
            consumer_threads => 3 // 消费线程数,一般与 kafka 分区数一致
            auto_offset_reset => "earliest"
        }
    }
    output {
        stdout{codec=>rubydebug}
    }
  3. 执行以下命令启动 Logstash,进行消息消费。

    ./logstash -f input.conf

会看到刚才 Topic 中的数据被消费出来。

作为 outputs 接入
  1. 执行 bin/logstash-plugin list,查看已经支持的插件是否含有 logstash-output-kafka

  2. 在.bin/目录下编写配置文件 output.conf。 此处将标准输入作为数据来源,将 Kafka 作为数据目的地。

    input {
        input {
          stdin{}
      }
    }
    
    output {
       kafka {
            bootstrap_servers => "xx.xx.xx.xx:xxxx"  // ckafka 实例接入地址
            topic_id => "logstash_test" // ckafka topic 名称
           }
    }
  3. 执行如下命令启动 Logstash,向创建的 Topic 发送消息。

    ./logstash -f output.conf
  4. 启动 Kafka 消费者,检验上一步的生产数据。

    ./kafka-console-consumer.sh --bootstrap-server 172.0.0.1:9092 --topic logstash_test --from-begging --new-consumer

Filebeats 接入 Kafka

Beats 平台 集合了多种单一用途数据采集器。这些采集器安装后可用作轻量型代理,从成百上千或成千上万台机器向目标发送采集数据。 Beats 有多种采集器,您可以根据自身的需求下载对应的采集器。本文以 Filebeat(轻量型日志采集器)为例,向您介绍 Filebeat 接入 Kafka 的操作指方法,及接入后常见问题的解决方法。

前提条件

操作步骤

步骤1:创建 Topic

创建一个名为 test的 Topic。

步骤2:准备配置文件

进入 Filebeat 的安装目录,创建配置监控文件 filebeat.yml。

#======= Filebeat prospectors ==========
filebeat.prospectors:
- input_type: log 
# 此处为监听文件路径
  paths:
    - /var/log/messages

#=======  Outputs =========

#------------------ kafka -------------------------------------
output.kafka:
  version:0.10.2 // 根据不同 Kafka 集群版本配置
  # 设置为Kafka实例的接入地址
  hosts: ["xx.xx.xx.xx:xxxx"]
  # 设置目标topic的名称
  topic: 'test'
  partition.round_robin:
    reachable_only: false

  required_acks: 1
  compression: none
  max_message_bytes: 1000000

  # SASL 需要配置下列信息,如果不需要则下面两个选项可不配置
  username: "yourinstance#yourusername"  //username 需要拼接实例ID和用户名
  password: "yourpassword"

步骤4:Filebeat 发送消息

  1. 执行如下命令启动客户端。

    sudo ./filebeat -e -c filebeat.yml 
  2. 为监控文件增加数据(示例为写入监听的 testlog 文件)。

    echo ckafka1 >> testlog
    echo ckafka2 >> testlog
    echo ckafka3 >> testlog
  3. 开启 Consumer 消费对应的 Topic,获得以下数据。

    {"@timestamp":"2017-09-29T10:01:27.936Z","beat":{"hostname":"10.193.9.26","name":"10.193.9.26","version":"5.6.2"},"input_type":"log","message":"ckafka1","offset":500,"source":"/data/ryanyyang/hcmq/beats/filebeat-5.6.2-linux-x86_64/testlog","type":"log"}
    {"@timestamp":"2017-09-29T10:01:30.936Z","beat":{"hostname":"10.193.9.26","name":"10.193.9.26","version":"5.6.2"},"input_type":"log","message":"ckafka2","offset":508,"source":"/data/ryanyyang/hcmq/beats/filebeat-5.6.2-linux-x86_64/testlog","type":"log"}
    {"@timestamp":"2017-09-29T10:01:33.937Z","beat":{"hostname":"10.193.9.26","name":"10.193.9.26","version":"5.6.2"},"input_type":"log","message":"ckafka3","offset":516,"source":"/data/ryanyyang/hcmq/beats/filebeat-5.6.2-linux-x86_64/testlog","type":"log"}

SASL/PLAINTEXT 模式

如果您需要进行 SALS/PLAINTEXT 配置,则需要配置用户名与密码。 在 Kafka 配置区域新增加 username 和 password 配置即可。

参考链接

消息队列 CKafka - 文档中心 - 腾讯云 (tencent.com)

Three people walk together, there must be my teacher; knowledge sharing, the world is for the public. This article is written by Dongfeng Weiming technical blog EWhisper.cn .

Guess you like

Origin blog.csdn.net/east4ming/article/details/129485758