foreword
Kafka best practices, involving
- Typical usage scenarios
- Best Practices for Kafka Usage
Kafka typical usage scenarios
Data Streaming
Kafka can be connected to multiple mainstream streaming data processing technologies such as Spark, Flink, and Flume. Taking advantage of the high throughput of Kafka, customers can establish a transmission channel through Kafka to transmit massive data from the application side to the stream data processing engine. After the data is processed and analyzed, it can support back-end big data analysis, AI model training, etc. business.
log platform
The most commonly used scenario of Kafka and the scenario I am most familiar with is the log analysis system. A typical implementation is to deploy a log collector (such as Fluentd, Filebeat, or Logstash, etc.) on the client side for log collection, and send the data to Kafka, and then perform data calculations through the back-end ES, etc., and then build a display layer such as Kibana. Presentation of statistical analysis data.
internet of things
The Internet of Things (IoT) is gaining traction as valuable use cases emerge. However, a key challenge is integrating devices and machines to process data in real time and at scale. Apache Kafka® and its surrounding ecosystem, including Kafka Connect, Kafka Streams, have become the technology of choice for integrating and processing such datasets.
Kafka is already used in many IoT deployments, including consumer IoT and industrial Internet of Things (IIoT). Most scenarios require reliable, scalable, and secure end-to-end integration that supports real-time bidirectional communication and data processing. Some specific use cases are:
- Connected Vehicle Infrastructure
- Smart Cities and Smart Homes
- Smart Retail and Customer 360
- smart manufacturing
The specific implementation architecture is shown in the figure below:
best practice to use
Reliability Best Practices
Satisfy different reliability based on producer and consumer configuration
ProducerAt Least Once
The producer needs to set request.required.acks = ALL
, the master node on the server side writes successfully and the backup node synchronizes successfully to return the Response.
Consumer At Least Once
After receiving the message, the consumer should first perform the corresponding business operation, and then commit to indicate that the message has been processed . This processing method can ensure that a message can be consumed again when the business processing fails. Note that the consumer enable.auto.commit
parameter needs to be set False
to ensure that the commit action is manually controlled.
ProducerAt Most Once
To ensure that a message can be delivered at most once, it needs to be set request.required.acks = 0
and set at the same time retries = 0
. The principle here is that the producer does not retry when encountering any exceptions, and does not consider whether the broker responds to the successful writing.
Consumer At Most Once
To ensure that a message can be consumed at most once, consumers need to commit to indicate that the message has been processed after receiving the message, and then perform corresponding business operations . The principle here is that the consumer does not need to care about the processing result of the actual business, and immediately commits after receiving the message to tell the broker that the message has been successfully processed. Note that the consumer enable.auto.commit
parameterFalse
needs to be set to ensure that the commit action is manually controlled.
Producer Exactly-once
Since version 0.11 of Kafka , the semantics of idempotent messages has been added . By setting parameters, the message idempotence of a single partition can be realized.enable.idempotence=true
If the topic involves multiple partitions or multiple messages need to be encapsulated into one transaction to ensure idempotence, you need to increase Transaction control, as follows:
// 开启幂等控制参数
producerProps.put("enbale.idempotence", "true");
// 初始化事务
producer.initTransactions();
// 设置事务 ID
producerProps.put("transactional.id", "id-001");
try{
// 开始事务,并在事务中发送 2 条消息
producer.beginTranscation();
producer.send(record0);
producer.send(record1);
// 提交事务
producer.commitTranscation();
} catch( Exception e ) {
producer.abortTransaction();
producer.close();
}
Consumer Exactly-once
It needs to be set isolation.level=read_committed
and set enable.auto.commit = false
to ensure that the consumer only consumes the messages that the producer has committed. The consumer business needs to ensure transactionality to avoid repeated processing of messages, such as persisting the message to the database, and then submitting the commit to the server.
Choose the appropriate semantics according to the business scenario
Use At Least Once semantics to support businesses that can accept a small number of repeated messages
At Least Once is the most commonly used semantics, which can ensure that messages are sent and consumed as much as possible, with a good balance between performance and reliability, and can be used as the default mode . The business side can also ensure idempotency by adding a unique business primary key to the message body , and ensure that messages with the same business primary key are only processed once on the consumer side.
Using Exactly Once semantic support requires strong idempotent business
Exactly Once semantics are generally used for key businesses that absolutely do not allow repetition. Typical cases are order and payment related scenarios .
Use At Most Once semantics to support non-critical business
At Most Once semantics are generally used in non-critical business , business is not sensitive to message loss , just need to try to ensure the successful production and consumption of messages. A typical scenario where At Most Once semantics is used is message notification , where a small number of missing messages has little impact. In contrast, sending notifications repeatedly will cause a poor user experience.
Performance Tuning Best Practices
Reasonably set the number of partitions for Topic
The following summarizes the dimensions that are recommended for tuning performance through partitions. It is recommended that you tune the overall performance of the system based on theoretical analysis and stress testing.
Consider dimensions | illustrate |
---|---|
throughput | Increasing the number of partitions can increase the concurrency of message consumption. When the bottleneck of the system lies in the consumption end, and the consumption end can be expanded horizontally, increasing the partition can increase the system throughput. Each partition under each topic in Kafka is an independent message processing channel. The messages in a partition can only be consumed by one consumer group at the same time. When the number of consumer groups exceeds the number of partitions, the redundant consumer group Idle will appear. |
message sequence | Kafka can guarantee the order of messages within a partition, but the order of messages between partitions cannot be guaranteed. When adding partitions, you need to consider the impact of message order on business. |
Instance Partition upper limit | The increase of Partition will consume more underlying resources such as memory, IO and file handles. When planning the number of partitions for a topic, you need to consider the upper limit of partitions that the Kafka cluster can support. |
Description of the relationship between producers, consumers and partitions.
Reasonably set the batch size
If the topic has multiple partitions, the producer needs to confirm which partition to send the message to first. When sending multiple messages to the same partition, the Producer client will package the related messages into a Batch and send them to the server in batches. Generally, a small batch will cause the Producer client to generate a large number of requests, causing the request queue to be queued at the client and the server, thereby pushing up message sending and consumption delays as a whole.
An appropriate batch size can reduce the number of requests initiated by the client to the server when sending messages, and improve the throughput and delay of message sending as a whole.
Batch parameters are described as follows:
parameter | illustrate |
---|---|
batch.size |
The amount of cached messages sent to each partition (the sum of the bytes of the message content, not the number of messages). When the set value is reached, a network request will be triggered, and then the Producer client will send messages to the server in batches. |
linger.ms |
The maximum time each message will be in the cache. If this time is exceeded, the Producer client will batch.size ignore the limit and send the message to the server immediately. |
buffer.memory |
When the total size of all cached messages exceeds this value, the message will be sent to the server, batch.size and the linger.ms restrictions of and will be ignored. buffer.memory The default value of is 32MB, which can guarantee sufficient performance for a single Producer. |
There is no general method for selecting Batch-related parameter values. It is recommended to perform pressure testing and tuning for performance-sensitive business scenarios.
Use sticky partitions to handle bulk sends
Kafka producers and servers have a batch sending mechanism when sending messages, and only messages sent to the same Partition will be put into the same Batch. In a large batch sending scenario, if messages are scattered into multiple Partitions, multiple small batches may be formed, causing the batch sending mechanism to fail and reduce performance.
Kafka's default strategy for selecting partitions is as follows
Scenes | Strategy |
---|---|
Message specified Key | Hash the Key of the message, and then select a partition based on the hash result to ensure that messages with the same Key will be sent to the same partition. |
The message does not specify a Key | The default strategy is to cycle through all partitions of the topic and send messages to each partition in a round-robin fashion. |
It can be seen from the default mechanism that the selection of partitions is very random. Therefore, in the scenario of mass transfer, it is recommended to set partitioner.class
the parameter and specify a custom partition selection algorithm to implement sticky partitions .
One of the implementation methods is to use the same partition for a fixed period of time, and switch to the next partition after a period of time to avoid data from being scattered into multiple different partitions.
General Best Practices
Kafka's Guarantee of Message Order
Kafka will guarantee the order of messages in the same partition. If there are multiple partitions in the topic, the global order cannot be guaranteed. If you need to guarantee the global order, you need to control the number of partitions to 1.
Set a unique Key for the message
The message of the message queue Kafka has two fields: Key (message identifier) and Value (message content). For easy tracking, it is recommended to set a unique Key for the message. After that, you can track a message through the Key, print the sending log and consumption log, and understand the production and consumption of the message.
Reasonably set the retry strategy of the queue
In a distributed environment, due to reasons such as the network, messages may occasionally fail to be sent. The reason may be that the message has been sent successfully but the ACK mechanism failed or the message was not sent successfully. The default parameters can meet most scenarios, but the following retry parameters can be set as needed according to business needs:
parameter | illustrate |
---|---|
retries |
The number of retries, the default value is 3, but for applications with zero tolerance for data loss, please consider setting it to Integer.MAX_VALUE (effective and maximum). |
retry.backoff.ms |
The retry interval is recommended to be set to 1000. |
:exclamation: Note:
If you want to implement At Most Once semantics, retries need to be turned off.
Access best practices
Connecting Spark Streaming to Kafka
Spark Streaming is an extension of Spark Core for high-throughput and fault-tolerant processing of persistent data. Currently, external inputs supported include Kafka, Flume, HDFS/S3, Kinesis, Twitter, and TCP socket.
Spark Streaming abstracts continuous data into DStream (Discretized Stream), and DStream consists of a series of continuous RDDs (elastic distributed data sets), and each RDD is data generated within a certain time interval. Using functions to process DStream is actually to process these RDDs.
When using Spark Streaming as the data input of Kafka, the stable version and experimental version of Kafka can be supported:
Kafka Version | spark-streaming-kafka-0.8 | spark-streaming-kafka-0.10 |
---|---|---|
Broker Version | 0.8.2.1 or higher | 0.10.0 or higher |
Api Maturity | Deprecated | Stable |
Language Support | Scala、Java、Python | Scala、Java |
Receiver DStream | Yes | No |
Direct DStream | Yes | Yes |
SSL / TLS Support | No | Yes |
Offset Commit Api | No | Yes |
Dynamic Topic Subscription | No | Yes |
This practice uses the Kafka dependency of version 0.10.2.1.
Steps
Step 1: Create Kafka cluster and Topic
The steps to create a Kafka cluster are omitted, and then create a Topic test
named .
Step 2: Prepare the server environment
Centos6.8 system
package | version |
---|---|
sbt | 0.13.16 |
hadoop | 2.7.3 |
spark | 2.1.0 |
protobuf | 2.5.0 |
ssh | CentOS default installation |
Java | 1.8 |
The specific installation steps are omitted, including the following steps:
- install sbt
- Install protobuf
- Install Hadoop
- Install Spark
Step 3: Connect to Kafka
Produce messages to Kafka
The Kafka dependency of version 0.10.2.1 is used here.
build.sbt
Add dependencies in :
name := "Producer Example"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.kafka" % "kafka-clients" % "0.10.2.1"
Configuration
producer_example.scala
:import java.util.Properties import org.apache.kafka.clients.producer._ object ProducerExample extends App { val props = new Properties() props.put("bootstrap.servers", "172.0.0.1:9092") //实例信息中的内网 IP 与端口 props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer") props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer") val producer = new KafkaProducer[String, String](props) val TOPIC="test" //指定要生产的 Topic for(i<- 1 to 50){ val record = new ProducerRecord(TOPIC, "key", s"hello $i") //生产 key 是"key",value 是 hello i 的消息 producer.send(record) } val record = new ProducerRecord(TOPIC, "key", "the end "+new java.util.Date) producer.send(record) producer.close() //最后要断开 }
For more usage of ProducerRecord, please refer to ProducerRecord documentation.
Consume messages from Kafka
DirectStream
build.sbt
Add dependencies in :
name := "Consumer Example"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.0"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.1.0"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.1.0"
- Configuration
DirectStream_example.scala
:
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.kafka.common.TopicPartition
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.OffsetRange
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import collection.JavaConversions._
import Array._
object Kafka {
def main(args: Array[String]) {
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "172.0.0.1:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "spark_stream_test1",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> "false"
)
val sparkConf = new SparkConf()
sparkConf.setMaster("local")
sparkConf.setAppName("Kafka")
val ssc = new StreamingContext(sparkConf, Seconds(5))
val topics = Array("spark_test")
val offsets : Map[TopicPartition, Long] = Map()
for (i <- 0 until 3){
val tp = new TopicPartition("spark_test", i)
offsets.updated(tp , 0L)
}
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
println("directStream")
stream.foreachRDD{ rdd=>
//输出获得的消息
rdd.foreach{iter =>
val i = iter.value
println(s"${i}")
}
//获得offset
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.foreachPartition { iter =>
val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
}
}
// Start the computation
ssc.start()
ssc.awaitTermination()
}
}
RDD
- Configuration
build.sbt
(the configuration is the same as above, click to view ). - Configuration
RDD_example
:
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.OffsetRange
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import collection.JavaConversions._
import Array._
object Kafka {
def main(args: Array[String]) {
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "172.0.0.1:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "spark_stream",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val sc = new SparkContext("local", "Kafka", new SparkConf())
val java_kafkaParams : java.util.Map[String, Object] = kafkaParams
//按顺序向 parition 拉取相应 offset 范围的消息,如果拉取不到则阻塞直到超过等待时间或者新生产消息达到拉取的数量
val offsetRanges = Array[OffsetRange](
OffsetRange("spark_test", 0, 0, 5),
OffsetRange("spark_test", 1, 0, 5),
OffsetRange("spark_test", 2, 0, 5)
)
val range = KafkaUtils.createRDD[String, String](
sc,
java_kafkaParams,
offsetRanges,
PreferConsistent
)
range.foreach(rdd=>println(rdd.value))
sc.stop()
}
}
For more kafkaParams
usage , refer to the kafkaParams documentation.
Connecting Flume to Kafka
Apache Flume 是一个分布式、可靠、高可用的日志收集系统,支持各种各样的数据来源(如 HTTP、Log 文件、JMS、监听端口数据等),能将这些数据源的海量日志数据进行高效收集、聚合、移动,最后存储到指定存储系统中(如 Kafka、分布式文件系统、Solr 搜索服务器等)。
Flume 基本结构如下:
Flume 以 agent 为最小的独立运行单位。一个 agent 就是一个 JVM,单个 agent 由 Source、Sink 和 Channel 三大组件构成。
Flume 与 Kafka
把数据存储到 HDFS 或者 HBase 等下游存储模块或者计算模块时需要考虑各种复杂的场景,例如并发写入的量以及系统承载压力、网络延迟等问题。Flume 作为灵活的分布式系统具有多种接口,同时提供可定制化的管道。 在生产处理环节中,当生产与处理速度不一致时,Kafka 可以充当缓存角色。Kafka 拥有 partition 结构以及采用 append 追加数据,使 Kafka 具有优秀的吞吐能力;同时其拥有 replication 结构,使 Kafka 具有很高的容错性。 所以将 Flume 和 Kafka 结合起来,可以满足生产环境中绝大多数要求。
准备工作
- 下载 Apache Flume (1.6.0以上版本兼容 Kafka)
- 下载 Kafka工具包 (0.9.x以上版本,0.8已经不支持)
- 确认 Kafka 的 Source、 Sink 组件已经在 Flume 中。
接入方式
Kafka 可作为 Source 或者 Sink 端对消息进行导入或者导出。
Kafka Source
配置 kafka 作为消息来源,即将自己作为消费者,从 Kafka 中拉取数据传入到指定 Sink 中。主要配置选项如下:
配置项 | 说明 |
---|---|
channels |
自己配置的 Channel |
type |
必须为:org.apache.flume.source.kafka.KafkaSource |
kafka.bootstrap.servers |
Kafka Broker 的服务器地址 |
kafka.consumer.group.id |
作为 Kafka 消费端的 Group ID |
kafka.topics |
Kafka 中数据来源 Topic |
batchSize |
每次写入 Channel 的大小 |
batchDurationMillis |
每次写入最大间隔时间 |
示例:
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.channels = channel1
tier1.sources.source1.batchSize = 5000
tier1.sources.source1.batchDurationMillis = 2000
tier1.sources.source1.kafka.bootstrap.servers = localhost:9092
tier1.sources.source1.kafka.topics = test1, test2
tier1.sources.source1.kafka.consumer.group.id = custom.g.id
更多内容请参考 Apache Flume 官网。
Kafka Sink
配置 Kafka 作为内容接收方,即将自己作为生产者,推到 Kafka Server 中等待后续操作。主要配置选项如下:
配置项 | 说明 |
---|---|
channel |
自己配置的 Channel |
type |
必须为:org.apache.flume.sink.kafka.KafkaSink |
kafka.bootstrap.servers |
Kafka Broker 的服务器 |
kafka.topics |
Kafka 中数据来源 Topic |
kafka.flumeBatchSize |
每次写入的 Bacth 大小 |
kafka.producer.acks |
Kafka 生产者的生产策略 |
示例:
a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = mytopic
a1.sinks.k1.kafka.bootstrap.servers = localhost:9092
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = 1
更多内容请参考 Apache Flume 官网。
Storm 接入 Kafka
Storm 是一个分布式实时计算框架,能够对数据进行流式处理和提供通用性分布式 RPC 调用,可以实现处理事件亚秒级的延迟,适用于对延迟要求比较高的实时数据处理场景。
Storm 工作原理
在 Storm 的集群中有两种节点,控制节点Master Node
和工作节点Worker Node
。Master Node
上运行Nimbus
进程,用于资源分配与状态监控。Worker Node
上运行Supervisor
进程,监听工作任务,启动executor
执行。整个 Storm 集群依赖zookeeper
负责公共数据存放、集群状态监听、任务分配等功能。
用户提交给 Storm 的数据处理程序称为topology
,它处理的最小消息单位是tuple
,一个任意对象的数组。topology
由spout
和bolt
构成,spout
是产生tuple
的源头,bolt
可以订阅任意spout
或bolt
发出的tuple
进行处理。
Storm with Kafka
Storm 可以把 Kafka 作为spout
,消费数据进行处理;也可以作为bolt
,存放经过处理后的数据提供给其它组件消费。
Centos6.8系统
package | version |
---|---|
maven | 3.5.0 |
storm | 2.1.0 |
ssh | 5.3 |
Java | 1.8 |
前提条件
- 下载并安装 JDK 8。具体操作,请参见 Download JDK 8。
- 下载并安装 Storm,参考 Apache Storm downloads。
- 已创建 Kafka 集群。
操作步骤
步骤1:创建 Topic
步骤2:添加 Maven 依赖
pom.xml 配置如下:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>storm</groupId>
<artifactId>storm</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>storm</name>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-core</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-kafka-client</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>0.10.2.1</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>ExclamationTopology</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
步骤3:生产消息
使用 spout/bolt
topology 代码:
//TopologyKafkaProducerSpout.java
import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.StormSubmitter;
import org.apache.storm.kafka.bolt.KafkaBolt;
import org.apache.storm.kafka.bolt.mapper.FieldNameBasedTupleToKafkaMapper;
import org.apache.storm.kafka.bolt.selector.DefaultTopicSelector;
import org.apache.storm.topology.TopologyBuilder;
import org.apache.storm.utils.Utils;
import java.util.Properties;
public class TopologyKafkaProducerSpout {
//申请的kafka实例ip:port
private final static String BOOTSTRAP_SERVERS = "xx.xx.xx.xx:xxxx";
//指定要将消息写入的topic
private final static String TOPIC = "storm_test";
public static void main(String[] args) throws Exception {
//设置producer属性
//函数参考:https://kafka.apache.org/0100/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html
//属性参考:http://kafka.apache.org/0102/documentation.html
Properties properties = new Properties();
properties.put("bootstrap.servers", BOOTSTRAP_SERVERS);
properties.put("acks", "1");
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
//创建写入kafka的bolt,默认使用fields("key" "message")作为生产消息的key和message,也可以在FieldNameBasedTupleToKafkaMapper()中指定
KafkaBolt kafkaBolt = new KafkaBolt()
.withProducerProperties(properties)
.withTopicSelector(new DefaultTopicSelector(TOPIC))
.withTupleToKafkaMapper(new FieldNameBasedTupleToKafkaMapper());
TopologyBuilder builder = new TopologyBuilder();
//一个顺序生成消息的spout类,输出field是sentence
SerialSentenceSpout spout = new SerialSentenceSpout();
AddMessageKeyBolt bolt = new AddMessageKeyBolt();
builder.setSpout("kafka-spout", spout, 1);
//为tuple加上生产到kafka所需要的fields
builder.setBolt("add-key", bolt, 1).shuffleGrouping("kafka-spout");
//写入kafka
builder.setBolt("sendToKafka", kafkaBolt, 8).shuffleGrouping("add-key");
Config config = new Config();
if (args != null && args.length > 0) {
//集群模式,用于打包jar,并放到storm运行
config.setNumWorkers(1);
StormSubmitter.submitTopologyWithProgressBar(args[0], config, builder.createTopology());
} else {
//本地模式
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("test", config, builder.createTopology());
Utils.sleep(10000);
cluster.killTopology("test");
cluster.shutdown();
}
}
}
创建一个顺序生成消息的 spout 类:
import org.apache.storm.spout.SpoutOutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseRichSpout;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Values;
import org.apache.storm.utils.Utils;
import java.util.Map;
import java.util.UUID;
public class SerialSentenceSpout extends BaseRichSpout {
private SpoutOutputCollector spoutOutputCollector;
@Override
public void open(Map map, TopologyContext topologyContext, SpoutOutputCollector spoutOutputCollector) {
this.spoutOutputCollector = spoutOutputCollector;
}
@Override
public void nextTuple() {
Utils.sleep(1000);
//生产一个UUID字符串发送给下一个组件
spoutOutputCollector.emit(new Values(UUID.randomUUID().toString()));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declare(new Fields("sentence"));
}
}
为 tuple
加上 key、message 两个字段,当 key 为 null 时,生产的消息均匀分配到各个 partition,指定了 key 后将按照 key 值 hash 到特定 partition 上:
//AddMessageKeyBolt.java
import org.apache.storm.topology.BasicOutputCollector;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseBasicBolt;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
public class AddMessageKeyBolt extends BaseBasicBolt {
@Override
public void execute(Tuple tuple, BasicOutputCollector basicOutputCollector) {
//取出第一个filed值
String messae = tuple.getString(0);
//System.out.println(messae);
//发送给下一个组件
basicOutputCollector.emit(new Values(null, messae));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
//创建发送给下一个组件的schema
outputFieldsDeclarer.declare(new Fields("key", "message"));
}
}
使用 trident
使用 trident 类生成 topology:
//TopologyKafkaProducerTrident.java
import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.StormSubmitter;
import org.apache.storm.kafka.trident.TridentKafkaStateFactory;
import org.apache.storm.kafka.trident.TridentKafkaStateUpdater;
import org.apache.storm.kafka.trident.mapper.FieldNameBasedTupleToKafkaMapper;
import org.apache.storm.kafka.trident.selector.DefaultTopicSelector;
import org.apache.storm.trident.TridentTopology;
import org.apache.storm.trident.operation.BaseFunction;
import org.apache.storm.trident.operation.TridentCollector;
import org.apache.storm.trident.tuple.TridentTuple;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Values;
import org.apache.storm.utils.Utils;
import java.util.Properties;
public class TopologyKafkaProducerTrident {
//申请的kafka实例ip:port
private final static String BOOTSTRAP_SERVERS = "xx.xx.xx.xx:xxxx";
//指定要将消息写入的topic
private final static String TOPIC = "storm_test";
public static void main(String[] args) throws Exception {
//设置producer属性
//函数参考:https://kafka.apache.org/0100/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html
//属性参考:http://kafka.apache.org/0102/documentation.html
Properties properties = new Properties();
properties.put("bootstrap.servers", BOOTSTRAP_SERVERS);
properties.put("acks", "1");
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
//设置Trident
TridentKafkaStateFactory stateFactory = new TridentKafkaStateFactory()
.withProducerProperties(properties)
.withKafkaTopicSelector(new DefaultTopicSelector(TOPIC))
//设置使用fields("key", "value")作为消息写入 不像FieldNameBasedTupleToKafkaMapper有默认值
.withTridentTupleToKafkaMapper(new FieldNameBasedTupleToKafkaMapper("key", "value"));
TridentTopology builder = new TridentTopology();
//一个批量产生句子的spout,输出field为sentence
builder.newStream("kafka-spout", new TridentSerialSentenceSpout(5))
.each(new Fields("sentence"), new AddMessageKey(), new Fields("key", "value"))
.partitionPersist(stateFactory, new Fields("key", "value"), new TridentKafkaStateUpdater(), new Fields());
Config config = new Config();
if (args != null && args.length > 0) {
//集群模式,用于打包jar,并放到storm运行
config.setNumWorkers(1);
StormSubmitter.submitTopologyWithProgressBar(args[0], config, builder.build());
} else {
//本地模式
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("test", config, builder.build());
Utils.sleep(10000);
cluster.killTopology("test");
cluster.shutdown();
}
}
private static class AddMessageKey extends BaseFunction {
@Override
public void execute(TridentTuple tridentTuple, TridentCollector tridentCollector) {
//取出第一个filed值
String messae = tridentTuple.getString(0);
//System.out.println(messae);
//发送给下一个组件
//tridentCollector.emit(new Values(Integer.toString(messae.hashCode()), messae));
tridentCollector.emit(new Values(null, messae));
}
}
}
创建一个批量生成消息的 spout 类:
//TridentSerialSentenceSpout.java
import org.apache.storm.Config;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.trident.operation.TridentCollector;
import org.apache.storm.trident.spout.IBatchSpout;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Values;
import org.apache.storm.utils.Utils;
import java.util.Map;
import java.util.UUID;
public class TridentSerialSentenceSpout implements IBatchSpout {
private final int batchCount;
public TridentSerialSentenceSpout(int batchCount) {
this.batchCount = batchCount;
}
@Override
public void open(Map map, TopologyContext topologyContext) {
}
@Override
public void emitBatch(long l, TridentCollector tridentCollector) {
Utils.sleep(1000);
for(int i = 0; i < batchCount; i++){
tridentCollector.emit(new Values(UUID.randomUUID().toString()));
}
}
@Override
public void ack(long l) {
}
@Override
public void close() {
}
@Override
public Map<String, Object> getComponentConfiguration() {
Config conf = new Config();
conf.setMaxTaskParallelism(1);
return conf;
}
@Override
public Fields getOutputFields() {
return new Fields("sentence");
}
}
步骤4:消费消息
使用 spout/bolt
//TopologyKafkaConsumerSpout.java
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.StormSubmitter;
import org.apache.storm.kafka.spout.*;
import org.apache.storm.task.OutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.TopologyBuilder;
import org.apache.storm.topology.base.BaseRichBolt;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import org.apache.storm.utils.Utils;
import java.util.HashMap;
import java.util.Map;
import static org.apache.storm.kafka.spout.FirstPollOffsetStrategy.LATEST;
public class TopologyKafkaConsumerSpout {
//申请的kafka实例ip:port
private final static String BOOTSTRAP_SERVERS = "xx.xx.xx.xx:xxxx";
//指定要将消息写入的topic
private final static String TOPIC = "storm_test";
public static void main(String[] args) throws Exception {
//设置重试策略
KafkaSpoutRetryService kafkaSpoutRetryService = new KafkaSpoutRetryExponentialBackoff(
KafkaSpoutRetryExponentialBackoff.TimeInterval.microSeconds(500),
KafkaSpoutRetryExponentialBackoff.TimeInterval.milliSeconds(2),
Integer.MAX_VALUE,
KafkaSpoutRetryExponentialBackoff.TimeInterval.seconds(10)
);
ByTopicRecordTranslator<String, String> trans = new ByTopicRecordTranslator<>(
(r) -> new Values(r.topic(), r.partition(), r.offset(), r.key(), r.value()),
new Fields("topic", "partition", "offset", "key", "value"));
//设置consumer参数
//函数参考http://storm.apache.org/releases/1.1.0/javadocs/org/apache/storm/kafka/spout/KafkaSpoutConfig.Builder.html
//参数参考http://kafka.apache.org/0102/documentation.html
KafkaSpoutConfig spoutConfig = KafkaSpoutConfig.builder(BOOTSTRAP_SERVERS, TOPIC)
.setProp(new HashMap<String, Object>(){
{
put(ConsumerConfig.GROUP_ID_CONFIG, "test-group1"); //设置group
put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "50000"); //设置session超时
put(ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG, "60000"); //设置请求超时
}})
.setOffsetCommitPeriodMs(10_000) //设置自动确认时间
.setFirstPollOffsetStrategy(LATEST) //设置拉取最新消息
.setRetry(kafkaSpoutRetryService)
.setRecordTranslator(trans)
.build();
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("kafka-spout", new KafkaSpout(spoutConfig), 1);
builder.setBolt("bolt", new BaseRichBolt(){
private OutputCollector outputCollector;
@Override
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
}
@Override
public void prepare(Map map, TopologyContext topologyContext, OutputCollector outputCollector) {
this.outputCollector = outputCollector;
}
@Override
public void execute(Tuple tuple) {
System.out.println(tuple.getStringByField("value"));
outputCollector.ack(tuple);
}
}, 1).shuffleGrouping("kafka-spout");
Config config = new Config();
config.setMaxSpoutPending(20);
if (args != null && args.length > 0) {
config.setNumWorkers(3);
StormSubmitter.submitTopologyWithProgressBar(args[0], config, builder.createTopology());
}
else {
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("test", config, builder.createTopology());
Utils.sleep(20000);
cluster.killTopology("test");
cluster.shutdown();
}
}
}
使用 trident
//TopologyKafkaConsumerTrident.java
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.StormSubmitter;
import org.apache.storm.generated.StormTopology;
import org.apache.storm.kafka.spout.ByTopicRecordTranslator;
import org.apache.storm.kafka.spout.trident.KafkaTridentSpoutConfig;
import org.apache.storm.kafka.spout.trident.KafkaTridentSpoutOpaque;
import org.apache.storm.trident.Stream;
import org.apache.storm.trident.TridentTopology;
import org.apache.storm.trident.operation.BaseFunction;
import org.apache.storm.trident.operation.TridentCollector;
import org.apache.storm.trident.tuple.TridentTuple;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Values;
import org.apache.storm.utils.Utils;
import java.util.HashMap;
import static org.apache.storm.kafka.spout.FirstPollOffsetStrategy.LATEST;
public class TopologyKafkaConsumerTrident {
//申请的kafka实例ip:port
private final static String BOOTSTRAP_SERVERS = "xx.xx.xx.xx:xxxx";
//指定要将消息写入的topic
private final static String TOPIC = "storm_test";
public static void main(String[] args) throws Exception {
ByTopicRecordTranslator<String, String> trans = new ByTopicRecordTranslator<>(
(r) -> new Values(r.topic(), r.partition(), r.offset(), r.key(), r.value()),
new Fields("topic", "partition", "offset", "key", "value"));
//设置consumer参数
//函数参考http://storm.apache.org/releases/1.1.0/javadocs/org/apache/storm/kafka/spout/KafkaSpoutConfig.Builder.html
//参数参考http://kafka.apache.org/0102/documentation.html
KafkaTridentSpoutConfig spoutConfig = KafkaTridentSpoutConfig.builder(BOOTSTRAP_SERVERS, TOPIC)
.setProp(new HashMap<String, Object>(){
{
put(ConsumerConfig.GROUP_ID_CONFIG, "test-group1"); //设置group
put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true"); //设置自动确认
put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "50000"); //设置session超时
put(ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG, "60000"); //设置请求超时
}})
.setFirstPollOffsetStrategy(LATEST) //设置拉取最新消息
.setRecordTranslator(trans)
.build();
TridentTopology builder = new TridentTopology();
// Stream spoutStream = builder.newStream("spout", new KafkaTridentSpoutTransactional(spoutConfig)); //事务型
Stream spoutStream = builder.newStream("spout", new KafkaTridentSpoutOpaque(spoutConfig));
spoutStream.each(spoutStream.getOutputFields(), new BaseFunction(){
@Override
public void execute(TridentTuple tridentTuple, TridentCollector tridentCollector) {
System.out.println(tridentTuple.getStringByField("value"));
tridentCollector.emit(new Values(tridentTuple.getStringByField("value")));
}
}, new Fields("message"));
Config conf = new Config();
conf.setMaxSpoutPending(20);conf.setNumWorkers(1);
if (args != null && args.length > 0) {
conf.setNumWorkers(3);
StormSubmitter.submitTopologyWithProgressBar(args[0], conf, builder.build());
}
else {
StormTopology stormTopology = builder.build();
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("test", conf, stormTopology);
Utils.sleep(10000);
cluster.killTopology("test");
cluster.shutdown();stormTopology.clear();
}
}
}
步骤5:提交 Storm
使用 mvn package
编译后,可以提交到本地集群进行 debug 测试,也可以提交到正式集群进行运行。
storm jar your_jar_name.jar topology_name
storm jar your_jar_name.jar topology_name tast_name
Logstash 接入 Kafka
Logstash 是一个开源的日志处理工具,可以从多个源头收集数据、过滤收集的数据并对数据进行存储作为其他用途。
Logstash 灵活性强,拥有强大的语法分析功能,插件丰富,支持多种输入和输出源。Logstash 作为水平可伸缩的数据管道,与 Elasticsearch 和 Kibana 配合,在日志收集检索方面功能强大。
Logstash 工作原理
Logstash 数据处理可以分为三个阶段:inputs → filters → outputs。
- inputs:产生数据来源,例如文件、syslog、redis 和 beats 此类来源。
- filters:修改过滤数据, 在 Logstash 数据管道中属于中间环节,可以根据条件去对事件进行更改。一些常见的过滤器包括:grok、mutate、drop 和 clone 等。
- outputs:将数据传输到其他地方,一个事件可以传输到多个 outputs,当传输完成后这个事件就结束。Elasticsearch 就是最常见的 outputs。
同时 Logstash 支持编码解码,可以在 inputs 和 outputs 端指定格式。
Logstash 接入 Kafka 的优势
- 可以异步处理数据:防止突发流量。
- 解耦:当 Elasticsearch 异常的时候不会影响上游工作。
:exclamation: 注意: Logstash 过滤消耗资源,如果部署在生产 server 上会影响其性能。
操作步骤
准备工作
- 下载并安装 Logstash,参考 Download Logstash。
- 下载并安装 JDK 8,参考 Download JDK 8。
- 已创建 Kafka 集群。
步骤1:创建 Topic
创建一个名为 logstash_test
的 Topic。
步骤2:接入 Kafka
作为 inputs 接入
执行
bin/logstash-plugin list
,查看已经支持的插件是否含有logstash-input-kafka
。在
.bin/
目录下编写配置文件input.conf
。 此处将标准输出作为数据终点,将 Kafka 作为数据来源。input { kafka { bootstrap_servers => "xx.xx.xx.xx:xxxx" // kafka 实例接入地址 group_id => "logstash_group" // kafka groupid 名称 topics => ["logstash_test"] // kafka topic 名称 consumer_threads => 3 // 消费线程数,一般与 kafka 分区数一致 auto_offset_reset => "earliest" } } output { stdout{codec=>rubydebug} }
执行以下命令启动 Logstash,进行消息消费。
./logstash -f input.conf
会看到刚才 Topic 中的数据被消费出来。
作为 outputs 接入
执行
bin/logstash-plugin list
,查看已经支持的插件是否含有logstash-output-kafka
。在.
bin/
目录下编写配置文件output.conf
。 此处将标准输入作为数据来源,将 Kafka 作为数据目的地。input { input { stdin{} } } output { kafka { bootstrap_servers => "xx.xx.xx.xx:xxxx" // ckafka 实例接入地址 topic_id => "logstash_test" // ckafka topic 名称 } }
执行如下命令启动 Logstash,向创建的 Topic 发送消息。
./logstash -f output.conf
启动 Kafka 消费者,检验上一步的生产数据。
./kafka-console-consumer.sh --bootstrap-server 172.0.0.1:9092 --topic logstash_test --from-begging --new-consumer
Filebeats 接入 Kafka
Beats 平台 集合了多种单一用途数据采集器。这些采集器安装后可用作轻量型代理,从成百上千或成千上万台机器向目标发送采集数据。 Beats 有多种采集器,您可以根据自身的需求下载对应的采集器。本文以 Filebeat(轻量型日志采集器)为例,向您介绍 Filebeat 接入 Kafka 的操作指方法,及接入后常见问题的解决方法。
前提条件
- 下载并安装 Filebeat(参见 Download Filebeat)
- 下载并安装JDK 8(参见 Download JDK 8)
- 已 创建 Kafka 集群
操作步骤
步骤1:创建 Topic
创建一个名为 test
的 Topic。
步骤2:准备配置文件
进入 Filebeat 的安装目录,创建配置监控文件 filebeat.yml。
#======= Filebeat prospectors ==========
filebeat.prospectors:
- input_type: log
# 此处为监听文件路径
paths:
- /var/log/messages
#======= Outputs =========
#------------------ kafka -------------------------------------
output.kafka:
version:0.10.2 // 根据不同 Kafka 集群版本配置
# 设置为Kafka实例的接入地址
hosts: ["xx.xx.xx.xx:xxxx"]
# 设置目标topic的名称
topic: 'test'
partition.round_robin:
reachable_only: false
required_acks: 1
compression: none
max_message_bytes: 1000000
# SASL 需要配置下列信息,如果不需要则下面两个选项可不配置
username: "yourinstance#yourusername" //username 需要拼接实例ID和用户名
password: "yourpassword"
步骤4:Filebeat 发送消息
执行如下命令启动客户端。
sudo ./filebeat -e -c filebeat.yml
为监控文件增加数据(示例为写入监听的 testlog 文件)。
echo ckafka1 >> testlog echo ckafka2 >> testlog echo ckafka3 >> testlog
开启 Consumer 消费对应的 Topic,获得以下数据。
{"@timestamp":"2017-09-29T10:01:27.936Z","beat":{"hostname":"10.193.9.26","name":"10.193.9.26","version":"5.6.2"},"input_type":"log","message":"ckafka1","offset":500,"source":"/data/ryanyyang/hcmq/beats/filebeat-5.6.2-linux-x86_64/testlog","type":"log"} {"@timestamp":"2017-09-29T10:01:30.936Z","beat":{"hostname":"10.193.9.26","name":"10.193.9.26","version":"5.6.2"},"input_type":"log","message":"ckafka2","offset":508,"source":"/data/ryanyyang/hcmq/beats/filebeat-5.6.2-linux-x86_64/testlog","type":"log"} {"@timestamp":"2017-09-29T10:01:33.937Z","beat":{"hostname":"10.193.9.26","name":"10.193.9.26","version":"5.6.2"},"input_type":"log","message":"ckafka3","offset":516,"source":"/data/ryanyyang/hcmq/beats/filebeat-5.6.2-linux-x86_64/testlog","type":"log"}
SASL/PLAINTEXT 模式
如果您需要进行 SALS/PLAINTEXT 配置,则需要配置用户名与密码。 在 Kafka 配置区域新增加 username 和 password 配置即可。
参考链接
消息队列 CKafka - 文档中心 - 腾讯云 (tencent.com)
Three people walk together, there must be my teacher; knowledge sharing, the world is for the public. This article is written by Dongfeng Weiming technical blog EWhisper.cn .