Spark series (XVI) - Spark Streaming integration Kafka

First, Imprint

Spark for the different versions of Kafka, offers two integrated programs: spark-streaming-kafka-0-8and spark-streaming-kafka-0-10, the main differences are as follows:

spark-streaming-kafka-0-8 spark-streaming-kafka-0-10
Kafka version 0.8.2.1 or higher 0.10.0 or higher
AP state Deprecated
from Spark 2.3.0 version of the beginning, Kafka 0.8 support has been deprecated
Stable (Stable)
language support Scala, Java, Python Scala, Java
Receiver DStream Yes No
Direct DStream Yes Yes
SSL / TLS Support No Yes
Offset Commit API (offset submitted) No Yes
Topic Subscription Dynamic
(Dynamic Theme subscription)
No Yes

Kafka version is used herein kafka_2.12-2.2.0, it is the second approach to integrate.

Second, the project relies

Maven build project uses, rely mainly on the following:

<properties>
    <scala.version>2.12</scala.version>
</properties>

<dependencies>
    <!-- Spark Streaming-->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming_${scala.version}</artifactId>
        <version>${spark.version}</version>
    </dependency>
    <!-- Spark Streaming 整合 Kafka 依赖-->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming-kafka-0-10_${scala.version}</artifactId>
        <version>2.4.3</version>
    </dependency>
</dependencies>

See full source code of this warehouse: the Spark-Streaming-Kafka used to live

Third, the integration of Kafka

By calling the KafkaUtilsobject createDirectStreamto create an input stream, complete code is as follows:

import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * spark streaming 整合 kafka
  */
object KafkaDirectStream {

  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setAppName("KafkaDirectStream").setMaster("local[2]")
    val streamingContext = new StreamingContext(sparkConf, Seconds(5))

    val kafkaParams = Map[String, Object](
      /*
       * 指定 broker 的地址清单,清单里不需要包含所有的 broker 地址,生产者会从给定的 broker 里查找其他 broker 的信息。
       * 不过建议至少提供两个 broker 的信息作为容错。
       */
      "bootstrap.servers" -> "hadoop001:9092",
      /*键的序列化器*/
      "key.deserializer" -> classOf[StringDeserializer],
      /*值的序列化器*/
      "value.deserializer" -> classOf[StringDeserializer],
      /*消费者所在分组的 ID*/
      "group.id" -> "spark-streaming-group",
      /*
       * 该属性指定了消费者在读取一个没有偏移量的分区或者偏移量无效的情况下该作何处理:
       * latest: 在偏移量无效的情况下,消费者将从最新的记录开始读取数据(在消费者启动之后生成的记录)
       * earliest: 在偏移量无效的情况下,消费者将从起始位置读取分区的记录
       */
      "auto.offset.reset" -> "latest",
      /*是否自动提交*/
      "enable.auto.commit" -> (true: java.lang.Boolean)
    )
    
    /*可以同时订阅多个主题*/
    val topics = Array("spark-streaming-topic")
    val stream = KafkaUtils.createDirectStream[String, String](
      streamingContext,
      /*位置策略*/
      PreferConsistent,
      /*订阅主题*/
      Subscribe[String, String](topics, kafkaParams)
    )

    /*打印输入流*/
    stream.map(record => (record.key, record.value)).print()

    streamingContext.start()
    streamingContext.awaitTermination()
  }
}

3.1 ConsumerRecord

Here the input stream obtained in each Record fact ConsumerRecord<K, V>examples, which contains all the information available Record, source code is as follows:

public class ConsumerRecord<K, V> {
    
    public static final long NO_TIMESTAMP = RecordBatch.NO_TIMESTAMP;
    public static final int NULL_SIZE = -1;
    public static final int NULL_CHECKSUM = -1;
    
    /*主题名称*/
    private final String topic;
    /*分区编号*/
    private final int partition;
    /*偏移量*/
    private final long offset;
    /*时间戳*/
    private final long timestamp;
    /*时间戳代表的含义*/
    private final TimestampType timestampType;
    /*键序列化器*/
    private final int serializedKeySize;
    /*值序列化器*/
    private final int serializedValueSize;
    /*值序列化器*/
    private final Headers headers;
    /*键*/
    private final K key;
    /*值*/
    private final V value;
    .....   
}

3.2 Producers property

In the sample code kafkaParamsencapsulated Kafka consumer attributes and independent Spark Streaming, Kafka native API is defined in there. Wherein the server address, and the key values of the sequence of the sequence is mandatory, and other configurations are optional. The remaining optional configuration items as follows:

1. fetch.min.byte

The number of consumers to obtain records from the server minimum bytes. If the amount of data available is less than the set value, broker will have to wait for enough data is available when it will return to the consumer.

2. fetch.max.wait.ms

broker returned to the waiting time consumer data.

3. max.partition.fetch.bytes

Zoning maximum number of bytes returned to consumers.

4. session.timeout.ms

Consumers can disconnect the connection time with the server before they are considered dead.

5. auto.offset.reset

This attribute specifies a read at the consumer without offset or shift the invalid partition the case for what amount of treatment:

  • latest (default): In the case of the offset amount is invalid, consumer generated from its start date after recording starts reading data;
  • earliest: In the case where the offset is invalid, the consumer from the start position of reading record partition.

6. enable.auto.commit

Submit it automatically offset, the default value is true, in order to avoid duplication of data and loss of data, it can be set to false.

7. client.id

Client id, used to identify the origin of the message server.

8. max.poll.records

Single call poll()method can return the number of records.

9. receive.buffer.bytes 和 send.buffer.byte

These parameters specify the size of the receiving and sending TCP socket packet buffer, -1 default values ​​for the system.

3.3 Location Policy

Spark Streaming provides the following three positions in strategy for the distribution relationship between the specified partition and Kafka theme Spark program execution Executors:

  • PreferConsistent : It uniform distribution of all partitions on the Executors;

  • PreferBrokers : Spark When the Executor and Kafka Broker can select the option on the same machine, it takes precedence over the allocation of partitions on the leader to Executor Broker on the machine;
  • PreferFixed : partition may specify the mapping relationship relating to a particular host, explicitly assign a partition to a specific host, its constructor as follows:

@Experimental
def PreferFixed(hostMap: collection.Map[TopicPartition, String]): LocationStrategy =
  new PreferFixed(new ju.HashMap[TopicPartition, String](hostMap.asJava))

@Experimental
def PreferFixed(hostMap: ju.Map[TopicPartition, String]): LocationStrategy =
  new PreferFixed(hostMap)

3.4 subscription

Spark Streaming subscription provides two themes, respectively, Subscribeand SubscribePattern. The latter can be used to match the regular subscription name of the theme. Configured respectively as follows:

/**
  * @param 需要订阅的主题的集合
  * @param Kafka 消费者参数
  * @param offsets(可选): 在初始启动时开始的偏移量。如果没有,则将使用保存的偏移量或 auto.offset.reset 属性的值
  */
def Subscribe[K, V](
    topics: ju.Collection[jl.String],
    kafkaParams: ju.Map[String, Object],
    offsets: ju.Map[TopicPartition, jl.Long]): ConsumerStrategy[K, V] = { ... }

/**
  * @param 需要订阅的正则
  * @param Kafka 消费者参数
  * @param offsets(可选): 在初始启动时开始的偏移量。如果没有,则将使用保存的偏移量或 auto.offset.reset 属性的值
  */
def SubscribePattern[K, V](
    pattern: ju.regex.Pattern,
    kafkaParams: collection.Map[String, Object],
    offsets: collection.Map[TopicPartition, Long]): ConsumerStrategy[K, V] = { ... }

In the sample code, we did not actually specify a third parameter offsets, the program uses the default configuration of the auto.offset.resetvalue of the property latest, that is, in the case of an invalid offset, generated from its consumer after the start of the latest record start reading data.

3.5 The author Offset

In the sample code, we will enable.auto.commitset to true, the representative automatically submitted. In some cases, you may need higher reliability, as submitted by offsets in the business entirely after the completion of treatment, this time you can use the manual submission. Want to submit manually, you need to call Kafka native API:

  • commitSync: Asynchronous submission;
  • commitAsync: For synchronous submission.

Specific submission can be found: Kafka used to live Detailed consumer

Fourth, start the test

4.1 Creating themes

1. Start Kakfa

Kafka run depends on the zookeeper, need to pre-start, you can start Kafka built zookeeper, you can also start your own installation:

# zookeeper启动命令
bin/zkServer.sh start

# 内置zookeeper启动命令
bin/zookeeper-server-start.sh config/zookeeper.properties

Start kafka single node for testing:

# bin/kafka-server-start.sh config/server.properties

2. Create topic

# 创建用于测试主题
bin/kafka-topics.sh --create \
                    --bootstrap-server hadoop001:9092 \
                    --replication-factor 1 \
                    --partitions 1  \
                    --topic spark-streaming-topic

# 查看所有主题
 bin/kafka-topics.sh --list --bootstrap-server hadoop001:9092

3. Create Producers

Here we create a Kafka producer, used to send test data:

bin/kafka-console-producer.sh --broker-list hadoop001:9092 --topic spark-streaming-topic

4.2 Local Mode Test

Here I started directly Spark Streaming uses local mode. After you start using a producer sends data, view the results from the console.

As can be seen from the console output data stream has been successfully received, since kafka-console-producer.shthe data transmission is no default key, so the key is null. At the same time it can also be seen from the output specified in the program groupIdand the program automatically assigned clientId.

Reference material

  1. https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html

More big data series can be found GitHub open source project : Big Data Getting Started

Guess you like

Origin www.cnblogs.com/heibaiying/p/11359594.html