How to implement batch-flow-integrated elastic data processing based on Apache Pulsar and Spark?

Current status of batch flow

In the field of massively parallel data analysis, AMPLab's "One stack to rule them all" proposes to use Apache Spark as a unified engine to support common data processing scenarios such as batch processing, stream processing, interactive query, and machine learning. In July 2017, Spark structured streaming, which was officially launched by Spark 2.2.0 version, uses Spark SQL as the underlying unified execution engine for stream processing and batch processing. Optimize query of boundary table (static historical data), and provide users with Dataset/DataFrame API for joint processing of batch data, which further blurs the boundary of batch data processing.

On the other hand, Apache Flink entered the public eye around 2016. With its better stream processing engine at the time, the native Watermark supports the data consistency guarantee of "Exaclty Once" and the support of various scenarios such as batch and stream integrated computing. Become a strong challenger to Spark. Whether using Spark or Flink, what users really care about is how to better use data and tap the value of data faster. Streaming data and static data are no longer separate individuals, but two different representations of a piece of data. the way.

However, in practice, building a batch-stream-integrated data platform is not just a task of the computing engine layer. Because in traditional solutions, near real-time stream and event data are usually stored in message queues (such as RabbitMQ) and real-time data pipelines (such as Apache Kafka), while static data required for batch processing is usually stored in file systems and object storage. . This means that, on the one hand, in the process of data analysis, in order to ensure the correctness and real-time performance of the results, it is necessary to perform joint queries on the data stored in the two types of systems; on the other hand, in the process of operation and maintenance, it is necessary to periodically Dump stream data to file/object storage, and ensure the performance of message queues and data pipelines by maintaining the total amount of data in stream form below the threshold (because the partition-based architecture design of this type of system tightly couples the message service And message storage, and most of them rely too much on the file system. As the amount of data increases, system performance will drop sharply), but artificial data relocation will not only increase the operation and maintenance costs of the system, but also data cleaning and reading during the relocation process. Fetching and loading are also a huge consumption of cluster resources.

At the same time, from the popularity of Mesos and YARN, the rise of Docker, to the widespread adoption of Kubernetes, the entire infrastructure is developing towards containerization. The traditional tightly coupled message service and message computing architecture cannot adapt well. Containerized architecture. Take Kafka as an example. Its partition-centric architecture tightly couples message service and message storage. Kafka's partitions are strongly bound to one or a group of physical machines, which brings about the problem of expensive and lengthy partition data rebalancing process during machine failure or cluster expansion; its storage design with partition as granularity It also cannot make good use of existing cloud storage resources; in addition, the over-simple design leads to the need to solve many architectural flaws in multi-tenant management and IO isolation for containerization.

Introduction to Pulsar

Apache Pulsar is a multi-tenant, high-performance enterprise-level message publishing and subscribing system. It was originally developed by Yahoo. It graduated from the Apache incubator in September 2018 and became the top open source project of the Apache Foundation. Pulsar is built based on the publish-subscribe model (pub-sub). Producers publish messages to topics. Consumers can subscribe to topics, process received messages, and send confirmation (Ack) after message processing is complete. ). Pulsar provides four types of subscriptions, which can coexist on the same topic, distinguished by subscription name:

  • Exclusive subscription-There can only be one consumer at a time under a subscription name.

  • Shared (shared) subscription-can be subscribed by multiple consumers, each consumer receives a part of the message.

  • Failover subscription-allows multiple consumers to connect to the same topic, but only one consumer can receive messages. Only when the current consumer fails, other consumers start to receive messages.

  • Key-shared subscription (beta feature)-multiple consumers connected to the same topic, the same Key will always be sent to the same consumer.

Pulsar supports the concept of multi-tenancy from the beginning of its design. Tenants can span multiple clusters. Each tenant has its own authentication and authentication methods. Tenants are also storage quotas and message survival. Time (TTL) and management unit of isolation strategy. The multi-tenant feature of Pulsar can be fully reflected on the topic URL, and its structure is  persistent://tenant/namespace/topic. Namespace is the most basic management unit in Pulsar. We can set permissions, adjust replication options, manage cross-cluster data replication, control the expiration time of messages, or perform other key tasks.

Pulsar's unique architecture

The most fundamental difference between Pulsar and other messaging systems is that it uses a layered architecture that separates computing and storage. The Pulsar cluster consists of two layers: the stateless service layer, which is composed of a group of brokers that accept and deliver messages; the distributed storage layer, which is composed of a group of Apache BookKeeper storage nodes named bookies, with high availability, strong consistency, Features of low latency.

Like Kafka, Pulsar also stores topic data based on the logical concept of topic partition. The difference is that Kafka's physical storage is also based on partitions, each partition must be stored as a whole (a directory) on a broker, and each topic partition of Pulsar is essentially a distributed storage on BookKeeper Log, each log is divided into segments (Segment). Each segment acts as a Ledger on BookKeeper, evenly distributed and stored in multiple bookies. Storage layered architecture and segment-centric sharded storage are two key design concepts of Pulsar. Based on this, Pulsar provides many important advantages: unlimited topic partitions, instant storage expansion, no data migration, seamless broker failure recovery, seamless cluster expansion, seamless storage (Bookie) failure recovery, and independent Scalability.

The message system decouples the producer and the consumer, but the actual message is still structured in nature. Therefore, a coordination mechanism is needed between the producer and the consumer to reach a consensus on the message structure in the production and consumption process. To achieve the purpose of type safety. Pulsar has a built-in Schema registration method to provide a way to transfer message type agreement on the message system side. The client can agree on topic-level message type information by uploading Schema, and Pulsar is responsible for message type checking and automatic serialization of typed messages , Deserialization, thereby reducing the cost of repeated development and maintenance of message parsing code between multiple applications. Of course, Schema definition and type safety are an optional mechanism and will not cause any performance overhead to the publishing and consumption of untyped messages.

Read and write Pulsar data in Spark  

自 Spark 2.2 版本 Structured Streaming 正式发布,Spark 只保留了 SparkSession 作为主程序入口,你只需编写 DataSet/DataFrame API 程序,以声明形式对数据的操作,而将具体的查询优化与批流处理执行的细节交由 Spark SQL 引擎进行处理。对于一个数据处理作业,需要定义 DataFrame 的产生、变换和写出三个部分,而将 Pulsar 作为流数据平台与 Spark 进行集成正是要解决如何从 Pulsar 中读取数据(Source)和如何向 Pulsar 写出运算结果(Sink)两个问题。

为了实现以 Pulsar 为源读取批流数据与支持批流数据向 Pulsar 的写入,我们构建了 Spark Pulsar Connector。

对 Structured Streaming 的支持

上图展示了 Structured Streaming(以下简称 SS )的主要组件:

  • 输入和输出——为了提供细粒度的容错,SS 要求输入数据源(Source)是可重放(replayable)的;为了提供端到端的 Exactly-Once 的语义,需要输出(Sink)支持幂等写出(一条消息被多次写入与一次写入效果一致,可由 DBMS、KV 系统通过键约束的方式支持)。

  • API——用户通过编写 Spark SQL 的 batch API(SQL 或 DataFrame)指定对一个或多个流、表的查询,并定义一个输出表保存所有的输出结果,而引擎内部决定如何将结果增量地写到 Sink 中。为了支持流处理,SS 在原有的 Spark SQL API 上添加了一些接口:

    • 触发器(Trigger)——控制引擎触发流处理执行、在 Sink 中更新结果的频率。

    • 水印机制(Watermark policy)——用户通过指定字段做 event time,来决定对晚到数据的处理。

    • 有状态算子(Stateful operator)——用户可以根据 Key 跟踪和更新算子内部的可变状态,完成复杂的业务需求(例如,基于会话的窗口)。

  • 执行层——当收到一个查询时,SS 决定它的增量执行方式,进行优化、并开始执行。SS 有两种可选的执行模型:

    • Microbatch model(微批处理模式)——默认的执行方式,与 Spark Streaming 的 DStream 类似,将流切成 micro batch,对每个 batch 分别处理。这种模式支持动态负载均衡、故障恢复等机制,适合将吞吐率作为主要性能指标的应用。

    • Continuous mode(持续模式)——在集群上启动长时间运行的算子,适合处理较为简单、延迟敏感类应用。

  • Log 和 State Store —— SS 利用两种持久化存储来提供容错保障:一个 Write-ahead-Log(WAL),记录被成功消费且持久化写出的每个数据源中的位置;一个大规模的 state store, 存储长期运行的聚集算子内部的状态快照。当故障发生时,SS 会根据快照的位置,通过重放之后的消息完成流处理状态的恢复。

具体到源码层面,Source 接口定义了可重放数据源需要提供的功能。

trait Source {
 def schema: StructType
 def getOffset: Option[Offset]
 def getBatch(start: Option[Offset], end: Offset): DataFrame
 def commit(end: Offset): Unit
 def stop(): Unit
}

trait Sink {
 def addBatch(batchId: Long, data: DataFrame): Unit
}

以 microbatch 执行模式为例:

  1. 在每个 microbatch 的最开始,SS 会向 source 询问当前的最新进度(getOffset),并将其持久化到 WAL 中。

  2. 随后,source 根据 SS 提供的 start end 偏移量,提供区间范围的数据(getBatch)。

  3. SS 触发计算逻辑的优化和编译,把计算结果写出给 sink(addBatch),这时才触发实际的取数据操作以及计算过程。

  4. 在数据完整写出到 sink 后,SS 通知 source 可以废弃数据(commit),并将成功执行的batchId 写入内部维护的 commitLog 中。

具体到 Pulsar 的 connector 实现中:

  1. 在所有批次开始执行前,SS 会调用 schema 方法返回消息的结构信息,在 schema 方法内部,我们从 Pulsar 的 Schema Registry 提取出所有主题的 Schema,并进行一致性检查。

  2. 随后,我们为每个主题分区创建一个消费者,按照 (start, end] 返回主题分区中的数据。

  3. 当收到 SS 的 commit 通知时,通过 topics 中的 resetCursor 向 Pulsar 标志消息消费的完成。Sink 中构建的生产者则将 addBatch 中获取的实际数据以消息形式追加写入相应的主题中。

    对批处理作业的支持

在某个时间点执行的批作业,可以看作是对 Pulsar 平台中的流数据在一个时间点的快照进行的数据分析。Spark 对历史数据的查询是以 Relation 为单位,Spark Pulsar Connector 提供createRelation 方法的实现根据用户指定的多个主题分区构建表,并返回包含 Schema 信息的 DataSet。在查询计划阶段,Connector 的功能分成两步:首先,根据用户提供的一个或多个主题,在 Pulsar Schema Registry 中查找主题 Schema,并检查多个主题 Schema 的一致性;其次,将用户指定的所有主题分区进行任务划分(Partition),得到的分片即是 Spark source task 的执行粒度。

Pulsar 提供了两层的接口对其中的数据进行访问,基于主题分区的 Consumer/Reader 接口,以传统消息接收为语义的顺序数据读取;Segment 级的读接口,提供对 Segment 数据的直接读取。因此,相应地从 Pulsar 读数据执行批作业可以分成两种粒度(即读取数据的并行度)进行:以主题分区为粒度(每个主题分区作为一个分片);以 Segment 为粒度(将一个主题分区的多个 Segment 组织成一个分片,因此一个主题分区会有多个对应的分片)。你可以按照批作业的并行度需求和可分配计算资源选择合适的消息读取的并行粒度。另一方面,将批作业的执行存储到 Pulsar 也很直观,你只需指定写入的主题和消息路由规则(RoundRobin 或者按 Key 划分),在 Sink task 中创建的每个生产者会将待写出的消息送至对应的主题分区。

如何使用 Spark Pulsar Connector

根据一个或多个主题创建流处理 Source。

val df = spark
 .readStream
 .format("pulsar")
 .option("service.url", "pulsar://localhost:6650")
 .option("admin.url", "http://localhost:8080")
 .option("topicsPattern", "topic.*") // Subscribe to a pattern
 // .option("topics", "topic1,topic2")    // Subscribe to multiple topics
 // .option("topic", "topic1"). //subscribe to a single topic
 .option("startingOffsets", startingOffsets)
 .load()
df.selectExpr("CAST(__key AS STRING)", "CAST(value AS STRING)")
 .as[(String, String)]


构建批处理 Source。

val df = spark
 .read
 .format("pulsar")
 .option("service.url", "pulsar://localhost:6650")
 .option("admin.url", "http://localhost:8080")
 .option("topicsPattern", "topic.*")
 .option("startingOffsets", "earliest")
 .option("endingOffsets", "latest")
 .load()
df.selectExpr("CAST(__key AS STRING)", "CAST(value AS STRING)")
 .as[(String, String)]


使用数据中本身的 topic 字段向多个主题进行持续 Sink。

val ds = df
 .selectExpr("topic", "CAST(__key AS STRING)", "CAST(value AS STRING)")
 .writeStream
 .format("pulsar")
 .option("service.url", "pulsar://localhost:6650")
 .start()


将批处理结果写回 Pulsar。

df.selectExpr("CAST(__key AS STRING)", "CAST(value AS STRING)")
 .write
 .format("pulsar")
 .option("service.url", "pulsar://localhost:6650")
 .option("topic", "topic1")
 .save()

注意

由于 Spark Pulsar Connector 支持结构化消息的消费和写入,为了避免消息负载中字段和消息元数据(event time、publish time、key 和 messageId)的潜在命名冲突,消息元数据字段在 Spark schema 中以双下划线做为前缀(例如,__eventTime)。


Guess you like

Origin blog.51cto.com/15060462/2678362
Recommended