Application and Practice of Streaming Data Processing in Baidu Data Factory

The data factory originally used the Hive engine to perform offline batch data analysis and PB-level queries, and process some core report data. However, during our promotion process, we discovered that users still have requests for complex analysis, real-time processing, and data mining. We started following up with Spark when Spark 1.0 was launched. It was thoroughly promoted in the team at Spark 1.6, when it was Spark Streaming. At that time, Spark 1.6's API was based on RDD, and it was not synchronized with Spark's batch processing API.

Later, when Spark 2.2 and 2.3 were released, Spark launched Struct Streaming. Its API and batch processing API were completely consistent. At this time, we ushered in a complete architecture upgrade, and a more powerful package based on Spark; Based on Spark, plus other CRM, PFS and other modules to do unified metadata processing and unified resource scheduling; a unified computing engine based on Spark, the previous set of Hive is also fully integrated into Spark Come; including multiple submission methods, security management, etc. Finally, a complete set of finished products is formed.

image

This is our current overall technical architecture. The lower left corner is unified metadata processing, including file metadata processing and structural data, such as Hive or MySQL metadata processing, which are all used as a unified processing layer. There is also the unified scheduling engine processing. We will have a unified scheduling engine. The user registers the queue. When executing, only the queue is displayed to execute specific resources. K8S, etc. are also currently supported. The next level is the unified Spark engine. In the Spark engine, we currently support SQL and Dataset API, and can run various complex processing on it.

The next level is the workspace provided by Jupyter, the scheduling provided by our self-research, and the self-researched set of stream computing job processing. The whole set constitutes the current complete data factory.

Application of Streaming Data Processing in Baidu Data Factory

Next is the most core part, what Baidu does on Spark streaming batch processing, mainly Spark streaming SQL issues, real-time transfer to offline issues, real-time transfer to large-screen display issues.

image

First of all, Spark itself has a complete set of APIs and a dedicated engine for analysis. All streams, batch analysis, and information will undergo a series of processing through this set of APIs, including semantic analysis, syntax analysis, and some optimizations. You can notice that the lower right corner is vacant, and Spark currently does not provide this part of the content. In the process of user transformation, many users came from Hive. They are users who are familiar with applying SQL. If they want to process streaming data, they will urgently need a SQL engine to help them transform Hive's SQL into a stream. Style SQL.

We have done some development on this issue. First of all, let's take a look. The above is the Spark API layer. Spark's API for stream batch processing has been completed very well. It is distinguished by read and readStream. Batch-to-SQL processing, in fact, it can be completely seen that it is Spark read a Source first, and it will be mapped to a From Table, and the specific processing will be mapped to various operations such as select, join, and union. Then Write Table is mapped into kafka_Table at the end. We can also map this type of streaming SQL.

For example, in this example, it reads from Kafka and outputs to HDFS, and the mapped streaming SQL is Kafka_Table. We can specifically define a Kafka Table. The user's processing, we will become select *. Attention everyone, I added a stream keyword in the middle. When Spark is processing, it does the processing of the unified engine. Only when the user of the API layer writes a readStream, it is a streaming process. If the user writes a Read, it is a batch process. Relatively speaking, we have done the corresponding processing in the SQL layer. The same SQL is run. If the user does not add the Stream keyword, it is a batch of SQL. If the Stream keyword is added, it becomes a stream processing. . But the engines processed inside are completely the same, except that when the plan is finally converted, the final part of the original plan is converted into a stream plan.

It also involves how Table is stored. Table storage, as shown in the figure, when a bulk Table is stored, it is nothing more than Table name, Schema, Properties, and some definitions, such as some configuration information such as location. The Kafka Table we created can also be mapped in. For example, the Table name I created can be mapped to Table Name, etc., and have a one-to-one correspondence with the Meta side, so that we have a complete set on the Hive side. Table storage information. It will generate a virtual path, but it will not actually be created or executed. In the data factory, we define a Table in this way. After a table is defined, it can be used by multiple people, except for temporary tables. You can authorize others to read it, and you can also perform column cuts. You can perform some desensitization data analysis on it. At this time, We need to consider the Table of versatility .

First of all, the first point is that the Table we created cannot store multiple data sources. For example, if you store three data sources, you can define three tables at this time, join different data sources, do unit analysis, etc. This is all OK in Spark.

The second point can only define general configurations. When giving others to use, authorize, and calculate, you will definitely need to consider his own application scenarios, which cannot be mixed with information related to the application scenarios, such as watermark configuration. This is the information related to the application scenarios. This part of the information They cannot be stored in the Table.

In addition, Spark natively supports Kafka and streaming Table. It must be able to read in a stream or in batches. Now that the Table is defined, how do we read it? This is our improvement in Spark. The main change is to change a logic of the semantic analysis layer. You can look at the entire logical frame on the right.

image

我们主要改动就改动在语义层分析上。在这个规则上增加一个语义分析 FindDataSouce,完成流式表解析,和增加了一个专门的可执行类 SQLStreamingSink,用来专门读取。比如说读取 Watermark 或各种的配置,最终产生一个流式的 Plan 去执行。最后经过语义层分析之后,就变成了一个 SQLStreamingSink。到这一步,一旦执行起来,它和正常 API 调度是一样的,这样 API 和 SQL 在语义层之后已经达到完整一致了。整体只在语义层这个地方做了一些区别而已。

image

总结来说,首先我们提供了一个统一的 Hive 元数据存储,我们所有的 Table 都是基于 Hive 的,当然 Hive 只是一种方案,主要是用数据仓库的形式来对数据进行统一管理。我们升级了 FindDataSource,用来处理 Table 到 Source 一个定义。比如说用户的 groupby,filter,sum 等操作都会通过语义解析的时候,会对它进行分析。现在这套方案,我们已经提供给社区。如果大家有兴趣的可以一块看看有什么更好的方案解决 Streaming Join Bach 的处理,在 Spark 的 Patch 中搜 SQLStreaming 可以看到目前正在讨论的链接,欢迎一起讨论相关方案。

实时转离线问题

很多人会说实时转离线用的场景多吗?其实在百度内的场景用的不少的,因为比如说用户的点击日志数据,大部分都是实时存放在日志中的,他会进行流式处理时候会全部打到 Kafka 消息队列里面,它还是有些需求,对数据进行一些比如说天级别,月级别,甚至年级别的数据分析。这时候应该怎么办呢?于是我们就提供了实时转离线的功能,其实 Spark 本身已经提供了一个实时转离线的,就是右侧的这个代码。

image

比如说定义了一个 CSV,定义一下它的分区路径,定义 partition、location 等,去执行写的操作时是有问题的。首先它输出信息完全是由人为去记在如一个卡片里,记在一个统一的文档里,下游去用的时候,问上游询要。一旦上游改变了,里面信息肯定就完全不对称了。

另外一个就是迁移升级问题。在百度内部存在集群迁移的问题,一个机房一般用几年之后,都会下线,这时候就会存在机型迁移。在输出路径上,或者在输出格式上,需要做出改变。而现在,一般来说,就是很多数据开发的人员,更期望的是开发的代码,能长时间运行。如果我用当前的实时转离线方案的话,肯定需要拿到原生的代码进行修改,再搭架构,再执行,这也是现实场景中用户吐槽最多的。

另外就是一个拓展文件格式,比如说甲方要求输出 sequenceFile,并指定了格式。但下游需要的 sequenceFile 格式是不同,怎么去拓展?这样开发代价很大,很多用户是比较抵制的,我们提供的方案是什么呢?就是实时转数仓的方案。

image

大家知道 Hive 保存了很多信息,里面这些元数据信息,比如说有统一的管理员进行管理之后,这些信息都可以很明晰,很明了的被其他人看到。但输出升级或迁移了,这时候修改 location 就可以了,代码比较好改。具体代码是什么样呢?就是右下角的这行代码。

image

这个方案具体实现是这样的,原生有一个 FileSink,读取 CSV 的数据,然后对数据进行处理之后,进行流式输出。只不过我们改动变成了 Hive 的 File Format,以批量的形式进行写入,然后输出出去。我们在创建 FileSink 的时候,我们会读一下用户填入的 Table,拿到它的信息,注入进去,进行一个写操作。大家可以注意到有一个分析监听器 HiveDynamicPartitionSinker,在运行完每一个批量,都有一个信息反馈,如到底输出了哪些 partition,能拿到相应的分区信息。拿到这个分区信息之后,可以对它进行一些处理,最后注入到 Hive 中,这样的话到底生产了哪些 partition,完全可以在 Hive 里面可以去查到的。

基本上这就是这完整的一套实时转数仓方案。我们基于这一套方案,在百度内部已经推广到了很多业务上。在百度内部还有很多细粒度的权限处理使用实时转数仓方案。一旦写入数仓,下游在订阅的时候会有一些数据脱敏,或者是读写权限控制,或者列权限控制,都能用这套方案进行控制,所以实时转数仓方案在百度内部还是比较受欢迎的。

实时转大屏展示

我们在推广过程中,用户是也些大屏展示的使用场景的。

image

这页显示的这是一个通用的用户使用场景,比如说将 Streaming 实时的数据输出到一个 OLAP 里面,最终展示给大屏。但是这个场景有什么问题呢?这里需要额外使用其他的系统,比如 Kafka 系统、OLAP 系统,最后再接入到大屏展示里面。每个系统,都需要找不同的负责人去处理,中间一旦出问题,需要反复去找负责人,与他们相互协调。中间会很复杂,一旦有一个网络出现故障,数据就会延时,负责人都需要处理自己的业务瓶颈。处理瓶颈是在用户使用场景中比较难办的问题。

image

我们是如何处理呢?现有的一套解决方案提供了一套原生的基于 Spark  SQL,Spark  shell 的系统。我们自研了一套就是交互式的分析,可以在本地,通过 Livy JBDC API,当用户 select * 或者什么的时候,它完全可以交互式的提交到集群上处理,然后反馈给用户。这时候结合 Spark MemorySink 去处理,将数据进行一些复杂的分析,具体的业务逻辑处理之后,它会将数据写入到集群内存里面。前端通过 Livy JDBC API 去交互式的去查询 Spark 集群内存的数据。这样数据是直接落到集群内存的,没有任何产出,是保存在内存里面的。Livy 去查的时候直接从里面去拿出来,相当于直接走内存,不会走经过落盘,网络等。这套方案就在实时处理和部署上,比较受欢迎。

image

然后这就是,我们整个的流批的处理,用户通过各种方式传输数据,我们经过流到流,或者是流到批,或者是批到批的处理,当然中间我们可以完全映射成 Table,就是流式的 Table 到批的 Table,再到具体的业务输出方。或者流式的 Table 到流式的 Table,或者是批的 Table 的到批的 Table,这完全是 OK 的。

流式数据处理在百度数据工厂的实践

基于这套数据处理,我给大家简单介绍一下百度内部的一些实践。下图是流式的第一版本的产品化界面。我们已经在设计第三版了。左边是一个流式 SQL 的一个提交界面,用户可以在里面添加 SQL,右边是流式监控。和现在 Spark 比较,和 Spark 监控页面比较类似的,监控当天有什么数据,显示实际处理量、数据时延等。

image

广告物料分析实践案例是我们一个比较典型的流批处理的使用场景。在实际的产品运维过程中,存在广告主投放广告的场景。广告主投放广告付钱是要看真实的点击率、曝光率和转化率的,而且很多广告主是根据曝光量、点击量、转化量来付钱的。

In this case, we need to analyze the advertising materials specifically, generate the PV, UV, click-through rate, and conversion rate of the advertisement based on the click, exposure log, and conversion data, and generate advertising revenue based on the billing data.

image

You can see that we will directly output to the big screen display through Streaming. Advertisers will directly see the real-time user volume, current output, and revenue. Another part of the data is output offline, and output through real-time transfer offline, and perform a daily-level, daily-level analysis, generate some daily-level, and some monthly-level PVs for analysis by the back-end strategist to adjust the advertisement Delivery strategy.


Guess you like

Origin blog.51cto.com/15060462/2678175