Kafka Optimization and Application Clusters big data platform hornet's nest

Kafka is the current popular middleware message queues, which can handle huge amounts of data in real time, with high throughput, low latency and reliable messaging features such as asynchronous transfer mechanism, we can solve the problem of the exchange and transfer of data between different systems.
Kafka in the hornet's nest also has a very wide range of applications, provide support for many of the core business. This article will focus on practical applications Kafka hornet's nest in the big data platform, introduce the relevant business scenarios, at different stages of Kafka application of what we encounter and how to solve problems, then what other programs.


Part.1 scenarios

Scene from Kafka in big data platform application point of view, is divided into the following three categories:
The first is to Kafka as a database , providing large data storage service platform for real-time data. From the source and use of two dimensions, the real-time data can be divided into the business end of the DB data, log monitor type, based on the client logs buried point (H5, WEB, APP, applet) and server logs.
The second is to provide a data source for data analysis , each Buried log as a data source, and support offline docking company data, real-time data warehouse and analysis system, including multi-dimensional queries, real-time Druid OLAP, logs and other details.
The third category is to provide data for the subscription business side . In addition to the application inside the big data platform, we also provide data to recommend the use of Kafka searching, large transportation, hotels, and other content-centric business core subscription services, real-time features such as user computing, real-time user training and portraits of real-time recommendations, anti-cheating , business alarm monitoring.
The main application as shown below:


Part.2 path of evolution in four stages

Early big data platform as the reason for the introduction of Kafka log collection and treatment systems business, mainly on account of its low-latency high-throughput, multi-subscription, data back, etc., can better meet the needs of big data scene. But with the rapid increase in business volume, as well as problems encountered in operational use and system maintenance, such as imperfect registration mechanism, monitoring mechanism, leading to problems can not quickly locate, and not after the failure of a number of online real-time tasks fast recovery causes the message backlog, the stability and availability of the cluster Kafka was being challenged, has experienced several serious problems.
For us to solve the above problem urgent and difficult. For some of the pain points in the use of big data platform Kafka exists, we use the cluster to expand the application layer to do a series of practical, on the whole it consists of four phases:
The first stage: version upgrade . Around some of the bottlenecks and problems in production and consumption data platform, our technology selection for the current version of Kafka, and ultimately determine the use of version 1.1.1.
Phase II: resource isolation . In order to support the rapid development of business, we have improved resource isolation between the building and the multi-cluster within the cluster Topic.
The third stage: access control and alarm monitoring.
First, in terms of security, the Kafka cluster in the early running naked state. Since the multi-product line sharing Kafka, it is easy to misread due Topic leads to other business data security issues. Therefore, we based SASL / SCRAM + ACL adds authentication functionality.
In terms of monitoring alarms, Kafka has become the current standard real-time calculation of input data sources, then one Lag backlog, handling the case has become an important indicator of the health of whether real-time tasks. Therefore, big data platform to build a unified Kafka alarm monitoring platform named "radar" multi-dimensional cluster monitoring Kafka and consumer cases.
Phase IV: application extensions . Early Kafka in the company's various business lines and opening up process, due to the lack of a unified use of standardized, resulting in an incorrect use of some of the business side. To address this pain point, we constructed a real-time subscription platform, in the form of application services assigned to give the business side, data production and consumption of application, the user platform licensing, consumer monitor alarms, and many other aspects of the process of automation, to create the demand side use the full range of the overall closed-loop control of resources.
Here we will introduce unfold around a few key points.

Core practices 1. upgrade

Before the big data platform has been used in this early version is 0.8.3 Kafka, but as of current, Kafka latest official version of Release 2.3 has arrived, so a lot of long-term use version 0.8 of bottlenecks and problems encountered in the process of gradually, we are It can be solved through version upgrades.
For example, the following are some common problems when using an older version before:
  • The lack of support for Security: data security problems exist and can not use fine-grained management of resources through authentication and authorization
  • broker under replicated: find broker in the state under replicated, but the cause of uncertainties, it is difficult to solve.
  • The new feature is not available: such as transaction messages, power and other messages, message timestamp, message inquiries.
  • Management client to offset dependence zookeeper, zookeeper's too heavy to use, increasing the complexity of the operation and maintenance of
  • Monitoring index is imperfect: as topic, partition, broker data size indicators, kafka manager and other monitoring tools to support low version kafka bad
While some of the characteristics of the target version selection research, such as:
  • Version 0.9, an increase of quotas and security, where security authentication and authorization functions we are most concerned about
  • Version 0.10, more granular time stamp can quickly find data based on the offset, find a time stamp you want. This real-time data processing based data replay Kafka data sources is extremely important
  • Version 0.11, idempotency and Transactions of support and a copy of the data loss / data inconsistencies resolved.
  • Version 1.1, maintenance of transport improved. For example, when Controller Shut Down, you want to close a Broker when needed before a long, complicated process has been greatly improved in version 1.0.
Chose version 1.1, it is because, for Camus and Kafka version compatibility and version 1.1 has met the comprehensive consideration of the use scenario of important new features supported. Here again briefly about Camus components, is also a Linkedin open source, in our big data platform primarily as Kafka Dump data to HDFS is an important way.

2. Resource isolation

Before Because little complexity and scale of business, big data platform is relatively simple to divide Kafka cluster. Then, after a period of time led to the company's business data is mixed together, there is a certain irrational use of business topics are likely to cause some Broker overloading, impact on other normal business, and even the failure of some Broker will affect the entire cluster, leading to the risk of company-wide business unavailable.
To solve the above problems, in the transformation of the cluster made two practice:
  • Functionally independent property split clusters
  • Topic resource isolation within the cluster size
(1) Cluster Split
Kafka split according to feature dimensions plurality of physical clusters, for service isolation, reducing the complexity of operation and maintenance.
To bury the most important point is the use of data, currently split into three clusters, cluster of various types of functions defined as follows:
  • Log Cluster : buried after each end-point data acquisition will give priority landing to the cluster, so this process can not occur due to problems caused Kafka acquisition interrupt, which Kafka high availability requirements. So the cluster does not provide external subscription, to ensure the consumer controllable; simultaneously also as a source of the offline trunking service acquisition, the data dump Hourly time granularity to HDFS by Camus components, which participate in the subsequent part of the data off-line calculation.
  • The full amount of subscription clusters: the vast majority of the cluster data from the Log Topic is the cluster real-time synchronization over. We mentioned above data Log cluster is not external, so the whole amount cluster assumed the duties of consumer subscription. Mainly for real-time tasks inside the platform, to data analysis for multiple lines of business and provide analysis services.
  • Personalized custom cluster : mentioned earlier, we can split the party based on business needs, merge data log sources, and we also support customized Topic, the cluster only need to provide the shunt after landing Topic of storage.
Clusters overall structure is divided in the following figure:
(2) resource isolation
Topic of traffic volume is an important basis for the cluster resource isolation. For example, we Buried large amount of log data sources are two mobile-event Buried backend data sources on Buried data source server-event and the end in the business, we should avoid storing two data relating to the partition assigned to the same cluster node on a Broker. By physical separation of different Topic, to avoid traffic on the Broker is tilted.

3. The access control and alarm monitoring

(1) access control
We said at the beginning of introduction, early Kafka set up secure authentication in the cluster is not running naked state, so long as we know Broker connection address to production and consumption, there is a serious data security issues.
In general, use SASL Kerberos users will choose, but the scene on the use of cluster platform Kafka, the user system is not complicated, it is overkill to use Kerberos, while Kerberos is relatively complex, there is a risk lead to other problems. In addition, Encryption, due to the internal network environment are running, so do not use SSL encryption.
The final platform Kafka as a cluster using SASL authentication manner, based SASL / SCRAM + lightweight combination of ACL, dynamic user creation to ensure data security.
(2) Monitoring Alarms
Before we often find in the use of the cluster, the performance of consumer applications for no reason worse. Cause of the problem, usually a high probability Consumer hysteresis data read did not hit Page- cache, resulting in Broker end of the machine's kernel must first read data is loaded into the Page- cache, the result can be returned to the Consumer from disk, rather It could serve to write to disk now to read the data, affecting the consumer to read and write at the same time reducing the performance of the cluster.
这时就需要找出滞后 Consumer 的应用进行事前的干预从而减少问题发生,因此监控告警无论对平台还是用户都有着重大的意义。下面介绍一下我们的实践思路。
整体方案:
整体方案主要是基于开源组件 Kafka JMX Metrics+OpenFalcon+Grafana:
  • Kafka JMX Metrics:Kafka broker 的内部指标都以 JMX Metrics 的形式暴露给外部。1.1.1 版本 提供了丰富的监控指标,满足监控需要
  • OpenFalcon:小米开源的一款企业级、高可用、可扩展的开源监控系统
  • Grafana:Metrics 可视化系统,大家比较熟悉,可对接多种 Metrics 数据源。
关于监控:
  • Falcon-agent:部署到每台 Broker 上, 解析 Kafka JMX 指标上报数据
  • Grafana:用来可视化 Falcon Kafka Metrics 数据,对 Cluster、Broker、Topic、Consumer 4 个角色制作监控大盘。
  • Eagle:获取消费组 Active 状态、消费组 Lag 积压情况,同时提供 API,为监控告警系统「雷达」提供监控数据。
关于告警:
雷达系统: 自研监控系统,通过 Falcon 及 Eagle 获取 Kafka 指标,结合设定阈值进行告警。以消费方式举例,Lag 是衡量消费情况是否正常的一个重要指标,如果 Lag 一直增加,必须要对它进行处理。
发生问题的时候,不仅 Consumer 管理员要知道,它的用户也要知道,所以报警系统也需要通知到用户。具体方式是通过企业微信告警机器人自动提醒对应消费组的负责人或使用者及 Kafka 集群的管理者。
监控示例:


4. 应用扩展

(1) 实时数据订阅平台
实时数据订阅平台是一个提供 Kafka 使用全流程管理的系统应用,以工单审批的方式将数据生产和消费申请、平台用户授权、使用方监控告警等众多环节流程化自动化, 并提供统一管控。
核心思想是基于 Kafka 数据源的身份认证和权限控制,增加数据安全性的同时对 Kafka 下游应用进行管理。
(2) 标准化的申请流程
无论生产者还是消费者的需求,使用方首先会以工单的方式提出订阅申请。申请信息包括业务线、Topic、订阅方式等信息;工单最终会流转到平台等待审批;如果审批通过,使用方会分配到授权账号及 Broker 地址。至此,使用方就可以进行正常的生产消费了。
(3) 监控告警
对于平台来说,权限与资源是绑定的,资源可以是用于生产的 Topic 或消费使用的 GroupTopic。一旦权限分配后,对于该部分资源的使用就会自动在我们的雷达监控系统进行注册,用于资源整个生命的周期的监控。
(4) 数据重播
出于对数据完整性和准确性的考量,目前 Lamda 架构已经是大数据的一种常用架构方式。但从另一方面来说,Lamda 架构也存在资源的过多使用和开发难度高等问题。
实时订阅平台可以为消费组提供任意位点的重置,支持对实时数据按时间、位点等多种方式的数据重播, 并提供对 Kappa 架构场景的支持,来解决以上痛点。
(5) 主题管理
为什么提供主题管理?举一些很简单的例子,比如当我们想让一个用户在集群上创建他自己的 Kafka  Topic,这时显然是不希望让他直接到一个节点上操作的。因此刚才所讲的服务,不管是对用户来讲,还是管理员来讲,我们都需要有一个界面操作它,因为不可能所有人都通过 SSH 去连服务器。
因此需要一个提供管理功能的服务,创建统一的入口并引入主题管理的服务,包括主题的创建、资源隔离指定、主题元数据管理等。
(6) 数据分流
在之前的架构中, 使用方消费 Kafka 数据的粒度都是每个 Kafka Topic 保存 LogSource 的全量数据,但在使用中很多消费方只需要消费各 LogSource 的部分数据,可能也就是某一个应用下几个埋点事件的数据。如果需要下游应用自己写过滤规则,肯定存在资源的浪费及使用便捷性的问题;另外还有一部分场景是需要多个数据源 Merge 在一起来使用的。
基于上面的两种情况, 我人实现了按业务方需求拆分、合并并定制化 Topic 支持跨数据源的数据合并及 appcode 和 event code 的任意组个条件的过滤规则。
发布了157 篇原创文章 · 获赞 43 · 访问量 9万+

Guess you like

Origin blog.csdn.net/qq_39581763/article/details/103902930