Optimization and Application Clusters big expansion Kafka data hornet's nest platform

Ma cellular technology original article, please subscribe to dry more public numbers: mfwtech

Kafka is the current popular middleware message queues, which can handle huge amounts of data in real time, with high throughput, low latency and reliable messaging features such as asynchronous transfer mechanism, we can solve the problem of the exchange and transfer of data between different systems.

Kafka in the hornet's nest also has a very wide range of applications, provide support for many of the core business. This article will focus on practical applications Kafka hornet's nest in the big data platform, introduce the relevant business scenarios, at different stages of Kafka application of what we encounter and how to solve problems, then what other programs.

Part.1 scenarios

Scene from Kafka in big data platform application point of view, is divided into the following three categories:

The first is to Kafka as a database , providing large data storage service platform for real-time data. From the source and use of two dimensions, the real-time data can be divided into the business end of the DB data, log monitor type, based on the client logs buried point (H5, WEB, APP, applet) and server logs.

The second is to provide a data source for data analysis , each Buried log as a data source, and support offline docking company data, real-time data warehouse and analysis system, including multi-dimensional queries, real-time Druid OLAP, logs and other details.

The third category is to provide data for the subscription business side . In addition to the application inside the big data platform, we also provide data to recommend the use of Kafka searching, large transportation, hotels, and other content-centric business core subscription services, real-time features such as user computing, real-time user training and portraits of real-time recommendations, anti-cheating , business alarm monitoring.

The main application as shown below:

Evolution of the road Part.2

four stages

Early big data platform as the reason for the introduction of Kafka log collection and treatment systems business, mainly on account of its low-latency high-throughput, multi-subscription, data back, etc., can better meet the needs of big data scene. But with the rapid increase in business volume, as well as problems encountered in operational use and system maintenance, such as imperfect registration mechanism, monitoring mechanism, leading to problems can not quickly locate, and not after the failure of a number of online real-time tasks fast recovery causes the message backlog, the stability and availability of the cluster Kafka was being challenged, has experienced several serious problems.

For us to solve the above problem urgent and difficult. For some of the pain points in the use of big data platform Kafka exists, we use the cluster to expand the application layer to do a series of practical, on the whole it consists of four phases:

The first stage: version upgrade . Around some of the bottlenecks and problems in production and consumption data platform, our technology selection for the current version of Kafka, and ultimately determine the use of version 1.1.1.

Phase II: resource isolation . In order to support the rapid development of business, we have improved resource isolation between the building and the multi-cluster within the cluster Topic.

** The third stage: ** access control and alarm monitoring.

First, in terms of security, the Kafka cluster in the early running naked state. Since the multi-product line sharing Kafka, it is easy to misread due Topic leads to other business data security issues. Therefore, we based SASL / SCRAM + ACL adds authentication functionality.

In terms of monitoring alarms, Kafka has become the current standard real-time calculation of input data sources, then one Lag backlog, handling the case has become an important indicator of the health of whether real-time tasks. Therefore, big data platform to build a unified Kafka alarm monitoring platform named "radar" multi-dimensional cluster monitoring Kafka and consumer cases.

Phase IV: application extensions . Early Kafka in the company's various business lines and opening up process, due to the lack of a unified use of standardized, resulting in an incorrect use of some of the business side. To address this pain point, we constructed a real-time subscription platform, in the form of application services assigned to give the business side, data production and consumption of application, the user platform licensing, consumer monitor alarms, and many other aspects of the process of automation, to create the demand side use the full range of the overall closed-loop control of resources.

Here we will introduce unfold around a few key points.

Core Practice

1. Version Upgrade

Before the big data platform has been used in this early version is 0.8.3 Kafka, but as of current, Kafka latest official version of Release 2.3 has arrived, so a lot of long-term use version 0.8 of bottlenecks and problems encountered in the process of gradually, we are It can be solved through version upgrades.

For example, the following are some common problems when using an older version before:

  • The lack of support for Security: data security problems exist and can not use fine-grained management of resources through authentication and authorization
  • broker under replicated: find broker in the state under replicated, but the cause of uncertainties, it is difficult to solve.
  • The new feature is not available: such as transaction messages, power and other messages, message timestamp, message inquiries.
  • Management client to offset dependence zookeeper, zookeeper's too heavy to use, increasing the complexity of the operation and maintenance of
  • Monitoring index is imperfect: as topic, partition, broker data size indicators, kafka manager and other monitoring tools to support low version kafka bad

While some of the characteristics of the target version selection research, such as:

  • Version 0.9, an increase of quotas and security, where security authentication and authorization functions we are most concerned about
  • Version 0.10, more granular time stamp can quickly find data based on the offset, find a time stamp you want. This real-time data processing based data replay Kafka data sources is extremely important
  • Version 0.11, idempotency and Transactions of support and a copy of the data loss / data inconsistencies resolved.
  • Version 1.1, maintenance of transport improved. For example, when Controller Shut Down, you want to close a Broker when needed before a long, complicated process has been greatly improved in version 1.0.

Chose version 1.1, it is because, for Camus and Kafka version compatibility and version 1.1 has met the comprehensive consideration of the use scenario of important new features supported. Here again briefly about Camus components, is also a Linkedin open source, in our big data platform primarily as Kafka Dump data to HDFS is an important way.

2. Resource isolation

Before Because little complexity and scale of business, big data platform is relatively simple to divide Kafka cluster. Then, after a period of time led to the company's business data is mixed together, there is a certain irrational use of business topics are likely to cause some Broker overloading, impact on other normal business, and even the failure of some Broker will affect the entire cluster, leading to the risk of company-wide business unavailable.

To solve the above problems, in the transformation of the cluster made two practice:

  • Functionally independent property split clusters
  • Topic resource isolation within the cluster size

(1) Cluster Split

Kafka split according to feature dimensions plurality of physical clusters, for service isolation, reducing the complexity of operation and maintenance.

To bury the most important point is the use of data, currently split into three clusters, cluster of various types of functions defined as follows:

  • Log Cluster : buried after each end-point data acquisition will give priority landing to the cluster, so this process can not occur due to problems caused Kafka acquisition interrupt, which Kafka high availability requirements. So the cluster does not provide external subscription, to ensure the consumer controllable; simultaneously also as a source of the offline trunking service acquisition, the data dump Hourly time granularity to HDFS by Camus components, which participate in the subsequent part of the data off-line calculation.

  • The full amount of subscription clusters: the vast majority of the cluster data from the Log Topic is the cluster real-time synchronization over. We mentioned above data Log cluster is not external, so the whole amount cluster assumed the duties of consumer subscription. Mainly for real-time tasks inside the platform, to data analysis for multiple lines of business and provide analysis services.

  • Personalized custom cluster : mentioned earlier, we can split the party based on business needs, merge data log sources, and we also support customized Topic, the cluster only need to provide the shunt after landing Topic of storage.

Clusters overall structure is divided in the following figure:

(2) resource isolation

Topic of traffic volume is an important basis for the cluster resource isolation. For example, we Buried large amount of log data sources are two mobile-event Buried backend data sources on Buried data source server-event and the end in the business, we should avoid storing two data relating to the partition assigned to the same cluster node on a Broker. By physical separation of different Topic, to avoid traffic on the Broker is tilted.

3. The access control and alarm monitoring

(1) access control

We said at the beginning of introduction, early Kafka set up secure authentication in the cluster is not running naked state, so long as we know Broker connection address to production and consumption, there is a serious data security issues.

In general, use SASL Kerberos users will choose, but the scene on the use of cluster platform Kafka, the user system is not complicated, it is overkill to use Kerberos, while Kerberos is relatively complex, there is a risk lead to other problems. In addition, Encryption, due to the internal network environment are running, so do not use SSL encryption.

The final platform Kafka as a cluster using SASL authentication manner, based SASL / SCRAM + lightweight combination of ACL, dynamic user creation to ensure data security.

(2) Monitoring Alarms

Before we often find in the use of the cluster, the performance of consumer applications for no reason worse. Cause of the problem, usually a high probability Consumer hysteresis data read did not hit Page- cache, resulting in Broker end of the machine's kernel must first read data is loaded into the Page- cache, the result can be returned to the Consumer from disk, rather It could serve to write to disk now to read the data, affecting the consumer to read and write at the same time reducing the performance of the cluster.

Then you need to find out the application lag Consumer intervene ex ante in order to reduce the problems occur, so monitoring alarms regardless of the platform or user of great significance. Here are some ideas of our practice.

Overall program:

The overall program is mainly based on open source components Kafka JMX Metrics + OpenFalcon + Grafana:

  • Kafka JMX Metrics: internal index Kafka broker are exposed to the outside in the form of JMX Metrics. 1.1.1 version provides a wealth of monitoring indicators, to meet the monitoring needs
  • OpenFalcon: Millet is an enterprise-class open source, high availability and scalable open source monitoring system
  • Grafana: Metrics visualization system, we are more familiar with a variety of dockable Metrics data sources.

About Monitoring:

  • Falcon-agent: deployed on each Broker, analytical Kafka JMX metrics reporting data
  • Grafana: Falcon Kafka Metrics used to visualize data, Cluster, Broker, Topic, Consumer 4 characters make monitoring the market.
  • Eagle: get the consumer group Active state, consumer groups Lag backlog of cases, while providing API, providing surveillance data to monitor alarm systems "radar."

About alert:

Radar systems : self-development monitoring system, Kafka acquired by Falcon and Eagle index, setting a threshold value in conjunction with an alarm. To consumption patterns, for example, Lag is an important indicator of consumption is normal, if Lag has been increased, we must deal with it.

When problems occurred, not only Consumer administrators to know that its users have to know, so the alarm system also needs to inform the user. BEST MODE is automatically alerts the manager or person in charge of the corresponding cluster users and Kafka consumer groups through micro-enterprise letter warning robot.

Monitoring Example:

4. The application extensions

** (1) real-time data platform subscription **

Real-time data subscriptions Platforms is an experienced Kafka uses the system application-wide process management to work orders approving way data production and consumption of application, platform user authorization, consumer monitor alarms, and many other aspects of the process of automation, and provide unified management and control.

The core idea is to Kafka downstream applications to manage identity-based authentication and access control Kafka data sources, increase data security at the same time.

(2) standardized application process

Whether producers or consumers, consumers will first apply to the proposed subscription ticket manner. Application information including information on business lines, Topic, subscription etc; ticket will eventually flow to the platform to wait for approval; if approved, consumers will be allocated to the authorized account and Broker address. So far, the consumer can conduct normal production spending.

(3) Monitoring Alarms

For the platform, authority and resources is bound GroupTopic resources can be used to produce Topic or consumer use. Once the distribution rights for the use of part of the resources will be automatically registered on our radar monitoring system for monitoring the entire life cycle of resources.

(4) Replay Data

For consideration of data integrity and accuracy of the current architecture has Lamda is a common architectural approach to big data. But on the other hand, Lamda architecture there is excessive use and development of higher difficulty of the problem of resources.

Subscribe to reset the real-time platform can provide any site for the consumer group, supporting data in various ways time, sites like replay real-time data, and provides support for Kappa architecture scene to solve these pain points.

(5) Theme Manager

Why providing theme management? For some very simple examples, such as when we want a user to create his own Kafka Topic on the cluster, then obviously you do not want to let him directly to a node operation. So just talk about service, whether the user is concerned, or the administrator is concerned, we need to have a single interface to operate it, because everyone can not even go through the SSH server.

It is necessary to provide a service management capabilities, creating a unified entrance and the introduction of theme management services, including creating themes, resource isolation is specified, subject metadata management.

(6) data distribution

In the previous architecture, consumer consumption data granularity are each Kafka Kafka Topic LogSource save the entire amount of data, but in many consumer use only portions of the data LogSource of consumption, that is probably one of several applications in buried data point events. If you need to write their own downstream application filtering rules, waste and ease of use of resources certainly exist; Another part of the scene is the need to Merge multiple data sources together to use.

Based on the above two cases, I realized the demand by business people party split, merge and customize Topic supports filtering rules set conditions of any data consolidation and appcode and event code across data sources.

Part.3 follow-up plan

  • Solve the problem of data duplication . In order to solve the real-time stream processing platform due to fault recovery and other factors lead to data duplication, we are trying to combine two Flink with transaction mechanism Kafka commit protocol from end to end only once semantics. Now small-scale trial on the platform, if you pass the test, will be rolled out in a production environment.

  • Consumer limiting . In a write once read many scenarios, if a large number of disk read a Consumer operations will affect other consumers delay Produce level operations. l Thus, by Kafka Quota Consume limiting mechanism and support dynamic adjustment of the threshold value is our subsequent direction

  • Scene extension . Based on Kafka Extended SDK, HTTP and other news subscriptions and production methods to meet the needs of different locales and scenes.

These are the practical application sharing about Kafka Big Data platform hornet's nest, and if you have any suggestions or questions, please leave a message at the number of public hornet's nest technology background.

Author: BearingPoint, a hornet's nest big data platform R & D engineers.

Guess you like

Origin juejin.im/post/5e0ede02f265da5d3d2e9570