Rao Jun: The past, present, and future of Apache Kafka Bai Yuqing: Knowing the design and implementation of Kubernetes-based kafka platform

Welcome to Tencent Cloud + Community to get more Tencent's massive technical practice dry goods~

This article was first published in the cloud+ community and may not be reproduced without permission.

Hello everyone, I would like to briefly introduce. My name is Rao Jun. I am one of the co-founders of Confluent, a start-up company in Silicon Valley. . Our company was established in 2014. The purpose of the establishment is to make the company a business that helps various enterprises to do data flow based on Kafka.

Before starting, I would like to do a simple survey, who has used Kafka in this room. About 80% of people use it. Okay, thank you. Sharing with you today, I want to share our project, the development of Kafka, how it was created, and some of his experiences. Speaking of Kafka, we have to go back to 2010. In this field, I joined LinkedIn in 2010. Many people may be familiar with it. This is a social platform that provides talents and opportunities. In 2010, LinkedIn began to take shape, which is also a stage of rapid growth of LinkedIn. When I joined LinkedIn in 2010, I was probably employee No. 600. I left LinkedIn in 2014. When I left, I had grown to employee No. 6000. In just four years, the organization developed rapidly. In the process of the rapid development of LinkedIn, the reason why it can have such a rapid development is inseparable from the data. Like many Internet companies, the data is the core of LinkedIn. LinkedIn has its own users. Users can directly or indirectly provide their own data to LinkedIn. Through various scientific research or analysis, LinkedIn can extract a lot of new insights and cognition, and this information will be fed back to us. On the product, this product will be more effective and can attract more users to our platform, so if the data is done well, it can form a very good virtuous circle, users can get more data, can do Better analytics can produce better products and attract more users.

Diversity of data sources

From a data point of view, LinkedIn's data is very diverse. The most common data may be known as one type - transaction data. These data are generally stored in the database. From the perspective of LinkedIn In other words, this kind of transaction data is very simple. You provide a job resume, or some resumes of your school, including the connection between you and the members in it, which are all transactional data, but there are still a lot of non-transactional data. , some behavioral data of many users, such as which link you click as a user, which search keywords you enter, these are actually very valuable information. From our internal operations, there are a lot of operational service metrics, some application logs, and finally some information on many of our smartphones, which is also very valuable. Therefore, in terms of value, the value of these non-transactional data is no less than the value of these transactional data. But in terms of traffic, the traffic of these non-transactional data may be 100 times, or even 10 million times, the data source of this transactional data. Let’s take a small example to see how LinkedIn uses the idea of ​​this data to provide this service.

It is called people you may know in English, abbreviated as PYMK. What this organization does is to provide some recommendations to LinkedIn users. He wants to recommend some other LinkedIn users. It is not yet in your connection. What is his recommendation? do it? It uses 30 to 40 kinds of information internally and adds them up to give you the final recommendation. To give some simple examples, say the two of us went to the same school, or worked in the same company, this is a strong message, maybe we just need to be connected, but there is this kind of indirect information, For example, two people, A and B, do not have any obvious relationship in common, but if many people see the resumes of these two people at the same time in a short period of time, it means that they may still have some The kind of hidden information that makes them worth connecting together. So in the early days of LinkedIn, if you use this service, you will find a lot of recommendations are amazing. At first glance, you may think that he would recommend such a person to me, but if you think about it, you will find that there are many strong reasons for it, and there are indeed some truths, and there are many similar services in it , so that he can use a variety of real-time data. But in 2010, we had a big problem with LinkedIn, that is, the integration of data was actually a very imperfect process. This picture roughly introduces the state at that time, so I see that there are various data sources above. LinkedIn was an old Internet company at the beginning. All data is stored in the database. With the concept of Development, I have a system that collects all user behavior data, a lot of data is stored in local files, and some other information is stored in the running log, running some identification monitoring data.

Downstream we can see that this is a variety of consumer end. LinkedIn initially had this kind of data warehouse. Over time, we have more and more real-time microservices, which are similar to these batch processing. Also capture more or less the same information from these different data sources. Like the recommendation engine we just mentioned, it is one of the microservices. We have many of them, as well as some social graph processing. It can analyze the relationship between two nodes, such as two LinkedIn members. How are they connected, which connection is the strongest, and there are some real-time searches, so these numbers are gradually increasing, and many of their usages are more real-time, from data generation to its updated system, most of the time it is A few seconds, or even shorter delays.

point-to-point data integration

So what we did at the time was, if we wanted to send this data from the data source to the consumer, the approach was what we called point-to-point data integration. We knew some data, and what we wanted was to send the data to the consumer. In the data warehouse, our approach is to write scripts or write some programs. After a few days, we found that many systems also need to read data, we will do some similar work, and write scripts and some programs, so we have been doing this kind of things for a long time, but After writing five or six similar dataflows, I found this to be a very inefficient practice. What is the main problem? The first problem we want to solve is a cross-multiplication problem, which is a cross-multiplication problem with data staff and data consumers. Therefore, each time a data source is added, the data source must be connected to all consumers. If a consumer is also added, the consumer needs to be directly connected to all data sources. The second problem is that when we do this kind of point-to-point streaming data flow, we have to repeat a lot of the same work every time we do a data flow, and we don’t have enough time to do it for each data source. To 100% perfection, so we feel that this architecture is not very ideal.

ideal architecture

So if you want to improve, what should be improved? We thought at the time that if there was such an architecture, assuming that we have a centralized logging system in the middle, we can cache the information of all data sources first. If we can do this, we will greatly simplify the framework. . So if you are a data source, you don't need to know all the consumers, the only thing you have to do is to send your data to the central logging system. Similarly, if you are a consumer, you don't need to know all the data sources, all you do is to subscribe to the messages you want like this central log system, so we simplify the cross-multiplication problem just now into one The real problem, the key is in the architecture, what kind of system can be used as the central log system, so this is what we were discussing at the time, we didn't want to recreate a new system at the beginning, this seems to be A very common enterprise-level problem, there should be a similar solution in this enterprise. If you look closely and think about it, the central logging system is similar to a traditional messaging system in terms of interface. Our message system generally separates this kind of production end and consumer end, and then it is a very real-time system, so we thought why not try some existing message systems, there were some open source message systems at that time, and some enterprises advanced messaging system, but we found it to be very ineffective. There are many specific reasons, but the most important reason is that these traditional message systems are not designed for this usage, especially the biggest problem is their throughput.

Kafka First Edition: High Throughput Publish-Subscribe Messaging System

A lot of this kind of early news, his designer gave the data on these databases, this kind of consumption transactional data to design, but it may be difficult for you to imagine a lot of non-transactional data, such as user behavior logs, and some This kind of monitoring data is circulated through this traditional news. So in this case, we feel that we have no way to solve this problem, but there is no ready-made result, then we say we do something by ourselves. Around 2010, we did the first version that started doing kafka. The positioning of the first version is also very simple, we just want to make it a high-throughput message system, and high storage is our most important purpose.

Distributed Architecture

In the following words, we will talk about how we achieve high throughput. The first thing we did with high throughput was that in the first version of coffee, we made it a distributed framework. Many people who are familiar with Kafka know that there are three layers in Kafka. The middle layer plus the service layer is the production side, and then the consumer side is below. The server usually has one or more nodes. The basic concept is called message rubbing. This message source can be partitioned, and each partition can be placed on a certain hard disk of a certain node, so if you want to increase throughput, the easiest way The method is to increase the machines in the cluster to have more resources. Whether from the perspective of storage or bandwidth, you can have more resources to receive a lot of data. Similarly, what we do on the production side and the consumer side is also a This multithreaded design. In either case, you can have thousands of such producer threads and consumer threads, writing or reading data from the karst cluster. So this design means that there are a lot of things in our first class, some old-fashioned messaging systems.

Simple and practical log storage

The second point we do is to use a log storage structure, which is also very simple, but it is a very efficient storage structure, so probably some of its structures are partitions of each message source, and there will be a corresponding log. Corresponding to such a log structure, and all the log structure and the hard disk are linked together will be stored through the hard disk. In this structure, each small square corresponds to a message, each message has a code, and the code is continuously increased. If you are a producer, what you do is to write the message you want to write to the log. At the end, you will get a new and larger message code log, and then send it to the consumer in order. In what order you write it in, it will be sent to the consumer in what order. The advantage of this is that , from the consumer side, your overhead is very small, because you don't need to remember all the messages on the consumer side, you only need to remember the code name of the last message it consumed. Then remember this, it can continue to consume from this place, because we know that all messages are sent in order, so all messages before this message should have been consumed.

two optimizations

This design has several advantages. The first advantage is that its access mode is very conducive to optimization, because not only from the perspective of writing but also from the perspective of reading, this is a linear write, and reading also starts from a certain position Linear reading. So from this point of view, it is beneficial to the operating system and file system to optimize its performance. The second point is that our system can support multiple consumption at the same time. At any time, you can have one or more consumers. The consumer can start consuming from this place, and another consumer can start from a different place. Consumption, but no matter how many consumers you have, this data is only stored once, so from a storage perspective, its performance has nothing to do with the number of times you consume it. Another point is not very obvious. Since our logs are stored on the hard disk, we can receive real-time consumers at the same time, and we can also accept some non-real-time batch consumers. But because all the data is on the hard disk, we can have a very large cache, so whether you are real-time or not, the service method from the consumer side is a set, he does not need to do different optimizations, the only The problem is that we rely on the operating system to decide which data is available to consumers from memory and which needs to be read from hard disk. But the design of this frame is the same. Finally, we made this kind of high throughput. We made two small optimizations. These two optimizations are related. The first optimization is batch processing. All three levels are on the server side. We just talked about these The message needs to be stored in a hard disk-based log, but it has a certain overhead to write to the hard disk, so we do not write every message to the hard disk immediately, but generally wait for a while, when we accumulate some enough After the message is sent, it will be written to the hard disk in batches, so although you still have the same overhead, your overhead is allocated to many messages, and the same is true on the production side. If you want to send a message, we generally also Instead of sending this message as a remote request to the server right away, we will also wait, hoping to wait for some more messages to package them together and send them to the server. Data compression is related to batch processing. Our compression is also performed on a batch of data, and it is end-to-end compression. If you enable the compression function, we will wait for a batch of data on the production side. Wait until a batch of data is completed. After that, we will do a compression of this batch of data together, and compress a batch of data, and often get a better compression ratio than this compression for each message. Different messages tend to have some of this repetition , and then the compressed data will be sent from the producer to the server, and the server will store the data in the compressed format in the log and send it to the consumer in a compressed format until the consumer consumes a message. will decompress the message. So if you enable compression, we not only save the network overhead, but also the storage overhead, so both of these are very efficient ways to achieve this kind of high throughput. So our first version of kafka took about half a year, but we took a little more time to apply it to LinkedIn's data line, because there are many microservices in LinkedIn, probably in our This was done at the end of 2011, and these were some basic quantities at the time.

On the production side, we had hundreds of thousands of messages produced, and then millions of messages were consumed. This data was still very considerable at the time, and LinkedIn had hundreds of microservices and tens of thousands of microservices at that time. Threads, and more importantly, after we have done this, we have achieved the democratization of a data within this field. Before kafka, if you were an engineer or a product manager in LinkedIn, or a data analyst, and you wanted to do some new designs or new applications of this kind, the most difficult problem was that you didn’t know how to What kind of interface is used to read it, and I don't know if the data is complete. After doing kafka, we have greatly simplified this problem and greatly liberated the ability of engineers to innovate. So after having a successful experience and feeling that this system of kafka is very useful, we have done some more development in the future. The development of the second part is mainly to do some support for this kind of high availability.

Kafka Version 2: High Availability

In the first version, each message is only stored on one node. If that node goes offline, the data cannot be obtained. If the machine is permanently damaged, your data will also be lost. So when we did the second version, we added some of this high availability, and the implementation method was to add this multi-copy mechanism. If there are multiple nodes in the group, then we can redundantly store a message on multiple copies, and the same small color is multiple different copies. In the same situation, if one of your machines goes offline, another sign, if there are the same copies, he can continue to provide services with the same data. So with the second version, we can expand the scope of the data that it can include, not only for this non-transactional data, but also for some transactional data. system to be collected.

In 2000, we also did one thing. That year, the kafka project was donated to the Apache Foundation. When we did this, we thought that the system we built was at least very useful for the inside of the field. Then we will see See if it is useful to other companies, and other Internet companies may also find it useful, but I didn't realize that after open source, it has a very wide range of uses, so it is often a network layout, not just limited to this kind of Internet companies. but the entire industry. As long as your company has some of this real-time data, you can use whatever you need to collect it. A big reason is that various traditional enterprises are also going through this process of software digitization. There are some traditional industries, where the strength in the past may be in those traditional manufacturing industries, or there are some retail outlets, but now they must be relatively strong in software or data. Then Kafka provides a very effective channel for many enterprises from the integration of real-time data. In the next step, we have experienced the development of kafka for several years, knowing that it is more and more widely used, so we want to do something dedicated to kafka, because this is a full-time job, so we left in 2014 LinkedIn founded Confluent. In this company, we want to provide convenience for all kinds of enterprises, and it can be used more widely. Now our company has more than 200 people.

Development of Kafka

Let's talk about the development we have made since 14 years. After that, we mainly did two things in kafka, the first one was related to enterprise-level functions, and this one was mainly related to data integration. The second block is related to data stream processing. Then we will talk a little bit about both aspects. This is skipped. In the enterprise world, we have done a large part, which is related to the data integration thing we talked about at the beginning. For many such companies, if your company has a long time, you will find that your data sources are scattered in many systems. What we just said is that it is very convenient to have kafka. You can extract these data, but different If you are a company, you don’t want every company to read things to make their own set of things, so there are two parts in our design. The first part is that it has a platform part, which extracts many common things. Come out and make a module. For example, you need to do some data distribution, you need to do some parallel processing, and you also need to do some such failure detection. After the detection, you can do some data balancing, so these common things are all In this module, there is an open interface in this module, which can be used to design and realize the connection of various different data sources. On the sending side of the data, we can do something similar if we want to search for some replicas, so this is the first piece we do.

The second block does an aspect related to data flow. If you have a system like Kafka that can collect a lot of data in real time, the initial use is as a data transmission platform. But we think that with time, kafka may not only be limited to a transmission platform, but can also be a platform for sharing and cooperation. After real-time data is available, you often do some things, such as you want to do some such data streams. , for example, to transfer a data from one format to another format, you may also want to do some data expansion, for example, you have a data stream, which contains some data information, but only the user's code name, users without data specific information, but you may have a lot of more detailed user information in the database. If you can combine these two information together, the data flow will be richer, allowing you to do more and more effective deal with. In addition, you may also want to do some real-time data aggregation. In the application, we want to simplify this piece.

Kafka-like future

In the future, I think the kafka system is not only a platform for real-time data collection and transmission, but it may also be a platform for more data flow processing, exchange and sharing if it develops over time, so we will Do more in this direction. In the future, as many applications become wider, we think many applications will become more and more real-time applications. So on this basis, we may have a strong ecosystem on kafka.

Finally, let me share a little story with you. This story is about one of our users in North America. This bank is a relatively traditional old bank and has been a historical bank for decades. A problem that has existed for a long time is that its data is very scattered. So if you are a customer of this bank, you may have an account with the bank, you may have a loan, you may have an insurance, you may have a credit card, all the previous information of this customer because it is different The commercial sector, it's all completely separate. If you are a salesperson of a bank, your trouble is that you cannot know all the information of this customer. This company has made a project related to Kafka. A project is to collect all the different data source information of customers in real time, and then push this information to their tens of thousands of sales staff. In this case, the sales staff are in When doing sales, there will be more effective real-time information to make more targeted recommendations to customers, so this project is very successful.

 

For more sharing information, click the link below:

https://ask.qcloudimg.com/draft/1184429/hz4fk5b242.pdf

 

Q&A

How to use apache kafka vs apache storm?

Related Reading

Chen Xinyu: Application of CKafka in Face Recognition PASS

Yang Yuan: Tencent Cloud Kafka Automated Operation Practice

Bai Yuqing: Knowing the design and implementation of Kubernetes-based kafka platform

 

This article has been authorized by the author to publish Tencent Cloud + Community, the original link: https://cloud.tencent.com/developer/article/1114675?fromSource=waitui

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325257932&siteId=291194637