Kafka--Introduction

  • Born in LinkedIn
  • At the beginning, it was to process massive real-time log data and do processing analysis
  • Then I developed and designed a distributed messaging system that can process real-time data and can achieve high throughput, scalability, and high performance-Kafka
  • In 2010, the Kafka project was hosted to the Github open source community
  • In 2011, Kafka became an open source project of the Apache Project Foundation.
  • In 2012, the Apache Project Foundation began to incubate the Kafka project.
  • After that, LinkedIn employees and community members continued to maintain and improve the Kafka project, and the Kafka project was continuously improved.
  • Today, the Kafka project has become one of the top projects of the Apache Project Foundation.
    Insert picture description here

Kafka official introduction

The original intention of Kafka design

  • The prototype of Kafka was developed by LinkedIn and was originally designed to be used by LinkedIn to process activity stream data and operational data.
  • Activity stream data refers to site content such as browser access records, page search records, and detailed records of viewing web pages.
  • Operational data refers to the basic indicators of the server, such as CPU, disk I/O, network, memory, etc.
  • In subsequent iterations, Kafka is designed as a unified platform that can be used to process all real-time data of large companies. It needs to be able to meet the following requirements.

1.高吞吐量

  • The number of users of software such as Alipay, WeChat, and QQ used in daily life is very large, and the data traffic generated per second is also very large. In the face of such scenarios, to aggregate message logs in real time, high throughput is necessary to support high-volume event streams.

2.高可用队列

  • Distributed message queuing systems all have asynchronous processing mechanisms. In addition, distributed message queue systems generally have the ability to handle a large amount of data backlog in order to support regular data loading of other offline systems.

3.低延时

  • Real-time application scenarios have extremely strict requirements on time delay. The less time it takes, the better the result. This means that the designed system must have 低延迟处理capabilities.

4.分布式机制

  • The system also needs to have features 分区、分布式、能实时处理消息such as support , so that it can be in place 机器出现故障时保证数据不丢失.
  • To meet these needs, Kafka has many unique features, which makes it more similar to database logs rather than traditional messaging systems.

Kafka application scenarios

  • In actual usage scenarios, Kafka has a wide range of applications. For example, log collection, messaging system, activity tracking, operational indicators, streaming, event sources, etc.

1. Log collection

  • In actual work, both the system and the application will generate a large number of logs. In order to facilitate the management of these logs, you can use Kafka to collect these scattered logs into the Kafka cluster, and then open these data to different consumers (Consumers) through the unified interface of Kafka. The unified interface includes: Hadoop application interface, HBase application interface, ElasticSearch application interface, etc.

2. Message system

  • For applications with large online business traffic, Kafka can be used as a buffer to reduce the pressure on the server. This can effectively decouple the producer (Producer) and the consumer (Consumer), and buffer the message data.

3. User trajectory

  • Kafka can be used to record various records generated by browser users or mobile App users, such as web pages browsed, content searched, content clicked, etc.
  • These user activity information will be collected by the server into the Kafka cluster for storage, and then consumers will “consume” these activity data for real-time analysis, or load it into the Hive data warehouse for offline data analysis and mining.

4. Record operation monitoring data

  • Kafka can also be used to record operational monitoring data, including collecting data from various distributed application systems (such as Hadoop systems, Hive systems, HBase systems, etc.).

5. Implement stream processing

  • Kafka is a stream processing platform, so it will also be used in combination with other big data suites in actual application scenarios, such as Spark Streaming, Storm, Flink, etc.

6. Event source

  • The event source is an application design style, in which a state change will generate a time-stamped record, and then the record generated in a time series is saved. In the face of very large stored data, you can use this method to build a very good back-end program.

Kafka's position in big data projects

  • More than 90% of big data real-time projects will be used!
  • As shown below: Kafka is generally used as a "message relay station" in the project
    Insert picture description here

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_46893497/article/details/114177607