Big Data-What is Kafka? &&Kafka's basic concepts&&Kafka instructions and data synchronization&&The difference between Kafka and MQ&&What is zero copy

1. What is Kafka


I. Overview

  1. Kafka is a message queue of publish and subscribe mode
  2. Kafka is a message queue that was developed by LinkedIn and then contributed to Apache
  3. Features of Kafka:
    1. Publish and subscribe message flow
    2. Provide fault tolerance when storing message streams
    3. It can be processed in time when the data stream appears
  4. Application scenarios of Kafka:
    1. Able to build a real-time stream pipeline for reliable data transmission between systems or applications
    2. Able to build a real-time streaming application that transforms or responds to data streams
  5. Kafka is built using the Scala language. Scala naturally supports concurrency and throughput, ensuring that the concurrency and throughput of Kafka are relatively high. In the actual process, the throughput of Kafka is 60~80M/s-Kafka uses zero at the bottom. Copy technology
  6. After Kafka receives the data, it will write the data to the local disk to ensure that the data is not lost. By default, Kafka will not clean up written data
  7. There is no single point of failure in Kafka
    1. The Kafka cluster can dynamically add or delete nodes at any time
    2. There is a copy strategy in Kafka

Two, the basic concept of Kafka


  1. broker:
    1. It means the node in Kafka
    2. Each broker needs to give a number, as long as the number is not the same
  2. topic:
    1. Function is used to classify data
    2. In Kafka, each piece of data must be sent to a specified topic
    3. Each topic corresponds to one to multiple partitions
    4. When deleting a theme, the directory corresponding to this theme will not be deleted immediately, but will be marked as deleted. Waiting for a minute or so will delete the marked directory
    5. If you need the delete operation to take effect immediately, you need to set delete.topic.enable to true

The server in the config directory. . . . file

  1. partition:
    1. Each partition corresponds to a directory
    2. If there are multiple Kafka nodes, the partition will be divided evenly on each node. The purpose of this design is to improve the throughput of Kafka
    3. If there are multiple partitions, the data is written in rotation when writing to the partition
  2. replicas:
    1. In Kafka, in order to ensure the availability of data, you can set up multiple copies
    2. If multiple copies are set, the copies are backed up in units of partitions
  3. leader and follower
    1. In Kafka, if multiple replicas are set, the controller will automatically elect between the replicas, electing a leader replica and other follower replicas
    2. Note: leader and follower refer to the master-slave relationship between replicas rather than the master-slave relationship between Kafka nodes
    3. Producer and Consumer only interact with the leader copy, not with the follower copy
  4. Controller:
    1. Used for election of leader copy and follower copy
    2. Controller will be on a certain Kafka node
    3. If the Controller is down, Zookeeper will start a Controller process on other Kafka nodes
  5. Consumer Group:
    1. By default, each consumer corresponds to a consumer group
    2. A consumer group can contain one or more consumers
    3. The same message can be subscribed by different consumer groups, but only one consumer in the same group can consume the message-sharing between groups, competition within groups

Three, Kafka instructions and data synchronization


1. Instructions

instruction

Explanation

sh kafka-server-start.sh ../config/server.properties

Open Kafka

sh kafka-topics.sh --create --zookeeper hadoop01:2181 --replication-factor 1 --partitions 1 --topic video

Create theme

sh kafka-topics.sh --delete --zookeeper hadoop01:2181 --topic novel

Delete topic

sh kafka-topics.sh --list --zookeeper hadoop01:2181

View all topics

sh kafka-topics.sh --describe --zookeeper hadoop01:2181 --topic txt

Describe the subject

sh kafka-console-consumer.sh --zookeeper hadoop01:2181 --topic video

Start consumer

sh kafka-console-producer.sh --broker-list hadoop01:9092 --topic video

Start producer

Second, data synchronization

  1. The producer writes the data to the leader copy
  2. The follower copy will send a message to the leader asking if there is any data that needs to be updated
  3. The leader will send the data that needs to be updated to the follower and wait for feedback from the follower
  4. If the follower record is successful, an ack signal is returned
  5. After the leader receives the ack, it will put the brokenerid where the follower copy is in the ISR queue
  6. The ISR is maintained on Zookeeper. Once the leader copy is lost, the Controller will first select a copy from the ISR to become the leader

Fourth, the difference between Kafka and MQ


The difference between Kafka and MQ (here to explain from four aspects)

1) In terms of architecture model,

RabbitMQ follows the AMQP protocol. The broker of RabbitMQ is composed of Exchange, Binding, and queue. Exchange and binding form the routing key of the message; the client Producer communicates with the server through the connection channel, and the Consumer obtains the message from the queue for consumption (long connection, queue). A message will be pushed to the consumer, and the consumer will read data from the input stream in a loop). rabbitMQ is broker-centric; there is a message confirmation mechanism.

Kafka complies with the general MQ structure. Producer, broker, and consumer are centered on the consumer. On the client consumer where the consumption information of the message is stored, the consumer pulls data in batches from the broker according to the point of consumption; there is no message confirmation mechanism.

 

2) In throughput,

RabbitMQ is slightly inferior to Kafka in terms of throughput. Their starting point is different. RabbitMQ supports reliable delivery of messages, supports transactions, and does not support batch operations; storage based on storage reliability requirements can use memory or hard disk.

Kafka has high throughput, internally uses message batch processing, zero-copy mechanism, data storage and acquisition is a local disk sequential batch operation, with O(1) complexity, and message processing efficiency is very high.

 

3) In terms of availability,

rabbitMQ supports the queue of the mirror, the main queue fails, and the mirror queue takes over.

Kafka's broker supports active and standby mode.

 

4) In terms of cluster load balancing,

The load balancing of rabbitMQ needs a separate loadbalancer to support.

Kafka uses zookeeper to manage the brokers and consumers in the cluster, and can register topics to zookeeper; through the coordination mechanism of zookeeper, the producer saves the broker information of the corresponding topic, which can be sent to the broker at random or polling; and the producer can be specified based on semantics Fragmentation, the message is sent to a fragment of the broker.

Five, what is zero copy


        Zero copy is a technology that prevents the CPU from copying data from one storage to another. Various zero-copy technologies that have emerged for device drivers, file systems, and network protocol stacks in operating systems have greatly improved the performance of specific applications, and enabled these applications to use system resources more effectively. This performance improvement is achieved by allowing the CPU to perform other tasks while the data is being copied.

        Zero-copy technology can reduce the number of data copy and shared bus operations, eliminate unnecessary intermediate copies of transmitted data between memories, thereby effectively improving data transmission efficiency. Moreover, the zero-copy technology reduces the overhead caused by context switching between the user application address space and the operating system kernel address space. Performing a large amount of data copy operation is actually a simple task. From the perspective of the operating system, if the CPU has been occupied to perform this simple task, it will be a waste of resources; if there are other comparisons Simple system components can take care of this, so that the CPU can be freed to do other things, and the use of system resources will be more effective. In summary, the goals of zero-copy technology can be summarized as follows:

Avoid data copy

① Avoid data copy operations between the operating system kernel buffers.

② Avoid data copy operations between the operating system kernel and user application address space.

③User applications can bypass the operating system and directly access the hardware storage.

④ Try to let DMA do the data transfer.

Combine multiple operations

① Avoid unnecessary system calls and context switching.

②The data to be copied can be cached first.

③ Try to let the hardware do the data processing.

Guess you like

Origin blog.csdn.net/weixin_47055922/article/details/108595116