Kafka detailed explanation of message queue


1. What is Kafka

In streaming computing, Kafka is generally used to cache data, and Storm performs calculations by consuming Kafka data.

  1. Apache Kafka is an open source messaging system, written in Scala. It is an open source messaging system project developed by the Apache Software Foundation.
  2. Kafka was originally developed by LinkedIn and was open sourced in early 2011. Graduated from Apache Incubator in October 2012. The goal of this project is to provide a unified, high-throughput, low-latency platform for processing real-time data.
  3. Kafka is a distributed message queue . Kafka categorizes messages according to Topic when they are stored. The sender is called Producer, and the recipient of the message is called Consumer. In addition, the Kafka cluster is composed of multiple Kafka instances, and each instance (server) is called a broker.
  4. Both the Kafka cluster and the consumer rely on the zookeeper cluster to store some meta information to ensure system availability. The offset of the new version is maintained locally

2. Kafka architecture

  1. Producer: The message producer is the client that sends messages to the Kafka broker;
  2. Consumer: message consumer, the client that fetches messages from kafka broker;
  3. Topic: can be understood as a queue;
  4. Consumer Group (CG): This is the method used by Kafka to realize the broadcast (to all consumers) and unicast (to any consumer) of a topic message. A topic can have multiple CGs. The topic message will be copied (not really copied, but conceptually) to all CGs, but each partion will only send the message to one consumer in the CG. If you need to implement broadcasting, as long as each consumer has an independent CG. To achieve unicast, as long as all consumers are in the same CG. With CG, consumers can also be grouped freely without sending messages to different topics multiple times;
  5. Broker: A kafka server is a broker. A cluster is composed of multiple brokers. A broker can hold multiple topics;
  6. Partition: In order to achieve scalability, a very large topic can be distributed to multiple brokers (ie servers), a topic can be divided into multiple partitions, and each partition is an ordered queue. Each message in the partition will be assigned an ordered id (offset). Kafka only guarantees that messages will be sent to consumers in the order of a partition, and does not guarantee the order of a topic as a whole (among multiple partitions);
  7. Offset: Kafka's storage files are named after offset.kafka. The advantage of using offset as the name is that it is easy to find. For example, if you want to find the location at 2049, just find the file 2048.kafka. Of course the first offset is 00000000000.kafka.
  8. Broker and consumer depend on zookeeper, while producer does not communicate with zookeeper

Zookeeper can connect fllower update operation back to leader to do fllower interaction.

The role and background of Zookeeper [data consistency, high availability]

管理代码中的变量的配置
设置命名服务
提升系统的可用性和安全性
管理Kafaka集群

3. Basic concepts

Understand the concepts of agents, producers, consumers, consumer groups, etc.

  1. Broker
  2. Producer (producer)
    In the Kafka system, the application that writes data is generally called the "producer".
    The Kafka producer can be understood as an application interface for data interaction between the Kafka system and the outside world.
  3. Consumer
  4. Comsumer Group

Understand the meaning of topics, partitions, replicas, and records in Kafka

  1. Topic
  2. Partition
  3. Replication
  4. Record

  1. The original intention of Kafka?
    High throughput, high availability queue, low latency, distributed mechanism
  2. What are the characteristics of Kafka?
    High throughput, high availability queue, low latency, distributed mechanism
  3. What scenarios is Kafka used in?
    Asynchronously generate data, offset migration, security mechanism, connector, rack awareness, data flow, time stamp, message semantics, log collection, message system, user trajectory, record operation monitoring data, implement stream processing, and event source
  4. What metadata information of Kafka is stored in zookeeper?
    Controller election times, agent nodes and topics, configuration, administrator operations, controllers.
  5. How is this metadata information distributed?
    Kafka metadata zookeeper storage
  6. Why do you need a consumer group?
    Horizontal program expansion, placing information accumulation

4. Partitioned storage

4.1. Partition to store data

Partition file storage

  • There are multiple partitions under one topic, and each partition is a separate directory
  • The partition naming rule is topic + sequential number starting from zero to partition n-1

Fragment file storage

  • Consists of index file and data file *.index index file.log data file
  • Kafka does not build an index for each message record, but uses a sparse index method

4.2. What are the methods for Kafka to clean up expired data?

Deletion strategy based on time and size

#系统默认保存7天
log.retention.hours=168

#系统默认没有设置大小
log.retention.bytes=-1

Compression strategy clear

如果使用压缩策略清除过期日志,则需要设置属性
log.cleanup.policy=compact

5. Kafka security mechanism

5.1. Understand Kafka security mechanism

There is no security mechanism before version 0.9. There are risks of leaking sensitive data, deleting topics, modifying partitions, etc.

Authentication

1.客户端和Kafka Broker之间连接认证
2.Broker和Broker之间连接认证
3.Broker和Zookeeper之间连接认证

Access control

1. 对读写删改主题权限控制
2. 可插拔权限认证,支持与外部授权服务集成
3. 自带简单的授权类kafka.secutity.auth.SimpleAclAuthorizer
4. 部署安全模块是可选的

5.2. Configure ACL

Cluster operation
tends to manage between agent nodes within the cluster, such as agent node upgrade, topic partition metadata leader switching, topic partition copy settings, etc.

Subject operation
Target specific access rights, such as read, delete, view, etc. to the subject

#如果没有设置ACL、则除超级用户外其他用户不能访问。默认为false
allow.everyone.if.no.acl.found=true
#设置超级用户
super.users=User.admin
#启用ACL,配置授权
authorizer.class.name=kafka.secutity.auth.SimpleAclAuthorizer

5.3. Kafka enables ACL mode

Cluster startup

# 文件/**/reader_jaas.conf权限认证信息内容

KafkaServer {
  org.apache.kafka.common.security.plain.PlainLoginModule required
  username="admin"
  password="admin"
  user_admin="admin"
  user_reader="reader"
  user_writer="writer";
};
 
#在 zookeeper-server-start.sh kafka-server-start.sh cat kafka-acls.sh脚本中添加
export KAFKA_OPTS="-Djava.security.auth.login.config=/**/reader_jaas.conf"

# 启动zookeeper
./zookeeper-server-start.sh ../config/zookeeper.properties 1>/dev/null 2>&1 &

# 启动Kafka
nohup ./kafka-server-start.sh ../config/server.properties > kafka-server-start.log 2>&1 &

View permissions

kafka-acls.sh

6. Kafka connector

Connector core concept

  1. Connector example
  2. Number of tasks
  3. Event thread
  4. converter

6.1. Understand the usage scenarios of the connector

连接器一般是用来构建数据管道
1.开始和结束的端点 [举例 Kafka数据移出到hbase 或者oracle数据移入到Kafka]
2.数据传输的中间介质[举例 海量数据存储到ES中,作为临时存储]

6.2. Features and advantages

characteristic

  1. Universal framework
  2. Stand-alone mode and distributed mode
  3. REST interface
  4. Automatic management of offset
  5. Distributed and scalable
  6. Data flow and batch integration

Advantage

  1. Source connector
  2. Sink connector

6.3. Operating Kafka Connector

Import data into Kafka in stand-alone mode

第一步:创建要导入的文件
第一步:修改配置文件../config/connect-file-source.properties 

./connect-standalone.sh  ../config/connect-standalone.properties ../config/connect-file-source.properties 

Import data into Kafka in distributed mode

./connect-distributed.sh ../config/connect-distributed.properties 

#查看版本号
curl http://dns:8083

6.4. Develop a simple Kafka connector plug-in

Write the Source connector

1.SourceConnector类:用来初始化连接器配置和任务数
2.SourceTask类:用来实现标准输入或者文件读取

Write sink connector

1.SinkTask类:用来实现标准输出或者文件写入
2.SinkConnector类:用来初始化连接器配置和任务数

Reference materials:
Kafka monitoring system-Kafka
single-machine configuration and deployment under Kafka Eagle Centos Kafka
installation and deployment
Kafka installation tutorial (detailed process)
apache kafka series server.properties configuration file parameter description
Kafka monitoring system Kafka Eagle analysis of
Kafka cluster deployment (Docker container the way)

Guess you like

Origin blog.csdn.net/baidu_41847368/article/details/114764613