Kafka introduction and installation configuration

Kafka introduction and installation configuration

1. Introduction

    Kafka is a distributed publish-subscribe messaging system. It was originally developed by LinkedIn Corporation, written in Scala, and later became part of the Apache project, which is currently a top-level Apache project.

    Kafka is a distributed, partitionable, and replicable messaging system. It provides the functionality of a normal messaging system, but with its own unique design.

    It provides features similar to JMS (Java Message Queuing Specification), but is completely different in design and implementation, in addition it is not an implementation of the JMS specification.

    Kafka classifies messages according to Topics when saving them. The sender of the message is called the Producer, and the receiver of the message is called the Consumer. In addition, the kafka cluster is composed of multiple kafka instances, and each instance (server) is called the broker.

    Both the kafka cluster, the producer and the consumer rely on zookeeper to ensure system availability and the cluster saves some meta information.

    Summarize:

    Kafka is a distributed message queue that stores data by topic. It has three roles of Producer, Consumer, and Broker. Zookeeper is used as the coordination tool of the cluster.

1. Features of Kafka

1. high throughput

    In theory, Kafka can produce about 250,000 messages (50MB) per second and process 550,000 messages (110MB) per second. In the production environment, this speed fluctuates, and this speed is equivalent to the speed of hard disk IO.

2. persistent data storage

    Persistence operations are possible. Persist messages to disk, so it can be used for batch consumption, such as ETL, and real-time applications. Prevent data loss by persisting data to disk and replication.

3. Distributed systems are easy to scale

    All producers, brokers and consumers can have multiple and are distributed. Machines can be expanded without downtime.

4. Client Maintenance Status

    The state of the message being processed is maintained on the consumer side, not on the server side. Automatically balances when it fails.

2. Basic Concepts

    Kafka summarizes messages by topic. Data between topics is isolated from each other.

    Programs that publish messages to Kafka topics are called producers.

    A program that subscribes to topics and consumes messages is called a consumer.

    Kafka runs in a cluster and can be composed of one or more services, each of which is called a broker.

    Producers send messages to the Kafka cluster over the network, and the cluster provides messages to consumers.

    The client and server communicate through the TCP protocol. Kafka provides Java clients and supports multiple languages.

1、Topic

A topic is a generalization of a set of messages.

1. partition

Kafka partitions the logs for each topic.

A partition in kafka is the basic unit of load balancing and failure recovery.

 

2.offset

    Each partition consists of a sequence of ordered, immutable messages that are appended to the partition in succession. Each message in a partition has a sequential sequence number called offset, which is used to uniquely identify a message in the partition. For a configurable period of time, the Kafka cluster retains all published messages, whether they are consumed or not.

    For example, if the message retention policy is set to 2 days, then a message can be consumed within two days of being published. After it expires it will be discarded to free up space. Kafka's performance is constant regardless of the amount of data, so keeping too much data is not a problem.

3. partition copy

    Each partition has replicas in several services in the Kafka cluster, so that these replica services can jointly process data and requests, and the number of replicas can be configured. Replicas make Kafka fault-tolerant.

    Each partition has one server as "leader" and zero or several servers as "followers". The leader is responsible for reading and writing messages, and the followers synchronize the leader's data and provide external read operations. If the leader goes down, one of the followers will automatically become the leader.

    Each server in the cluster plays two roles at the same time: as the leader of some of the partitions it holds, and as followers of other partitions, so that the cluster has better load balancing.

    Partitioning the log can achieve the following purposes:

    First this keeps the number of each log from being too large to be kept on a single service. In addition, each partition can be published and consumed independently, providing a possibility for concurrent operation of topics.

    Partitions are the basic unit of distributed data storage for load balancing failure recovery.

2、Producers

    The Producer publishes messages to the topic it specifies and is responsible for deciding which partition to publish to. Usually, the load balancing mechanism selects the partition randomly, but it can also select the partition through a specific partition function. The second one is used more often.

3、Consumers

    In fact, the only data that each consumer needs to maintain is the position of the message in the log, that is, the offset. This offset is maintained by the consumer. Generally, as the consumer continues to read messages, the value of the offset will continue to increase, but in fact, the consumer can read the messages in any order. For example, it can set the offset to an old value. to reread the previous message.

    The combination of the above features makes Kafka consumers very lightweight, and they can read messages without affecting the cluster and other consumers. You can use the command line to "tail" messages without affecting other consumers that are consuming messages.

    There are usually two modes for consuming messages: queuing and publish-subscribe.

1. queue mode

    In the queue mode, multiple consumers can read messages from the server at the same time, and each message is only read by one of the consumers.

    In layman's terms, consumers are in a competitive relationship, and they are all grabbing data from borrowers, and there is only one copy of this data, and whoever grabs it is who.

2. publish-subscribe model

    In publish-subscribe mode, messages are broadcast to all consumers. In this mode, each consumer can get the same message data.

3.consumer group

    Consumers can join a consumer group, which is used to implement the above two modes.

1> Within the group

    If all consumers are in a group, this becomes a traditional queue mode, which implements load balancing among consumers.

    Consumers in the group are in queue mode and compete for messages in a topic. The messages in the topic will be distributed to one member of the group, and the same message will only be sent to one of the consumers. Consumers in the same group can be in different programs or on different machines.

2> between groups

    If all consumers are not in different groups, this becomes a publish-subscribe model, and all messages are distributed to all consumers.

    If there are multiple Consumer groups to consume messages in the same topic, the publish-subscribe mode is used between the groups, which is a state of shared data. Every group can get all the messages in this topic.

3> Apply

    The common application method is that each topic has a certain number of consumer groups for consumption, and each group is a logical "subscriber". For fault tolerance and better stability, each group consists of several consumers. , and compete within the group to achieve load balancing. The competitive load balancing within the group is realized, and the sharing between groups does not affect each other. This is actually a publish-subscribe model, except that the subscriber is a group instead of a single consumer. 

4. Compared with traditional message queues

    Compared with traditional messaging systems, Kafka can guarantee orderliness well.

    Traditional queues store ordered messages on the server. If multiple consumers consume messages from this server at the same time, the server will distribute messages to consumers in the order in which the messages are stored. Although the server publishes messages in order, the messages are distributed to the consumers asynchronously, so when the messages arrive, the original order may have been lost, which means that concurrent consumption will lead to out of order. In order to avoid failures, such message systems usually use the concept of "dedicated consumer", in fact, only one consumer is allowed to consume messages, which of course means loss of concurrency.

    Kafka does a better job in this regard. Through the concept of partitioning, Kafka can provide better ordering and load balancing when multiple consumer groups are concurrent. Distribute each partition to only one consumer group, so that a partition is only consumed by one consumer of this group, and the messages of this partition can be consumed sequentially. Because there are multiple partitions, load balancing can still be performed among multiple consumer groups. Note that the number of consumer groups cannot be more than the number of partitions, that is, as many partitions are allowed as many concurrent consumptions.

    Kafka can only guarantee the ordering of messages within one partition, but not between different partitions, which can already meet the needs of most applications. If the ordering of all messages in a topic is required, then this topic can only have one partition, and of course only one consumer group consumes it.

5. Reasons to choose Kafka

    Why do message queues in big data environments often choose kafka?

    Distributed storage of data provides better performance, reliability, and scalability.

    Use disk to store data, and store data in a distributed manner according to topics and partitions, persistent storage, and provide massive data storage capabilities.

    Using disk to store data, continuous read and write to ensure performance, performance is related to the performance of the disk and has nothing to do with the size of the data volume.

3. Installation configuration

1. Download and install

    Download the Kafka installation package and upload it to the Linux server.

    Unzip:

tar -zxvf kafka_2.9.2-0.8.1.1.tgz

    After the decompression is completed, it is equivalent to the completion of the installation, but the configuration of the response is also required.

2. Configuration

1. pseudo-distributed

1>server.properties

    Modify the server.properties file.

log.dirs=/tmp/kafka-logs

    This option configures the data storage location of Kafka and needs to be changed.

2>zookeeper.properties

    Modify the zookeeper.properties configuration file. This file is the configuration file for Kafka's built-in Zookeeper. In order to ensure the independence of the software, Kafka has a built-in Zookeeper, so in the case of distributed use, there is no need to install Zookeeper.

    The following items configure the data storage location of zookeeper, which is in /tmp by default and needs to be modified.

dataDir=/tmp/zookeeper

3> start kafka

    Start zookeeper:

bin/zookeeper-server-start.sh config/zookeeper.properties &

Start kafka:

bin/kafka-server-start.sh config/server.properties

2. fully distributed

1>server.properties

    In the config directory, modify server.properties and modify the following parameters in the file:

broker.id=0 #当前server编号
port=9092 #使用的端口
log.dirs=/tmp/kafka-logs-1 #日志存储目录
zookeeper.connect=yun01:2181

    The broker.id in the cluster must be unique.

    The log storage directory needs to be changed to the planning directory, and the default /tmp directory cannot be used.

    Zookeeper needs to be configured with the ip or hostname:port of all servers in the Zookeeper cluster.

2> start

①Start zookeeper

    Execute the following startup commands on each machine:

zkServer.sh start

②Start kafka

bin/kafka-server-start.sh ../config/server.properties &

    The Kafka service will occupy the console by default, so it can run in the background.

3. Test

1. create topic

    Create a topic with 3 replicas:

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic topicname

2. Check

1> View themes

bin/kafka-topics.sh --list --zookeeper localhost:2181

2> View Node

    View information about each node:

bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic topicname

3. production news

    Send a message to topic:

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic topicname

4. Consuming messages

    Consuming messages:

bin/kafka-console-consumer.sh --zookeeper localhost:2181 --from-beginning --topic topicname

5. Experiment: Fault Tolerance

    Create downtime and see Kafka's fault tolerance.

kill -9 7564
bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic topicname
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --from-beginning --topic topicname

    Start kafka:

    Start zookeeper:

zkServer.sh start

Start kafka:

bin/kafka-server-start.sh config/server.properties

Fourth, use kafka

1. Sell operation

1. create topic

bin/kafka-topics.sh --create --zookeeper localhost:9092 --replication-factor 1 --partitions 1 --topic test

2. View topic

bin/kafka-topics.sh --list --zookeeper localhost:2181

3. production news

Use the command-line producer to read messages from a file or from standard input and send them to the server:

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test

4. Consuming messages

Start the command line consumer to read messages and output to standard output:

bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning

2. Java API operation

1. Build a development environment

    Create a java project and import kafka related packages. The jar package exists in the libs directory of the Kafka installation package. When copying, note that it contains not only the jar package, but also other types of files. Just copy the jar package.

2. Code

1> Consumer

/**
接收数据
*/
@Test
public void ConsumerReceive() throws Exception{
  Properties properties = new Properties();  
  properties.put("zookeeper.connect", "yun01:2181,yun02:2181,yun03:2181");//声明zk  
  // 必须要使用别的组名称, 如果生产者和消费者都在同一组,则不能访问同一组内的topic数据  
  properties.put("group.id", "group2xx");
  properties.put("auto.offset.reset", "smallest");
  // properties.put("zookeeper.session.timeout.ms", "400");
  // properties.put("zookeeper.sync.time.ms", "200");
  // properties.put("auto.commit.interval.ms", "1000");
  // properties.put("serializer.class", "kafka.serializer.StringEncoder");
  ConsumerConnector consumer = Consumer.createJavaConsumerConnector(new ConsumerConfig(properties));  
  Map<String, Integer> topicCountMap = new HashMap<String, Integer>();  
  topicCountMap.put("my-replicated-topic", 1); // 一次从主题中获取一个数据  
  Map<String, List<KafkaStream<byte[], byte[]>>>  messageStreams = consumer.createMessageStreams(topicCountMap);  
  // 获取每次接收到的这个数据 
  KafkaStream<byte[], byte[]> stream = messageStreams.get("my-replicated-topic").get(0); 
  ConsumerIterator<byte[], byte[]> iterator =  stream.iterator();  
  while(iterator.hasNext()){
    System.out.println("receive:" + new String(iterator.next().message()));
  }
}

2> Producer

/**
发送数据
*/
@Test
public void ProducerSend(){
  Properties props = new Properties();
  props.put("serializer.class", "kafka.serializer.StringEncoder");
  props.put("metadata.broker.list", "192.168.242.101:9092");
  Producer<Integer, String> producer = new Producer<Integer, String>(new ProducerConfig(props ));
  producer.send(new KeyedMessage<Integer, String>("my-replicated-topic","message~xxx123asdf"));
  producer.close();
}

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325153881&siteId=291194637