The most detailed history of the principle of big data framework Kafka Summary - completion of work certainly feel full

If you learn just big data, you want to learn through this article on big data, I suggest that you can close out the pages, big data entry is easy to learn, to achieve high-paying system is absolutely necessary to learn, of course, if you think through the large data to improve your income, you can read a detailed article I recommend

Recommended Reading articles

What Big Data Engineer Ali in the interview process?

Big Data requires learning how to base?

Experience big data development engineer salary 30K summary?

Kafka used to live
Kafka used to live was originally developed by the Linkedin, it is a distributed, support partitions (partition), multiple copies (replica), zookeeper coordinated distributed message-based systems, its greatest feature is the ability to handle large amounts of data in real-time meet the needs of scenes: for example hadoop batch-based systems, low-latency real-time systems, storm / Spark streaming engine, web / nginx logs, access logs, messaging services, etc., written in scala language, Linkedin, 2010 contribution to the Apache Foundation and become the top open source projects.

1. Introduction The
performance is good or bad message queue, the file storage mechanism design is one of the queues and technical service level most key measure of a message. The following file storage mechanism from Kafka and physical structure point of view, to analyze how Kafka is efficient file storage, and practical application.

 1.1 Kafka characteristics:
- high throughput, low latency: kafka can process hundreds of thousands of messages per second, which is the minimum delay of a few milliseconds each topic can be divided into a plurality of partition, consumer group to consume operations for partition.
- Scalability: Cluster support Kafka thermal expansion
- durability, reliability: the message is persisted to local disk, and supports data backup to prevent data loss
- Resilience: allows nodes in the cluster fails (if the copy number is n, then allowed to n-1 node failure)
- high concurrency: support thousands of clients simultaneously read and write

1.2 Kafka usage scenarios:
- log collection: a company can collect log various services with Kafka, in order to open the way to a unified interface to a variety of consumer services by kafka, for example hadoop, Hbase, Solr and so on.
- Messaging: decoupling and producers and consumers, cache messages and so on.
- User Activity Tracking: Kafka often used to record the user's web user or app activities such as web browsing, searching, clicking and other activities that information is published to the topic kafka each server, and then by subscribing to these subscribers Real-time monitoring and analysis topic to do, or loaded into hadoop, the data warehouse to do off-line analysis and mining.
- operational metrics: Kafka also often used to record operational monitoring data. Collecting various data including the distributed application, the production of concentrated feedback operation, such as alarms and reports.
- Streaming: such as spark streaming and Storm
- Event Source

1.3 Kakfa design ideas
- Kakfa Broker Leader election: Kakfa Broker cluster management by Zookeeper. To all of Kafka Broker node together registered a temporary node Zookeeper, because only a Kafka Broker will be registered, the other will fail, so the success registered on the Zookeeper this Kafka Broker temporary node will become Kafka Broker Controller, other Kafka broker called Kafka Broker follower. (This process is called Controller register ZooKeeper Watch). The Controller will listen to all the other information Kafka Broker, if this kafka broker controller is down, in the interim node zookeeper above will disappear, at which time all kafka broker will go to register a temporary node Zookeeper together because only a Kafka Broker will be registered, the other will fail, so the success registered on the Zookeeper this Kafka Broker temporary node will become Kafka Broker Controller, other Kafka broker called Kafka Broker follower. For example: Once a broker is down, kafka broker controller reads the state of all of the partition on the zookeeper down broker, and select ISR list as a replica partition Leader (ISR list if full replica linked, selected as a surviving replica Leader;
When ever happened here a bug, TalkingData use Kafka0.8.1 after kafka controller successfully registered in Zookeeper, timeout time communications is Zookeeper it and 6s, that is, if kafka controller if there is no heartbeat and Zookeeper do 6s, then Zookeeper, it considers the kafka controller is dead, this will delete the temporary node in Zookeeper, then other Kafka would not have thought controller, it will once again rushing to register temporary node, successfully registered the kafka broker to become controller and then, before the kafka controller will need to shut down shut down various nodes and monitor various events. But when reading and writing are very huge traffic kafka when a bug TalkingData, due to network reasons, kafka controller and Zookeeper 6s have no communication, then re-elected to a new kafka controller, but the original controller in shut down time is always unsuccessful, this time coming in message producer due to two kafka controller Kafka cluster can not be landing. Cause data siltation.
There was once a bug, TalkingData use Kafka0.8.1 time, when ack = 0, when expressed producer sent message, as long as the corresponding kafka broker topic partition leader received this message, producer returns success, regardless of partition leader really successful the message really saved to kafka. When ack = 1 when the transmitted message represents the producer, the synchronization message to the leader partition stored corresponding to the topic, and producer returns success, partition leader asynchronous message synchronization to the other partition replica. When ack = all or -1 indicates producer sent message, the message stored synchronized to the leader and a corresponding replica partition corresponding to the topic, returns success. But if kafka controller when a switch will result in switching partition leader of the (old controller above kafka partition leader will be elected to the other kafka broker), but this will result in lost data.
- Consumergroup: each consumer (consumer thread) can form a group (Consumer group), partition each message can only be set (Consumer group) a consumer (consumer thread) of consumption, if a message can be more consumer (consumer thread) consumption, then the consumer must be in different groups. Kafka a partition is not supported by the message consumer thread under two or more of a consumer group to the same process, starting again unless a new consumer group. So if you want to do while on a topic of consumption, then start multiple consumer group can, but be aware that the more consumer consumption here must be read partition inside the message sequence, from the newly launched consumer default most tip latest local partition blocking queue began to read message. It can not be more than that as a consumer to BET mutually exclusive as AMQ (for update pessimistic locking) concurrent processing message, since multiple BET to spend time in a Queue data, due to ensure that multiple threads can not take the same a message, so we need to line the level of pessimism (for update), which leads to performance degradation consume, throughput is not enough. And kafka order to guarantee throughput, allowing only one consumer thread under one consumer group to access a partition. If you feel that the time is not efficient, the number of partition can be added to scale, then coupled with new consumer thread to consume. If you want multiple different business needs of data this topic, since more like a consumer group, we are all sequential read message, offsite value independently of each other. So there is no lock contention, give full play to horizontal scalability, high throughput. It also formed the concept of distributed consumption.
    When you start a consumer group to spend a topic when the topic whether there are more less a partition, regardless of our consumer group which configure how many consumer thread, this consumer group all of the following consumer thread will consume all of the partition; even under this consumer group is only one consumer thread, then the consumer thread will go to the consumer all the partition. Therefore, the best design is that the number of consumer under consumer group thread is equal to the number of partition, so that efficiency is the highest.
    The same partition of a message can only be consumed within a same Consumer Group Consumer. It can not be more than one consumer consumer group, while consumer a partition.
    Under a consumer group, regardless of how many consumer, the consumer group must go back to all the partition this topic are consumed. When the number of consumer consumer group which is smaller than the number of the partition when this topic, as shown groupA, groupB, there will be a case where a plurality of partition conusmer thread consumption, in short, is the topic of the partition will be consumed. If the number of consumer consumer group which is equal to the number of the partition when this topic, Group C shown below, the efficiency is the highest at this time, each partition has to spend a consumer thread. When the number of consumer consumer group which is greater than the number of partition of this topic, as shown GroupD, there will be a consumer thread free. Therefore, we set the consumer group, when there is only need to specify the number of several consumer can, without specifying the corresponding consumption partition number, consumer will be automatically rebalance.
    Consumer Group consumer at a plurality of the same message can be consumed, but this is also a consumer to read the message o (1) in a sequential manner to consume ,, it will be repeated consumption of these message, not as the plurality AMQ BET as consumer consumption (for message locked, consumption can not be repeated when the consumer message)
- consumer Rebalance trigger conditions: (1) consumer add or remove trigger Rebalance consumer Group's (2) increase or decrease will trigger Broker consumer Rebalance
- Consumer: Consumer partition inside the message when processing is o (1) the reading order. It is necessary to maintain the previous read where offsite information. high level API, offset stored in Zookeeper in, offset low level API is maintained by themselves. In general we are using a high level api's. Consumer's delivery gurarantee, the default message is read first commmit reprocessing message, autocommit default is true, which will be updated when the first commit offsite + 1, once the failure handling, offsite already +1, this time will lose message; can configured to read and then processing the commit message, in response to the end consumer in this case will be relatively slow, it requires other processed job.
Under normal circumstances, a consumer group must be processed in a topic message. Best Practice consumer group which is the number of the consumer is equal to the number of topic inside the partition, so that the efficiency is the highest, a consumer Thread processing a partition. If the number of the consumer group inside the consumer is less than the number of topic inside the partition, there will be consumer thread to handle multiple partition (this is kafka automatic mechanism, we do not specify), but in short this topic inside the partition will be treated to all of. . If the number of the consumer group inside the consumer is greater than the number of topic inside partition of the extra consumer thread will idle what is not dry, and the rest is a consumer thread processing a partition, which resulted in a waste of resources, since a two consumer thread partition is impossible to deal with. So we have multiple distributed service service line, kafka number of consumer inside of each service are less than the amount corresponding to the topic of partition, but the number of consumer of all services and only equal to the number of partition, because the distributed service services All are from the consumer a consumer group, if different from the consumer group will process the duplicate message (under the same consumer can not handle a consumer group with a partition, different consumer group can be treated with a topic, are then processed sequentially Message, the processing will be repeated. such cases are generally two different business logic, two consumer group will start to process a Topic).

Before delving into the study Kafka, you need to understand Topics , Brokers , Producers and consumers a few key terms and so on. The following detailed description illustrates the main components and terminology.

In the figure above, the theme ( Topic ) is configured as three partitions. Partition. 1 ( the Partition. 1 ) has two offset factor 0and 1. Partition 2 ( the Partition 2 ) has four offset factor 0, 1, 2and 3, partition. 3 ( the Partition. 3 ) having an offset factor 0. replica of the same id hosting its server id.

It assumed that if the themes replication factor set 3, then Kafka will create three identical copies of each partition, and place them in a cluster to make it available to all its operations. In order to balance the load of the cluster, each agent stores one or more of these partitions. Multiple producers and consumers can publish and retrieve messages at the same time.

  • Topics - belonging to a particular category is referred to as message flow relating to ( Topics ), data stored in a subject. Theme is divided into multiple partitions. For each topic, Kafka retains a minimum range partition. Each such partition in an ordered sequence comprises a non-variable message. Partition segment is implemented as a set of files of the same size.
  • Partition - topics may have many partitions, so it can handle any number of data.
  • Offset the Partition - partition each message has a unique sequence identifier referred to as the offset.
  • Partition of Replicas - just copy the partition backup. Copy never read or write data. They are used to prevent data loss.
  • Brokers

    • Agent ( Brokers ) is a simple system, responsible for maintaining the data published. Each agent may each topic zero or more partitions. Assume that if a topic and Nproxy have Npartitions, each agent will have a partition.
    • Suppose there is a topic in the N and N partitions agent (n + m) more than N, then the first agent will have a partition N, next M agents will not have any partition of a specific topic.
    • Suppose a topic with N and N proxy partitions (nm) less than N agents, each agent will have one or more partitions share. Since the load distribution between the broker is not balanced, we do not recommend this case.
  • Cluster Kafka - Kafka have multiple brokers called Kafka cluster . Kafka cluster can be extended without the need for downtime. These clusters for the management of persistence and replication message data.

  • Producers - Producers ( Producer ) is the subject of one or more of Kafka publisher. Producers send data to Kafka broker. Whenever producer posted a message to the broker, the broker simply message is attached to the last segment file. In fact, the message is appended to the partition. Producers can also send messages to their selected partition.
  • Consumers - Consumer reads data from brokers. Consumers subscribe to one or more topics and consume messages posted by obtaining data from a broker.
  • Leader - Leader is responsible for all partitions to read and write nodes. Each partition has one server to act as a leader.
  • Follower - follow the leader ( Leader node) indicated called followers ( Follower ). If the leadership fails, one of the followers will automatically become the new leader. Followers play a normal role of consumers, stimulating messages and update their own data storage.

Kafka tool packaging org.apache.kafka.tools.*under. Tools are divided into system tools and replication tools.

System Tool

System tools you can use run classscripts to run from the command line. The syntax is as follows -

bin/kafka-run-class.sh package.class -- options

Shell

Below mentioned some system tools -

  • Kafka Migration Tool - This tool is used to migrate agents from one version to another.
  • Maker Mirror - This tool is used to cluster a mirror Kafka another.
  • Consumers offset Checker - This tool will show the consumer group specified set of themes and user groups, themes, partition, offset, log size, the owner.

Copy Tool

Kafka copy is a high-level design tools. The purpose is to add replication tools provide greater durability and higher availability. Below mentioned some replication tool -

  • Creating Thread Tools - This will create a theme contains the default number of partitions, replication factors, and executes a copy of the default allocation scheme of Kafka.

  • List Thread Tools - This tool lists information about a given topic list. If no theme at the command line, the tool will query Zookeeper for all topics and list their information. The field tool displays the name of the theme, zoning, leadership, copy, isr.

  • Add partitioning tool - when you create a theme, you must specify the number of partitions topic. Later, when increasing the amount topic, the topic may need more partitions. This tool helps add more partitions for a specific topic, you can also add a copy of the assignment manual partitioning.

  • Recommended Reading articles

    What Big Data Engineer Ali in the interview process?

    Big Data requires learning how to base?

    Experience big data development engineer salary 30K summary?


 

Guess you like

Origin blog.csdn.net/aa541505/article/details/90452782