Qihoo 360 Kafka practice

Business issues

  1. IDC read and write bandwidth pressure across the lead to big problems. More repeat business may also consume and produce, causing a great waste across the network IDC, IDC another across the network is not stable, often encounter some anomalies
  2. Official business directly naked client is difficult, to be the second package
    • Exceptions will catch incomplete
    • Use good kafka have higher requirements for business
    • The client can not be completely fault-tolerant cover 360 live scene requirements. When the network or cluster abnormal until unavailability, the data will be lost
    • It does not support exactly once semantics, with the realization of business needs
    • ......
  3. Lack of support for high-availability data, if all copies are distributed slices in a single rack, high risk of losing data
  4. Data slice (Partition) uneven distribution. If the number of cluster nodes is much greater than the number of partitions, kafka default allocation algorithm may cause data skew, some multi-node distribution, distribution of some rare hardware resource utilization is low
  5. New and old machines in the cluster configuration difference load imbalance. The rapid development of hardware over time, adding new machine performance is an old machine to increase geometrically, if evenly distributed, the load is not balanced, the new machines have enough to eat, eat the old machine.

Technology Selection

360 Selection Kafka several dimensions from the following considerations: community activity, client support, throughput, reliability, data comparison

 

 

From kafka architecture design analysis highlights

  • Kafka performance and throughput are high, by a mechanism to implement zero copy and sendfile pagecache, reading and writing sequential nature of such disks can be done with ordinary large throughput, relatively high cost.
  • Kafka isr through the replica and mechanisms to ensure high availability of data.
  • Kafka cluster has two administrative roles: controller is doing managing the cluster; coordinator mainly do business-level management. They by Kafka inside to serve as a broker, when a broker failover, elections in which a broker can be replaced, which can be found in Kafka a decentralized design, but the controller itself is a bottleneck, analogous to the hadoop the namenode.
  • Distributed System are either CP, either AP. Kafka achieve more flexible, different services can be done on the topic CP bias level according to its own business characteristics or partial AP configuration.

 

 

Based on the above comparison of multiple MQ multi-dimensional, integrated weigh the advantages and disadvantages to consider, 360 ultimately chose Kafka

Production Environment

  • Currently there are one hundred billion cluster log volume, PB-level data volume
  • Cluster size: more than 100 Gigabit machine,
  • cpu 24 core, network 10Gb / s, 128GB, magnetic disk 4TB * 12 JBOD
  • The maximum peak of 600,000 single topic QPS
  • Peak cluster of about 5 million QPS
  • Kafka-1.1.1 version deployment

solution

IDC transform a big problem across the read and write bandwidth,

  • IDC conducted using the mirrormaker synchronize data, IDC only sync data between a
    1. First, the effect of shielding an exception to the business
    2. Saves bandwidth between the IDC, through a synchronization mechanism to ensure the transmission of this data is only a
  • All local business only to read and write
  • Docker pool of hardware resources, to provide service SLA

Kafka solve client ease of use and stability

  • Kafka secondary package based on the official website of the client, the following principles package
    1. The business details were masked, exposing enough simple interface
    2. Framework handles all the details, to reduce the probability of mistakes business
    3. For producer and consumer increased by 2 components LogProducer and LogConsumer
  • Extreme cases still available
    1. Network or cluster exception guarantee is still available, if the network or cluster is unavailable, the data will first fall on local, such as recovery time from the local disk to be restored in Kafka.
  • Provided LogProducer
    1. Support at least once semantics
  • Provided LogConsumer
    1. Support at least once
    2. exactly once semantics, business needs to achieve rollback Interface

Solve data and resource utilization and balance problems

  • Consistent hash ring with increased virtual node, to solve data fragmentation (partition) the uneven distribution problems
    1. New hash circle: a MD5 hash made by vnode_str (such hostname-v0), to give the vnode_key virtual node, and then to save the virtual ring node dictionary mapping physical node, while vnode_key added to the list sorted_keys
    2. Partitioned hash ring replica: The (topic_name + partition_num + replica_num) as the key to obtain replica_key same MD5 hash described algorithm, then a binary search of the replica_key position in sorted_keys is, finally ring dictionary to map to the physical machine Node, so far replica assignment complete
  • A copy of a single-rack aggregation solve problems with the availability of low-balanced multi-distribution copies rack
    1. Realize replica of the rack aware, physical node information will be above the rack, rack information for the replica in the physical distribution node when the record will have been allocated, if there are repeated cases, you will find under the vnode_key position to find a physical position +1 node, we will ensure that the physical rack three replica must be different (if replica = 3)
  • Assigned with the right partition and the number of copies to solve new and old heavy machine configuration differences in load imbalance
    • Adding physical node only a small part of data migration;
    • Do the physical machine weights set different configurations that can support the deployment of heterogeneous clusters;

To enable authentication, authorization and ACL, to enhance security

  • Whitelist mechanism, through the application process work order management legal topics, consumers, filtering illegal

Alarm monitoring support

  • With jmx exporter + promehteus + grafana charts do show, deploy each broker above jmx exporter, prometheus going to pull that data, to show by last grafana
  • Transient monitoring indicators do with Kafka manager
  • Monitoring consumer lag do with burrow
  • Wonder do with the alarm, this is an internal implementation component 360, similar zabbix

SLA guarantee

  • The business is divided into three categories of priority, a high priority focus on security, low-priority topic downgrade
  • Suddenly service when high load, low-priority service degradation in emergency situations. For example: request / copy limiting

Reference material

One hundred billion the amount of data Kafka depth practice: https://mp.weixin.qq.com/s/5p1IgayVXvCSLLc0Zvoqew

 

Blog address reference: https://www.cnblogs.com/lizherui/p/12656777.html

Guess you like

Origin www.cnblogs.com/lizherui/p/12656777.html