Kafka combat (seven) - gracefully cluster deployment Kafka

Since it is a cluster, there must be multiple nodes Kafka, Kafka only single-node cluster consisting of pseudo only be used for routine testing, the line can not meet production needs.
The real online environment need to consider various factors, combined with their own business needs and develop. Look at some considerations (in the following order, but the order of sub-oh)

1 operating system - OS

You may ask the big data framework on Kafka JVM is not it? Java is a cross-platform language, the installation Kafka on different operating systems there any difference?
The difference is quite big!

Indeed, Kafka written by Scala / Java, source code is ".class" file compiled.
Originally deployed to which OS should be the same, but the difference between different OS or cluster to Kafka brought considerable influence.
Needless to say, deployed on Linux production environment is the greatest.

Consider adaptation of Kafka's operating system, Linux system is clearly better than the other two in particular Windows system more suitable for deployment Kafka. Can you be specific reasons laughing it?

1.1 I / O model

I / O model may be I / O model is a method of performing OS I / O instruction that approximation.
Mainstream I / O model usually has five types:

  1. Blocking the I / O
    EG in the Java Socket blocking mode
  2. Non-blocking the I / O
    EG in the Java mode Socket nonblocking
  3. I / O multiplexing
    system eg Linux the call selectfunction
  4. Signal driving the I / O
    EG the epoll system calls between the third and fourth models
  5. Asynchronous the I / O
    EG there is little support for Linux, but the Windows system provides a model belongs to the class called IOCP thread

I do not expand the implementation details of each model in detail here, because that's not the point of this article.

Relations Closer to home, I / O model and Kafka geometry?
Kafka Client uses Java's underlying selector, which selector

  • Implementation Mechanism on Linux is epoll
  • Implementation Mechanism on the Windows platform is select

So at this point on Kafka deployed on Linux is an advantage, to obtain more efficient I / O performance.

1.2 Data network transmission efficiency

Kafka production and consumption of messages are transmitted over the network, and messages are stored where?
Definitely disk!
Kafka requires large amounts of data transfer between the disk and network.
Linux has a zero-copy (Zero Copy) technology, it is to avoid expensive data copy kernel state enabling fast data transfer when data is transferred in the disk and network. Linux platform to achieve this zero-copy mechanism, but some regrettable that on the Windows platform have to wait until an updated version of Java 60 to 8 to "enjoy" to.

In short, the deployment of Kafka in Linux can enjoy zero-copy technology offers fast data transfer characteristics bring the ultimate pleasure.

1.3 Community ecology

Community currently Kafka Bug found on the Windows platform makes no promises. Therefore, the deployment of Kafka is only suitable for personal or functional verification test on the Windows platform, do not apply to a production environment.

2 Disk

2.1 tortured soul: mechanical hard drive or solid state drive

  • The former cheap and large capacity, but perishable!
  • The latter performance advantage, but expensive!

Recommendation is to use ordinary mechanical hard drive.

  • Although Kafka extensive use of the disk, can mostly avoid the biggest disadvantage of mechanical disk sequential read and write operations, to a certain extent, that is, random read and write slowly. From that point on, the use of SSD and not much performance advantage, inexpensive mechanical disk
  • The poor reliability due to defects it is easy to damage caused by the Kafka and provide a mechanism to ensure that the software level

2.2 whether you should use disk arrays (RAID)

The two main advantages of using RAID are:

  • Provide redundant disk storage space
  • Load balancing

But in terms of on Kafka

  • Kafka himself realizes redundancy to provide high reliability
  • By partitioning the design, but also to achieve load balancing itself in the software level

That being said there is the advantage of RAID is not so obvious. Although in practice there are still many manufacturers Kafka is indeed the underlying RAID storage referred to, but currently Kafka provides a more convenient and highly reliable storage solutions in this regard, so the online environment using RAID seems to have become not so important.
In summary, the pursuit of cost-effective companies can not build a RAID, using ordinary disks storage space can be. Fully capable of using mechanical disk Kafka online environment.

2.3 disk capacity

Clusters in the end how much?
Kafka message needs to be saved to disk, these messages will be saved by default for some time and then automatically deleted.
Although this time is configurable, but how you should combine their storage needs and business scenarios to plan the storage capacity of Kafka cluster it?

Suppose a business

  • We need to send 100 million messages a day to Kafka cluster
  • Each message stored in duplicate to prevent data loss
  • Message saved by default two weeks

Now suppose that the average message size is 1KB, then you can say you need a cluster of Kafka how much disk space reserved for this business do?

Calculation:

  • 1KB of 100 million messages per day, keep two
    1亿 * 1KB * 2 / 1000 / 1000 = 200GB

  • In addition to the cluster generally Kafka stored message data also other types of data, such as index data
    and then to reserve 10% of the disk space, the total storage capacity is 220GB

  • To keep for two weeks, then the overall capacity is the
    220GB * 14, about 3TB
  • Kafka supports data compression, assuming that the compression ratio is 0.75
    so the final planning of storage space is 0.75 * 3 = 2.25TB

In short when planning disk capacity you need to consider the following several elements:

  • The number of new messages
  • Message retention time
  • The average message size
  • Back up
  • Whether compression is enabled

3 Bandwidth

For such a large frame Kafka data transmission through the network, the bandwidth easily become a bottleneck.
Ordinary Ethernet bandwidth are mainly two: 1Gbps and 10Gbps Gigabit Network Gigabit networks, especially gigabit network should be standard general corporate networks equipped
with gigabit network to illustrate the bandwidth resources planning.

Really want to plan the number of servers required to Kafka.
Gigabit Ethernet is assumed room environment, i.e. 1Gbps, now have a business, which is a processing target SLA or 1TB of data traffic over 1 hour.
So the question is, how many Kafka server in the end you need to complete this business do?

Compute

Bandwidth of 1Gbps, namely 1Gb per second data
assuming each Kafka servers are installed in the exclusive machine that Kafka on each machine is not mixed with other service
usually you can only assume that Kafka will use 70% of the bandwidth resources, because the total to leave some resources for other applications or processes. Over 70% threshold there is the possibility of network packet loss, so 70% is a more reasonable set of values, that is to say a single server Kafka most will be able to use about 700Mb bandwidth.

This is just the maximum bandwidth resources it can use, you can not let Kafka server routine use so many resources, it is usually again set aside an additional 2/3 of resources, namely
单台服务器使用带宽700Mb / 3 ≈ 240Mbps
where 2/3 is actually quite conservative, can be combined use the machine as appropriate, to reduce the value

With 240Mbps, may be calculated over 1 hour 1TB number of servers required processing of data.
According to this objective, data 2336Mb per second, divided by 240, is approximately equal to 10 servers.
If the message still needs additional two copies, then the total number of servers also multiplied by 3, or 30 units.

to sum up

Its blind launched a Kafka environment and then afterwards laborious adjustments, as necessary at the outset to think of the actual scene under good business cluster environment. You need to take into consideration when considering the deployment scheme can not only be evaluated on a single dimension.

reference

  • Linux kernel architecture model
  • Kafka core technology and combat

Guess you like

Origin www.cnblogs.com/JavaEdge/p/12071160.html