Good programmers to share large data several important questions about the training of kafka

  Good programmers Big Data Training Share on several important issues kafka of:

  The concept 1.segment

  there is a topic in the plurality of partitions each have a plurality of segment, segment size may be provided in the configuration file kafka, are equal size segment, each segment has a plurality of index file and the corresponding the data files

  2. The data storage mechanism? (Faster data writing reasons)

  First broker receiving the data, the data into the operating system (Linux) a cache (pagecache)

  pagecache will use as much free memory, use sendfile duplicate cache technology to reduce as much as possible between the operating system and applications, will be used when writing data written in the way of sequential write speed of data up to 600m / s

  3.consumer load balancing is how to solve the problem?

  When the same time the number of consumer group is changed, it will trigger kafka load balancing, first obtain consumer spending starting partition number, and then calculate the number of partitions to consumer spending, hashcode value modulo I finally start the partition number of points number of area

1, the distribution policy data

  kafka default call your own partitioner (DefaultPartitioner) partition, you can also customize the partition, custom partition needs to achieve Partitioner qualities to achieve partition method

2, kafka is how to ensure that data is not lost? kafka after receiving the data will be stored according to topic created by the specified number of copies, copies of data are synchronized by kafka own, multi-copy mechanism ensures data security

3, kafka can ensure that topic in the global data ordered it

  kafka can be done orderly partition, between partitions are unordered

  How do global ordered it? The easiest way is to create a partition specified number of partitions is 1 topic

4, if you want to over-consumption has been consumed data

  1. The different group.

  2. Data some configurations, the line can be generated to synchronize to the mirror, then large amounts of data processed by a particular cluster region.


Guess you like

Origin blog.51cto.com/14479068/2431090