Remember once ceph abnormal heartbeat mechanism Case

Phenomenon: When deploying a cluster using ceph when it came to a situation, in large-scale cluster when there is a network node or abnormal osd, mon delay to the exception osd marked down, we have been waiting for 900s mon find the node osd only then it has not been updated pgmap abnormal osd marked down, and updates osdmap spread out. But within the 900s, the client IO will still have to go osd issued abnormal, leading to a timeout io, and further affect the last business.

Analysis:
We mon log which also saw the establishment of other osd osd and abnormal heartbeats mon report to the osd exception, but does not have these osd mon marked down in a short time. After a review of a number of books related to network and information, only to find the problem.
First, we focus on several related configuration items osd configuration:
(1) osd_heartbeat_min_peers: 10
(2) mon_osd_min_down_reporters: 2
(3) mon_osd_min_down_reporters_ratio: 0.5
The above parameters can be performed ceph daemon osd.x config show on cluster nodes ceph View (x is your osd cluster of id).
What the cause of the problem is it?
Each osd will randomly select 10 peer osd problems when deployed to the scene of the cluster as an object the establishment of a heartbeat, but the mechanism of ceph, this 10 osd does not necessarily guarantee that all scattered on different nodes. Therefore, when there osd abnormal, the probability of the reporter mon report does not satisfy the osd down ratio = 0.5, that is, the number of reporter not over half the number of clustered storage host, so osd abnormal heartbeat packets can not live through mechanism between osd fast mark down, until mon found this osd pgmap has not updated until after 900s to identify anomalies (another mechanism, it can be seen as a mechanism to keep alive heartbeat osd final insurance), and spread out through osdmap. And this upper 900s for business, it is often unacceptable.
But this phenomenon for small-scale clusters hardly occurs, such as in a 3 node cluster ceph example:
Remember once ceph abnormal heartbeat mechanism Case
If the number of peer with other nodes osd established less than a osd_heartbeat_min_peers, then osd will continue to choose to establish a heartbeat connection with their close osd ( and even their own on the same node.)
For osd heartbeat mechanism, the Internet was summed up a few requirements:
(1) timely: the establishment of a heartbeat can be found osd osd other abnormalities in the second grade and report monitor, monitor within minutes to the osd marked down off the assembly line
(2) proper pressure: Do not think peer better, especially now that the practical application scenarios osd monitor the heartbeat and send messages and network links are shared public network and cluster network, the heartbeat connection is established will greatly affect too much performance of the system. Mon heartbeat a separate manner osd maintained, but ceph between keep-alive heartbeat osd, this pressure will be distributed to the respective osd, the central node mon greatly reduced pressure.
Remember once ceph abnormal heartbeat mechanism Case
(3) network jitter tolerance: mon collected after the report osd, will go through periods of waiting several conditions, rather than rush to osd marked down. These conditions osd effectiveness target time is greater than the threshold value is determined by a fixed amount osd_heartbeat_grace and historical network conditions, and the number reported by the host, and whether min_reporters min_reporters_ratio, and within a certain time, the failure is not reported to be canceled by the source report and the like.
(4) Diffusion Mechanism: two kinds of realization, mon active diffuser osdmap, there is an inert osd client and take themselves. To make an exception and other information in a timely manner so that client osd perceived, the former general is to achieve better.

Summary and Implications:
two direction changes can be made.
(1) For the existing mechanisms to take the number of clustered storage nodes as min_reporter_ratio 0.5 is obviously unreasonable, it should be used on this osd osd establish how many host heartbeat (take the number of host), it would establish a heartbeat host 0.5 * The total number of judges as the basis.
(2) some scenes, we will define their own data stored in the logical area number, by using a hierarchical structure of crush, such as defining a plurality of logical regions ceph cluster, a copy of a slice or a data only exists in a logic region, then the heartbeat associated scope osd establish appropriate connections need to streamline and accurate.

Now osd heartbeat mechanism ceph realized there was still a lot of problems, I do not know the back will not be replaced with a new mechanism of the current mechanism, let us wait and see.

Guess you like

Origin blog.51cto.com/12374206/2417781