ceph osd knowledge

osd object storage device:

ceph first data storage procedure will be divided into a plurality of data object, (an object ID for each object, the size can be set, the default is 4MB), the memory cell is the smallest object Ceph storage. Because of the number of object, in order to effectively reduce the index table to the OSD Object, reduce the complexity of the metadata, so that the writing and reading more flexible introduced pg (Placement Group): PG used to manage object, each object by the Hash, mapped to a pg, one may include a plurality pg object. CRUSH Pg through the calculation, it is mapped to the osd. If three copies, each pg is mapped to three osd, ensure data redundancy. Writing data as shown below:

 

 

 

Complete the storage of user data by the vast majority are working osd daemon process to achieve. There are generally a plurality of clusters osd, client would interact with I / O operations directly to the OSD, without the need for intervention ceph monitor. In each cluster ceph osd high degree of autonomy, data replication, data recovery, data migration by osd autonomous central controller without intervention; mutual supervision among osd, ensure that the failure can be promptly captured and reported to the monitor; by osd between the client and the osd and learn from each other and point-spread osdmap, ceph fast failover and recovery, and to ensure the greatest degree of external storage to provide uninterrupted service.

Network communication

       In the design, the whole Ceph RADOS cluster network is divided into two separate planes, i.e., the public network cluster network plane and plane. Public network plane, the client from the cluster communication. Since the client must be obtained before OSDMap by Monitor, Monitor it must also be exposed to the public network, cluster network is used for communication between the OSD, between OSD principle can also communicate with the public multiplex network, then why in the design To the two networks separate planes it?

       First, the flow of both transmission is not a peer. 3 to copy, for example, OSD processing time per client write operations must copy the same data to two different copies of two other OSD, so in this case from the client to the OSD flow ratio is 1, then from the OSD the other two copies to the flow ratio is 2, due to internal cluster in some cases there will be data recovery, data re-balancing task, therefore theoretically cluster network load is much higher than the public network. Second, the different nature of the business carried both, if forced the two into one, a background task will probably take up a lot of bandwidth required to cause a higher priority client service bandwidth is not guaranteed, if the physical both will be isolated, it can be simply produced from the networks to avoid interference between the traffic they carry.

osd recovery process:

After a new line on the OSD, according to the first communication configuration information monitor. Monitor will add cluster map, and set up state and out, and then the latest version of the cluster map be given to the new OSD. After receiving the monitor sent cluster map, calculate the new PG OSD their own and carried by the carrier and the other of the OSD same PG. Then, the new OSD will get in touch with the OSD. If the PG is currently in a degraded state (i.e., the OSD carrier number is less than the normal value PG, as it should normally be three, then only one or two. This is usually due to failure OSD), if present new osd join the other OSD will copy all objects and metadata within the PG to the new OSD. After the data has been copied, the new OSD is set up and in the state.

osd re-balance:

If all goes well before the PG, the new OSD will replace a swap existing in the OSD (PG will be re-elected in the Primary OSD), and assume the data. After the data has been copied, the new OSD is set up and in the state, and was replaced OSD will exit the PG (state but usually still up and in, because even carry other PG). The cluster map content will be updated accordingly.

Troubleshooting:

If a OSD find themselves together and another carrying a PG OSD can not Unicom, the situation will be reported to monitor. Furthermore, if an OSD deamon find themselves working in an abnormal state, it will also take the initiative to report the abnormal situation to monitor. State in the above case, monitor OSD will issue appears set down and in. If the time period of more than a book, the OSD is still not back to normal, its status will be set down and out. Conversely, if the OSD can be restored to normal, its status will be re-up and in. After these changes state, monitor will be updated cluster map and diffusion.

1, between the OSD heartbeat:

     Each osd_heartbeat_interval (default 2 seconds) will detect a heartbeat packets back and forth between the OSD, the heartbeat packets will emit from the public network and the cluster, respectively, (the heartbeat packet is unicast) When an OSD in osd_heartbeat_grace (default seven seconds) Time in the other OSD did not receive heartbeat time, then this will be reported to the OSD OSD MON said another heartbeat has timed out. When the OSD MON received from two different fault domains are reported with a heart problem OSD, it will be the OSD mark DOWN. [Mon_osd_min_down_reporters (the default is 2) (same fault domain will only remember 1 vote)]

2, OSD report their status to the mon:

      Each OSD minimum osd_beacon_report_interval (default 100 seconds), if not a report to the OSD mon time in mon_osd_report_timeout (default 300 seconds) had their own state, mon considers it down the.

Osd data stored in / var / lib / ceph / osd /

Log file /var/log/ceph/ceph-osd.*

Osd water level information is "mon_osd_nearfull_ratio": "0.850000", the alarm will be close to 85%

"Mon_osd_full_ratio": "0.950000", more than 95 will error.

 

OSD Offline Analysis

If the problem is environmental factors cause OSD off, then, this time usually presents heartbeat timeout, exception type into network anomalies, abnormal disk, the disk too much pressure OSD suicide. Ceph.log mainly from the role of monitoring nodes can see the information section, followed by some of the information may be obtained from the OSD log off inside.

grep -rnE "FAILED assert | suicide | Seg | link | dummy io | osd_max_markdown_count | returned: -5". ceph-osd $ id.log If you have relevant keywords describing a process crash

a) FAILED assert keyword: assert generally due to certain conditions are not met, trigger logic bug

b) suicide Keywords: general internal thread timeout exceeds the set threshold (default 150s), take the initiative to withdraw from the process leading to OSD

   You can see the water level by grep "water level" ceph-osd. $ Id.log

c) Seg Keywords: illegal memory access program leads to crash

d) link Keywords: lead down card, confirmed by grep "NIC Link is Down" / var / log / messages

e) dummy io Keywords: Description store busy or slow layer, the write operation is not completed within a 4k 60s,

To view the page at this time on the osd fault period data disk utilization

f) osd_max_markdown_count Keywords: osd been markdown several times, resulting in osd process to exit a period of time, usually at this time osd network problems frequently mark down

 

g) returned: -5 Keywords: generally disk hardware problem at this time can be confirmed by grep "critical medium error" / var / log / messages

Internet problem

When the network anomaly caused by the presence of the card off the assembly line, the entire server all OSD will detect this phenomenon and offline, so version 3.2 or more display osd.x "mark down" or other income to be reported in the OSD will ceph.log to "refuse" message, this time directly to the OSD settings mon offline.

grep -rnE "stuck, cost|send msg delay|recv stuck" ceph-osd.$id.log

If you have relevant keywords, there is a high probability indicates that the network problems

a) stuck, cost keyword: the ping message packet consumes too much time on the network (by default will print larger than 1s)

b) send msg delay Keywords: sending ping message packet network delay, the network transmission is generally considered to result in slow

c) recv stuck Keywords: network received the package cause slow

The above phenomenon, indicating a network problem between osd, you can use ping -f $ ip packet loss rate View

Check / var / log / messages there are no corresponding event card Down

If there is detected cards OSD Down, then there will be related errors, which represents hb_front_server_messager public network, hb_back_server_messager represents cluster network, link 1 represents a normal, link 0 shows the problem.

 

Guess you like

Origin www.cnblogs.com/blogzjt/p/11990326.html