What does the following message mean nfs4_reclaim_open_state Lock reclaim failed!

What does the following message mean: “nfs4_reclaim_open_state: Lock reclaim failed!”

环境

Red Hat Enterprise Linux (RHEL) 6
- all kernels
NFSv4
NFS client

问题

We’re seeing the following message in /var/log/messages. What does it mean?

Jun 11 19:47:02 foobar kernel: nfs4_reclaim_open_state: Lock reclaim failed!

决议

This message is a somewhat generic message which is generated by the NFSv4 client code. It means the following
1. An NFS operation completed with an error status which triggered the NFS state manager thread to begin recovering state.
2. In the process of recovering states, the state manager attempted to reclaim locks via nfs4_reclaim_locks() and this operation failed
Often this message occurs due to a lease expiration event caused by a networking issue between the NFS client and NFS server. In this case the error status seen in an NFS operation will be NFS4ERR_EXPIRED.
Unfortunately it is not specific enough to track down what happened. To track down what caused the message, the NFS traffic leading up to the message must be captured, or some method of instrumenting the kernel to track the logic (such as systemtap) must be used. See the Diagnostic Steps section for more information.

Examples

Example1: I’m seeing “nfs4_reclaim_open_state: Lock reclaim failed” while using an active / passive multi-node failover setup controlled by an exclusive lock file on NFS4. This message may occur when using an active / passive multi-node failover setup and the active node that has the lock experiences a network partition and may or may not indicate a more severe problem.
Example2: Delayed NFSv4 RENEW Reply leads to expired lease and triggers state recovery. Something has delayed the RENEW Reply at the server side, leading to the expired lease and client state recovery. This is an NFS server issue.* Example1: Race with kernel readahead and NFSv4 LOCKU / RELEASE_LOCKOWNER - https://access.redhat.com/solutions/382283. This issue was due to a NFS client race between kernel readahead and an unlock (LOCKU). The client was not checking whether READs were still in progress from kernel readahead before sending an LOCKU and subsequent RELEASE_LOCKOWNER. This resulted in NFSv4 READs returning NFS4ERR_BAD_STATEID, which triggered the NFS state manager to begin recovering state. Then, in the process of recovering state, it attempted to recover the lock but this failed as well and produced the “Lock reclaim failed” message.
Example3: Using NFSv4 delegations and seeing “Lock reclaim failed” messages on kernels before 2.6.32-573.el6 - https://access.redhat.com/solutions/1118233. When the NFS client tries to recover locks, if a lock has a delegation, the check which causes this message to be printed is invalid. Thus the message should not be printed and can be safely ignored.
Example4: NFSv4 flock regression in certain RHEL6 kernels before 2.6.32-358.6.1.el6 - https://access.redhat.com/solutions/318163. This problem occurred when an exclusive flock was obtained on a file opened for read (the open mode of the file was different from the lock). The NFSv4 code emulates flock via byte-range locks in later kernels (including RHEL6). NFSv4 byte range locks require matching the open mode of the file. The client code sent a LOCK request and received an error back (10038 == NFS4ERR_OPENMODE). When the error was received, this triggered NFS state recovery and lock recovery was attempted, but failed due to the same error and the “Lock reclaim failed” was printed. The error was not handled properly and due to a problem with a backported patch for a separate problem, the same request was re-sent only to error in the same manner.

根源

This message comes from the NFSv4 client code inside fs/nfs/nfs4state.c nfs4_reclaim_open_state() function below, and is called as part of the NFSv4 client’s state recovery. The state recovery is triggered when certain operations complete with certain error codes.

fs/nfs/nfs4state.c
1155 static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs4_state_recovery_ops *ops)
1156 {
1157    struct nfs4_state *state;
1158    struct nfs4_lock_state *lock;
1159    int status = 0;
1160 
1161    /* Note: we rely on the sp->so_states list being ordered 
1162     * so that we always reclaim open(O_RDWR) and/or open(O_WRITE)
1163     * states first.
1164     * This is needed to ensure that the server won't give us any
1165     * read delegations that we have to return if, say, we are
1166     * recovering after a network partition or a reboot from a
1167     * server that doesn't support a grace period.
1168     */
1169    spin_lock(&sp->so_lock);
1170    write_seqcount_begin(&sp->so_reclaim_seqcount);
1171 restart:
1172    list_for_each_entry(state, &sp->so_states, open_states) {
1173            if (!test_and_clear_bit(ops->state_flag_bit, &state->flags))
1174                    continue;
1175            if (state->state == 0)
1176                    continue;
1177            atomic_inc(&state->count);
1178            spin_unlock(&sp->so_lock);
1179            status = ops->recover_open(sp, state);
1180            if (status >= 0) {
1181                    status = nfs4_reclaim_locks(state, ops);
1182                    if (status >= 0) {
1183                            spin_lock(&state->state_lock);
1184                            list_for_each_entry(lock, &state->lock_states, ls_locks) {
1185                                    if (!(lock->ls_flags & NFS_LOCK_INITIALIZED))
1186-->                                         printk("%s: Lock reclaim failed!\n",
1187                                                    __func__);
1188                            }
1189                            spin_unlock(&state->state_lock);
1190                            nfs4_put_open_state(state);
1191                            spin_lock(&sp->so_lock);
1192                            goto restart;
1193                    }
1194            }

诊断步骤

Gathering a tcpdump leading up to “Lock reclaim failed”

Use the tcpdump-watch.sh script attached to Gathering data and logs to troubleshoot NFS issues and modify the ‘match=’ line to include “Lock reclaim failed” as follows, and run the script as stated in the article.

When the script detects the “Lock reclaim failed” message, it will stop the tcpdump and the tcpdump output file as well as the /var/log/messages file should be uploaded to Red Hat for analysis:

What does the following message mean nfs4_reclaim_open_state Lock reclaim failed!

What does the following message mean: “nfs4_reclaim_open_state: Lock reclaim failed!”

环境

问题

决议

Examples

根源

诊断步骤

Gathering a tcpdump leading up to “Lock reclaim failed”

猜你喜欢