nfs connection failure automatic detection and recovery program

Business scene

       The file is used as the interface between the applications. In order to simplify the design, a separate file transfer module is not used, but the upload and download of files are automatically completed by the system level by using a shared nfs server. However, in the maintenance phase, it often happens that nfs cannot be connected. The problem caused the application io to report an error, which affected the normal progress of the business.

 

Cause Analysis

       1>Unknown to the client maintenance personnel, the server where the nfs server is located has been restarted, resulting in the original mount point being unable to access the nfs server, which needs to be re-mounted

       2> The nfs client side host restarts, there is no automatic mount or the mount fails, causing the application to report an error

       3> The network failure causes the nfs connection to be unavailable

 

countermeasures taken

       Automatically detect the situation of nfs mount through the shell program

 

Program implementation

       When I manually confirm nfs, I use the df command. If there is a problem with the nfs connection, because it is synchronous IO, df will be stuck there and wait until the nfs connection is available. Judging whether nfs is normal by the execution of the df command, but it needs to be implemented through two processes. If one process is stuck, once the df command is stuck there, the whole program will be stuck there, so two shell programs are needed. .

 

1. Main shell program

       Invoke the subshell program and let the subshell program execute in the background

       sleep for 30 seconds, then use pid to detect whether the subshell process has exited

        If the child process is still there, it means that the nfs connection is abnormal and returns an exception exit 2,

                Note: There is no kill sub-shell process here. If the nfs mount is successful, df will return successfully, and the sub-process will automatically exit. Of course, you can also kill it manually.

        If the child process is not there, then further execute the df|grep command to see if the mount point exists

        If the mount point does not exist, execute the mount command to mount it

        If mount is successful, return success exit 0

        If mount fails, return exception exit 1

  

2. Subshell program

       write pid to a file

       Execute the df command

 

main program

 

#!/bin/sh

echo "checknfs.sh start..."

workdir="/home/wk"
mdirdown="/data/download"
mdirup="/data/upload"
remoteip=1.2.3.4
rcdown=0
rcup=0

cd $workdir
sh ./checkdf.sh&

sleep 30

pid=`cat checkdfpid.log`
echo "got checkdf.sh pid is:"$pid
ps -p $pid
if [ $? -ne 0 ];
then
    downcount=`df|grep $mdirdown|wc -l`
    upcount=`df|grep $mdirup|wc -l`
	if [ $downcount -ge 1 ] && [ $upcount -ge 1 ];
	then
        echo "checknfs.sh end normally..."
        exit 0
	else
	    if [ $downcount -lt 1 ];
		then
		    mount  $remoteip:$mdirdown  $mdirdown
			rcdown=$?
			
			if [ $rcdown -ne 0 ];
			then
	            echo "(mount fail) -> "$mdirdown
			be
		be
        
        if [ $upcount -lt 1 ];
		then
		    mount  $remoteip:$mdirup  $mdirup
			rcup=$?
			if [ $rcup -ne 0 ];
			then
    	        echo "(mount fail) -> "$mdirup
			be
		be		
        
		if [ $rcdown -eq 0 ] && [ $rcup -eq 0 ];
	    then
		    echo "checknfs.sh end normally(remounted)..."
		    exit 0
		else
		    echo "checknfs.sh end abnormally(mount fail)..."
		    exit 1
		be
	be
else
	echo "checknfs.sh end abnormally(df block)..."
	exit 2
be

 

subroutine

 

#!/bin/sh

echo "checkdf.sh start..."
echo 'checkdf pid:'$$
echo $$ > checkdfpid.log

df

echo "checkdf.sh end normally..."

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326339198&siteId=291194637