#!/bin/bash while [ 1 ] do job_error_no=`kubectl get pod -n weifeng |grep -i "job"|grep -ci error` if [ $job_error_no -gt 0 ];then ps -fe|grep k8s_job_status_monitor|grep -v grep|awk '{print $2}'|xargs kill -9 echo "k8s job running is not stable " >> /tmp/k8s_job_error_no.log fi sleep 60 done
If k8s cluster job status appear error, the script automatically kill off their montior process, triggered by an alarm monitoring process to monitor cloud Ali cloud
Ali and so the monitoring process monitoring document https://www.cnblogs.com/weifeng1463/p/11591796.html