1. Problem description
The problem I have encountered is that the company's intranet service is reversed through nginx on the Alibaba Cloud public network server (this intranet service is published to the external network through ddns, and the ip is dynamic), when the IP corresponding to the company's intranet service changes When accessing from the external network, an nginx error will appear, . Check the nginx error log, and a similar error "connect() failed (110: Connection timed out) while connecting to upstream, client: 58.22.113.106" appears. Internet found that such problems are caused by the DNS cache of Nginx.
nginx version: nginx/1.14.1
OS:CentOS Linux release 8.5.2111
2. Problem analysis
After tracking the problem repeatedly three times before and after, it was found that the problem occurred every time the IP corresponding to the ddns domain name changed, so it was definitely a DNS cache problem. The investigation methods in the process are as follows:
# Use the following command to view the current IP of the ddns domain name
dig @8.8.8.8 "A.com" +short
# Check the actual resolved IP in the error log, specifically the following command to find out the address corresponding to the log line upstream
tail -fn100 /var/log/nginx/error.log |grep "110: Connection timed out"
# If this problem occurs, soft restart nginx can immediately re-analyze the domain name in the configuration file and return to normal
nginx -s reload
Here I quote one of the articles "Nginx Reverse Proxy DNS Cache Problem" that I referred to in the process of solving this problem . The article introduces two solutions, and even other articles mention three solutions. In the process of solving my actual problem, I also tried three solutions before finally solving it.
3. Problem solving
You can check the articles cited in "Problem Analysis", here I also post the order of the solutions I tried.
1. Upstream combined with max_fails and fail_timeout
This solution was tested and did not solve my problem, and the problem happened the next day.
upstream A {
least_conn;
server A.com:8080 max_fails=1 fail_timeout=10s;
}server{
……
location / {
proxy_pass https://A;
}}
2、resolver + resolver_timeout
This solution also failed to solve the problem, and the problem recurred. default interface
server{
……
resolver 114.114.114.114 8.8.8.8 valid=10s;
resolver_timeout 3s;location / {
proxy_pass https://A.com:28080;
}}
3、1+2
This solution also didn't solve my problem and the problem happened again.
upstream A { least_conn; server A.com:8080 max_fails=1 fail_timeout=10s; }
server{
……
resolver 114.114.114.114 8.8.8.8 valid=10s;
resolver_timeout 3s;location / { proxy_pass https://A; }
}
4. Use variables in proxy_pass (this scheme needs to be combined with resolver to take effect)
Started to try the 4th solution, which successfully solved the problem. After long-term observation, I use this solution here. When the IP of the forwarded domain name DDNS changes, the nginx error page will still appear on the first visit, and then the log will report an error as "A.com could not be resolved (110: Operation timed out )", but accessing again is normal. If anyone knows how to completely fix this problem, please leave a message to let me know, I am very grateful!
At the beginning, I just used variables in proxy_pass. After nginx -s reload, the access directly reported an nginx error. After looking at the error.log, I found "no resolver defined to resolve A.com", and then added the resolver configuration, and finally the corresponding modified configuration of Nginx as follows:
server{
resolver 114.114.114.114 8.8.8.8 valid=10s;
resolver_timeout 3s;location / {
set $upstream_param "A.com:28080";
proxy_pass https://$upstream_param;
}}