Nginx Server Timeout Issue, automatically stopping every week #950

haribachala · 2024-12-07T05:27:02Z

Nginx Version: nginx:1.27.1-alpine3.20

Environment: ECS Fargate Container

Description: We are using Nginx as a reverse proxy in our production environment. Every Saturday around 00:00 UTC, the Nginx server stops responding to API calls, resulting in HTTP Status Code 499 errors. The API calls take more than 60 seconds to complete. Initially, we suspected the issue was with the upstream API taking too long to respond. However, during the latest occurrence, we observed that the upstream API responds within 5 seconds when called directly. When called via Nginx, the requests fail with a timeout error.

Observations:

Nginx health check (nginx_status) shows active, and waiting connections are less than 10.
The issue occurs consistently every Saturday around 00:00 UTC.
Logs show only timeout errors.
Restarting Nginx resolves the issue temporarily.

To reproduce

Use the nginx:1.27.1-alpine3.20 image in an ECS Fargate container.
Configure Nginx as a reverse proxy.
Observe the server behavior around 00:00 UTC on Saturdays.

Expected behavior

Nginx should work without any issues
.

Your environment

Nginx image version: nginx:1.27.1-alpine3.20
Environments: ECS Fargate

Additional context

The issue is not occurring every day, the pattern is every Saturday at 00:00 UTC. We don't ahave ny schedule jobs to stop the nginx server; the nginx_status endpoint is responding correctly.

oxpa · 2024-12-07T20:40:51Z

did you test the upstream from an Nginx container or other locations?
I'm not familiar with ECS Fargate but I'm pretty sure nginx in a container can work flawlessly for more than a week.

Having nginx error log may help. Having container logs may help as well.

haribachala · 2024-12-09T09:50:30Z

Thanks. Yes, I have verified the upstream from the hosted environment. I have found similar issues reported in StackOverflow, but I wonder if this issue is with the latest versions. Verified logs and metrics of the container they look good

https://stackoverflow.com/questions/51147952/nginx-stops-automatically-at-particular-time
https://stackoverflow.com/questions/42622986/nginx-server-stops-automatically-and-my-site-goes-down-and-i-need-to-restart-in

oxpa · 2024-12-09T11:33:43Z

Well, I can assure you there are no weekly limits and no weekly tasks in nginx or nginx docker containers.
And without seeing logs there is nothing much more I can say to help you.

haribachala · 2024-12-11T09:28:16Z

Thanks.The logs don’t provide much information, except that the upstream fails with HTTP code 499. I have enabled debug mode (error_log /dev/stderr debug;) and expect to gather more details soon. One observation is that when an upstream call times out, I checked the active connections using the nginx_status endpoint. The active connections were 4, with 1 waiting and 1 writing, which seems normal to me.

The health endpoints are working as expected, and there are issues only with Reverse proxy , apart from debug mode, how can I troubleshoot this?

Quick question: if an upstream server call times out for any reason, will NGINX stop forwarding subsequent calls to that upstream?
(we have only one upstream server)

oxpa · 2024-12-11T11:12:35Z

@haribachala
HTTP 499 is not an upstream code. This is a code to indicate that the client closed the connection without getting a response. Basically, client timeout is lower than that of nginx towards an upstream.

Error log usually have some details on what's going on. Also having upstream related variables in access log may help (upstream_addr, status, etc).
Can you post your configuration? Do you have hostnames in your configuration with no dynamic resolution? Can it be that Amazon changes an IP of your upstream and nginx still tries to use an old IP ? In this case your tests (say, with cURL) will be successful but nginx may timeout requests.

if you have access into the container, probably it's worth trying doing requests through nginx and directly and comparing tcpdump output for both. It should be pretty obvious, really.

haribachala · 2024-12-11T11:37:56Z

Yes, 499 is a client timeout error( this happened via datadog agent, which waits only for 60 seconds).

custom template config:

server {
listen 8080 default_server;
listen [::]:8080;
server_name ${server_name_env};

http2 on;
client_body_timeout 240;
client_header_timeout 240;
keepalive_timeout 240;
proxy_connect_timeout 300;
proxy_send_timeout 300;
proxy_read_timeout 300;
fastcgi_send_timeout 300;
fastcgi_read_timeout 300;
send_timeout 240;
proxy_buffers 32 16k;
gzip_comp_level 9;
gzip_types text/css text/javascript application/javascript application/x-javascript;

client_max_body_size 100M;

location = /health {
return 200 'OK';
access_log off;
add_header Content-Type text/plain;
}

location /route/nginx_status/ {
stub_status;
include /etc/nginx/conf.d/proxy_headers.conf;
include /etc/nginx/conf.d/access/ips.conf;
deny all;
}

location /route/dbservice/ {
include /etc/nginx/conf.d/proxy_headers.conf;
proxy_pass https://${kong_alb_url}/dbservice/;
}

location /route/exportservice/ {
include /etc/nginx/conf.d/proxy_headers.conf;
proxy_pass https://${kong_alb_url}/exportservice/;
}

error_page 404 /404.html;
error_page 500 /500.html;
error_page 502 /502.html;
error_page 503 504 /5xx.html;

location ~^/(404.html|500.html|502.html|5xx.html|scheduled-downtime.html) {
root /etc/nginx/error;
}

}

nginx.conf - template conf

load_module modules/ngx_http_geoip_module.so;

worker_processes auto;

error_log /dev/stderr debug;

pid /tmp/nginx.pid;

events {
worker_connections 1024;
}

http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
include /etc/nginx/conf.d/access/real_ip.conf;
underscores_in_headers on;
map $request_uri $loggable {
/health/ 0;
default 1;
}
map $http_trace_Id $trace_Id {
default $http_trace_Id; # Use the incoming trace_id if it exists
'' $request_id; # If trace_id is empty, use correlation_id
}

log_format upstream_time '$remote_addr - $remote_user [$time_local] "$request" '
'$status Trace-ID: $trace_Id "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" '
'rt=$request_time uct=$upstream_connect_time uht=$upstream_header_time urt=$upstream_response_time';

access_log /dev/stdout upstream_time if=$loggable;

sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 75; #default 65;

gzip on;
gzip_disable "msie6";

include /etc/nginx/conf.d/*.conf;

client_body_temp_path /tmp/client_temp;
proxy_temp_path /tmp/proxy_temp_path;
fastcgi_temp_path /tmp/fastcgi_temp;
uwsgi_temp_path /tmp/uwsgi_temp;
scgi_temp_path /tmp/scgi_temp;

}

oxpa · 2024-12-11T13:22:57Z

You don't have upstream_addr in your log format. It would be curious to compare the IP from nginx and IP from, say, cURL

Is it ${kong_alb_url} that doesn't reply? I assume this variable is substituted at the container configuration and has a hostname in it.
Does IP of this service change weekly?
If yes - that's probably the reason. Maybe create an upstream block with "server $abl_name resolve;" and configure a resolver.

haribachala · 2024-12-11T14:52:24Z

kong_alb_url - yes, it's a container environment variable. It's an ALB URL; the underlying target IP(s) will change, not the ALB name. For example, at 'https://mykongHost.com' , the underlying IP(s) of 'mykongHost' will change when there is a deployment.
When ALB resolves to IP, does the NGINX cache the IP? It's stateless requests, right?
I will add upstream_addr in the logs.

oxpa · 2024-12-11T15:09:02Z

if the container has something like "proxy_pass https://mykongHost.com ...." - it is resolved once, at nginx startup. And then never re-resolved.
What you should do is to configure a resolver at http{} level, then configure an upstream for this host and add "resolve" parameter to the said server. resulting config should be roughly like this:

resolver $resolvers_from_host;
upstream kong {
    zone u_kong 128k;
    server myconghost:443 resolve;
    keepalive 4;
}
server {
    proxy_http_version 1.1;
    proxy_set_header Connection '';
    location / { proxy_pass https://kong;}
}

This way "myconghost" will be properly re-resolved when needed

haribachala · 2024-12-11T15:32:14Z

“Thank you. Regarding the suggestion for ‘myconghost’ in the upstream block, can I still use the ALB/DNS placeholder here, considering its environment variable will change for each environment?”

oxpa · 2024-12-11T15:43:25Z

It should be a domain name that nginx can resolve. If your placeholder is substituted with a domain name - then yes, sure you can.

haribachala · 2024-12-11T15:45:39Z

resolvers_from_host is ALB/DNS url right?

oxpa · 2024-12-11T16:11:17Z

nope: http://nginx.org/en/docs/http/ngx_http_core_module.html#resolver
https://github.com/nginxinc/docker-nginx/blob/master/entrypoint/15-local-resolvers.envsh
It's IPs of nameservers from your resolve.conf (you can use env var from the entrypoint script above)

haribachala · 2024-12-12T11:59:43Z

Thanks. I have enabled debug mode and also modified the config as follows:
resolver 10.X.X.X 10.X.X.X valid=30s;
upstream kong_url {
zone u_kong 128k;
server kong_alb_here:443 resolve max_fails=5 fail_timeout=360s;

location /route/dbservice/ {
include /etc/nginx/conf.d/proxy_headers.conf;
proxy_pass https://kong_url/dbservice/;
}

I will monitor this for sometime

oxpa · 2024-12-12T13:11:25Z

@haribachala just two notes: if you don't have "keepalive" inside the upstream block - there will be no keepalive connections. Keepalive saves a lot of resources. And be careful with max_fails and fail_timeout: it's easy to misconfigure them. Many people prefer max_fails=0.

haribachala · 2024-12-12T13:53:23Z

Sorry, keepalive 4; is there. While copying, I missed the last two lines of the upstream block. I will verify the max_fails and fail_timeout config.

After the changes, the example log is:
172.17.0.1 - - [12/Dec/2024:12:42:18 +0000] "GET '/route/dbservice/uri/ HTTP/1.1" 200 Trace-ID: aad20d6b74f18ef04289cec94351d99d "-" "PostmanRuntime/7.43.0" "-" rt=1.386 uct=1.006 uht=1.388 urt=1.388 upstream_addr=10.X.X.83:443

upstream_addr - is nothing but Kong internal ALB IP (the internal ALB has more than one IP; in my case, 3 IP(s) are there, for every client request any one of the IP printing in round robin) , to replicate the issue (resolve working or not), we have restarted kong server , the IP of kong instance were changed but not the ALB IP(s)

oxpa · 2024-12-12T14:49:53Z

upstream connect time is a bit high though (1 second?). You can, probably, increase keepalive value to have a bigger pool of connections.

Otherwise - let's wait till saturday and this time we'll have more data to work with.
If it stops working for you on Saturday again - try requests with curl from nginx container directly and through nginx. And we'll figure out what's wrong.

haribachala · 2024-12-13T05:00:22Z

It may be high because I am trying from my local machine and upstream in another region, in prod upstream, and the application is in the same region. Sure, next time the event happens, I will try to call upstream from the Nginx container.

haribachala · 2024-12-16T03:42:51Z

this time upstream ALB IP(s) not changed yet, will update once its done.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nginx Server Timeout Issue, automatically stopping every week #950

Nginx Server Timeout Issue, automatically stopping every week #950

haribachala commented Dec 7, 2024 •

edited

Loading

oxpa commented Dec 7, 2024

haribachala commented Dec 9, 2024

oxpa commented Dec 9, 2024

haribachala commented Dec 11, 2024 •

edited

Loading

oxpa commented Dec 11, 2024

haribachala commented Dec 11, 2024 •

edited

Loading

oxpa commented Dec 11, 2024

haribachala commented Dec 11, 2024

oxpa commented Dec 11, 2024

haribachala commented Dec 11, 2024

oxpa commented Dec 11, 2024

haribachala commented Dec 11, 2024

oxpa commented Dec 11, 2024

haribachala commented Dec 12, 2024

oxpa commented Dec 12, 2024

haribachala commented Dec 12, 2024

oxpa commented Dec 12, 2024

haribachala commented Dec 13, 2024

haribachala commented Dec 16, 2024

Nginx Server Timeout Issue, automatically stopping every week #950

Nginx Server Timeout Issue, automatically stopping every week #950

Comments

haribachala commented Dec 7, 2024 • edited Loading

To reproduce

Expected behavior

Your environment

Additional context

oxpa commented Dec 7, 2024

haribachala commented Dec 9, 2024

oxpa commented Dec 9, 2024

haribachala commented Dec 11, 2024 • edited Loading

oxpa commented Dec 11, 2024

haribachala commented Dec 11, 2024 • edited Loading

}

nginx.conf - template conf

oxpa commented Dec 11, 2024

haribachala commented Dec 11, 2024

oxpa commented Dec 11, 2024

haribachala commented Dec 11, 2024

oxpa commented Dec 11, 2024

haribachala commented Dec 11, 2024

oxpa commented Dec 11, 2024

haribachala commented Dec 12, 2024

oxpa commented Dec 12, 2024

haribachala commented Dec 12, 2024

oxpa commented Dec 12, 2024

haribachala commented Dec 13, 2024

haribachala commented Dec 16, 2024

haribachala commented Dec 7, 2024 •

edited

Loading

haribachala commented Dec 11, 2024 •

edited

Loading

haribachala commented Dec 11, 2024 •

edited

Loading