Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

At a random point in time stops responding on HTTP requests. #1399

Closed
fortyanov opened this issue Oct 31, 2019 · 27 comments
Closed

At a random point in time stops responding on HTTP requests. #1399

fortyanov opened this issue Oct 31, 2019 · 27 comments

Comments

@fortyanov
Copy link
Contributor

Environment

  • PostgreSQL version: 12
  • PostgREST version: 6.0.2-713b214
  • Operating System: FreeBSD 12.1-STABLE

Description of the problem

At a random point in time, it stops responding to http requests. We tried the program assembled manually and the assembly from github, the problem occurs in both cases.

Our .conf file contains options
db-pool = 15
db-pool-timeout = 1000

I cannot provide more detailed information, because the logs do not contain information about the reasons for stopping the response to HTTP requests. When a problem occurs, reloading the database schema works fine. The only way to resume the process is to restart it.

In the near future I will attach TCP dump, as soon as problem repeats.

@fortyanov
Copy link
Contributor Author

Here is TCP dump.
dump.pcap.zip

@steve-chavez
Copy link
Member

steve-chavez commented Nov 2, 2019

I think this could be the same problem as #905. That issue is frequently visited, seems the error persists even on current versions.

Unfortunately, using tshark on the pcap file reveals nothing for me.

How about if you add this change steve-chavez@110314d and recompile from master? Perhaps that could lead to a more meaningful error message on the client.

@fortyanov
Copy link
Contributor Author

I rebuilt the project with a patch 110314d but when problem repeats, I still have no log messages. My haskell is very bad. Can you say in which parts of project should I turn my attention.

@steve-chavez
Copy link
Member

Oh, the patch only adds a client error message for when the empty reply happens again. I'd like to see what's the error message in that case.

@steve-chavez
Copy link
Member

Also, could it be an out of memory issue?

@fortyanov
Copy link
Contributor Author

fortyanov commented Nov 6, 2019

Free memory is about 2 GB minimum. Daemon not returns any error messages and do not response on any HTTP requests and do not close TCP connections with clients.

@fortyanov
Copy link
Contributor Author

We set parameter
db-pool-timeout = 1
and problem not repeats about 2 days. Hope this information will helps you.

@fortyanov
Copy link
Contributor Author

Problem repeats again but not stable. Seems like the problem is somehow related to db-pool-timeout parameter. Do you know some more logging instruments to get more information?

@steve-chavez
Copy link
Member

Hmm.. I'm trying to reproduce the issue. Is your pg server in the same box as postgrest?

This is what I'm doing. Have an application_name query param in the postgrest db-uri.

db-uri = "postgres://user@localhost/db?application_name=postgrest_app"
db-pool-timeout = 1000000

Then we can look at pg_stat_activiy:

select * from pg_stat_activity where application_name = 'postgrest_app';

And we can terminate the connection:

select pg_terminate_backend(pid) from pg_stat_activity where application_name = 'postgrest_app';

In this case the next request to postgrest will result in a 503 with the following message:

curl 'localhost:3000/'
{"details":"no connection to the server\n","message":"Database client error"} 

And the server will log:

Attempting to connect to the database...
127.0.0.1 - - [11/Nov/2019:11:49:41 -0500] "GET / HTTP/1.1" 503 - "" "curl/7.64.0"
Connection successful

PostgREST tries to reconnect to the db with exponential backoff, and in my case the next request will succeed.

Even If I shut down the PostgreSQL server, PostgREST will print server logs when trying to recover:

127.0.0.1 - - [11/Nov/2019:11:52:19 -0500] "GET / HTTP/1.1" 503 - "" "curl/7.64.0"                                              [15/429]
{"details":"could not connect to server: Connection refused\n\tIs the server running on host \"localhost\" (::1) and accepting\n\tTCP/IP
 connections on port 5432?\ncould not connect to server: Connection refused\n\tIs the server running on host \"localhost\" (127.0.0.1) $
nd accepting\n\tTCP/IP connections on port 5432?\n","code":"","message":"Database connection error"}
Attempting to reconnect to the database in 0 seconds...
{"details":"could not connect to server: Connection refused\n\tIs the server running on host \"localhost\" (::1) and accepting\n\tTCP/I$
 connections on port 5432?\ncould not connect to server: Connection refused\n\tIs the server running on host \"localhost\" (127.0.0.1) $
nd accepting\n\tTCP/IP connections on port 5432?\n","code":"","message":"Database connection error"}
Attempting to reconnect to the database in 1 seconds...
127.0.0.1 - - [11/Nov/2019:11:52:21 -0500] "GET / HTTP/1.1" 503 - "" "curl/7.64.0"
{"details":"could not connect to server: Connection refused\n\tIs the server running on host \"localhost\" (::1) and accepting\n\tTCP/I$
 connections on port 5432?\ncould not connect to server: Connection refused\n\tIs the server running on host \"localhost\" (127.0.0.1) a
nd accepting\n\tTCP/IP connections on port 5432?\n","code":"","message":"Database connection error"}
Attempting to reconnect to the database in 2 seconds...

And clients would get 503 errors with this message:

curl localhost:3000
{"message":"Database connection lost, retrying the connection."}

Could you see if the recovery procedure is similar on your server? Maybe this would give another clue.

@fortyanov
Copy link
Contributor Author

Yes, I have got exact same behavior on our server. Now I return db-pool-timeout=1000 and will disable keep-alive on nginx because there I have an assumption that this might be related to the problem.

@fortyanov
Copy link
Contributor Author

In nginx configuration file

upstream postgrest
{
  server 127.0.0.1:3000;
  keepalive 64;
}

we disable keepalive parameter and now all works fine.

@steve-chavez
Copy link
Member

@fortyanov Oh, great find!

So, we should remove that recommended keepalive from the hardening postgrest doc.

Why would that happen though? The nginx doc on keepalive says:

keepalive – The number of idle keepalive connections to an upstream server that remain open for each worker process. There is no default value.

@fortyanov
Copy link
Contributor Author

I don't understand exact cause of such bug. I guess this can happening because some kind of race condition. Keepalive is very important for performance.

@elimisteve
Copy link
Contributor

I will note that I've been using PostgREST for 3 years and I've never run into this issue. This may be worth mentioning because I always have a Go binary proxying through to PostgREST, not Nginx, which may be why!

@fortyanov
Copy link
Contributor Author

fortyanov commented Nov 19, 2019

I will note that I've been using PostgREST for 3 years and I've never run into this issue. This may be worth mentioning because I always have a Go binary proxying through to PostgREST, not Nginx, which may be why!

Do you have keepalive (persistent connection to postgrest) realization on your Go proxy?

@elimisteve
Copy link
Contributor

I'm not sure.

Following the source, I see: https://github.com/EffectiveAF/effective/blob/master/server.go#L75 => https://github.com/golang/go/blob/master/src/net/http/httputil/reverseproxy.go#L100-L124 => https://github.com/golang/go/blob/master/src/net/http/httputil/reverseproxy.go#L24-L79 => https://github.com/golang/go/blob/master/src/net/http/transport.go#L37-L54 .

Perhaps most relevantly I see

Timeout: 30 * time.Second

KeepAlive: 30 * time.Second

and

type Transport struct {
        ...

    	// DisableKeepAlives, if true, disables HTTP keep-alives and
	// will only use the connection to the server for a single
	// HTTP request.
	//
	// This is unrelated to the similarly named TCP keep-alives.
        DisableKeepAlives bool
        ...

which defaults to false.

@dwagin
Copy link
Contributor

dwagin commented Nov 19, 2019

yesodweb/wai#618

Maybe this

@fortyanov
Copy link
Contributor Author

We setup unix socket on postrest

server-unix-socket = "/tmp/.s.PostgREST"
server-unix-socket-mode = "777"

Problem still happens.

When we setup NGINX keepalive_timeout value up

{
  server unix:/tmp/.s.PostgREST;
  keepalive 64;
  keepalive_timeout 120s;
}

to make it bigger then keep-alive in postgrest the problem not happens. Default keepalive timeout is 60 second in nginx and postgrest too (I think) . Seems like there is race condition on closing session between postgrest and nginx.

@fortyanov
Copy link
Contributor Author

fortyanov commented Dec 2, 2019

Im bad in haskell no there is could be information about library that used in postgrest
yesodweb/wai#734 (comment)

@steve-chavez
Copy link
Member

I think for now we could recommend the keepalive_timeout > 60s in our nginx samples. Ideally we would output a clear error when this happens(not sure if possible).

@fortyanov Did the issue appear again? Also could you estimate what kind of traffic your server is handling? I'd still like to reproduce the issue.

@steve-chavez
Copy link
Member

I've also seen this on a production instance. I tried the keepalive_timeout 120s with no success. I kept getting nginx upstream timeout errors.

Tuning the db-pool to higher values(up to 80, having pg max_connections=100) also didn't help.

Scenario was like this:

  • pg and pgrst on different servers.
  • 210 concurrent requests arrive and pgrst stops responding(this gives the upstream timeout errors). About 5 mins later, pgrst responds to newly arrived 249 concurrent requests, then stops responding. 5~ mins later, responds to 199 concurrent requests and then stops responding again.
  • The above keeps happening with similar numbers, pgrst responds->stops responding(timeouts)->responds.
  • The immediate solution was to fix the client and reduce the number of requests.

I'm lead to believe that this is related to hasql-pool(nikita-volkov/hasql-pool#3, nikita-volkov/hasql-pool#4 (comment)). It could be that it's not erring when there's no more available connections.

@ruslantalpa
Copy link
Contributor

It’s not postgrest or hasql or nginx. It’s the kernel settings for your tcp stack

Details on what to tune here https://stackoverflow.com/questions/40339547/periodic-drop-in-throughput-of-nginx-reverse-proxy-what-can-it-be/40356386#40356386

@drTr0jan
Copy link
Contributor

My environment

  • FreeBSD 12.2;
  • PostgREST nightly (d3a8b5f) or 7.0.1-release.
  • PostgreSQL 13.2 + PgBouncer 1.14.0
  • Nginx 1.20.0

Not working configuration

nginx.conf:

upstream postgrest
{
  server unix:/tmp/.s.PostgREST;
  keepalive 64;
#  keepalive_timeout 60s; # assumed as default
}

Working configuration

nginx.conf:

upstream postgrest
{
  server unix:/tmp/.s.PostgREST;
  keepalive 64;
  keepalive_timeout 120s;
}

Investigation

I've investigated the problem much more.
The issue doesn't depend of socket type - AF_INET or AF_UNIX.

While local (unix) socket is used, PostgREST opens the socket after startup:

# sockstat -u
USER     COMMAND    PID   FD PROTO  LOCAL ADDRESS         FOREIGN ADDRESS
postgrest postgrest 18225 48 stream /tmp/.s.PostgREST

Trace of system calls on worked system is:

# truss -p 18225
...
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
_umtx_op(0x8032c46b8,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
_umtx_op(0x802ae0f50,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x0,0x0) = 0 (0x0)
_umtx_op(0x8032c9400,UMTX_OP_MUTEX_WAKE2,0x0,0x0,0x0) = 0 (0x0)
_umtx_op(0x8032c9400,UMTX_OP_MUTEX_WAIT,0x0,0x0,0x0) = 0 (0x0)
accept4(0x30,0x42003d99a0,0x42003d9a40,0x30000000) = 47 (0x2f)
fcntl(47,F_GETFD,)				 = 1 (0x1)
fcntl(47,F_SETFD,FD_CLOEXEC)			 = 0 (0x0)
setsockopt(47,IPPROTO_TCP,TCP_NODELAY,0x4200443908,4) ERR#22 'Invalid argument'
kevent(18,0x0,0,{ },64,{ 0.000000000 })		 = 0 (0x0)
mmap(0x0,20480,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 34473328640 (0x806c54000)
accept4(0x30,0x4200443920,0x42004439c0,0x30000000) ERR#35 'Resource temporarily unavailable'
kevent(18,0x0,0,{ },64,{ 0.000000000 })		 = 0 (0x0)
kevent(23,{ 48,EVFILT_READ,EV_ADD|EV_ONESHOT,0,0,0x0 },1,0x0,0,{ 0.000000000 }) = 0 (0x0)
_umtx_op(0x8032c32b8,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
recvfrom(47,"GET / HTTP/1.1\r\nHost: postgres"...,16384,0,NULL,0x0) = 79 (0x4f)
_umtx_op(0x8048812b8,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
clock_gettime(4,{ 10091166.965890658 })		 = 0 (0x0)
_umtx_op(0x8048812b8,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
_umtx_op(0x8032c9380,UMTX_OP_MUTEX_WAKE2,0x0,0x0,0x0) = 0 (0x0)
_umtx_op(0x8032c9380,UMTX_OP_MUTEX_WAIT,0x0,0x0,0x0) = 0 (0x0)
clock_gettime(0,{ 1620994043.980874065 })	 = 0 (0x0)
clock_gettime(4,{ 10091166.966852797 })		 = 0 (0x0)
sendto(49,"P\0\0\0006\0BEGIN ISOLATION LEVE"...,92,0,NULL,0) = 92 (0x5c)
nanosleep({ 0.010000000 })			 = 0 (0x0)
poll({ 49/POLLIN|POLLERR },1,-1)		 = 1 (0x1)
recvfrom(49,"1\0\0\0\^D2\0\0\0\^Dn\0\0\0\^DC"...,16384,0,NULL,0x0) = 32 (0x20)
sendto(49,"P\0\0\^A$\0select set_config($1,"...,654,0,NULL,0) = 654 (0x28e)
poll({ 49/POLLIN|POLLERR },1,-1)		 = 1 (0x1)
recvfrom(49,"1\0\0\0\^D2\0\0\0\^DT\0\0\0\M-n"...,16384,0,NULL,0x0) = 430 (0x1ae)
mmap(0x0,12288,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 34473349120 (0x806c59000)
mmap(0x0,12288,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 34473361408 (0x806c5c000)
sendto(49,"P\0\0\^E[\0\n    select\n      n"...,1438,0,NULL,0) = 1438 (0x59e)
poll({ 49/POLLIN|POLLERR },1,-1)		 = 1 (0x1)
recvfrom(49,"1\0\0\0\^D2\0\0\0\^DT\0\0\0\M^C"...,16384,0,NULL,0x0) = 162 (0xa2)
mmap(0x0,12288,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 34473373696 (0x806c5f000)
sendto(49,"P\0\0\0\M-@\0\n      select\n   "...,259,0,NULL,0) = 259 (0x103)
poll({ 49/POLLIN|POLLERR },1,-1)		 = 1 (0x1)
recvfrom(49,"1\0\0\0\^D2\0\0\0\^DT\0\0\0$\0"...,16384,0,NULL,0x0) = 78 (0x4e)
sendto(49,"P\0\0\a\M-O\0\n -- Recursively g"...,2066,0,NULL,0) = 2066 (0x812)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
poll({ 49/POLLIN|POLLERR },1,-1)		 = 1 (0x1)
recvfrom(49,"1\0\0\0\^D2\0\0\0\^DT\0\0\^A0\0"...,16384,0,NULL,0x0) = 988 (0x3dc)
sendto(49,"P\0\0\0\^N\0COMMIT\0\0\0B\0\0\0"...,52,0,NULL,0) = 52 (0x34)
poll({ 49/POLLIN|POLLERR },1,-1)		 = 1 (0x1)
recvfrom(49,"1\0\0\0\^D2\0\0\0\^Dn\0\0\0\^DS"...,16384,0,NULL,0x0) = 55 (0x37)
clock_gettime(0,{ 1620994044.198996088 })	 = 0 (0x0)
_umtx_op(0x8048812b8,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
_umtx_op(0x8032c9300,UMTX_OP_MUTEX_WAIT,0x0,0x0,0x0) = 0 (0x0)
clock_gettime(13,{ 1620994044.000000000 })	 = 0 (0x0)
_umtx_op(0x8048642b8,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
_umtx_op(0x8032ed140,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x0,0x0) = 0 (0x0)
clock_gettime(4,{ 10091167.185940172 })		 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
clock_gettime(13,{ 1620994044.000000000 })	 = 0 (0x0)
clock_gettime(4,{ 10091167.186504214 })		 = 0 (0x0)
sendto(47,"HTTP/1.1 200 OK\r\nTransfer-Enco"...,3421,0,NULL,0) = 3421 (0xd5d)
recvfrom(47,0x806c2f5cf,16305,0,0x0,0x0)	 ERR#35 'Resource temporarily unavailable'
kevent(23,{ 47,EVFILT_READ,EV_ADD|EV_ONESHOT,0,0,0x0 },1,0x0,0,{ 0.000000000 }) = 0 (0x0)
_umtx_op(0x8048812b8,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
_umtx_op(0x8032ed068,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x0,0x0) = 0 (0x0)
_umtx_op(0x8032c9400,UMTX_OP_MUTEX_WAKE2,0x0,0x0,0x0) = 0 (0x0)
_umtx_op(0x8032c9400,UMTX_OP_MUTEX_WAIT,0x0,0x0,0x0) = 0 (0x0)
_umtx_op(0x8032c4bb8,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
_umtx_op(0x8032ed158,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x0,0x0) = 0 (0x0)
kevent(23,0x0,0,{ },64,{ 0.000000000 })		 = 0 (0x0)
kevent(23,0x0,0,{ },64,{ 0.000000000 })		 = 0 (0x0)
recvfrom(47,0x806c2f5cf,16305,0,NULL,0x0)	 = 0 (0x0)
close(47)					 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
...

But after issue (subject hang) PostgREST has left another FD:

# sockstat -u
USER     COMMAND    PID   FD PROTO  LOCAL ADDRESS         FOREIGN ADDRESS
postgrest postgrest 18225 48 stream /tmp/.s.PostgREST
postgrest postgrest 18225 52 stream /tmp/.s.PostgREST

and has stopped to processing request:

# truss -p 18225
...
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
clock_gettime(4,{ 10435412.577459186 })		 = 0 (0x0)
_umtx_op(0x8061e62b8,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
_umtx_op(0x8032c9500,UMTX_OP_MUTEX_WAKE2,0x0,0x0,0x0) = 0 (0x0)
_umtx_op(0x8032c9500,UMTX_OP_MUTEX_WAIT,0x0,0x0,0x0) = 0 (0x0)
clock_gettime(0,{ 1621338289.592394262 })	 = 0 (0x0)
clock_gettime(4,{ 10435412.578476639 })		 = 0 (0x0)
write(46,"\M^?",1)				 = 1 (0x1)
poll({ 43/POLLIN 45/POLLIN },2,-1)		 = 1 (0x1)
read(45,"\M^?",4096)				 = 1 (0x1)
clock_gettime(4,{ 10435412.579218358 })		 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
nanosleep({ 0.010000000 })			 = 0 (0x0)
...
# fstat -p 18225
USER     CMD          PID   FD MOUNT      INUM MODE         SZ|DV R/W
772      postgrest  18225    0 /dev       29   crw-rw-rw-    null rw
772      postgrest  18225    1* pipe fffff802286a9d48 <-> fffff802286a9be0      0 rw
772      postgrest  18225    2* pipe fffff802286a9d48 <-> fffff802286a9be0      0 rw
772      postgrest  18225    3
772      postgrest  18225    4* pipe fffff8022860c2f8 <-> fffff8022860c460      0 rw
772      postgrest  18225    5* pipe fffff8022860c460 <-> fffff8022860c2f8      0 rw
772      postgrest  18225    6* pipe fffff802967f72f8 <-> fffff802967f7460      0 rw
772      postgrest  18225    7* pipe fffff802967f7460 <-> fffff802967f72f8      0 rw
772      postgrest  18225    8
772      postgrest  18225    9* pipe fffff801e9567000 <-> fffff801e9567168      0 rw
772      postgrest  18225   10* pipe fffff801e9567168 <-> fffff801e9567000      0 rw
772      postgrest  18225   11* pipe fffff8028c478be0 <-> fffff8028c478d48      0 rw
772      postgrest  18225   12* pipe fffff8028c478d48 <-> fffff8028c478be0      0 rw
772      postgrest  18225   13
772      postgrest  18225   14* pipe fffff802d46d1000 <-> fffff802d46d1168      0 rw
772      postgrest  18225   15* pipe fffff802d46d1168 <-> fffff802d46d1000      0 rw
772      postgrest  18225   16* pipe fffff802455debe0 <-> fffff802455ded48      0 rw
772      postgrest  18225   17* pipe fffff802455ded48 <-> fffff802455debe0      0 rw
772      postgrest  18225   18
772      postgrest  18225   19* pipe fffff802285fe5f0 <-> fffff802285fe758      0 rw
772      postgrest  18225   20* pipe fffff802285fe758 <-> fffff802285fe5f0      0 rw
772      postgrest  18225   21* pipe fffff8029031e000 <-> fffff8029031e168      0 rw
772      postgrest  18225   22* pipe fffff8029031e168 <-> fffff8029031e000      0 rw
772      postgrest  18225   23
772      postgrest  18225   24* pipe fffff802455dd2f8 <-> fffff802455dd460      0 rw
772      postgrest  18225   25* pipe fffff802455dd460 <-> fffff802455dd2f8      0 rw
772      postgrest  18225   26* pipe fffff800198552f8 <-> fffff80019855460      0 rw
772      postgrest  18225   27* pipe fffff80019855460 <-> fffff800198552f8      0 rw
772      postgrest  18225   28
772      postgrest  18225   29* pipe fffff801e94f6be0 <-> fffff801e94f6d48      0 rw
772      postgrest  18225   30* pipe fffff801e94f6d48 <-> fffff801e94f6be0      0 rw
772      postgrest  18225   31* pipe fffff802455b5be0 <-> fffff802455b5d48      0 rw
772      postgrest  18225   32* pipe fffff802455b5d48 <-> fffff802455b5be0      0 rw
772      postgrest  18225   33
772      postgrest  18225   34* pipe fffff801e9562be0 <-> fffff801e9562d48      0 rw
772      postgrest  18225   35* pipe fffff801e9562d48 <-> fffff801e9562be0      0 rw
772      postgrest  18225   36* pipe fffff8013a9c05f0 <-> fffff8013a9c0758      0 rw
772      postgrest  18225   37* pipe fffff8013a9c0758 <-> fffff8013a9c05f0      0 rw
772      postgrest  18225   38
772      postgrest  18225   39* pipe fffff800196775f0 <-> fffff80019677758      0 rw
772      postgrest  18225   40* pipe fffff80019677758 <-> fffff800196775f0      0 rw
772      postgrest  18225   41* pipe fffff80007ef75f0 <-> fffff80007ef7758      0 rw
772      postgrest  18225   42* pipe fffff80007ef7758 <-> fffff80007ef75f0      0 rw
772      postgrest  18225   43* pipe fffff802286a82f8 <-> fffff802286a8460      0 rw
772      postgrest  18225   44* pipe fffff802286a8460 <-> fffff802286a82f8      0 rw
772      postgrest  18225   45* pipe fffff8000f02f5f0 <-> fffff8000f02f758      0 rw
772      postgrest  18225   46* pipe fffff8000f02f758 <-> fffff8000f02f5f0      0 rw
772      postgrest  18225   48* local stream fffff8026d2e5200
772      postgrest  18225   52* local stream fffff800464f9100

@steve-chavez
Copy link
Member

I've investigated the problem much more.
The issue doesn't depend of socket type - AF_INET or AF_UNIX.

Thanks a lot for helping here @drTr0jan

But after issue (subject hang) PostgREST has left another FD:

How did you managed to reproduce the hanging?

@steve-chavez
Copy link
Member

The hint above suggests that this issue was related to file descriptors.

On #2042 we noted that PostgREST will stop processing new requests when EMFILE(Too many open files) is reached, a dependency was upgraded to fix this on #2158.

Assuming this issue will be fixed with the above change as well, so will close - but please reopen if the problem reappears.

@kevinmlong
Copy link

@steve-chavez - I'm running a self-hosted Supabase instance (with an nginx proxy in front) and running into a similar problem where we're seeing intermittent 503 errors when making requests from our clients.

I've posted a github issue with Supabase here and on their discord page here

Sounds like this may be related. I'd appreciate your insight on this issue we're experiencing.

@steve-chavez
Copy link
Member

@kevinmlong That looks unrelated to this issue since you're getting a response. I've made a suggestion on supabase/supabase#12918 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

7 participants