Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload of large object fails with nats: error: nats: stalled with too many outstanding async published messages #993

Open
paolobarbolini opened this issue Feb 20, 2024 · 1 comment
Labels
defect Suspected defect such as a bug or regression

Comments

@paolobarbolini
Copy link

Observed behavior

When uploading large files to the Object Store the upload sometimes stalls for a few seconds and then breaks with error nats: error: nats: stalled with too many outstanding async published messages

Expected behavior

Retries?

Server and client version

nats-server: v2.10.10
natscli: v0.1.3

Host environment

Debian 12.5 on aarch64

Steps to reproduce

Upload a 1 GB file to an object store and observe the upload sometimes break.

@paolobarbolini paolobarbolini added the defect Suspected defect such as a bug or regression label Feb 20, 2024
@williamstein
Copy link

I just hit this too. It happens repeatedly every time for me. Steps:

  1. Install Release v2.10.26 of the server and v0.1.6 of the Go cli client.
  2. Create a tmpfs, so the file I'm uploading quickly reads from "disk".
  3. Create a 1.2 GB file (it probably doesn't matter what).
  4. I'm running nats+jetstream locally with zero load (and basically a newly installed minimal setup) on Ubuntu 22.04 LTS on a beefy Google cloud server.
  5. Here is what happens every single time:
/ram$ nats object add backups
/ram$ time nats object put backups 0.zfs

739 MiB / 1.2 GiB [==================================================================================>--------------------------------------------------]

nats: error: nats: stalled with too many outstanding async published messages

real    0m4.041s
user    0m0.748s
sys     0m0.471s
/ram$ time nats object put backups 0.zfs

730 MiB / 1.2 GiB [================================================================================>----------------------------------------------------]

nats: error: nats: stalled with too many outstanding async published messages

real    0m3.913s
user    0m0.693s
sys     0m0.510s
/ram$ df -h .
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            12G  2.2G  9.9G  19% /ram

I don't think the expected behavior is "Retries?". Instead, maybe object put needs to be implemented in such a way that it respects streaming/flow control, rather than just trying to push as much data as possible all at once, and having it fail if the client is too fast. Unrelated to this application, I recently wrote a similar chunked file upload system that basically proxies POST requests over NATS, and exactly these considerations are what I had to deal with all over to make that fully robust and memory efficient.

I tried uploading the same file in nodejs instead using https://github.com/nats-io/nats.js/blob/main/obj/README.md (i.e., the nodejs client) and it fails after a while with:

> (node:200771) PromiseRejectionHandledWarning: Promise rejection was handled asynchronously (rejection id: 8)
(Use `node --trace-warnings ...` to show where the warning was created)
(node:200771) PromiseRejectionHandledWarning: Promise rejection was handled asynchronously (rejection id: 9)
(node:200771) PromiseRejectionHandledWarning: Promise rejection was handled asynchronously (rejection id: 10)
(node:200771) PromiseRejectionHandledWarning: Promise rejection was handled asynchronously (rejection id: 11)
(node:200771) PromiseRejectionHandledWarning: Promise rejection was handled asynchronously (rejection id: 12)
(node:200771) PromiseRejectionHandledWarning: Promise rejection was handled asynchronously (rejection id: 13)
(node:200771) PromiseRejectionHandledWarning: Promise rejection was handled asynchronously (rejection id: 14)
(node:200771) PromiseRejectionHandledWarning: Promise rejection was handled asynchronously (rejection id: 15)
(node:200771) PromiseRejectionHandledWarning: Promise rejection was handled asynchronously (rejection id: 16)
(node:200771) PromiseRejectionHandledWarning: Promise rejection was handled asynchronously (rejection id: 17)
...
(node:200771) PromiseRejectionHandledWarning: Promise rejection was handled asynchronously (rejection id: 2476)
(node:200771) PromiseRejectionHandledWarning: Promise rejection was handled asynchronously (rejection id: 2477)
Uncaught NatsError: TIMEOUT

Anyway, hopefully a dev will try to reproduce this, since it might be easy to reproduce.

WORKAROUND:

Install pv (the 'pipe viewer') and use it to throttle the file input as follows:

/ram$ time cat 0.zfs | pv -L 100M | nats object put backups --name=0.zfs -f
1.16GiB 0:00:11 [ 100MiB/s] [                               <=>                                                                                                     ]
Object information for backups > 0.zfs

               Size: 1.2 GiB
  Modification Time: 2025-03-04 18:06:43
             Chunks: 18,990
             Digest: SHA-256 dd54977fcae05078ca7e7d95e684f997d00d2d3bf393c7d80a73cde7113d0706

real    0m11.830s
user    0m1.341s
sys     0m2.099s

Obviously the throttling amount that works is a function of many things (load on server, disk speed, network), so if using this seriously, one would probably have to try a reasonable parameter and if upload fails, throttle more and try again.

Anyway, I've been using NATS intensely for the last 2 months, and this may be the first real bug I have hit. NATS is one of the most amazing pieces of software I've ever found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

2 participants