Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to allocate 5.27 EiB ... when trying to access cluster dashboard through wrong URL #8368

Closed
jonashaag opened this issue Nov 20, 2023 · 10 comments

Comments

@jonashaag
Copy link
Contributor

jonashaag commented Nov 20, 2023

Sorry for screenshot, I don't have copy and paste or GitHub access on that machine.

Describe the issue:

When you try to open the dashboard through the link printed by print(client), you trigger this exception in the scheduler

Minimal Complete Verifiable Example:

client = Client()
print(client) # Prints a tcp:// URL that's NOT the dashboard URL

Try to open that URL in the browser (I thought it's the dashboard URL).

Anything else we need to know?:

Environment:

  • Dask version: 2023.11
  • Python version: 3.11
  • Operating System: Linux 64
  • Install method (conda, pip, source): Conda
@fjetter
Copy link
Member

fjetter commented Nov 21, 2023

What browser are you using for this?

  • chrome and firefox redirect this to a google search for me
  • Safari tells me this is not a valid URL and aborts

If I just put in the IP, I get errors like ERR_INVALID_HTTP_RESPONSE or similar for chrome and safari

Firefox actually manages to get through to the server and just prints the handshake info (the stuff we're sending over the network to a connecting server)

image

I can see how your exception can be triggered from our server code but the browser must send something rather specific to trigger such a response.

@jonashaag
Copy link
Contributor Author

Ah interesting, didn't consider this.

Chrome through JupyterLab proxy.

@RaiinmakerWes
Copy link

RaiinmakerWes commented Jul 22, 2024

I ran into a very similar issue today. tl;dr Unable to allocate 6.34 EiB for an array with shape (7311138144931639129,) and data type uint8

Running Docker images for all three distributed layers:

  • Google Cloud VM running the scheduler via docker run --network host --mount type=bind,source="$(pwd)"/dask-env.yaml,target=/etc/dask/dask-env.yaml,readonly --name scheduler --rm ghcr.io/dask/dask dask-scheduler
  • local machine running an ngrok tcp tunnel via CLI, and Dask worker via docker run -p 13370:13370 ghcr.io/dask/dask dask worker --contact-address tcp://n.tcp.xx-xxx-n.ngrok.io:15721 --listen-address tcp://localhost:13370 tcp://xx.xxx.xx.xxx:8786
  • local machine running Dask notebook image via docker run -p 8888:8888 ghcr.io/dask/dask-notebook

I open the Jupyter notebook in Google Chrome via the http://127.0.0.1:8888/lab?token=5ff5.... output in Docker, then add a cell in the notebook to connect to the scheduler and run work.

import dask
from dask.distributed import Client

def inc(x):
    print('yay')
    return x + 1

client = Client('xx.xxx.xx.xxx:8786')

x = client.submit(inc, 10)

L = client.map(inc, range(1000))

print('x result', x.result())
print ('L gather', client.gather(L))

I see the printed "yay" in my local worker, and I see the task completion debug logs in my scheduler.
However, gathering the results fails with

2024-07-22 16:22:47,876 - distributed.core - DEBUG - Message from 'tcp://[local_static_ip]:61927': {'op': 'gather', 'keys': ('inc-0a4704b07c1765924dc76f5c705ae806',), 'reply': True}
2024-07-22 16:22:47,876 - distributed.core - DEBUG - Calling into handler gather
2024-07-22 16:22:47,877 - distributed.comm.core - DEBUG - Establishing connection to [ngrok_endpoint_address]:15721
2024-07-22 16:22:47,900 - distributed.comm.tcp - DEBUG - Setting TCP keepalive: nprobes=10, idle=10, interval=2
2024-07-22 16:22:47,900 - distributed.comm.tcp - DEBUG - Setting TCP user timeout: 30000 ms
2024-07-22 16:22:47,922 - distributed.utils_comm - ERROR - Unexpected error while collecting tasks ['inc-0a4704b07c1765924dc76f5c705ae806'] from tcp://[ngrok_endpoint_address]:15721
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils_comm.py", line 459, in retry_operation
    return await retry(
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils_comm.py", line 438, in retry
    return await coro()
  File "/opt/conda/lib/python3.10/site-packages/distributed/worker.py", line 2866, in get_data_from_worker
    comm = await rpc.connect(worker)
  File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1533, in connect
    return connect_attempt.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1423, in _connect
    comm = await connect(
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/core.py", line 377, in connect
    handshake = await comm.read()
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 227, in read
    frames_nosplit = await read_bytes_rw(stream, frames_nosplit_nbytes)
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 359, in read_bytes_rw
    buf = host_array(n)
  File "/opt/conda/lib/python3.10/site-packages/distributed/protocol/utils.py", line 29, in host_array
    return numpy.empty((n,), dtype="u1").data
numpy._core._exceptions._ArrayMemoryError: Unable to allocate 6.34 EiB for an array with shape (7311138144931639129,) and data type uint8
2024-07-22 16:22:47,924 - distributed.scheduler - ERROR - Couldn't gather keys: {'inc-0a4704b07c1765924dc76f5c705ae806': 'memory'}

I should also mention that I am able to see the tcp connections from the GCP scheduler into my local worker at the gather step by monitoring the ngrok tunnel stats, so I was able to verify connectivity.

I can see how your exception can be triggered from our server code but the browser must send something rather specific to trigger such a response.

Does my error line up with your expectations here? Any ideas on why this might be happening here?

@fjetter
Copy link
Member

fjetter commented Jul 23, 2024

I'm not familiar with ngrok so I can't tell what's going on in your case.

The way I think the original exception was triggered is that the browser connected to the dask server and the dask server tried to engage in its application side handshake (where it is reading and writing things to the TCP socket). However, instead of receiving plain bytes that correspond to our protocol, it encountered a certain HTTP message that ended up triggering this exception (our protocol is using the first couple of bytes in a message to infer how much data is incoming and we're using this information to efficiently allocate memory. If the first couple of bytes are anything else / random bytes this is easily interpreted as a very big integer).

I'm not sure what ngrok does but if it is changing the bytestream even slightly, this could cause such an exception. It could also happen if it is erroneously thinking this connection is using HTTP

@RaiinmakerWes
Copy link

Ah, I see - that makes sense.
I bet there is a connection problem in the scheduler -> ngrok -> worker direction and the error payload is triggering this.

Thanks for the insight :)

@zoltan
Copy link

zoltan commented Sep 6, 2024

just starting dask scheduler --host 0.0.0.0 in a conda environment and then trying to access http://ip:8786 will result in this on 2024.7.1 from conda-forge.

@dimm0
Copy link

dimm0 commented Oct 22, 2024

Any updates? Followed the guide to provision a new cluster with k8s operator and hitting this error

@jacobtomlinson
Copy link
Member

As @fjetter says I think a lot of people landing on this issue are coming here because this error happens when you try and open the Dask TCP port used the communication in a browser.

Reproducer steps

  • Start the dask scheduler with dask scheduler
  • Open the Dask comm port in a browser http://localhost:8786

This results in the :���������������*�������ƒ«compressionÀ¦python“���¯pickle-protocol� message in the browser and the numpy.core._exceptions._ArrayMemoryError: Unable to allocate 7.40 EiB for an array with shape (8530211521808319815,) and data type uint8 exception in the scheduler.

This is expected behaviour. You're openening a TCP only connection in a web browser. If you're trying to access the dashboard you need to connect to a different port http://localhost:8787.

The discussion about ngrok is interesting. Ngrok supports HTTP proxying (layer 7) and TCP proxying (layer 4). They support both modes as there are pros/cons to each, see this article to learn more. I assume that folks who are running into issues are using HTTP proxying instead of TCP proxying, which results in the same error as when you open the TCP port in a browser. The fix for this should just be to use the TCP proxying.

I'm going to close this issue out as "wontfix" as hopefully this comment solves most folks problems. I've also opened #8905 to track improving the failure mode of opening the TCP port in a browser.

If there are still ngrok related issues that happens when using the TCP proxying then I encourage folks to open a new issue with steps to reproduce the issue so we can look further into it.

@jacobtomlinson jacobtomlinson closed this as not planned Won't fix, can't repro, duplicate, stale Oct 24, 2024
@tbazadaykin
Copy link

It seems that this is indeed a bug that needs to be fixed.
Our DevOps spent two days trying to figure out what was causing the scheduler to crash. It turned out that in the production Kubernetes cluster, there's a monitoring system in place that pings open ports to check if the service is alive.
Pinging a port to check service availability is a fairly common practice, and fixing this bug is definitely worth it.

@jacobtomlinson
Copy link
Member

@tbazadaykin they need to ping the port via TCP, not HTTP. You mentioned Kubernetes so I assume you're talking about a liveness probe, so you need to use a TCP probe. If you make an HTTP request to a non-HTTP port there is no guarantee that it will behave as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants