Unable to allocate 5.27 EiB ... when trying to access cluster dashboard through wrong URL #8368

jonashaag · 2023-11-20T19:49:46Z

Sorry for screenshot, I don't have copy and paste or GitHub access on that machine.

Describe the issue:

When you try to open the dashboard through the link printed by print(client), you trigger this exception in the scheduler

Minimal Complete Verifiable Example:

client = Client()
print(client) # Prints a tcp:// URL that's NOT the dashboard URL

Try to open that URL in the browser (I thought it's the dashboard URL).

Anything else we need to know?:

Environment:

Dask version: 2023.11
Python version: 3.11
Operating System: Linux 64
Install method (conda, pip, source): Conda

The text was updated successfully, but these errors were encountered:

fjetter · 2023-11-21T11:23:56Z

What browser are you using for this?

chrome and firefox redirect this to a google search for me
Safari tells me this is not a valid URL and aborts

If I just put in the IP, I get errors like ERR_INVALID_HTTP_RESPONSE or similar for chrome and safari

Firefox actually manages to get through to the server and just prints the handshake info (the stuff we're sending over the network to a connecting server)

I can see how your exception can be triggered from our server code but the browser must send something rather specific to trigger such a response.

jonashaag · 2023-11-21T11:43:19Z

Ah interesting, didn't consider this.

Chrome through JupyterLab proxy.

RaiinmakerWes · 2024-07-22T16:43:44Z

I ran into a very similar issue today. tl;dr Unable to allocate 6.34 EiB for an array with shape (7311138144931639129,) and data type uint8

Running Docker images for all three distributed layers:

Google Cloud VM running the scheduler via docker run --network host --mount type=bind,source="$(pwd)"/dask-env.yaml,target=/etc/dask/dask-env.yaml,readonly --name scheduler --rm ghcr.io/dask/dask dask-scheduler
local machine running an ngrok tcp tunnel via CLI, and Dask worker via docker run -p 13370:13370 ghcr.io/dask/dask dask worker --contact-address tcp://n.tcp.xx-xxx-n.ngrok.io:15721 --listen-address tcp://localhost:13370 tcp://xx.xxx.xx.xxx:8786
local machine running Dask notebook image via docker run -p 8888:8888 ghcr.io/dask/dask-notebook

I open the Jupyter notebook in Google Chrome via the http://127.0.0.1:8888/lab?token=5ff5.... output in Docker, then add a cell in the notebook to connect to the scheduler and run work.

import dask
from dask.distributed import Client

def inc(x):
    print('yay')
    return x + 1

client = Client('xx.xxx.xx.xxx:8786')

x = client.submit(inc, 10)

L = client.map(inc, range(1000))

print('x result', x.result())
print ('L gather', client.gather(L))

I see the printed "yay" in my local worker, and I see the task completion debug logs in my scheduler.
However, gathering the results fails with

2024-07-22 16:22:47,876 - distributed.core - DEBUG - Message from 'tcp://[local_static_ip]:61927': {'op': 'gather', 'keys': ('inc-0a4704b07c1765924dc76f5c705ae806',), 'reply': True}
2024-07-22 16:22:47,876 - distributed.core - DEBUG - Calling into handler gather
2024-07-22 16:22:47,877 - distributed.comm.core - DEBUG - Establishing connection to [ngrok_endpoint_address]:15721
2024-07-22 16:22:47,900 - distributed.comm.tcp - DEBUG - Setting TCP keepalive: nprobes=10, idle=10, interval=2
2024-07-22 16:22:47,900 - distributed.comm.tcp - DEBUG - Setting TCP user timeout: 30000 ms
2024-07-22 16:22:47,922 - distributed.utils_comm - ERROR - Unexpected error while collecting tasks ['inc-0a4704b07c1765924dc76f5c705ae806'] from tcp://[ngrok_endpoint_address]:15721
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils_comm.py", line 459, in retry_operation
    return await retry(
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils_comm.py", line 438, in retry
    return await coro()
  File "/opt/conda/lib/python3.10/site-packages/distributed/worker.py", line 2866, in get_data_from_worker
    comm = await rpc.connect(worker)
  File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1533, in connect
    return connect_attempt.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1423, in _connect
    comm = await connect(
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/core.py", line 377, in connect
    handshake = await comm.read()
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 227, in read
    frames_nosplit = await read_bytes_rw(stream, frames_nosplit_nbytes)
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 359, in read_bytes_rw
    buf = host_array(n)
  File "/opt/conda/lib/python3.10/site-packages/distributed/protocol/utils.py", line 29, in host_array
    return numpy.empty((n,), dtype="u1").data
numpy._core._exceptions._ArrayMemoryError: Unable to allocate 6.34 EiB for an array with shape (7311138144931639129,) and data type uint8
2024-07-22 16:22:47,924 - distributed.scheduler - ERROR - Couldn't gather keys: {'inc-0a4704b07c1765924dc76f5c705ae806': 'memory'}

I should also mention that I am able to see the tcp connections from the GCP scheduler into my local worker at the gather step by monitoring the ngrok tunnel stats, so I was able to verify connectivity.

I can see how your exception can be triggered from our server code but the browser must send something rather specific to trigger such a response.

Does my error line up with your expectations here? Any ideas on why this might be happening here?

fjetter · 2024-07-23T09:57:40Z

I'm not familiar with ngrok so I can't tell what's going on in your case.

The way I think the original exception was triggered is that the browser connected to the dask server and the dask server tried to engage in its application side handshake (where it is reading and writing things to the TCP socket). However, instead of receiving plain bytes that correspond to our protocol, it encountered a certain HTTP message that ended up triggering this exception (our protocol is using the first couple of bytes in a message to infer how much data is incoming and we're using this information to efficiently allocate memory. If the first couple of bytes are anything else / random bytes this is easily interpreted as a very big integer).

I'm not sure what ngrok does but if it is changing the bytestream even slightly, this could cause such an exception. It could also happen if it is erroneously thinking this connection is using HTTP

RaiinmakerWes · 2024-07-25T15:02:36Z

Ah, I see - that makes sense.
I bet there is a connection problem in the scheduler -> ngrok -> worker direction and the error payload is triggering this.

Thanks for the insight :)

zoltan · 2024-09-06T15:11:28Z

just starting dask scheduler --host 0.0.0.0 in a conda environment and then trying to access http://ip:8786 will result in this on 2024.7.1 from conda-forge.

dimm0 · 2024-10-22T19:39:37Z

Any updates? Followed the guide to provision a new cluster with k8s operator and hitting this error

jacobtomlinson · 2024-10-24T11:43:16Z

As @fjetter says I think a lot of people landing on this issue are coming here because this error happens when you try and open the Dask TCP port used the communication in a browser.

Reproducer steps

Start the dask scheduler with dask scheduler
Open the Dask comm port in a browser http://localhost:8786

This results in the :��*��ƒ«compressionÀ¦python“��¯pickle-protocol� message in the browser and the numpy.core._exceptions._ArrayMemoryError: Unable to allocate 7.40 EiB for an array with shape (8530211521808319815,) and data type uint8 exception in the scheduler.

This is expected behaviour. You're openening a TCP only connection in a web browser. If you're trying to access the dashboard you need to connect to a different port http://localhost:8787.

The discussion about ngrok is interesting. Ngrok supports HTTP proxying (layer 7) and TCP proxying (layer 4). They support both modes as there are pros/cons to each, see this article to learn more. I assume that folks who are running into issues are using HTTP proxying instead of TCP proxying, which results in the same error as when you open the TCP port in a browser. The fix for this should just be to use the TCP proxying.

I'm going to close this issue out as "wontfix" as hopefully this comment solves most folks problems. I've also opened #8905 to track improving the failure mode of opening the TCP port in a browser.

If there are still ngrok related issues that happens when using the TCP proxying then I encourage folks to open a new issue with steps to reproduce the issue so we can look further into it.

tbazadaykin · 2025-01-23T15:53:28Z

It seems that this is indeed a bug that needs to be fixed.
Our DevOps spent two days trying to figure out what was causing the scheduler to crash. It turned out that in the production Kubernetes cluster, there's a monitoring system in place that pings open ports to check if the service is alive.
Pinging a port to check service availability is a fairly common practice, and fixing this bug is definitely worth it.

jacobtomlinson · 2025-01-23T18:30:31Z

@tbazadaykin they need to ping the port via TCP, not HTTP. You mentioned Kubernetes so I assume you're talking about a liveness probe, so you need to use a TCP probe. If you make an HTTP request to a non-HTTP port there is no guarantee that it will behave as expected.

github-actions bot added the needs triage label Nov 20, 2023

jacobtomlinson closed this as not planned Won't fix, can't repro, duplicate, stale Oct 24, 2024

jacobtomlinson removed the needs triage label Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to allocate 5.27 EiB ... when trying to access cluster dashboard through wrong URL #8368

Unable to allocate 5.27 EiB ... when trying to access cluster dashboard through wrong URL #8368

jonashaag commented Nov 20, 2023 •

edited

Loading

fjetter commented Nov 21, 2023

jonashaag commented Nov 21, 2023

RaiinmakerWes commented Jul 22, 2024 •

edited

Loading

fjetter commented Jul 23, 2024

RaiinmakerWes commented Jul 25, 2024

zoltan commented Sep 6, 2024

dimm0 commented Oct 22, 2024

jacobtomlinson commented Oct 24, 2024

tbazadaykin commented Jan 23, 2025

jacobtomlinson commented Jan 23, 2025

Unable to allocate 5.27 EiB ... when trying to access cluster dashboard through wrong URL #8368

Unable to allocate 5.27 EiB ... when trying to access cluster dashboard through wrong URL #8368

Comments

jonashaag commented Nov 20, 2023 • edited Loading

fjetter commented Nov 21, 2023

jonashaag commented Nov 21, 2023

RaiinmakerWes commented Jul 22, 2024 • edited Loading

fjetter commented Jul 23, 2024

RaiinmakerWes commented Jul 25, 2024

zoltan commented Sep 6, 2024

dimm0 commented Oct 22, 2024

jacobtomlinson commented Oct 24, 2024

tbazadaykin commented Jan 23, 2025

jacobtomlinson commented Jan 23, 2025

jonashaag commented Nov 20, 2023 •

edited

Loading

RaiinmakerWes commented Jul 22, 2024 •

edited

Loading