Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] NVTabular runs into OOM or dies when scaling to large dataset #1683

Open
bschifferer opened this issue Sep 23, 2022 · 5 comments
Open
Assignees
Labels
bug Something isn't working P1
Milestone

Comments

@bschifferer
Copy link
Contributor

Describe the bug
I tried multiple workflows and run into different issues when I run on multi-GPU setup running NVTabular workflows on large datasets.

Error 1: Workers just die one after one
Characteristic:

  • Dataset size: ~200 million rows
  • ~200 columns
  • some categorify ops, some minmaxnormalization, some lambda ops
2022-09-15 12:59:09,526 - tornado.application - ERROR - Exception in callback <function Worker.__init__.<locals>.<lambda> at 0x7fc9dd565e50>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/usr/local/lib/python3.8/dist-packages/distributed/worker.py", line 773, in <lambda>
    lambda: self.batched_stream.send({"op": "keep-alive"}), 60000
  File "/usr/local/lib/python3.8/dist-packages/distributed/batched.py", line 137, in send
    raise CommClosedError(f"Comm {self.comm!r} already closed.")
distributed.comm.core.CommClosedError: Comm <TCP (closed) Worker->Scheduler local=tcp://127.0.0.1:48834 remote=tcp://127.0.0.1:36323> already closed.
2022-09-15 12:59:09,533 - tornado.application - ERROR - Exception in callback <function Worker.__init__.<locals>.<lambda> at 0x7f6c60fe9940>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/usr/local/lib/python3.8/dist-packages/distributed/worker.py", line 773, in <lambda>
    lambda: self.batched_stream.send({"op": "keep-alive"}), 60000
  File "/usr/local/lib/python3.8/dist-packages/distributed/batched.py", line 137, in send
    raise CommClosedError(f"Comm {self.comm!r} already closed.")
distributed.comm.core.CommClosedError: Comm <TCP (closed) Worker->Scheduler local=tcp://127.0.0.1:48832 remote=tcp://127.0.0.1:36323> already closed.

Error 2: Run into OOM
Workflow:

features1 = (
    [['col1', 'col2']] >> 
    nvt.ops.Categorify()
)

features2 = (
    ['col3'] >>
    nvt.ops.Categorify(
        num_buckets=10_000_000
    )
)

targets = ['target1', 'target2']
features = features1+features2+targets

Characteristics:

  • ~35000 files, ~350GB parquet, 15 billion rows
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /usr/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory
@bschifferer bschifferer added bug Something isn't working P1 P0 and removed P1 labels Sep 23, 2022
@viswa-nvidia
Copy link

@benfred , please check with @bschifferer on this

@EvenOldridge
Copy link
Member

@rjzamora Any idea what could be happening here? I know you've been putting in some work on Categorify. I think this is happening during the compute of all uniques, which we may want to allow as an input into the op since it's a relatively straightforward piece of information to pull from a data lake.

@rjzamora
Copy link
Collaborator

Any idea what could be happening here?

I suppose there are many possibilities, depending on if the failure happens in the fit or the transform. For example, #1692 explains two reasons why the fit could be a problem with the current implementation (lack of a "proper" tree reduction, and the requirement to write all uniques for a given column to disk at once).

@rjzamora
Copy link
Collaborator

@bschifferer - I'd like to explore if #1692 (or some variation of it) can help with this. Can you share details about the system you are running on and a representative/toy dataset where you are seeing issues? (feel free to contact me offline about the dataset)

@viswa-nvidia
Copy link

@bschifferer , please update the status of this ticket. Are we workign on this data set now ?

@viswa-nvidia viswa-nvidia added P1 and removed P0 labels Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1
Projects
None yet
Development

No branches or pull requests

6 participants