[BUG] NVTabular runs into OOM or dies when scaling to large dataset #1683

bschifferer · 2022-09-23T13:34:58Z

Describe the bug
I tried multiple workflows and run into different issues when I run on multi-GPU setup running NVTabular workflows on large datasets.

Error 1: Workers just die one after one
Characteristic:

Dataset size: ~200 million rows
~200 columns
some categorify ops, some minmaxnormalization, some lambda ops

2022-09-15 12:59:09,526 - tornado.application - ERROR - Exception in callback <function Worker.__init__.<locals>.<lambda> at 0x7fc9dd565e50>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/usr/local/lib/python3.8/dist-packages/distributed/worker.py", line 773, in <lambda>
    lambda: self.batched_stream.send({"op": "keep-alive"}), 60000
  File "/usr/local/lib/python3.8/dist-packages/distributed/batched.py", line 137, in send
    raise CommClosedError(f"Comm {self.comm!r} already closed.")
distributed.comm.core.CommClosedError: Comm <TCP (closed) Worker->Scheduler local=tcp://127.0.0.1:48834 remote=tcp://127.0.0.1:36323> already closed.
2022-09-15 12:59:09,533 - tornado.application - ERROR - Exception in callback <function Worker.__init__.<locals>.<lambda> at 0x7f6c60fe9940>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 921, in _run
    val = self.callback()
  File "/usr/local/lib/python3.8/dist-packages/distributed/worker.py", line 773, in <lambda>
    lambda: self.batched_stream.send({"op": "keep-alive"}), 60000
  File "/usr/local/lib/python3.8/dist-packages/distributed/batched.py", line 137, in send
    raise CommClosedError(f"Comm {self.comm!r} already closed.")
distributed.comm.core.CommClosedError: Comm <TCP (closed) Worker->Scheduler local=tcp://127.0.0.1:48832 remote=tcp://127.0.0.1:36323> already closed.

Error 2: Run into OOM
Workflow:

features1 = (
    [['col1', 'col2']] >> 
    nvt.ops.Categorify()
)

features2 = (
    ['col3'] >>
    nvt.ops.Categorify(
        num_buckets=10_000_000
    )
)

targets = ['target1', 'target2']
features = features1+features2+targets

Characteristics:

~35000 files, ~350GB parquet, 15 billion rows

MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /usr/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory

The text was updated successfully, but these errors were encountered:

viswa-nvidia · 2022-09-26T23:29:24Z

@benfred , please check with @bschifferer on this

EvenOldridge · 2022-10-17T23:40:42Z

@rjzamora Any idea what could be happening here? I know you've been putting in some work on Categorify. I think this is happening during the compute of all uniques, which we may want to allow as an input into the op since it's a relatively straightforward piece of information to pull from a data lake.

rjzamora · 2022-10-18T00:14:02Z

Any idea what could be happening here?

I suppose there are many possibilities, depending on if the failure happens in the fit or the transform. For example, #1692 explains two reasons why the fit could be a problem with the current implementation (lack of a "proper" tree reduction, and the requirement to write all uniques for a given column to disk at once).

rjzamora · 2022-10-21T17:11:20Z

@bschifferer - I'd like to explore if #1692 (or some variation of it) can help with this. Can you share details about the system you are running on and a representative/toy dataset where you are seeing issues? (feel free to contact me offline about the dataset)

viswa-nvidia · 2023-04-11T17:14:49Z

@bschifferer , please update the status of this ticket. Are we workign on this data set now ?

bschifferer added bug Something isn't working P1 P0 and removed P1 labels Sep 23, 2022

viswa-nvidia assigned benfred Sep 26, 2022

viswa-nvidia added this to the Merlin 22.10 milestone Sep 26, 2022

EvenOldridge assigned rjzamora Oct 17, 2022

viswa-nvidia modified the milestones: Merlin 22.10, Merlin 22.12 Nov 8, 2022

karlhigley modified the milestones: Merlin 22.12, Merlin 23.04 Apr 4, 2023

viswa-nvidia assigned bschifferer and unassigned benfred Apr 11, 2023

viswa-nvidia added P1 and removed P0 labels Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] NVTabular runs into OOM or dies when scaling to large dataset #1683

[BUG] NVTabular runs into OOM or dies when scaling to large dataset #1683

bschifferer commented Sep 23, 2022

viswa-nvidia commented Sep 26, 2022

EvenOldridge commented Oct 17, 2022

rjzamora commented Oct 18, 2022

rjzamora commented Oct 21, 2022

viswa-nvidia commented Apr 11, 2023

[BUG] NVTabular runs into OOM or dies when scaling to large dataset #1683

[BUG] NVTabular runs into OOM or dies when scaling to large dataset #1683

Comments

bschifferer commented Sep 23, 2022

viswa-nvidia commented Sep 26, 2022

EvenOldridge commented Oct 17, 2022

rjzamora commented Oct 18, 2022

rjzamora commented Oct 21, 2022

viswa-nvidia commented Apr 11, 2023