Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[<Ray component: data] ray.data.read_text raise numpy.core._exceptions._ArrayMemoryError: Unable to allocate #46293

Open
Ox0400 opened this issue Jun 27, 2024 · 2 comments
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks

Comments

@Ox0400
Copy link
Contributor

Ox0400 commented Jun 27, 2024

What happened + What you expected to happen

Crashed when load a large text file using ray.data.read_text.

I just want to do repartition without parse to json format for per lines to free CPU and Mem resources, and I known can use read_json.

Versions / Dependencies

ray==2.31.0

Reproduction script

root@a135306f9a92:/var/work# du -sh xxxx.json
447M    xxxx.json
root@a135306f9a92:/var/work# wc -l xxxx.json
51776 xxxx.json
root@a135306f9a92:/var/work# 
>>> ds = ray.data.read_text('xxxxxx.json')
>>> ds.schema()
2024-06-27 16:59:34,573 INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2024-06-27_16-52-09_868831_7603/logs/ray-data
2024-06-27 16:59:34,574 INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadText]
Running 0:   0%|                                                                                                                                             | 0/1 [00:00<?, ?it/s]2024-06-27 17:00:37,810(ERROR streaming_executor_state.py:455 -- An exception was raised from a task of operator "ReadText->SplitBlocks(67)". Dataset execution will now abort. To ignore this exception and continue, set DataContext.max_errored_blocks.
                                                                                                   2024-06-27 17:00:37,821      ERROR exceptions.py:73 -- Exception occurred in Ray Data or Ray Core internal code. If you continue to see this error, please open an issue on the Ray project GitHub page with the full stack trace below: https://github.com/ray-project/ray/issues/new/choose
ray.data.exceptions.SystemException

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/site-packages/ray/data/dataset.py", line 2528, in schema
    base_schema = self._plan.schema(fetch_if_missing=False)
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/plan.py", line 353, in schema
    blocks_with_metadata, _, _ = self.execute_to_iterator()
  File "/usr/local/lib/python3.10/site-packages/ray/data/exceptions.py", line 86, in handle_trace
    raise e.with_traceback(None) from SystemException()
ray.exceptions.RayTaskError(ValueError): ray::ReadText->SplitBlocks(67)() (pid=8006, ip=172.17.0.3)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 414. GiB for an array with shape (4900,) and data type <U22697406

During handling of the above exception, another exception occurred:

ray::ReadText->SplitBlocks(67)() (pid=8006, ip=172.17.0.3)
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_operator.py", line 438, in _map_task
    for b_out in map_transformer.apply_transform(iter(blocks), ctx):
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 451, in __call__
    for block in blocks:
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 392, in __call__
    for data in iter:
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 253, in __call__
    yield from self._block_fn(input, ctx)
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/planner/plan_read_op.py", line 92, in do_read
    yield from call_with_retry(
  File "/usr/local/lib/python3.10/site-packages/ray/data/datasource/datasource.py", line 197, in __call__
    yield from result
  File "/usr/local/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 256, in read_task_fn
    yield from read_files(read_paths)
  File "/usr/local/lib/python3.10/site-packages/ray/data/datasource/file_based_datasource.py", line 222, in read_files
    for block in read_stream(f, read_path):
  File "/usr/local/lib/python3.10/site-packages/ray/data/datasource/text_datasource.py", line 41, in _read_stream
    builder.add(item)
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/delegating_block_builder.py", line 38, in add
    self._builder.add(item)
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/table_block.py", line 86, in add
    self._compact_if_needed()
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/table_block.py", line 152, in _compact_if_needed
    columns = {
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/table_block.py", line 153, in <dictcomp>
    key: convert_udf_returns_to_numpy(col) for key, col in self._columns.items()
  File "/usr/local/lib/python3.10/site-packages/ray/data/_internal/numpy_support.py", line 102, in convert_udf_returns_to_numpy
    raise ValueError(
ValueError: Failed to convert column values to numpy array: (['{"txt": "\\n\\"\\"\\"\\nA suite of tools for dealing with notebooks...\\n\\"\\"\\"\\n\\nimport gtk\\n\\ndef prepNotebook(notebook=None, group=1):\\n    \\"\\"\\"\\n    Setup a notebook for use in vw...): Unable to allocate 414. GiB for an array with shape (4900,) and data type <U22697406.
>>> 
#The items and sub item types, length.
type(udf_return_col)=<class 'list'>  len(udf_return_col)=4900 
type(udf_return_col[0])=<class 'str'> len(udf_return_col[0])=2576

Issue Severity

High: It blocks me from completing my task.

@Ox0400 Ox0400 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 27, 2024
@Ox0400
Copy link
Contributor Author

Ox0400 commented Jun 27, 2024

#46298 (comment)

@Ox0400
Copy link
Contributor Author

Ox0400 commented Jul 2, 2024

😊

@anyscalesam anyscalesam added the data Ray Data-related issues label Jul 8, 2024
@scottjlee scottjlee added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

3 participants