-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Data] Improve appearance of repr(dataset) #59631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new tabular representation for ray.data.Dataset, similar to polars, which significantly improves the user experience for inspecting datasets. The implementation is well-structured, with a fallback mechanism to the old plan-based representation to ensure robustness. The changes include new helper functions for building the ASCII table, and the tests have been updated accordingly to cover both the new and old repr behaviors.
I've found one area for performance improvement in the logic for collecting tail rows for the representation, which can be made more efficient. Overall, this is a great enhancement to Ray Data.
|
Hi @bveeramani — I put together this PR to implement the Polars-style repr requested in #59482. When you have a moment, could you take a quick look and let me know if this matches what you had in mind? Happy to adjust or follow up on any feedback. |
python/ray/data/dataset.py
Outdated
| column_widths: List[int], | ||
| ) -> List[str]: | ||
| lines: List[str] = [] | ||
| top = _render_border("┌", "┬", "┐", "─", column_widths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Could we use rounded corners? I think people find them more aesthetically pleasing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have used rounded corners, and it looks like more aesthetically. Thanks for your suggestion
python/ray/data/dataset.py
Outdated
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| _DATASET_REPR_MAX_ROWS = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you have 1000 columns? How does polars or other DataFrame libraries handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@myandpr I would suggest reading through: https://github.com/pola-rs/polars/blob/ac0b7519/crates/polars-core/src/fmt.rs .
They handle truncation at both column and row level, and both should be configurable from the user's perspective.
We should test rendering of large vectors, strings etc. in repr to ensure that truncation works as expected. Each dtype should have its own config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@goutamvenkat-anyscale , Thanks for the pointers! Dataset.repr now uses the polars-style table, truncation knobs (rows/columns/value per dtype) are exposed via DataContext, and tests cover row/column truncation plus long string/list/ndarray cases.
I'm not entirely sure if this matches what you were envisioning, so please let me know if anything still needs adjusting.
|
@myandpr Nice!! This is great 🔥 @owenowenisme could you review the implementation pls? |
python/ray/data/dataset.py
Outdated
| return block.to_arrow() | ||
|
|
||
|
|
||
| @dataclass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move these implementations into a separate module? dataset.py is already ~7,000 lines, and continuing to append to it will hurt maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review, and I have moved these implementations into a separate dataset_repr.py file.
python/ray/data/dataset.py
Outdated
| return head_rows, tail_rows, show_gap | ||
|
|
||
|
|
||
| def _determine_preview_row_targets( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some comments for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upadted.
|
@myandpr The test is failing, could you fix that? |
5d5ed05 to
b26992f
Compare
Hi @owenowenisme ,I’ve fixed the failing tests—could you take another look when you have time? Thanks a lot! |
bveeramani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ty! Just left some comments.
Ping me whenever you're ready for another round of feedback, and I can give you a fast review
doc/source/data/loading-data.rst
Outdated
| :options: +ELLIPSIS | ||
|
|
||
| MaterializedDataset( | ||
| num_blocks=3, | ||
| num_rows=3, | ||
| schema={food: string, price: double} | ||
| ) | ||
| shape: (3, 2) | ||
| ... | ||
| (Showing 3 of 3 rows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the doc examples in the files touched by this PR to show the full actual repr instead of .../ELLIPSIS (including loading-data.rst and quickstart.rst).
python/ray/data/context.py
Outdated
| DEFAULT_DATASET_REPR_MAX_ROWS = env_integer("RAY_DATASET_REPR_MAX_ROWS", 10) | ||
| DEFAULT_DATASET_REPR_HEAD_ROWS = env_integer("RAY_DATASET_REPR_HEAD_ROWS", 5) | ||
| DEFAULT_DATASET_REPR_MAX_COLUMNS = env_integer("RAY_DATASET_REPR_MAX_COLUMNS", 10) | ||
| DEFAULT_DATASET_REPR_HEAD_COLUMNS = env_integer("RAY_DATASET_REPR_HEAD_COLUMNS", 5) | ||
| DEFAULT_DATASET_REPR_MAX_COLUMN_WIDTH = env_integer( | ||
| "RAY_DATASET_REPR_MAX_COLUMN_WIDTH", 40 | ||
| ) | ||
| DEFAULT_DATASET_REPR_MAX_STRING_LENGTH = env_integer( | ||
| "RAY_DATASET_REPR_MAX_STRING_LENGTH", 40 | ||
| ) | ||
| DEFAULT_DATASET_REPR_MAX_BYTES_LENGTH = env_integer( | ||
| "RAY_DATASET_REPR_MAX_BYTES_LENGTH", 40 | ||
| ) | ||
| DEFAULT_DATASET_REPR_MAX_COLLECTION_ITEMS = env_integer( | ||
| "RAY_DATASET_REPR_MAX_COLLECTION_ITEMS", 5 | ||
| ) | ||
| DEFAULT_DATASET_REPR_MAX_TENSOR_ELEMENTS = env_integer( | ||
| "RAY_DATASET_REPR_MAX_TENSOR_ELEMENTS", 8 | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What happens if we don't expose these configurations? Is it important that users are able to configure these?
- If it is important that we expose these to the user, I think it'd be cleaner if we encapsulate this within a single dataclass to avoid flooding the top-level
DataContextwith several repr-specific parameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO I don't think it's super important to expose these knobs, but would love to get your take
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback! I agree these knobs don’t need to be user‑exposed right now. I removed the dataset_repr_* fields from DataContext and kept the defaults internal to the repr implementation (and dropped the config-based tests accordingly).
python/ray/data/dataset.py
Outdated
| except Exception: | ||
| logger.debug("Falling back to plan-based Dataset repr.", exc_info=True) | ||
| return self._plan.get_plan_as_string(self.__class__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would this fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a safety net for failures in the tabular repr path (e.g., schema fetch / BlockAccessor / pyarrow issues or formatting errors).
If anything in that path throws, we fall back to the plan-based repr so repr(ds) never crashes user workflows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should do a bare except because it prevents us from removing the old code path, and it'll also mask bugs that we would want to hear about and fix
| if schema is None or not isinstance(schema, Schema): | ||
| return self._plan.get_plan_as_string(self.__class__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the only code path where we use get_plan_as_string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_plan_as_string is used as a fallback in repr (when tabular rendering fails or schema is missing) and also when the dataset has no columns in
_build_dataset_ascii_repr. Since it’s a low‑level detail, I removed tests that asserted it directly.
|
|
||
|
|
||
| def test_dataset_repr(ray_start_regular_shared): | ||
| def test_dataset_plan_string(ray_start_regular_shared): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_plan_as_string is a low-level implementation detail, and I don't think we need to test it.
I think we should either test the actual repr(ds), or if itds redundant with the tests below, I think we can jsut nuke this test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I removed the tests that directly assert get_plan_as_string (both test_dataset_plan_string and test_dataset_plan_as_string) to avoid coupling to low‑level plan formatting.
| assert "Dataset isn't materialized" in text | ||
| assert "shape: (5, 1)" in text | ||
| assert "│ id" in text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, should we explicitly assert that the repr looks like the expected string? I think it might be easier to read and catch more errors, but I guess the tradeoff is that it'd make the test pretty brittle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update
| ctx.dataset_repr_max_rows = 4 | ||
| ctx.dataset_repr_head_rows = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to my comment above -- if we don't need to expose these configuration, I don't think we need these tests
| ] | ||
| assert "MapBatches(func)" in ds.__repr__() | ||
| plan_repr = ds._plan.get_plan_as_string(type(ds)) | ||
| assert "MapBatches(func)" in plan_repr | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this and test_map -- would it make sense to assert against Dataset.explain() instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes—updated both test_groupby_e2e and test_map to assert against Dataset.explain() output instead of repr / get_plan_as_string.
| def _truncate_repr_if_needed(representation: str) -> str: | ||
| """Ensure repr strings remain shorter than MAX_REPR_LENGTH.""" | ||
| if len(representation) < MAX_REPR_LENGTH: | ||
| return representation | ||
|
|
||
| ellipsis = "..." | ||
| closing = "" | ||
| if representation: | ||
| closing = representation[-1] | ||
| representation = representation[:-1] | ||
|
|
||
| max_body_len = MAX_REPR_LENGTH - len(ellipsis) - len(closing) - 1 | ||
| if max_body_len < 0: | ||
| max_body_len = 0 | ||
|
|
||
| truncated_body = representation[:max_body_len] | ||
| return f"{truncated_body}{ellipsis}{closing}" | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we don't include this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this, repr(trainer) can exceed MAX_REPR_LENGTH when non‑default args (like datasets/config) are large. There’s an existing test (test_base_trainer.py::test_repr) that asserts len(repr(trainer)) < MAX_REPR_LENGTH, so this truncation keeps that invariant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. I think the problem is that for Train V1, it uses the dataset repr in the Trainer repr, and that looks pretty bad with the table format:
"<DummyTrainer datasets={'train': shape: (3, 1)\n╭───────╮\n│ item │\n│ --- │\n│ int64 │\n╞═══════╡\n│ 1 │\n│ 2 ...>"
Rather than truncate the repr, I think we should special-case datasets in the BaseTrainer repr and just do dataset._plan.get_plan_as_string(...). It's not great because it breaks abstraction barriers and uses a brittle low-level API, but I think it might be the best option for now because it's simple and we can't break Train V1
| is_gap: bool = False | ||
|
|
||
|
|
||
| def _prepare_columns_for_repr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I read the functions in this module, it wasn't obvious to me what each of the functions did without reading the implementation. Would you mind adding docstrings summarizing what each function does (not just repeating the funtion name), and describing the input and output types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added docstring
e32fa05 to
6af81f2
Compare
|
Conflicts fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ty! Left a bunch of comments.
Some high-level feedback:
- I had difficulty reading
dataset_repr.py. To make the module easier to read, I think we should remove unnecessary indirection (e.g., _DatasetReprConfig ) and cut scope (e.g., remove special cases that might not be necessary for this initial implementation) - Could you split out the pre-requisite test refactors into different PRs? We can land the changes faster that way
doc/source/data/loading-data.rst
Outdated
| "food": ["spam", "ham", "eggs"], | ||
| "price": [9.34, 5.37, 0.94] | ||
| }) | ||
| df["food"] = df["food"].astype("string") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the output look like if we exclude this? I think readers of the documentation might feel confused as to why we include this in the code snippet
|
|
||
| DEFAULT_ENABLE_GET_OBJECT_LOCATIONS_FOR_METRICS = False | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Unintended diff?
| @dataclass | ||
| class _DatasetReprConfig: | ||
| max_rows: int | ||
| head_rows: int | ||
| max_columns: int | ||
| head_columns: int | ||
| max_column_width: int | ||
| max_string_length: int | ||
| max_bytes_length: int | ||
| max_collection_items: int | ||
| max_tensor_elements: int | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we've removed these configurations from the DataContext, this dataclass is a layer of indirection. Could you remove it, and also all of the dead logic? I think it'd really simplify some of this code
For example, once we remove _DatasetReprConfig , we can remove this snippet:
Before
max_columns = config.max_columns
if max_columns is None or max_columns <= 0:
max_columns = len(columns)
max_columns = max(1, max_columns)
if num_columns <= max_columns:
...
After
if num_columns <= _DATASET_REPR_MAX_COLUMNS:
...
| _DATASET_REPR_MAX_STRING_LENGTH = 40 | ||
| _DATASET_REPR_MAX_BYTES_LENGTH = 40 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to define lengths for strings and bytes? Can we just truncate based on the column width?
| _DATASET_REPR_ELLIPSIS = "…" | ||
| _DATASET_REPR_PREVIEW_MAX_DEPTH = 2 | ||
| _DATASET_REPR_MAX_ROWS = 10 | ||
| _DATASET_REPR_HEAD_ROWS = 5 | ||
| _DATASET_REPR_MAX_COLUMNS = 10 | ||
| _DATASET_REPR_HEAD_COLUMNS = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add comments describing what each of these constants represent? I felt confused by _DATASET_REPR_PREVIEW_MAX_DEPTH and _DATASET_REPR_HEAD_COLUMNS in particular
python/ray/data/dataset.py
Outdated
| except Exception: | ||
| logger.debug("Falling back to plan-based Dataset repr.", exc_info=True) | ||
| return self._plan.get_plan_as_string(self.__class__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should do a bare except because it prevents us from removing the old code path, and it'll also mask bugs that we would want to hear about and fix
python/ray/data/tests/test_stats.py
Outdated
| plan_str = ds._plan.get_plan_as_string(type(ds)) | ||
| assert plan_str == ( | ||
| "MapBatches(<lambda>)\n" | ||
| "+- Dataset(name=test_ds, num_rows=100, schema={id: int64})" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this assertion will make a lot of sense after this PR, because it's testing against an extremely low-level API that's mostly obsolete. I think the intention of this assertion is to make sure we include the dataset name in the repr and that the dataset name is correct.
If the user has specified a name, should we include that somewhere in the repr?
For the assertion here and later in this function, I think we should just test that the name is in the repr, like assert "test_ds" in repr(ds)
python/ray/data/tests/test_tensor.py
Outdated
| assert str(ds) == ( | ||
| plan_str = ds._plan.get_plan_as_string(type(ds)) | ||
| assert plan_str == ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the assertions in this file to explicitly check the count and scheme rather than test against get_plan_as_string?
| def _truncate_repr_if_needed(representation: str) -> str: | ||
| """Ensure repr strings remain shorter than MAX_REPR_LENGTH.""" | ||
| if len(representation) < MAX_REPR_LENGTH: | ||
| return representation | ||
|
|
||
| ellipsis = "..." | ||
| closing = "" | ||
| if representation: | ||
| closing = representation[-1] | ||
| representation = representation[:-1] | ||
|
|
||
| max_body_len = MAX_REPR_LENGTH - len(ellipsis) - len(closing) - 1 | ||
| if max_body_len < 0: | ||
| max_body_len = 0 | ||
|
|
||
| truncated_body = representation[:max_body_len] | ||
| return f"{truncated_body}{ellipsis}{closing}" | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. I think the problem is that for Train V1, it uses the dataset repr in the Trainer repr, and that looks pretty bad with the table format:
"<DummyTrainer datasets={'train': shape: (3, 1)\n╭───────╮\n│ item │\n│ --- │\n│ int64 │\n╞═══════╡\n│ 1 │\n│ 2 ...>"
Rather than truncate the repr, I think we should special-case datasets in the BaseTrainer repr and just do dataset._plan.get_plan_as_string(...). It's not great because it breaks abstraction barriers and uses a brittle low-level API, but I think it might be the best option for now because it's simple and we can't break Train V1
| def _build_dataset_ascii_repr( | ||
| dataset: "Dataset", | ||
| schema: "Schema", | ||
| is_materialized: bool, | ||
| config: _DatasetReprConfig, | ||
| ) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't easily testable because it requires a Ray cluster and can't be truly unit tested.
To make this easy to quickly test (and so that we can easily test all of the edge and special cases), could you split this into two functions:
- One higher-level functions that accepts a dataset and maybe some other parameters as input
- A lower-level function that just accepts a schema, row count, dataset name, and head and tail
dict[str, Any]rows?
And then we can write tests for most of the cases with (2), put them in data/tests/unit, and they'll run immediately
6af81f2 to
a7565c5
Compare
| def _format_value(value: Any) -> str: | ||
| if isinstance(value, np.generic): | ||
| value = value.item() | ||
| return str(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newlines in cell values break table rendering
Medium Severity
The _format_value function uses str(value) directly, which for numpy arrays produces strings containing embedded newlines (e.g., '[[1. 1.]\n [1. 1.]]' for a 2x2 array). Since this 20-character string is shorter than _DATASET_REPR_MAX_COLUMN_WIDTH (40), it won't be truncated by _truncate_to_cell_width. The newline character is then passed to _render_row, where ljust() doesn't handle it, causing the table border structure to break across multiple lines and corrupting the ASCII table output.
Signed-off-by: yaommen <[email protected]>
a7565c5 to
0b8fc4d
Compare

Description
improve the UX by making this look like polars when using Ray Dataset.
test demo
Related issues
Fixes #59482