Improve dataset repr with polars-style table formatting #59748

ayushm98 · 2025-12-29T23:54:43Z

Summary

Transforms Dataset repr from plain text to an attractive polars-style table format with box-drawing characters.

Before

>>> ray.data.read_parquet("example://iris.parquet")
Dataset(num_rows=?, schema={sepal.length: double, sepal.width: double, petal.length: double, petal.width: double, variety: string})

After

>>> ray.data.read_parquet("example://iris.parquet")
shape: (?, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ sepal.length │ sepal.width │ petal.length │ petal.width │ variety │
│ --- double   │ --- double  │ --- double   │ --- double  │ --- str │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘

Much easier to read at a glance!

Implementation

Added _format_dataset_as_table() helper with Unicode box-drawing
Modified ExecutionPlan.get_plan_as_string() to use table formatter
Shows shape (num_rows, num_cols) similar to polars
Displays num_blocks for MaterializedDataset
Column types shown inline with names (e.g., "--- double")
Dynamic column width calculation based on content
Long column names/types automatically truncated

Future Enhancements

Sample row display for materialized datasets (marked with TODO)
Could add ellipsis rows for large datasets
Configurable max columns to display

Testing

Verified repr output for:

Non-materialized datasets (shows schema only)
Materialized datasets (shows schema + shape)
Various column types and widths

gemini-code-assist

Code Review

This pull request significantly improves the readability of dataset representations by adopting a polars-style table format. The new _format_dataset_as_table helper function is a great addition, and the refactoring in ExecutionPlan.get_plan_as_string simplifies the code nicely. I've identified a bug in how nested plans are indented, which I've provided a fix for. I also have a suggestion to improve the clarity of the column width calculation logic. Overall, this is a valuable improvement.

gemini-code-assist · 2025-12-29T23:57:34Z

python/ray/data/_internal/plan.py

+            indented_lines = []
+            for line in dataset_str.split("\n"):
+                indented_lines.append(f"{INDENT_STR * (plan_max_depth - 1)}+- {line}")
+            plan_str += "\n".join(indented_lines)


The current indentation logic for nested plans incorrectly prefixes every line of the dataset's table representation with +- . This breaks the visual hierarchy of the plan tree. The +- should only prefix the first line of the nested dataset block, and subsequent lines should be indented to align with the content of that first line.

lines = dataset_str.split("\n") indented_lines = [f"{INDENT_STR * (plan_max_depth - 1)}+- {lines[0]}"] subsequent_indent = INDENT_STR * plan_max_depth indented_lines.extend(subsequent_indent + line for line in lines[1:]) plan_str += "\n".join(indented_lines)

gemini-code-assist · 2025-12-29T23:57:34Z

python/ray/data/_internal/plan.py

+            for row in sample_rows[:max_rows]:
+                if name in row:
+                    val_str = str(row[name])
+                    # Truncate long values
+                    if len(val_str) > 50:
+                        val_str = val_str[:47] + "..."
+                    width = max(width, len(val_str))


The logic for calculating column width contribution from sample rows can be made more direct and readable. Instead of creating a truncated string to then measure its length, you can directly use min() with the value's length and the cap of 50. This avoids unnecessary string manipulation and makes the intent clearer.

for row in sample_rows[:max_rows]: if name in row: val_len = len(str(row[name])) # The width contribution from a value is capped at 50. width = max(width, min(val_len, 50))

cursor · 2025-12-29T23:57:36Z

python/ray/data/_internal/plan.py

+            schema=schema,
+            num_rows=count,
+            num_blocks=num_blocks,
+            sample_rows=None  # TODO: Add sample data fetching in future PR


Dataset name is lost in new table representation

The _format_dataset_as_table() function doesn't accept a name parameter and self._dataset_name is never passed to it. The old code displayed name={self._dataset_name} when set, but this information is now completely lost in the representation. Named datasets will no longer show their names, which is a regression in functionality.

Additional Locations (1)

python/ray/data/_internal/plan.py#L36-L38

cursor · 2025-12-29T23:57:36Z

python/ray/data/_internal/plan.py

+            indented_lines = []
+            for line in dataset_str.split("\n"):
+                indented_lines.append(f"{INDENT_STR * (plan_max_depth - 1)}+- {line}")
+            plan_str += "\n".join(indented_lines)


Tree prefix incorrectly added to every table line

When plan_max_depth > 0, the code adds the +- prefix to every line of the multi-line table output. The +- prefix is a tree structure indicator that should only appear on the first line. Subsequent lines should have only indentation (spaces), not the +- prefix. This produces malformed output like +- shape:... followed by +- ┌───... instead of proper continuation indentation.

cursor · 2025-12-29T23:57:36Z

python/ray/data/_internal/plan.py

+        Formatted table string with box-drawing characters
+    """
+    if not schema or not schema.names:
+        return f"Dataset(num_rows={num_rows})"


AttributeError when schema is a type object

The _format_dataset_as_table() function doesn't handle the case when schema is a Python type object. The schema() method returns Union[type, "pyarrow.lib.Schema"], and the old code at line 343 explicitly handled isinstance(schema, type). However, _format_dataset_as_table() accesses schema.names which will raise AttributeError for type objects since they don't have a .names attribute.

Additional Locations (1)

python/ray/data/_internal/plan.py#L362-L364

Fixes ray-project#59482 Dataset repr now uses an attractive table format with box-drawing characters, similar to polars, making it much easier to read schema and dataset information at a glance. Before: Dataset(num_rows=150, schema={sepal.length: double, sepal.width: double, ...}) After: shape: (150, 5) ┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐ │ sepal.length │ sepal.width │ petal.length │ petal.width │ variety │ │ --- double │ --- double │ --- double │ --- double │ --- str │ ╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡ └──────────────┴─────────────┴──────────────┴─────────────┴─────────┘ Changes: - Added _format_dataset_as_table() helper function - Modified get_plan_as_string() to use table formatter - Shows shape (rows, cols) and num_blocks for materialized datasets - Uses Unicode box-drawing characters for clean borders - Column types displayed inline with column names - Future: Can add sample row display (marked with TODO) Significantly improves developer UX when inspecting datasets. Signed-off-by: ayushm98 <[email protected]>

cursor · 2026-01-11T09:53:08Z

python/ray/data/_internal/plan.py

+    for col_type, width in zip(col_types, col_widths):
+        type_str = f"--- {col_type}"
+        parts.append(f" {type_str:{width}} │")
+    lines.append("".join(parts))


Column names and types not truncated despite width cap

Low Severity

Column widths are capped at 50 characters on line 75, but when outputting column names (line 96) and types (line 103), Python's format specifier {name:{width}} only pads shorter strings - it doesn't truncate longer ones. If a column name or type exceeds 50 characters, it will overflow the allocated width, causing misalignment between the content and the box-drawing borders. The PR description claims "Long column names/types automatically truncated" but this truncation logic is missing.

ayushm98 · 2026-01-11T13:45:43Z

I've added the DCO sign-off. The failing microcheck appears to be related to data tests and linting - I'll investigate the specific failures and push fixes if needed.

owenowenisme

I think #59631 is already doing this.
I can find you something else to work onif I you are interested in.

bveeramani · 2026-01-12T08:19:39Z

Hey @ayushm98, thanks for opening this PR!

As @owenowenisme mentioned, I think we're going to proceed with #59631 for now since it's farther along the review process.

Let me know us know if you want to pick something else up -- there's a lot to do

ayushm98 requested a review from a team as a code owner December 29, 2025 23:54

gemini-code-assist bot reviewed Dec 29, 2025

View reviewed changes

cursor bot reviewed Dec 29, 2025

View reviewed changes

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Dec 30, 2025

ayushm98 force-pushed the improve-dataset-repr branch from 611c260 to 9749c1a Compare January 11, 2026 09:47

cursor bot reviewed Jan 11, 2026

View reviewed changes

owenowenisme reviewed Jan 12, 2026

View reviewed changes

bveeramani closed this Jan 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve dataset repr with polars-style table formatting #59748

Improve dataset repr with polars-style table formatting #59748

Uh oh!

ayushm98 commented Dec 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 29, 2025

Uh oh!

gemini-code-assist bot Dec 29, 2025

Uh oh!

cursor bot Dec 29, 2025

Uh oh!

cursor bot Dec 29, 2025

Uh oh!

cursor bot Dec 29, 2025

Uh oh!

cursor bot Jan 11, 2026

Uh oh!

ayushm98 commented Jan 11, 2026

Uh oh!

owenowenisme left a comment •

edited

Loading

Uh oh!

bveeramani commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve dataset repr with polars-style table formatting #59748

Improve dataset repr with polars-style table formatting #59748

Uh oh!

Conversation

ayushm98 commented Dec 29, 2025

Summary

Before

After

Implementation

Future Enhancements

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

cursor bot Dec 29, 2025

Choose a reason for hiding this comment

Dataset name is lost in new table representation

Uh oh!

cursor bot Dec 29, 2025

Choose a reason for hiding this comment

Tree prefix incorrectly added to every table line

Uh oh!

cursor bot Dec 29, 2025

Choose a reason for hiding this comment

AttributeError when schema is a type object

Uh oh!

cursor bot Jan 11, 2026

Choose a reason for hiding this comment

Column names and types not truncated despite width cap

Uh oh!

ayushm98 commented Jan 11, 2026

Uh oh!

owenowenisme left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bveeramani commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

owenowenisme left a comment •

edited

Loading