Skip to content

Conversation

@ayushm98
Copy link

Fixes #59482

Summary

Transforms Dataset repr from plain text to an attractive polars-style table format with box-drawing characters.

Before

>>> ray.data.read_parquet("example://iris.parquet")
Dataset(num_rows=?, schema={sepal.length: double, sepal.width: double, petal.length: double, petal.width: double, variety: string})

After

>>> ray.data.read_parquet("example://iris.parquet")
shape: (?, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
│ sepal.lengthsepal.widthpetal.lengthpetal.widthvariety │
│ --- double--- double--- double--- double--- str │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
└──────────────┴─────────────┴──────────────┴─────────────┴─────────┘

Much easier to read at a glance!

Implementation

  • Added _format_dataset_as_table() helper with Unicode box-drawing
  • Modified ExecutionPlan.get_plan_as_string() to use table formatter
  • Shows shape (num_rows, num_cols) similar to polars
  • Displays num_blocks for MaterializedDataset
  • Column types shown inline with names (e.g., "--- double")
  • Dynamic column width calculation based on content
  • Long column names/types automatically truncated

Future Enhancements

  • Sample row display for materialized datasets (marked with TODO)
  • Could add ellipsis rows for large datasets
  • Configurable max columns to display

Testing

Verified repr output for:

  • Non-materialized datasets (shows schema only)
  • Materialized datasets (shows schema + shape)
  • Various column types and widths

@ayushm98 ayushm98 requested a review from a team as a code owner December 29, 2025 23:54
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly improves the readability of dataset representations by adopting a polars-style table format. The new _format_dataset_as_table helper function is a great addition, and the refactoring in ExecutionPlan.get_plan_as_string simplifies the code nicely. I've identified a bug in how nested plans are indented, which I've provided a fix for. I also have a suggestion to improve the clarity of the column width calculation logic. Overall, this is a valuable improvement.

Comment on lines +375 to +378
indented_lines = []
for line in dataset_str.split("\n"):
indented_lines.append(f"{INDENT_STR * (plan_max_depth - 1)}+- {line}")
plan_str += "\n".join(indented_lines)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current indentation logic for nested plans incorrectly prefixes every line of the dataset's table representation with +- . This breaks the visual hierarchy of the plan tree. The +- should only prefix the first line of the nested dataset block, and subsequent lines should be indented to align with the content of that first line.

            lines = dataset_str.split("\n")
            indented_lines = [f"{INDENT_STR * (plan_max_depth - 1)}+- {lines[0]}"]
            subsequent_indent = INDENT_STR * plan_max_depth
            indented_lines.extend(subsequent_indent + line for line in lines[1:])
            plan_str += "\n".join(indented_lines)

Comment on lines +68 to +74
for row in sample_rows[:max_rows]:
if name in row:
val_str = str(row[name])
# Truncate long values
if len(val_str) > 50:
val_str = val_str[:47] + "..."
width = max(width, len(val_str))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for calculating column width contribution from sample rows can be made more direct and readable. Instead of creating a truncated string to then measure its length, you can directly use min() with the value's length and the cap of 50. This avoids unnecessary string manipulation and makes the intent clearer.

            for row in sample_rows[:max_rows]:
                if name in row:
                    val_len = len(str(row[name]))
                    # The width contribution from a value is capped at 50.
                    width = max(width, min(val_len, 50))

schema=schema,
num_rows=count,
num_blocks=num_blocks,
sample_rows=None # TODO: Add sample data fetching in future PR
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset name is lost in new table representation

The _format_dataset_as_table() function doesn't accept a name parameter and self._dataset_name is never passed to it. The old code displayed name={self._dataset_name} when set, but this information is now completely lost in the representation. Named datasets will no longer show their names, which is a regression in functionality.

Additional Locations (1)

Fix in Cursor Fix in Web

indented_lines = []
for line in dataset_str.split("\n"):
indented_lines.append(f"{INDENT_STR * (plan_max_depth - 1)}+- {line}")
plan_str += "\n".join(indented_lines)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tree prefix incorrectly added to every table line

When plan_max_depth > 0, the code adds the +- prefix to every line of the multi-line table output. The +- prefix is a tree structure indicator that should only appear on the first line. Subsequent lines should have only indentation (spaces), not the +- prefix. This produces malformed output like +- shape:... followed by +- ┌───... instead of proper continuation indentation.

Fix in Cursor Fix in Web

Formatted table string with box-drawing characters
"""
if not schema or not schema.names:
return f"Dataset(num_rows={num_rows})"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AttributeError when schema is a type object

The _format_dataset_as_table() function doesn't handle the case when schema is a Python type object. The schema() method returns Union[type, "pyarrow.lib.Schema"], and the old code at line 343 explicitly handled isinstance(schema, type). However, _format_dataset_as_table() accesses schema.names which will raise AttributeError for type objects since they don't have a .names attribute.

Additional Locations (1)

Fix in Cursor Fix in Web

@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Dec 30, 2025
Fixes ray-project#59482

Dataset repr now uses an attractive table format with box-drawing
characters, similar to polars, making it much easier to read schema
and dataset information at a glance.

Before:
  Dataset(num_rows=150, schema={sepal.length: double, sepal.width: double, ...})

After:
  shape: (150, 5)
  ┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
  │ sepal.length │ sepal.width │ petal.length │ petal.width │ variety │
  │ --- double   │ --- double  │ --- double   │ --- double  │ --- str │
  ╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
  └──────────────┴─────────────┴──────────────┴─────────────┴─────────┘

Changes:
- Added _format_dataset_as_table() helper function
- Modified get_plan_as_string() to use table formatter
- Shows shape (rows, cols) and num_blocks for materialized datasets
- Uses Unicode box-drawing characters for clean borders
- Column types displayed inline with column names
- Future: Can add sample row display (marked with TODO)

Significantly improves developer UX when inspecting datasets.

Signed-off-by: ayushm98 <[email protected]>
@ayushm98 ayushm98 force-pushed the improve-dataset-repr branch from 611c260 to 9749c1a Compare January 11, 2026 09:47
for col_type, width in zip(col_types, col_widths):
type_str = f"--- {col_type}"
parts.append(f" {type_str:{width}} │")
lines.append("".join(parts))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column names and types not truncated despite width cap

Low Severity

Column widths are capped at 50 characters on line 75, but when outputting column names (line 96) and types (line 103), Python's format specifier {name:{width}} only pads shorter strings - it doesn't truncate longer ones. If a column name or type exceeds 50 characters, it will overflow the allocated width, causing misalignment between the content and the box-drawing borders. The PR description claims "Long column names/types automatically truncated" but this truncation logic is missing.

Fix in Cursor Fix in Web

@ayushm98
Copy link
Author

I've added the DCO sign-off. The failing microcheck appears to be related to data tests and linting - I'll investigate the specific failures and push fixes if needed.

Copy link
Member

@owenowenisme owenowenisme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think #59631 is already doing this.
I can find you something else to work onif I you are interested in.

@bveeramani
Copy link
Member

Hey @ayushm98, thanks for opening this PR!

As @owenowenisme mentioned, I think we're going to proceed with #59631 for now since it's farther along the review process.

Let me know us know if you want to pick something else up -- there's a lot to do

@bveeramani bveeramani closed this Jan 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Improve appearance of repr(dataset)

3 participants