-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Improve dataset repr with polars-style table formatting #59748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly improves the readability of dataset representations by adopting a polars-style table format. The new _format_dataset_as_table helper function is a great addition, and the refactoring in ExecutionPlan.get_plan_as_string simplifies the code nicely. I've identified a bug in how nested plans are indented, which I've provided a fix for. I also have a suggestion to improve the clarity of the column width calculation logic. Overall, this is a valuable improvement.
| indented_lines = [] | ||
| for line in dataset_str.split("\n"): | ||
| indented_lines.append(f"{INDENT_STR * (plan_max_depth - 1)}+- {line}") | ||
| plan_str += "\n".join(indented_lines) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current indentation logic for nested plans incorrectly prefixes every line of the dataset's table representation with +- . This breaks the visual hierarchy of the plan tree. The +- should only prefix the first line of the nested dataset block, and subsequent lines should be indented to align with the content of that first line.
lines = dataset_str.split("\n")
indented_lines = [f"{INDENT_STR * (plan_max_depth - 1)}+- {lines[0]}"]
subsequent_indent = INDENT_STR * plan_max_depth
indented_lines.extend(subsequent_indent + line for line in lines[1:])
plan_str += "\n".join(indented_lines)| for row in sample_rows[:max_rows]: | ||
| if name in row: | ||
| val_str = str(row[name]) | ||
| # Truncate long values | ||
| if len(val_str) > 50: | ||
| val_str = val_str[:47] + "..." | ||
| width = max(width, len(val_str)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for calculating column width contribution from sample rows can be made more direct and readable. Instead of creating a truncated string to then measure its length, you can directly use min() with the value's length and the cap of 50. This avoids unnecessary string manipulation and makes the intent clearer.
for row in sample_rows[:max_rows]:
if name in row:
val_len = len(str(row[name]))
# The width contribution from a value is capped at 50.
width = max(width, min(val_len, 50))| schema=schema, | ||
| num_rows=count, | ||
| num_blocks=num_blocks, | ||
| sample_rows=None # TODO: Add sample data fetching in future PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dataset name is lost in new table representation
The _format_dataset_as_table() function doesn't accept a name parameter and self._dataset_name is never passed to it. The old code displayed name={self._dataset_name} when set, but this information is now completely lost in the representation. Named datasets will no longer show their names, which is a regression in functionality.
Additional Locations (1)
| indented_lines = [] | ||
| for line in dataset_str.split("\n"): | ||
| indented_lines.append(f"{INDENT_STR * (plan_max_depth - 1)}+- {line}") | ||
| plan_str += "\n".join(indented_lines) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tree prefix incorrectly added to every table line
When plan_max_depth > 0, the code adds the +- prefix to every line of the multi-line table output. The +- prefix is a tree structure indicator that should only appear on the first line. Subsequent lines should have only indentation (spaces), not the +- prefix. This produces malformed output like +- shape:... followed by +- ┌───... instead of proper continuation indentation.
| Formatted table string with box-drawing characters | ||
| """ | ||
| if not schema or not schema.names: | ||
| return f"Dataset(num_rows={num_rows})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AttributeError when schema is a type object
The _format_dataset_as_table() function doesn't handle the case when schema is a Python type object. The schema() method returns Union[type, "pyarrow.lib.Schema"], and the old code at line 343 explicitly handled isinstance(schema, type). However, _format_dataset_as_table() accesses schema.names which will raise AttributeError for type objects since they don't have a .names attribute.
Additional Locations (1)
Fixes ray-project#59482 Dataset repr now uses an attractive table format with box-drawing characters, similar to polars, making it much easier to read schema and dataset information at a glance. Before: Dataset(num_rows=150, schema={sepal.length: double, sepal.width: double, ...}) After: shape: (150, 5) ┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐ │ sepal.length │ sepal.width │ petal.length │ petal.width │ variety │ │ --- double │ --- double │ --- double │ --- double │ --- str │ ╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡ └──────────────┴─────────────┴──────────────┴─────────────┴─────────┘ Changes: - Added _format_dataset_as_table() helper function - Modified get_plan_as_string() to use table formatter - Shows shape (rows, cols) and num_blocks for materialized datasets - Uses Unicode box-drawing characters for clean borders - Column types displayed inline with column names - Future: Can add sample row display (marked with TODO) Significantly improves developer UX when inspecting datasets. Signed-off-by: ayushm98 <[email protected]>
611c260 to
9749c1a
Compare
| for col_type, width in zip(col_types, col_widths): | ||
| type_str = f"--- {col_type}" | ||
| parts.append(f" {type_str:{width}} │") | ||
| lines.append("".join(parts)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Column names and types not truncated despite width cap
Low Severity
Column widths are capped at 50 characters on line 75, but when outputting column names (line 96) and types (line 103), Python's format specifier {name:{width}} only pads shorter strings - it doesn't truncate longer ones. If a column name or type exceeds 50 characters, it will overflow the allocated width, causing misalignment between the content and the box-drawing borders. The PR description claims "Long column names/types automatically truncated" but this truncation logic is missing.
|
I've added the DCO sign-off. The failing microcheck appears to be related to data tests and linting - I'll investigate the specific failures and push fixes if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think #59631 is already doing this.
I can find you something else to work onif I you are interested in.
|
Hey @ayushm98, thanks for opening this PR! As @owenowenisme mentioned, I think we're going to proceed with #59631 for now since it's farther along the review process. Let me know us know if you want to pick something else up -- there's a lot to do |
Fixes #59482
Summary
Transforms Dataset repr from plain text to an attractive polars-style table format with box-drawing characters.
Before
After
Much easier to read at a glance!
Implementation
_format_dataset_as_table()helper with Unicode box-drawingExecutionPlan.get_plan_as_string()to use table formatterFuture Enhancements
Testing
Verified repr output for: