[SPIKE]: Dataset transformer improvements #4349

blanchco · 2024-08-01T20:04:24Z

Thoughts on Data Transformer Performance

The Problem:
The data transformer currently faces performance issues because every time we reopen it, all code cells are re-executed to restore the previous state. As the number of code cells in a notebook increases, this issue compounds, leading to longer loading times. Additionally, re-running all cells may not always replicate the exact state from when the notebook was last closed. For instance, if a random function was used to generate a column in a DataFrame, reopening the notebook would yield different results each time. Although we save the notebook's history, the kernel is shut down when we leave the node, necessitating the rerun of the entire history upon reopening. Checkpointing is not a viable solution, as the state is lost when the kernel session is terminated.

A Solution:
To address this, we can save the notebook history (as we already do) and, upon reopening the node, display the history without re-executing it. The Beaker kernel would be initialized with the input dataset(s) and their respective output dataset(s). For example:

A dataset (dataset1) is attached to the data transformer.
The transformer is opened, and the Beaker kernel is initialized with dataset1.
The agent modifies dataset1 (e.g., adds a column), resulting in a new dataset (dataset2).
Dataset2 is saved, the node is closed, and the session is terminated.
Upon reopening the node, the history is displayed but not re-executed.
A new session is initialized with dataset1 and dataset2 already defined.

This approach preserves the notebook's state without the need to rerun all cells, enhancing performance and ensuring consistency.

blanchco · 2024-08-01T20:07:18Z

@YohannParis @mwdchang Any thoughts above this above would be great 😄. This is from my understanding of whats happening at the moment and a potential solution

cc @mattprintz

blanchco self-assigned this Aug 1, 2024

blanchco changed the title ~~Dataset transformer improvements~~ [SPIKE]: Dataset transformer improvements Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPIKE]: Dataset transformer improvements #4349

[SPIKE]: Dataset transformer improvements #4349

blanchco commented Aug 1, 2024 •

edited

Loading

blanchco commented Aug 1, 2024 •

edited

Loading

[SPIKE]: Dataset transformer improvements #4349

[SPIKE]: Dataset transformer improvements #4349

Comments

blanchco commented Aug 1, 2024 • edited Loading

Thoughts on Data Transformer Performance

blanchco commented Aug 1, 2024 • edited Loading

blanchco commented Aug 1, 2024 •

edited

Loading

blanchco commented Aug 1, 2024 •

edited

Loading