Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPIKE]: Dataset transformer improvements #4349

Open
blanchco opened this issue Aug 1, 2024 · 1 comment
Open

[SPIKE]: Dataset transformer improvements #4349

blanchco opened this issue Aug 1, 2024 · 1 comment
Assignees

Comments

@blanchco
Copy link
Contributor

blanchco commented Aug 1, 2024

Thoughts on Data Transformer Performance

The Problem:
The data transformer currently faces performance issues because every time we reopen it, all code cells are re-executed to restore the previous state. As the number of code cells in a notebook increases, this issue compounds, leading to longer loading times. Additionally, re-running all cells may not always replicate the exact state from when the notebook was last closed. For instance, if a random function was used to generate a column in a DataFrame, reopening the notebook would yield different results each time. Although we save the notebook's history, the kernel is shut down when we leave the node, necessitating the rerun of the entire history upon reopening. Checkpointing is not a viable solution, as the state is lost when the kernel session is terminated.

A Solution:
To address this, we can save the notebook history (as we already do) and, upon reopening the node, display the history without re-executing it. The Beaker kernel would be initialized with the input dataset(s) and their respective output dataset(s). For example:

  1. A dataset (dataset1) is attached to the data transformer.
  2. The transformer is opened, and the Beaker kernel is initialized with dataset1.
  3. The agent modifies dataset1 (e.g., adds a column), resulting in a new dataset (dataset2).
  4. Dataset2 is saved, the node is closed, and the session is terminated.
  5. Upon reopening the node, the history is displayed but not re-executed.
  6. A new session is initialized with dataset1 and dataset2 already defined.

This approach preserves the notebook's state without the need to rerun all cells, enhancing performance and ensuring consistency.

@blanchco blanchco self-assigned this Aug 1, 2024
@blanchco
Copy link
Contributor Author

blanchco commented Aug 1, 2024

@YohannParis @mwdchang Any thoughts above this above would be great 😄. This is from my understanding of whats happening at the moment and a potential solution

cc @mattprintz

@blanchco blanchco changed the title Dataset transformer improvements [SPIKE]: Dataset transformer improvements Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant