[label bot] code duplication among notebooks #122

jlewi · 2020-04-07T16:51:54Z

I'm noticing a lot of code duplication between the various notebooks. Which makes it hard to identify which notebook to use. This is probably tech debt as a result of us creating new copies of code rather than refactoring and reusing. We should try to clean this up.

Code shared between notebooks should be moved into reusable functions, classes, or modules in the py directory
notebooks should call the reusable functions
notebooks should clearly explain what they are doing so its obvious how different notebooks compare.

As an example the following two notebooks both seem to be fetching GitHub issues and computing embeddings

issue_loader.ipynb
Get-GitHub-Issues.ipynb
The former appears to be using the functions in embeddings.py
It looks like the latter is still defining those same functions inside the notebook

issue-label-bot · 2020-04-07T16:52:00Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
kind/feature	0.57

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

* This is the first step in creating an org wide model for all of Kubeflow (kubeflow#110) * Modify the Get-GitHub-Issues.ipynb model to reuse code in the embeddings directory. * Add some missing packages to requirements.txt. * Use the GitHub graphql client to get a list of all repositories. * Add some missing GCS utilities * Remove some of the duplication between Get-GitHub-Issues.ipynb and our library methods (kubeflow#122)

* This is the first step in creating an org wide model for all of Kubeflow (kubeflow#110) * Modify the Get-GitHub-Issues.ipynb model to reuse code in the embeddings directory. * Add some missing packages to requirements.txt. * Use the GitHub graphql client to get a list of all repositories. * Add some missing GCS utilities * Remove some of the duplication between Get-GitHub-Issues.ipynb and our library methods (kubeflow#122) * Start fetching the data from bigquery. * Using BigQuery turned out to be a lot better for bulk pulling all of the Kubeflow issues. * Use hdf5 to save the data. * Start a doc to keep track of notes for how to train a Kubeflow model. Add logic to save to hDF5 and do a sanity check compared to the inference code * Add a function to fetch the data using the GitHub API * related to kubeflow#126 swiching the embeddings service to use the GraphQL API rather than html fetching * I added this function as a way to sanity check that we get the same data using bigquery as at inference time.

* This is the first step in creating an org wide model for all of Kubeflow (#110) * Modify the Get-GitHub-Issues.ipynb model to reuse code in the embeddings directory. * Add some missing packages to requirements.txt. * Use the GitHub graphql client to get a list of all repositories. * Add some missing GCS utilities * Remove some of the duplication between Get-GitHub-Issues.ipynb and our library methods (#122) * Start fetching the data from bigquery. * Using BigQuery turned out to be a lot better for bulk pulling all of the Kubeflow issues. * Use hdf5 to save the data. * Start a doc to keep track of notes for how to train a Kubeflow model. Add logic to save to hDF5 and do a sanity check compared to the inference code * Add a function to fetch the data using the GitHub API * related to #126 swiching the embeddings service to use the GraphQL API rather than html fetching * I added this function as a way to sanity check that we get the same data using bigquery as at inference time.

jlewi added kind/feature priority/p2 kind/process area/labelbot labels Apr 7, 2020

jlewi mentioned this issue Apr 8, 2020

Compute embeddings for all kubeflow repositories. #124

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[label bot] code duplication among notebooks #122

[label bot] code duplication among notebooks #122

jlewi commented Apr 7, 2020

issue-label-bot bot commented Apr 7, 2020

[label bot] code duplication among notebooks #122

[label bot] code duplication among notebooks #122

Comments

jlewi commented Apr 7, 2020

issue-label-bot bot commented Apr 7, 2020