Skip to content
This repository was archived by the owner on Jan 31, 2022. It is now read-only.

[label bot] code duplication among notebooks #122

Open
jlewi opened this issue Apr 7, 2020 · 1 comment
Open

[label bot] code duplication among notebooks #122

jlewi opened this issue Apr 7, 2020 · 1 comment

Comments

@jlewi
Copy link
Contributor

jlewi commented Apr 7, 2020

I'm noticing a lot of code duplication between the various notebooks. Which makes it hard to identify which notebook to use. This is probably tech debt as a result of us creating new copies of code rather than refactoring and reusing. We should try to clean this up.

  1. Code shared between notebooks should be moved into reusable functions, classes, or modules in the py directory
  2. notebooks should call the reusable functions
  3. notebooks should clearly explain what they are doing so its obvious how different notebooks compare.

As an example the following two notebooks both seem to be fetching GitHub issues and computing embeddings

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/feature 0.57

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

jlewi added a commit to jlewi/code-intelligence that referenced this issue Apr 8, 2020
* This is the first step in creating an org wide model for all of Kubeflow (kubeflow#110)

* Modify the Get-GitHub-Issues.ipynb model to reuse code in the embeddings directory.
  * Add some missing packages to requirements.txt.
  * Use the GitHub graphql client to get a list of all repositories.

* Add some missing GCS utilities

* Remove some of the duplication between Get-GitHub-Issues.ipynb and our
  library methods (kubeflow#122)
jlewi added a commit to jlewi/code-intelligence that referenced this issue Apr 12, 2020
* This is the first step in creating an org wide model for all of Kubeflow (kubeflow#110)

* Modify the Get-GitHub-Issues.ipynb model to reuse code in the embeddings directory.
  * Add some missing packages to requirements.txt.
  * Use the GitHub graphql client to get a list of all repositories.

* Add some missing GCS utilities

* Remove some of the duplication between Get-GitHub-Issues.ipynb and our
  library methods (kubeflow#122)

* Start fetching the data from bigquery.

  * Using BigQuery turned out to be a lot better for bulk pulling all
    of the Kubeflow issues.

* Use hdf5 to save the data.

* Start a doc to keep track of notes for how to train a Kubeflow model. Add logic to save to hDF5 and do a sanity check compared to the inference code

* Add a function to fetch the data using the GitHub API
  * related to kubeflow#126 swiching the embeddings service to use the GraphQL API rather than html fetching
  * I added this function as a way to sanity check that we get the same data
    using bigquery as at inference time.
k8s-ci-robot pushed a commit that referenced this issue Apr 12, 2020
* This is the first step in creating an org wide model for all of Kubeflow (#110)

* Modify the Get-GitHub-Issues.ipynb model to reuse code in the embeddings directory.
  * Add some missing packages to requirements.txt.
  * Use the GitHub graphql client to get a list of all repositories.

* Add some missing GCS utilities

* Remove some of the duplication between Get-GitHub-Issues.ipynb and our
  library methods (#122)

* Start fetching the data from bigquery.

  * Using BigQuery turned out to be a lot better for bulk pulling all
    of the Kubeflow issues.

* Use hdf5 to save the data.

* Start a doc to keep track of notes for how to train a Kubeflow model. Add logic to save to hDF5 and do a sanity check compared to the inference code

* Add a function to fetch the data using the GitHub API
  * related to #126 swiching the embeddings service to use the GraphQL API rather than html fetching
  * I added this function as a way to sanity check that we get the same data
    using bigquery as at inference time.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant