This repository was archived by the owner on Jan 31, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 21
[label bot] code duplication among notebooks #122
Comments
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
jlewi
added a commit
to jlewi/code-intelligence
that referenced
this issue
Apr 8, 2020
* This is the first step in creating an org wide model for all of Kubeflow (kubeflow#110) * Modify the Get-GitHub-Issues.ipynb model to reuse code in the embeddings directory. * Add some missing packages to requirements.txt. * Use the GitHub graphql client to get a list of all repositories. * Add some missing GCS utilities * Remove some of the duplication between Get-GitHub-Issues.ipynb and our library methods (kubeflow#122)
jlewi
added a commit
to jlewi/code-intelligence
that referenced
this issue
Apr 12, 2020
* This is the first step in creating an org wide model for all of Kubeflow (kubeflow#110) * Modify the Get-GitHub-Issues.ipynb model to reuse code in the embeddings directory. * Add some missing packages to requirements.txt. * Use the GitHub graphql client to get a list of all repositories. * Add some missing GCS utilities * Remove some of the duplication between Get-GitHub-Issues.ipynb and our library methods (kubeflow#122) * Start fetching the data from bigquery. * Using BigQuery turned out to be a lot better for bulk pulling all of the Kubeflow issues. * Use hdf5 to save the data. * Start a doc to keep track of notes for how to train a Kubeflow model. Add logic to save to hDF5 and do a sanity check compared to the inference code * Add a function to fetch the data using the GitHub API * related to kubeflow#126 swiching the embeddings service to use the GraphQL API rather than html fetching * I added this function as a way to sanity check that we get the same data using bigquery as at inference time.
k8s-ci-robot
pushed a commit
that referenced
this issue
Apr 12, 2020
* This is the first step in creating an org wide model for all of Kubeflow (#110) * Modify the Get-GitHub-Issues.ipynb model to reuse code in the embeddings directory. * Add some missing packages to requirements.txt. * Use the GitHub graphql client to get a list of all repositories. * Add some missing GCS utilities * Remove some of the duplication between Get-GitHub-Issues.ipynb and our library methods (#122) * Start fetching the data from bigquery. * Using BigQuery turned out to be a lot better for bulk pulling all of the Kubeflow issues. * Use hdf5 to save the data. * Start a doc to keep track of notes for how to train a Kubeflow model. Add logic to save to hDF5 and do a sanity check compared to the inference code * Add a function to fetch the data using the GitHub API * related to #126 swiching the embeddings service to use the GraphQL API rather than html fetching * I added this function as a way to sanity check that we get the same data using bigquery as at inference time.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
I'm noticing a lot of code duplication between the various notebooks. Which makes it hard to identify which notebook to use. This is probably tech debt as a result of us creating new copies of code rather than refactoring and reusing. We should try to clean this up.
As an example the following two notebooks both seem to be fetching GitHub issues and computing embeddings
issue_loader.ipynb
Get-GitHub-Issues.ipynb
The former appears to be using the functions in embeddings.py
It looks like the latter is still defining those same functions inside the notebook
The text was updated successfully, but these errors were encountered: