Guided Data Science

We present a recommendation system for Data Scientists that given a user cell of code will recommend what the next line of code should be.

The recommendation system is built of three main parts (that are thoroughly explained here):

Data-set Builder : Collects the necessary data to build our system (see- data_gathering).
- Downloaded Datasets, notebooks and metadata are stored in the datasets directory.
- The parsed tsv files that were used to train our models are stored in the Data directory.
Workflow-Stage Classifier : Classifies the code to the relevant Data Science workflow stage and provides context to the code (see- Classification).
Recommendation Engine : Generates the next-line recommendation (see- Chatbot).

The system architecture scheme:

The entire flow of creating the system is explained in the Flow.ipynb notebook.

Required libraries can be installed using the requirements.txt file. Alternatively, you can create an environment using the environment.yml file.
Notice that in order to use the Dataset Builder you must have Kaggle credentials set up.
Follow instruction at: https://github.com/Kaggle/kaggle-api#api-credentials
You also need to configure your kaggle username and password in the data_gathering/consts.py file.
For the weak supervision process in Classification/Exploration_and_WeakSupervision.ipynb you must have snorkel v0.7 installed. snorkel does not support pip install. Follow instructions at: https://github.com/HazyResearch/snorkel#installation

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Chatbot		Chatbot
Classification		Classification
Data		Data
Documentation		Documentation
Lib		Lib
Scripts		Scripts
__pycache__		__pycache__
data_gathering		data_gathering
datasets		datasets
venv/Lib/site-packages		venv/Lib/site-packages
Flow.ipynb		Flow.ipynb
README.md		README.md
environment.yml		environment.yml
pip-selfcheck.json		pip-selfcheck.json
pyvenv.cfg		pyvenv.cfg
requirements.txt		requirements.txt

Provide feedback