This repository contains the content optimization for HealthHub.
You can download the Anaconda Distribution for your respective operating system here. You may also find out how to get started with Anaconda Distribution here. To verfiy your installation, you can head to the Command Line Interface (CLI) and run the following command:
conda list
You should see a list of packages installed in your active environment and their versions displayed. For more information, refer here.
Once set up, create a virtual environment using conda
and install dependencies:
# Create a virtual environment
conda create -n <VENV_NAME> python=3.12 -y
conda activate <VENV_NAME>
# Install dependencies
pip install -r requirements.txt
Refer to the documentation here (recommended) on how to install Poetry based on your operating system.
Important
For Mac users, if encountering issues with poetry command not found
, add export PATH="$HOME/.local/bin:$PATH"
in your .zshrc
file in your home folder and run source ~/.zshrc
.
First create a virtual environment by running the following commands:
poetry shell
Tip
If you see the following error; The currently activated Python version 3.11.7 is not supported by the project (^3.12). Trying to find and use a compatible version.
, run:
poetry env use 3.12.3 # Python version used in the project
To install the defined dependencies for your project, just run the install
command. The install
command reads the pyproject.toml
file from the current project, resolves the dependencies, and installs them.
poetry install
Warning
If you face an error installing gensim
with poetry
, run this command:
poetry run python -m pip install gensim --disable-pip-version-check --no-deps --no-cache-dir --no-binary gensim
If there is a poetry.lock
file in the current directory, it will use the exact versions from there instead of resolving them. This ensures that everyone using the library will get the same versions of the dependencies.
If there is no poetry.lock
file, Poetry will create one after dependency resolution.
Tip
It is best practice to commit the poetry.lock
to version control for more reproducible builds. For more information, refer here.
The exploratory/experimental code for content optimization is stored in the notebooks/
folder.
-
artifacts/
: contains the output of the exploratory/experimental code-
notebooks/
: contains experiments generated bypapermill
-
outputs/
: contains the experiment outputs (i.e. confusion matrices) generated bypapermill
statistical_vector_based_embeddings_similarity_scores.xlsx
: contains the similarity scores of experiments generated from Statistical Vector-based Embedding techniques.
-
-
content-optimization/
: contains the Kedro pipeline for data preprocessing, engineering and clustering- For more information about the pipeline, refer to the
README.md
.
- For more information about the pipeline, refer to the
-
data/
: contains the data used in the exploratory/experimental code-
healthhub_small/
: contains a small subset of Health Hub raw data -
healthhub_small_clean/
: contains the small subset of Health Hub cleaned data; also stores the embeddings generated from Sentence Transformers in aparquet
format.
-
-
notebooks/
: contains the exploratory/experimental code where bulk of the logic is implemented-
logger.py
: contains the code for logging -
preprocess.ipynb
: contains the code for preprocessing the raw data; cleaned output will be stored inhealthhub_small_clean/
; only needed to run once -
embeddings.ipynb
: contains the code for generating embeddings; embeddings will be stored inhealthhub_small_clean/
-
similarity.ipynb
: contains the code for calculating similarity between embeddings -
runner.ipynb
: contains the code for running the notebooks —embeddings.ipynb
andsimilarity.ipynb
; parameterize bypapermill
; this notebook helps you run your experiments for different models and pooling strategies and evaluate the results in theartifacts/
folder -
emb_sim_statistical.ipynb
: contains the code for generating embeddings from Statistical Vector-based Embeddings (SVE) techniques and calculating the similarity between embeddings -
runner_statistical.ipynb
: contains the code for running the notebook —emb_sim_statistical.ipynb
; parameterize bypapermill
; this notebook helps you run your experiments for different SVE techniques and similarity metrics and evaluate the results in theartifacts/
folder
-
To run the notebooks, you can use the runner.ipynb
or runner_statistical.ipynb
:
# runner.ipynb
import papermill as pm
from logger import logger
pm.inspect_notebook("<INPUT_NOTEBOOK>") # inspects and outputs the notebook's parameters
pm.execute_notebook(
input_path="<INPUT_NOTEBOOK>", # input notebook path
output_path="<OUTPUT_NOTEBOOK>", # output notebook path
parameters={...}, # parameters to be passed to the notebook in a dictionary
Warning
Refrain from pushing into main
branch directly — it is bad practice. Always create a new branch and make your changes on your new branch.
Every time you complete a feature or change on a branch and want to push it to GitHub to make a pull request, you need to ensure you lint your code.
You can simply run the command pre-commit run --all-files
to lint your code. For more information, refer to the pre-commit docs. To see what linters are used, refer to the .pre-commit-config.yaml
YAML file.
Alternatively, there is a Makefile
that can also lint your code base when you run the simpler command make lint
.
You should ensure that all cases are satisfied before you push to GitHub (you should see that all has passed). If not, please debug accordingly or your pull request may be rejected and closed.
The lint.yml
is a GitHub workflow that kicks off several GitHub Actions when a pull request is made. These GitHub Actions check that your code have been properly linted before it is passed for review. Once all actions have passed and the PR approved, your changes will be merged to the main
branch.
Note
The pre-commit
will run regardless if you forget to explicitly call it. Nonetheless, it is recommended to call it explicitly so you can make any necessary changes in advanced.