Content Optimization for HealthHub

This repository contains the content optimization for HealthHub.

Install Dependencies

Anaconda (Recommended)

You can download the Anaconda Distribution for your respective operating system here. You may also find out how to get started with Anaconda Distribution here. To verfiy your installation, you can head to the Command Line Interface (CLI) and run the following command:

conda list

You should see a list of packages installed in your active environment and their versions displayed. For more information, refer here.

Once set up, create a virtual environment using conda and install dependencies:

# Create a virtual environment
conda create -n <VENV_NAME> python=3.12 -y
conda activate <VENV_NAME>

# Install dependencies
pip install -r requirements.txt

Poetry

Refer to the documentation here (recommended) on how to install Poetry based on your operating system.

Important

For Mac users, if encountering issues with poetry command not found, add export PATH="$HOME/.local/bin:$PATH" in your .zshrc file in your home folder and run source ~/.zshrc.

First create a virtual environment by running the following commands:

poetry shell

Tip

If you see the following error; The currently activated Python version 3.11.7 is not supported by the project (^3.12). Trying to find and use a compatible version., run:

poetry env use 3.12.3  # Python version used in the project

To install the defined dependencies for your project, just run the install command. The install command reads the pyproject.toml file from the current project, resolves the dependencies, and installs them.

poetry install

Warning

If you face an error installing gensim with poetry, run this command:

poetry run python -m pip install gensim --disable-pip-version-check --no-deps --no-cache-dir --no-binary gensim

If there is a poetry.lock file in the current directory, it will use the exact versions from there instead of resolving them. This ensures that everyone using the library will get the same versions of the dependencies.

If there is no poetry.lock file, Poetry will create one after dependency resolution.

Tip

It is best practice to commit the poetry.lock to version control for more reproducible builds. For more information, refer here.

File Structure

The exploratory/experimental code for content optimization is stored in the notebooks/ folder.

artifacts/: contains the output of the exploratory/experimental code
- notebooks/: contains experiments generated by papermill
- outputs/: contains the experiment outputs (i.e. confusion matrices) generated by papermill
  - statistical_vector_based_embeddings_similarity_scores.xlsx: contains the similarity scores of experiments generated from Statistical Vector-based Embedding techniques.
content-optimization/: contains the Kedro pipeline for data preprocessing, engineering and clustering
- For more information about the pipeline, refer to the README.md.
data/: contains the data used in the exploratory/experimental code
- healthhub_small/: contains a small subset of Health Hub raw data
- healthhub_small_clean/: contains the small subset of Health Hub cleaned data; also stores the embeddings generated from Sentence Transformers in a parquet format.
notebooks/: contains the exploratory/experimental code where bulk of the logic is implemented
- logger.py: contains the code for logging
- preprocess.ipynb: contains the code for preprocessing the raw data; cleaned output will be stored in healthhub_small_clean/; only needed to run once
- embeddings.ipynb: contains the code for generating embeddings; embeddings will be stored in healthhub_small_clean/
- similarity.ipynb: contains the code for calculating similarity between embeddings
- runner.ipynb: contains the code for running the notebooks — embeddings.ipynb and similarity.ipynb; parameterize by papermill; this notebook helps you run your experiments for different models and pooling strategies and evaluate the results in the artifacts/ folder
- emb_sim_statistical.ipynb: contains the code for generating embeddings from Statistical Vector-based Embeddings (SVE) techniques and calculating the similarity between embeddings
- runner_statistical.ipynb: contains the code for running the notebook — emb_sim_statistical.ipynb; parameterize by papermill; this notebook helps you run your experiments for different SVE techniques and similarity metrics and evaluate the results in the artifacts/ folder

Usage

To run the notebooks, you can use the runner.ipynb or runner_statistical.ipynb:

# runner.ipynb

import papermill as pm
from logger import logger

pm.inspect_notebook("<INPUT_NOTEBOOK>")  # inspects and outputs the notebook's parameters

pm.execute_notebook(
    input_path="<INPUT_NOTEBOOK>",  # input notebook path
    output_path="<OUTPUT_NOTEBOOK>",  # output notebook path
    parameters={...},  # parameters to be passed to the notebook in a dictionary

Pushing to GitHub

Warning

Refrain from pushing into main branch directly — it is bad practice. Always create a new branch and make your changes on your new branch.

Every time you complete a feature or change on a branch and want to push it to GitHub to make a pull request, you need to ensure you lint your code.

You can simply run the command pre-commit run --all-files to lint your code. For more information, refer to the pre-commit docs. To see what linters are used, refer to the .pre-commit-config.yaml YAML file.

Alternatively, there is a Makefile that can also lint your code base when you run the simpler command make lint.

You should ensure that all cases are satisfied before you push to GitHub (you should see that all has passed). If not, please debug accordingly or your pull request may be rejected and closed.

The lint.yml is a GitHub workflow that kicks off several GitHub Actions when a pull request is made. These GitHub Actions check that your code have been properly linted before it is passed for review. Once all actions have passed and the PR approved, your changes will be merged to the main branch.

Note

The pre-commit will run regardless if you forget to explicitly call it. Nonetheless, it is recommended to call it explicitly so you can make any necessary changes in advanced.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Content Optimization for HealthHub

Install Dependencies

Anaconda (Recommended)

Poetry

File Structure

Usage

Pushing to GitHub

About

Releases

Packages

Contributors 6

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
.github/workflows		.github/workflows
artifacts		artifacts
backend/app		backend/app
content-optimization		content-optimization
data		data
docker		docker
frontend		frontend
notebooks		notebooks
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

Wilsven/healthhub-content-optimization

Folders and files

Latest commit

History

Repository files navigation

Content Optimization for HealthHub

Install Dependencies

Anaconda (Recommended)

Poetry

File Structure

Usage

Pushing to GitHub

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages