This tool enables fuzzy semantic search over curated datasets of genes, drugs, and diseases. It was developed to enhance user accessibility within the PharmAlchemy knowledge platform.
- Fuzzy matching using
SequenceMatcherwith adjustable thresholds - Real-time search with confidence scoring
- Synonym-aware query expansion
- Modular design for adding new datasets
- Tkinter-based GUI
g_final.csv: Gene symbols, synonyms, UniProt IDsDrugBank Structure Links.csv: Names, formulas, InChI/SMILES, CIDsHSDN-Symptoms-DO.tsv: Symptoms, diseases, DOIDs
python PhAlSemantic.pySelect a dataset, choose the search field, and begin typing to receive live-ranked suggestions.
To add a new searchable dataset (e.g., pathways, enzymes):
-
Prepare Your CSV/TSV File
- Ensure the file has no null values in searchable columns.
- Format all string fields as lowercase, whitespace-trimmed.
- Each row should represent a single entry.
-
Create a Loader Function Define a new function (e.g.,
load_pathway_data) inPhAlSemantic.pymodeled afterload_gene_data. -
Register the Dataset Add a new entry to the
datasetsdictionary:'Pathways': { 'csv_filename': 'pathway_data.csv', 'search_options': ['pathway_name', 'pathway_id'], 'search_labels': ['Pathway Name', 'Pathway ID'], 'load_function': load_pathway_data }
To allow searching by a new column (e.g., EC Number):
-
Ensure the Column Exists The field must be present and complete in the dataset.
-
Add to Index In the dataset loader function:
ec_index = {} for entry in candidate_data: ec = entry.get('EC_Number', '').lower() if ec: ec_index[ec] = entry indexes['EC_Number'] = ec_index
-
Expose in UI Add the column to
search_optionsandsearch_labelsin your dataset config.
MIT License for code. Attribution required for datasets (e.g., DrugBank, DOID, etc.).
If you use this tool, please cite the accompanying report: "Enhancing Biomedical Data Accessibility via Modular Semantic Search: A Fuzzy Matching System for PharmAlchemy"