A benchmarking suite for testing the performance of Splink's comparison functions across different database backends.
This repository contains benchmarking tests to measure and compare the execution speed of Splink's comparison functions (like exact matching, Jaro similarity, and Jaro-Winkler similarity) across different database backends including DuckDB and Apache Spark.
- Install dependencies using Poetry:
poetry install
Execute the benchmarks using pytest-benchmark:
poetry run pytest benchmarks/
- Benchmarks common string comparison functions used in record linkage
- Tests against multiple database backends (DuckDB, Spark)
- Generates test datasets of configurable sizes
- Uses pytest-benchmark for reliable performance measurements
- Python 3.11 (Spark 3.5 doesn't official support Python 3.12)