-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Initial commit of the code documentation notebook tests
- Loading branch information
Showing
141 changed files
with
22,366 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
# Release notes | ||
|
||
## Version 1.4.1 | ||
|
||
- Major updates of documentation for open-sourcing | ||
- Add extra_features option to emm.fit_classifier function | ||
- Add option drop_duplicate_candidates option to prepare_name_pairs_pd function | ||
- Rename SupervisedLayerEstimator as SparkSupervisedLayerEstimator | ||
- Consistent carry_on_cols behavior between pandas and spark indexing classes | ||
- Significant cleanup of cleanup of parameters.py. | ||
- Remove init_spark file and related calls | ||
- Cleanup of util and spark_utils functions | ||
- Remove unused dap related io functions | ||
|
||
## Version 1.4.0 | ||
|
||
- Introduce `Timer` context for logging | ||
- Removed backwards compatibility `unionByName` helper. Spark >= 3.1 required. | ||
- Replaced custom "NON NFKD MAP" with `unidecode` | ||
- Integration test speedup: split-off long-running integration test | ||
- Removed: `verbose`, `compute_missing`, `use_tqdm`, `save_intermediary`, `n_jobs` options removed, `mlflow` dependencies | ||
- Removed: prediction explanations (bloat), unused unsupervised model, "name_clustering" aggregation | ||
- Perf: 5-10x speedup of feature computations | ||
- Perf: `max_frequency_nm_score` and `mean_score` aggregation method short-circuit groups with only one record (2-3x speedup for skewed datasets) | ||
- Tests: added requests retries with backoff for unstable connections | ||
|
||
## Version 1.3.14 | ||
|
||
- Converted RST readme and changelog to Markdown | ||
- Introduced new parameters for force execution and cosine similary threads. | ||
|
||
## Version 1.3.5-1.3.13 | ||
|
||
See git history for changes. | ||
|
||
## Version 1.3.4, Jan 2023 | ||
|
||
- Added helper function to activate mlflow tracking. | ||
- Added spark example to example.py | ||
- Minor updates to documentation. | ||
|
||
## Version 1.3.3, Dec 2022 | ||
|
||
- Added sm feature indicating matches of legal entity forms between names. Turn on with parameter | ||
`with_legal_entity_forms_match=True`. Example usage in: | ||
`03-entity-matching-training-pandas-version.ipynb`. For | ||
code see `calc_features/cleanco_lef_matching.py`. | ||
- Added code for calculating discrimination threshold curves: | ||
`em.calc_threshold()`. Example usage in: | ||
`03-entity-matching-training-pandas-version.ipynb`. | ||
- Added example notebook for name aggregation. See: | ||
`04-entity-matching-aggregation-pandas-version.ipynb`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
Copyright 2023 ING Analytics Wholesale Banking | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated | ||
documentation files (the "Software"), to deal in the Software without restriction, including without limitation | ||
the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and | ||
to permit persons to whom the Software is furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all copies or substantial portions | ||
of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED | ||
TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL | ||
THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF | ||
CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER | ||
DEALINGS IN THE SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
################################################################################################ | ||
# | ||
# NOTICE: pass-through licensing of bundled components | ||
# | ||
# Entity Matching Model gathers together a toolkit of pre-existing third-party | ||
# open-source software components. These software components are governed by their own licenses | ||
# which Entity Matching Model does not modify or supersede, please consult the originating | ||
# authors. These components altogether have a mixture of the following licenses: Apache 2.0, GNU, | ||
# MIT, BSD2, BSD3 licenses. | ||
# | ||
# Although we have examined the licenses to verify acceptance of commercial and non-commercial | ||
# use, please see and consult the original licenses or authors. | ||
# | ||
################################################################################################ | ||
# | ||
# There are EMM functions/classes where code or techniques have been reproduced and/or modified | ||
# from existing open-source packages. We list these here: | ||
# | ||
# Package: cleanco | ||
# EMM file: emm/calc_features/cleanco_lef_matching.py | ||
# Function: custom_basename_and_lef() | ||
# Reference: https://github.com/psolin/cleanco/blob/master/cleanco/clean.py#L76 | ||
# License: MIT | ||
# https://github.com/psolin/cleanco/blob/master/LICENSE.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,140 @@ | ||
# EntityMatchingModel | ||
# Entity Matching model | ||
|
||
[![emm package in P11945-outgoing feed in Azure | ||
Artifacts](https://feeds.dev.azure.com/INGCDaaS/49255723-5232-4e9f-9501-068bf5e381a9/_apis/public/Packaging/Feeds/P11945-outgoing/Packages/8436e3e5-0029-4c5e-9a98-a9961acdd9a0/Badge)](https://dev.azure.com/INGCDaaS/IngOne/_artifacts/feed/P11945-outgoing/PyPI/emm?preferRelease=true) | ||
|
||
Entity Matching Model (EMM) solves the problem of matching company names between two possibly very | ||
large datasets. EMM can match millions against millions of names with a distributed approach. | ||
It uses the well-established candidate selection techniques in string matching, | ||
namely: tfidf vectorization combined with cosine similarity (with significant optimization), | ||
both word-based and character-based, and sorted neighbourhood indexing. | ||
These so-called indexers act complementary for selecting realistic name-pair candidates. | ||
On top of the indexers, EMM has a classifier with optimized string-based, rank-based, and legal-entity | ||
based features to estimate how confident a company name match is. | ||
|
||
The classifier can be trained to give a string similarity score or a probability of match. | ||
Both types of score are useful, in particular when there are many good-looking matches to choose between. | ||
Optionally, the EMM package can also be used to match a group of company names that belong together, | ||
to a common company name in the ground truth. For example, all different names used to address an external bank account. | ||
This step aggregates the name-matching scores from the supervised layer into a single match. | ||
|
||
The package is modular in design and and works both using both Pandas and Spark. A classifier trained with the former | ||
can be used with the latter and vice versa. | ||
|
||
For release history see: ``CHANGES.md``. | ||
|
||
## Notebooks | ||
|
||
For detailed examples of the code please see the notebooks under `notebooks/`. | ||
|
||
- `01-entity-matching-pandas-version.ipynb`: Using the Pandas version of EMM for name-matching. | ||
- `02-entity-matching-spark-version.ipynb`: Using the Spark version of EMM for name-matching. | ||
- `03-entity-matching-training-pandas-version.ipynb`: Fitting the supervised model and setting a discrimination threshold (Pandas). | ||
- `04-entity-matching-aggregation-pandas-version.ipynb`: Using the aggregation layer and setting a discrimination threshold (Pandas). | ||
|
||
## Documentation | ||
|
||
For documentation, design, and API see `docs/`. | ||
|
||
|
||
## Check it out | ||
|
||
The Entity matching model library requires Python >= 3.7 and is pip friendly. To get started, simply do: | ||
|
||
```shell | ||
pip install emm | ||
``` | ||
|
||
or check out the code from our repository: | ||
|
||
```shell | ||
git clone https://github.com/ing-bank/EntityMatchingModel.git | ||
pip install -e EntityMatchingModel/ | ||
``` | ||
|
||
where in this example the code is installed in edit mode (option -e). | ||
|
||
Additional dependencies can be installed with, e.g.: | ||
|
||
```shell | ||
pip install "emm[spark,dev,test]" | ||
``` | ||
|
||
You can now use the package in Python with: | ||
|
||
|
||
```python | ||
import emm | ||
``` | ||
|
||
**Congratulations, you are now ready to use the Entity Matching model!** | ||
|
||
## Quick run | ||
|
||
As a quick example, you can do: | ||
|
||
```python | ||
from emm import PandasEntityMatching | ||
from emm.data.create_data import create_example_noised_names | ||
|
||
# generate example ground-truth names and matching noised names, with typos and missing words. | ||
ground_truth, noised_names = create_example_noised_names(random_seed=42) | ||
train_names, test_names = noised_names[:5000], noised_names[5000:] | ||
|
||
# two example name-pair candidate generators: character-based cosine similarity and sorted neighbouring indexing | ||
indexers = [ | ||
{ | ||
'type': 'cosine_similarity', | ||
'tokenizer': 'characters', # character-based cosine similarity. alternative: 'words' | ||
'ngram': 2, # 2-character tokens only | ||
'num_candidates': 5, # max 5 candidates per name-to-match | ||
'cos_sim_lower_bound': 0.2, # lower bound on cosine similarity | ||
}, | ||
{'type': 'sni', 'window_length': 3} # sorted neighbouring indexing window of size 3. | ||
] | ||
em_params = { | ||
'name_only': True, # only consider name information for matching | ||
'entity_id_col': 'Index', # important to set both index and name columns to pick up | ||
'name_col': 'Name', | ||
'indexers': indexers, | ||
'supervised_on': False, # no supervided model (yet) to select best candidates | ||
'with_legal_entity_forms_match': True, # add feature that indicates match of legal entity forms (e.g. ltd != co) | ||
} | ||
# 1. initialize the entity matcher | ||
p = PandasEntityMatching(em_params) | ||
|
||
# 2. fitting: prepare the indexers based on the ground truth names, eg. fit the tfidf matrix of the first indexer. | ||
p.fit(ground_truth) | ||
|
||
# 3. create and fit a supervised model for the PandasEntityMatching object, to pick the best match (this takes a while) | ||
# input is "positive" names column 'Name' that are all supposed to match to the ground truth, | ||
# and an id column 'Index' to check with candidate name-pairs are matching and which not. | ||
# A fraction of these names may be turned into negative names (= no match to the ground truth). | ||
# (internally, candidate name-pairs are automatically generated, these are the input to the classification) | ||
p.fit_classifier(train_names, create_negative_sample_fraction=0.5) | ||
|
||
# 4. scoring: generate pandas dataframe of all name-pair candidates. | ||
# The classifier-based probability of match is provided in the column 'nm_score'. | ||
# Note: can also call p.transform() without training the classifier first. | ||
candidates_scored_pd = p.transform(test_names) | ||
|
||
# 5. scoring: for each name-to-match, select the best ground-truth candidate. | ||
best_candidates = candidates_scored_pd[candidates_scored_pd.best_match] | ||
best_candidates.head() | ||
``` | ||
|
||
For Spark, you can use the class `SparkEntityMatching` instead, with the same API as the Pandas version. | ||
For all available examples, please see the tutorial notebooks under `notebooks/`. | ||
|
||
## Project contributors | ||
|
||
This package was authored by ING Analytics Wholesale Banking. | ||
|
||
## Contact and support | ||
|
||
Contact the WBAA team via Github issues. | ||
Please note that INGA-WB provides support only on a best-effort basis. | ||
|
||
## License | ||
|
||
Copyright ING WBAA 2023. Entity Matching Model is completely free, open-source and licensed under the [MIT license](https://en.wikipedia.org/wiki/MIT_License). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
# Minimal makefile for Sphinx documentation | ||
# | ||
|
||
# You can set these variables from the command line, and also | ||
# from the environment for the first two. | ||
SPHINXOPTS ?= | ||
SPHINXBUILD ?= sphinx-build | ||
SOURCEDIR = source | ||
BUILDDIR = build | ||
|
||
# Put it first so that "make" without argument is like "make help". | ||
help: | ||
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) | ||
|
||
.PHONY: help Makefile | ||
|
||
# Catch-all target: route all unknown targets to Sphinx using the new | ||
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). | ||
%: Makefile | ||
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
Generating Documentation with Sphinx | ||
==================================== | ||
|
||
This README is for generating and writing documentation using Sphinx. | ||
On the repository there should already be the auto-generated files | ||
along with the regular documentation. | ||
|
||
Installing Sphinx | ||
----------------- | ||
|
||
First install Sphinx. Go to http://www.sphinx-doc.org/en/stable/ or run | ||
|
||
:: | ||
|
||
pip install -U Sphinx | ||
pip install -U sphinx-rtd-theme | ||
conda install -c conda-forge nbsphinx | ||
|
||
The docs/sphinx folder has the structure of a Sphinx project. | ||
However, if you want to make a new Sphinx project run: | ||
|
||
:: | ||
|
||
sphinx-quickstart | ||
|
||
It quickly generates a conf.py file which contains your configuration | ||
for your sphinx build. | ||
|
||
Update the HTML docs | ||
-------------------- | ||
|
||
Now we want Sphinx to autogenerate from docstrings and other | ||
documentation in the code base. Luckily Sphinx has the apidoc | ||
functionality. This goes through a path, finds all the python files and | ||
depending on your arguments, parses certain parts of the code | ||
(docstring, hidden classes, etc.). | ||
|
||
**First make sure your environment it setup properly. Python must be | ||
able to import all modules otherwise it will not work!** | ||
|
||
From the the root of the repository: | ||
|
||
:: | ||
|
||
$ source setup.sh | ||
|
||
To run the autogeneration of the documentation type in /docs/: | ||
|
||
:: | ||
|
||
./autogenerate.sh | ||
|
||
to scan the pyfiles and generate \*.rst files with the documentation. | ||
The script itself contains the usage of apidoc. | ||
|
||
Now to make the actual documentation files run: | ||
|
||
:: | ||
|
||
make clean | ||
|
||
to clean up the old make of sphinx and run: | ||
|
||
:: | ||
|
||
make html | ||
|
||
to make the new html build. It will be stored in (your config can adjust | ||
this, but the default is:) docs/build/html/ The index.html is the | ||
starting page. Open this file to see the result. | ||
|
||
What is an .rst file? | ||
~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
R(e)ST is the format that Sphinx uses it stands for ReSTructured | ||
(http://docutils.sourceforge.net/docs/user/rst/quickref.html). It looks | ||
for other RST files to import, see index.rst to see how the **toctree** | ||
refers to other files. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
#!/bin/bash | ||
|
||
# (re)create required directories | ||
rm -rf autogen | ||
mkdir -p source/_static autogen | ||
|
||
# auto-generate code documentation | ||
sphinx-apidoc -f -H API -o autogen ../../emm/ | ||
mv autogen/modules.rst autogen/api_index.rst | ||
mv autogen/* source/ | ||
|
||
# remove auto-gen directory | ||
rm -rf autogen | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
@ECHO OFF | ||
|
||
pushd %~dp0 | ||
|
||
REM Command file for Sphinx documentation | ||
|
||
if "%SPHINXBUILD%" == "" ( | ||
set SPHINXBUILD=sphinx-build | ||
) | ||
set SOURCEDIR=source | ||
set BUILDDIR=build | ||
|
||
if "%1" == "" goto help | ||
|
||
%SPHINXBUILD% >NUL 2>NUL | ||
if errorlevel 9009 ( | ||
echo. | ||
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx | ||
echo.installed, then set the SPHINXBUILD environment variable to point | ||
echo.to the full path of the 'sphinx-build' executable. Alternatively you | ||
echo.may add the Sphinx directory to PATH. | ||
echo. | ||
echo.If you don't have Sphinx installed, grab it from | ||
echo.http://sphinx-doc.org/ | ||
exit /b 1 | ||
) | ||
|
||
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% | ||
goto end | ||
|
||
:help | ||
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% | ||
|
||
:end | ||
popd |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
API | ||
=== | ||
|
||
.. toctree:: | ||
:maxdepth: 4 | ||
|
||
emm |
Oops, something went wrong.