Initial commit of the code documentation notebook tests

ing-bank · Nov 27, 2023 · b8b190a · b8b190a
1 parent a92a705
commit b8b190a
Show file tree

Hide file tree

Showing 141 changed files with 22,366 additions and 1 deletion.
diff --git a/CHANGES.md b/CHANGES.md
@@ -0,0 +1,52 @@
+# Release notes
+
+## Version 1.4.1
+
+- Major updates of documentation for open-sourcing
+- Add extra_features option to emm.fit_classifier function
+- Add option drop_duplicate_candidates option to prepare_name_pairs_pd function
+- Rename SupervisedLayerEstimator as SparkSupervisedLayerEstimator
+- Consistent carry_on_cols behavior between pandas and spark indexing classes
+- Significant cleanup of cleanup of parameters.py.
+- Remove init_spark file and related calls
+- Cleanup of util and spark_utils functions
+- Remove unused dap related io functions
+
+## Version 1.4.0
+
+- Introduce `Timer` context for logging
+- Removed backwards compatibility `unionByName` helper. Spark >= 3.1 required.
+- Replaced custom "NON NFKD MAP" with `unidecode`
+- Integration test speedup: split-off long-running integration test
+- Removed: `verbose`, `compute_missing`, `use_tqdm`, `save_intermediary`, `n_jobs` options removed, `mlflow` dependencies
+- Removed: prediction explanations (bloat), unused unsupervised model, "name_clustering" aggregation
+- Perf: 5-10x speedup of feature computations
+- Perf: `max_frequency_nm_score` and `mean_score` aggregation method short-circuit groups with only one record (2-3x speedup for skewed datasets)
+- Tests: added requests retries with backoff for unstable connections
+
+## Version 1.3.14
+
+- Converted RST readme and changelog to Markdown
+- Introduced new parameters for force execution and cosine similary threads.
+
+## Version 1.3.5-1.3.13
+
+See git history for changes.
+
+## Version 1.3.4, Jan 2023
+
+- Added helper function to activate mlflow tracking.
+- Added spark example to example.py
+- Minor updates to documentation.
+
+## Version 1.3.3, Dec 2022
+
+- Added sm feature indicating matches of legal entity forms between names. Turn on with parameter
+ `with_legal_entity_forms_match=True`. Example usage in:
+    `03-entity-matching-training-pandas-version.ipynb`. For
+    code see `calc_features/cleanco_lef_matching.py`.
+- Added code for calculating discrimination threshold curves:
+    `em.calc_threshold()`. Example usage in:
+    `03-entity-matching-training-pandas-version.ipynb`.
+- Added example notebook for name aggregation. See:
+    `04-entity-matching-aggregation-pandas-version.ipynb`.
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,15 @@
+Copyright 2023 ING Analytics Wholesale Banking
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
+documentation files (the "Software"), to deal in the Software without restriction, including without limitation
+the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
+to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions
+of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED
+TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF
+CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+DEALINGS IN THE SOFTWARE.
diff --git a/NOTICE b/NOTICE
@@ -0,0 +1,24 @@
+################################################################################################
+#
+# NOTICE: pass-through licensing of bundled components
+#
+# Entity Matching Model gathers together a toolkit of pre-existing third-party
+# open-source software components. These software components are governed by their own licenses 
+# which Entity Matching Model does not modify or supersede, please consult the originating
+# authors. These components altogether have a mixture of the following licenses: Apache 2.0, GNU,
+# MIT, BSD2, BSD3 licenses.
+#
+# Although we have examined the licenses to verify acceptance of commercial and non-commercial
+# use, please see and consult the original licenses or authors.
+#
+################################################################################################
+#
+# There are EMM functions/classes where code or techniques have been reproduced and/or modified
+# from existing open-source packages. We list these here:
+#
+# Package: cleanco
+# EMM file: emm/calc_features/cleanco_lef_matching.py
+#    Function: custom_basename_and_lef()
+#    Reference: https://github.com/psolin/cleanco/blob/master/cleanco/clean.py#L76
+# License: MIT
+#    https://github.com/psolin/cleanco/blob/master/LICENSE.txt
diff --git a/README.md b/README.md
@@ -1 +1,140 @@
-# EntityMatchingModel
+# Entity Matching model
+
+[![emm package in P11945-outgoing feed in Azure
+Artifacts](https://feeds.dev.azure.com/INGCDaaS/49255723-5232-4e9f-9501-068bf5e381a9/_apis/public/Packaging/Feeds/P11945-outgoing/Packages/8436e3e5-0029-4c5e-9a98-a9961acdd9a0/Badge)](https://dev.azure.com/INGCDaaS/IngOne/_artifacts/feed/P11945-outgoing/PyPI/emm?preferRelease=true)
+
+Entity Matching Model (EMM) solves the problem of matching company names between two possibly very
+large datasets. EMM can match millions against millions of names with a distributed approach.
+It uses the well-established candidate selection techniques in string matching,
+namely: tfidf vectorization combined with cosine similarity (with significant optimization),
+both word-based and character-based, and sorted neighbourhood indexing.
+These so-called indexers act complementary for selecting realistic name-pair candidates.
+On top of the indexers, EMM has a classifier with optimized string-based, rank-based, and legal-entity
+based features to estimate how confident a company name match is.
+
+The classifier can be trained to give a string similarity score or a probability of match.
+Both types of score are useful, in particular when there are many good-looking matches to choose between.
+Optionally, the EMM package can also be used to match a group of company names that belong together,
+to a common company name in the ground truth. For example, all different names used to address an external bank account.
+This step aggregates the name-matching scores from the supervised layer into a single match.
+
+The package is modular in design and and works both using both Pandas and Spark. A classifier trained with the former
+can be used with the latter and vice versa.
+
+For release history see: ``CHANGES.md``.
+
+## Notebooks
+
+For detailed examples of the code please see the notebooks under `notebooks/`.
+
+- `01-entity-matching-pandas-version.ipynb`: Using the Pandas version of EMM for name-matching.
+- `02-entity-matching-spark-version.ipynb`: Using the Spark version of EMM for name-matching.
+- `03-entity-matching-training-pandas-version.ipynb`: Fitting the supervised model and setting a discrimination threshold (Pandas).
+- `04-entity-matching-aggregation-pandas-version.ipynb`: Using the aggregation layer and setting a discrimination threshold (Pandas).
+
+## Documentation
+
+For documentation, design, and API see `docs/`.
+
+
+## Check it out
+
+The Entity matching model library requires Python >= 3.7 and is pip friendly. To get started, simply do:
+
+```shell
+pip install emm
+```
+
+or check out the code from our repository:
+
+```shell
+git clone https://github.com/ing-bank/EntityMatchingModel.git
+pip install -e EntityMatchingModel/
+```
+
+where in this example the code is installed in edit mode (option -e).
+
+Additional dependencies can be installed with, e.g.:
+
+```shell
+pip install "emm[spark,dev,test]"
+```
+
+You can now use the package in Python with:
+
+
+```python
+import emm
+```
+
+**Congratulations, you are now ready to use the Entity Matching model!**
+
+## Quick run
+
+As a quick example, you can do:
+
+```python
+from emm import PandasEntityMatching
+from emm.data.create_data import create_example_noised_names
+
+# generate example ground-truth names and matching noised names, with typos and missing words.
+ground_truth, noised_names = create_example_noised_names(random_seed=42)
+train_names, test_names = noised_names[:5000], noised_names[5000:]
+
+# two example name-pair candidate generators: character-based cosine similarity and sorted neighbouring indexing
+indexers = [
+  {
+      'type': 'cosine_similarity',
+      'tokenizer': 'characters',   # character-based cosine similarity. alternative: 'words'
+      'ngram': 2,                  # 2-character tokens only
+      'num_candidates': 5,         # max 5 candidates per name-to-match
+      'cos_sim_lower_bound': 0.2,  # lower bound on cosine similarity
+  },
+  {'type': 'sni', 'window_length': 3}  # sorted neighbouring indexing window of size 3.
+]
+em_params = {
+  'name_only': True,         # only consider name information for matching
+  'entity_id_col': 'Index',  # important to set both index and name columns to pick up
+  'name_col': 'Name',
+  'indexers': indexers,
+  'supervised_on': False,    # no supervided model (yet) to select best candidates
+  'with_legal_entity_forms_match': True,   # add feature that indicates match of legal entity forms (e.g. ltd != co)
+}
+# 1. initialize the entity matcher
+p = PandasEntityMatching(em_params)
+
+# 2. fitting: prepare the indexers based on the ground truth names, eg. fit the tfidf matrix of the first indexer.
+p.fit(ground_truth)
+
+# 3. create and fit a supervised model for the PandasEntityMatching object, to pick the best match (this takes a while)
+#    input is "positive" names column 'Name' that are all supposed to match to the ground truth,
+#    and an id column 'Index' to check with candidate name-pairs are matching and which not.
+#    A fraction of these names may be turned into negative names (= no match to the ground truth).
+#    (internally, candidate name-pairs are automatically generated, these are the input to the classification)
+p.fit_classifier(train_names, create_negative_sample_fraction=0.5)
+
+# 4. scoring: generate pandas dataframe of all name-pair candidates.
+#    The classifier-based probability of match is provided in the column 'nm_score'.
+#    Note: can also call p.transform() without training the classifier first.
+candidates_scored_pd = p.transform(test_names)
+
+# 5. scoring: for each name-to-match, select the best ground-truth candidate.
+best_candidates = candidates_scored_pd[candidates_scored_pd.best_match]
+best_candidates.head()
+```
+
+For Spark, you can use the class `SparkEntityMatching` instead, with the same API as the Pandas version.
+For all available examples, please see the tutorial notebooks under `notebooks/`.
+
+## Project contributors
+
+This package was authored by ING Analytics Wholesale Banking.
+
+## Contact and support
+
+Contact the WBAA team via Github issues.
+Please note that INGA-WB provides support only on a best-effort basis.
+
+## License
+
+Copyright ING WBAA 2023. Entity Matching Model is completely free, open-source and licensed under the [MIT license](https://en.wikipedia.org/wiki/MIT_License).
diff --git a/docs/sphinx/Makefile b/docs/sphinx/Makefile
@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = source
+BUILDDIR      = build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/sphinx/README.rst b/docs/sphinx/README.rst
@@ -0,0 +1,78 @@
+Generating Documentation with Sphinx
+====================================
+
+This README is for generating and writing documentation using Sphinx.
+On the repository there should already be the auto-generated files
+along with the regular documentation.
+
+Installing Sphinx
+-----------------
+
+First install Sphinx. Go to http://www.sphinx-doc.org/en/stable/ or run
+
+::
+
+    pip install -U Sphinx
+    pip install -U sphinx-rtd-theme
+    conda install -c conda-forge nbsphinx
+
+The docs/sphinx folder has the structure of a Sphinx project.
+However, if you want to make a new Sphinx project run:
+
+::
+
+    sphinx-quickstart
+
+It quickly generates a conf.py file which contains your configuration
+for your sphinx build.
+
+Update the HTML docs
+--------------------
+
+Now we want Sphinx to autogenerate from docstrings and other
+documentation in the code base. Luckily Sphinx has the apidoc
+functionality. This goes through a path, finds all the python files and
+depending on your arguments, parses certain parts of the code
+(docstring, hidden classes, etc.).
+
+**First make sure your environment it setup properly. Python must be
+able to import all modules otherwise it will not work!**
+
+From the the root of the repository:
+
+::
+
+    $ source setup.sh
+
+To run the autogeneration of the documentation type in /docs/:
+
+::
+
+    ./autogenerate.sh
+
+to scan the pyfiles and generate \*.rst files with the documentation.
+The script itself contains the usage of apidoc.
+
+Now to make the actual documentation files run:
+
+::
+
+    make clean
+
+to clean up the old make of sphinx and run:
+
+::
+
+    make html
+
+to make the new html build. It will be stored in (your config can adjust
+this, but the default is:) docs/build/html/ The index.html is the
+starting page. Open this file to see the result.
+
+What is an .rst file?
+~~~~~~~~~~~~~~~~~~~~~
+
+R(e)ST is the format that Sphinx uses it stands for ReSTructured
+(http://docutils.sourceforge.net/docs/user/rst/quickref.html). It looks
+for other RST files to import, see index.rst to see how the **toctree**
+refers to other files.
diff --git a/docs/sphinx/autogenerate.sh b/docs/sphinx/autogenerate.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+
+# (re)create required directories
+rm -rf autogen
+mkdir -p source/_static autogen
+
+# auto-generate code documentation
+sphinx-apidoc -f -H API -o autogen ../../emm/ 
+mv autogen/modules.rst autogen/api_index.rst
+mv autogen/* source/ 
+
+# remove auto-gen directory
+rm -rf autogen
+
diff --git a/docs/sphinx/make.bat b/docs/sphinx/make.bat
@@ -0,0 +1,35 @@
+@ECHO OFF
+
+pushd %~dp0
+
+REM Command file for Sphinx documentation
+
+if "%SPHINXBUILD%" == "" (
+	set SPHINXBUILD=sphinx-build
+)
+set SOURCEDIR=source
+set BUILDDIR=build
+
+if "%1" == "" goto help
+
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+	echo.
+	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
+	echo.installed, then set the SPHINXBUILD environment variable to point
+	echo.to the full path of the 'sphinx-build' executable. Alternatively you
+	echo.may add the Sphinx directory to PATH.
+	echo.
+	echo.If you don't have Sphinx installed, grab it from
+	echo.http://sphinx-doc.org/
+	exit /b 1
+)
+
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+goto end
+
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+
+:end
+popd
diff --git a/docs/sphinx/source/api_index.rst b/docs/sphinx/source/api_index.rst
@@ -0,0 +1,7 @@
+API
+===
+
+.. toctree::
+   :maxdepth: 4
+
+   emm