skrub

skrub - formerly dirty_cat - is a Python library that facilitates prepping your tables for machine learning.

For a detailed description of the problem of encoding dirty categorical data, see Similarity encoding for learning with dirty categorical variables [1] and Encoding high-cardinality string categorical variables [2].

If you like the package, please spread the word, and ⭐ the repository!

What can skrub do?

skrub provides tools (TableVectorizer, fuzzy_join...) and encoders (GapEncoder, MinHashEncoder...) for morphological similarities, for which we usually identify three common cases: similarities, typos and variations

The first example notebook goes in-depth on how to identify and deal with dirty data using the skrub library.

What skrub cannot do

Semantic similarities are currently not supported. For example, the similarity between car and automobile is outside the reach of the methods implemented here.

This kind of problem is tackled by Natural Language Processing methods.

skrub can still help with handling typos and variations in this kind of setting.

Installation (WIP)

There are currently no PiPy releases. You can still install the package from the GitHub repository with:

pip install git+https://github.com/skrub-data/skrub.git

Dependencies

Dependencies and minimal versions are listed in the setup file.

Related projects

Are listed on the skrub's website

Contributing

If you want to encourage development of skrub, the best thing to do is to spread the word!

If you encounter an issue while using skrub, please open an issue and/or submit a pull request. Don't hesitate, you're helping to make this project better for everyone!

Additional resources

Introductory video (YouTube)
Overview poster for EuroSciPy 2022 (Google Drive)

References

[1]	Patricio Cerda, Gaël Varoquaux, Balázs Kégl. Similarity encoding for learning with dirty categorical variables. 2018. Machine Learning journal, Springer.

[2]	Patricio Cerda, Gaël Varoquaux. Encoding high-cardinality string categorical variables. 2020. IEEE Transactions on Knowledge & Data Engineering.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.rst

README.rst

skrub

What can skrub do?

What skrub cannot do

Installation (WIP)

Dependencies

Related projects

Contributing

Additional resources

References

Files

README.rst

Latest commit

History

README.rst

File metadata and controls

skrub

What can skrub do?

What skrub cannot do

Installation (WIP)

Dependencies

Related projects

Contributing

Additional resources

References