clean your image datasets effortlessly!
Installation • Features • Usage • CLI • Contributing
imgclean is a python library that helps you quickly clean image datasets by detecting and removing:
- duplicate images (including near-duplicates using perceptual hashing)
- blurry images (using laplacian variance)
- corrupted or unreadable files
- images that don't meet size requirements
when building machine learning models or managing large image collections, you often encounter:
- duplicate images wasting storage and skewing training data
- blurry or low-quality images reducing model performance
- corrupted files causing pipeline failures
- images with wrong dimensions
imgclean solves all of this in one line of code!
pip install imgcleanor install from source:
git clone https://github.com/yourusername/imgclean.git
cd imgclean
pip install -e .python api:
from imgclean import ImageCleaner
# create a cleaner
cleaner = ImageCleaner()
# clean a directory
results = cleaner.clean("path/to/images",
remove_duplicates=True,
remove_blurry=True,
blur_threshold=100)
print(f"Removed {results['removed']} images")
print(f"Kept {results['kept']} images")command line:
# basic usage
imgclean path/to/images
# with options
imgclean path/to/images --remove-duplicates --remove-blurry --blur-threshold 100
# dry run (see what would be removed without actually removing)
imgclean path/to/images --dry-runuses perceptual hashing to find duplicates and near-duplicates:
cleaner.clean("images/", remove_duplicates=True, hash_size=8)detects blurry images using laplacian variance:
cleaner.clean("images/", remove_blurry=True, blur_threshold=100)automatically removes corrupted or unreadable images:
cleaner.clean("images/", remove_corrupted=True)remove images that are too small or too large:
cleaner.clean("images/",
min_width=100,
min_height=100,
max_width=5000,
max_height=5000)results = cleaner.clean("images/", remove_duplicates=True, remove_blurry=True)
print(results)
# {
# 'kept': 850,
# 'removed': 150,
# 'duplicates': 100,
# 'blurry': 30,
# 'corrupted': 20,
# 'report': [...list of removed files with reasons...]
# }from imgclean import ImageCleaner
cleaner = ImageCleaner()
results = cleaner.clean(
"path/to/images",
# duplicate detection
remove_duplicates=True,
hash_size=8, # smaller = more sensitive
# blur detection
remove_blurry=True,
blur_threshold=100, # higher = more strict
# corruption check
remove_corrupted=True,
# size constraints
min_width=100,
min_height=100,
max_width=5000,
max_height=5000,
# output
output_dir=None, # copy good images to new directory
dry_run=False, # preview without deleting
verbose=True, # show progress bar
)Contributions are welcome! Feel free to open issues or submit pull requests.
mit license - feel free to use this in your projects!
if you find this useful, please star the repo! it helps others discover the project.
imgclean is fast! here's what you can expect:
- 1,000 images: ~10 seconds
- 10,000 images: ~2 minutes
- 100,000 images: ~20 minutes
performance varies based on image size and enabled features
- ml dataset preparation: clean training data before model training
- photography: organize and deduplicate photo collections
- web scraping: clean downloaded image datasets
- research: prepare datasets for computer vision research
- storage optimization: remove duplicates to save disk space
contributions are welcome! here's how you can help:
- fork the repository
- create a feature branch (
git checkout -b feature/amazing-feature) - commit your changes (
git commit -m 'add amazing feature') - push to the branch (
git push origin feature/amazing-feature) - open a pull request
please make sure to:
- add tests for new features
- update documentation
- follow the existing code style
- keep commits atomic and messages clear
please open an issue with:
- a clear description of the bug
- steps to reproduce
- expected vs actual behavior
- your environment (os, python version, imgclean version)
if imgclean helped you, please:
- star this repo
- tweet about it
- write a blog post
- tell your friends
built with:
- pillow - image processing
- opencv - blur detection
- imagehash - perceptual hashing
- tqdm - progress bars