Summary of changes in this release
New feature updates around data handling and preprocessing:
- Enable loading of Parquet and Arrow Dataset files.
- Dataset mixing via sampling probabilities in data config.
- New additional_data_handlers arg in train function to be registered with the data preprocessor.
- Support multiple files, directories, pattern-based paths, HF Dataset IDs, and their combinations via
data_config
. - New support for both multi-turn and single-turn chat interactions.
New tracker:
- New MLFlow tracker
Additional Changes
- Refactor test artifacts into tests/artifacts , adding new data types, datasets, and predefined data configs for new unit tests.
- Resolve issues with deprecated training arguments.
Full list of Changes
- feat: Add support to handle Parquet Dataset files via data config by @Abhishek-TAMU in #401
- test: add arrow datasets and arrow unit tests by @willmj in #403
- feat: Perform dataset mixing via sampling probabilities in data config by @dushyantbehl in #408
- feat: Expose additional data handlers as an argument in train by @dushyantbehl in #409
- fix: Move deprecated positional arguments from SFTTrainer to SFTConfig by @Luka-D in #399
- fix: update dataclass objects directly instead of creating new variables by @kmehant in #418
- test: Add unit tests to test multiple files in single dataset by @Abhishek-TAMU in #412
- feat: Add multi and single turn chat support by @dushyantbehl in #415
- feat: Integrate MLflow tracker by @dushyantbehl in #425
- feat: Handle passing of multiple files, multiple folders, path with patterns, HF Dataset and combination by @Abhishek-TAMU in #424
- docs: Add documentation for data preprocessor release by @dushyantbehl in #423
New Contributors
Full Changelog: v2.2.0...v2.3.1