v2.3.1

Latest

Latest

aluu317 released this 23 Dec 16:57

3ec30a0

Summary of changes in this release

New feature updates around data handling and preprocessing:

Enable loading of Parquet and Arrow Dataset files.
Dataset mixing via sampling probabilities in data config.
New additional_data_handlers arg in train function to be registered with the data preprocessor.
Support multiple files, directories, pattern-based paths, HF Dataset IDs, and their combinations via data_config.
New support for both multi-turn and single-turn chat interactions.

New tracker:

New MLFlow tracker

Additional Changes

Refactor test artifacts into tests/artifacts , adding new data types, datasets, and predefined data configs for new unit tests.
Resolve issues with deprecated training arguments.

Full list of Changes

feat: Add support to handle Parquet Dataset files via data config by @Abhishek-TAMU in #401
test: add arrow datasets and arrow unit tests by @willmj in #403
feat: Perform dataset mixing via sampling probabilities in data config by @dushyantbehl in #408
feat: Expose additional data handlers as an argument in train by @dushyantbehl in #409
fix: Move deprecated positional arguments from SFTTrainer to SFTConfig by @Luka-D in #399
fix: update dataclass objects directly instead of creating new variables by @kmehant in #418
test: Add unit tests to test multiple files in single dataset by @Abhishek-TAMU in #412
feat: Add multi and single turn chat support by @dushyantbehl in #415
feat: Integrate MLflow tracker by @dushyantbehl in #425
feat: Handle passing of multiple files, multiple folders, path with patterns, HF Dataset and combination by @Abhishek-TAMU in #424
docs: Add documentation for data preprocessor release by @dushyantbehl in #423

New Contributors

@Luka-D made their first contribution in #399

Full Changelog: v2.2.0...v2.3.1

Contributors

dushyantbehl, kmehant, and 3 other contributors

Assets 2