Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: merge set of changes for v2.3.0 #428

Merged
merged 12 commits into from
Dec 23, 2024
Merged

chore: merge set of changes for v2.3.0 #428

merged 12 commits into from
Dec 23, 2024

Conversation

aluu317
Copy link
Collaborator

@aluu317 aluu317 commented Dec 23, 2024

Description of the change

Related issue number

How to verify the PR

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Abhishek-TAMU and others added 11 commits December 7, 2024 13:45
Code to perform dataset sampling via sampling probabilities in data

Signed-off-by: Dushyant Behl <[email protected]>
* Expose additional data handlers as an argument to the train function.
Signed-off-by: Dushyant Behl <[email protected]>
#399)

* fix: set legacy behavior to false, enable new behavior

Signed-off-by: Will Johnson <[email protected]>

* fix: Resolve push_to_hub_token warning

Signed-off-by: Will Johnson <[email protected]>

* fix: Remove max_seq_length and dataset_text_field from SFTTrainer

Signed-off-by: Will Johnson <[email protected]>

* fmt

Signed-off-by: Will Johnson <[email protected]>

* fix: Resolve tokenizer.padding_side warning

Signed-off-by: Will Johnson <[email protected]>

* nit: restructure warning fixes

Signed-off-by: Will Johnson <[email protected]>

* fix: Add packing directly to SFTConfig

Signed-off-by: Will Johnson <[email protected]>

* fmt

Signed-off-by: Will Johnson <[email protected]>

* Removed dataset_kwargs from SFTTrainer

Removed the argument dataset_kwargs from the the invocation of SFTTRainer() because it will be deprecated in V1.0.0. Instead, dataset_kwargs have been added as a key to the training_args variable.

Following the example provided by HF found here: https://huggingface.co/docs/trl/en/sft_trainer#training-the-vision-language-model

Signed-off-by: Luka Dojcinovic <[email protected]>

* fix: Added max_seq_length back to SFTConfig()

Signed-off-by: Luka Dojcinovic <[email protected]>

* Removed legacy and padding_side args

Removed these args as they were based on changes from @willmj that haven't been approved yet

Signed-off-by: Luka Dojcinovic <[email protected]>

* Moved all args to additional_args

Following @kmehant suggestion.

Signed-off-by: Luka Dojcinovic <[email protected]>

* Removed packing and max_seq_length

Removed packing and max_seq_length variables from additional_args

Signed-off-by: Luka Dojcinovic <[email protected]>

* Removed check is_pretokenized_dataset

Co-authored-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Luka-D <[email protected]>

* Removed max_seq_length from additional_args

Signed-off-by: Luka Dojcinovic <[email protected]>

* Removed error.log

Signed-off-by: Luka Dojcinovic <[email protected]>

* fix: move packing to SFTConfig as well

Co-authored-by: Luka-D <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>

---------

Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Luka Dojcinovic <[email protected]>
Signed-off-by: Luka-D <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
Co-authored-by: Will Johnson <[email protected]>
Co-authored-by: Mehant Kammakomati <[email protected]>
Co-authored-by: Mehant Kammakomati <[email protected]>
…ts (#412)

* test: Add unit tests to test multiple files in single/multiple datasets

Signed-off-by: Abhishek <[email protected]>

* e2e testing unit test for multiple datasets with multiple files

Signed-off-by: Abhishek <[email protected]>

* test: multiple datasets with multiple datafiles column names

Signed-off-by: Will Johnson <[email protected]>

* PR changes

Signed-off-by: Abhishek <[email protected]>

* PR Changes

Signed-off-by: Abhishek <[email protected]>

* fix: fmt

Signed-off-by: Abhishek <[email protected]>

* Merge test_process_dataconfig_multiple_files_varied_data_formats

Signed-off-by: Abhishek <[email protected]>

---------

Signed-off-by: Abhishek <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Co-authored-by: Will Johnson <[email protected]>
Also add mlflow docs and add mlflow to docker file and as optional requirement

Signed-off-by: Dushyant Behl <[email protected]>
…atterns, HF Dataset and combination (#424)

Signed-off-by: Abhishek <[email protected]>
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@aluu317 aluu317 changed the title release: merge set of changes for v2.3.0 chore: merge set of changes for v2.3.0 Dec 23, 2024
@github-actions github-actions bot added the chore label Dec 23, 2024
@Abhishek-TAMU
Copy link
Collaborator

The commits looks good to me. After addition of this one more PR, looks good to merge.

Signed-off-by: Dushyant Behl <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Abhishek <[email protected]>
Co-authored-by: Will Johnson <[email protected]>
Co-authored-by: Abhishek <[email protected]>
@aluu317 aluu317 merged commit 3ec30a0 into release Dec 23, 2024
14 of 15 checks passed
@aluu317 aluu317 deleted the new_release_2.3.0 branch December 23, 2024 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants