Draft : HACK to preprocess data and pass to SFT TRainer #38

Ssukriti · 2024-02-09T04:22:36Z

To test this:
twitter_complaints_formatted.json

on Local installation of SFT TRainer : https://github.com/huggingface/trl/blob/main/trl/trainer/sft_trainer.py#L248 comment lines 248 and 249
if dataset_text_field is None and formatting_func is None:
to allow input/output dataset

Command to run

python tuning/sft_trainer.py  \
--model_name_or_path $MODEL_PATH  \
--data_path $DATA_PATH  \
--output_dir $OUTPUT_PATH  \
--num_train_epochs 20  \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4  \
--gradient_accumulation_steps 1  \
--evaluation_strategy "no"  \
--save_strategy "epoch"  \
--learning_rate 0.03  \
--weight_decay 0.  \
--warmup_ratio 0.03  \
--lr_scheduler_type "cosine"  \
--logging_steps 1  \
--include_tokens_per_second  \
--packing False  \
--use_flash_attn False  \
--tokenizer_name_or_path $MODEL_PATH \
--torch_dtype "float32" \
--peft_method "pt" \
--num_virtual_tokens 1500 \
--prompt_tuning_init_text "Classify if the tweet is a complaint or not:"

Signed-off-by: Sukriti-Sharma4 <[email protected]>

gkumbhat · 2024-02-12T22:07:40Z

tuning/utils/data_format_utils.py

+
+def infer_max_steps(
+    num_epochs: int,
+    batch_size: int,


gradient_accumlation needs to be taken into consideration for this.

gkumbhat · 2024-02-12T22:11:19Z

tuning/utils/data_format_utils.py

+        if dataset_type == IterableDataset:
+            return mapped_dataset
+        else:
+            return HFDataset(mapped_dataset)


since we are not using iterable dataset, this can potentially blow up memory for larger dataset. I think the processing would also happen upfront

gkumbhat · 2024-02-12T22:15:13Z

tuning/utils/data_format_utils.py

+        max_concat_length = max_seq_length
+
+        # Truncate based on max source or max target length before considering as a joined sequence
+        model_inputs = tokenizer(source)


hmm.. by default truncation is off in tokenizer ? So no truncation would happen ? (as opposed to comment above)

gkumbhat · 2024-02-12T22:20:02Z

QQ for the testing parameters:

Not sure which model you tried with, but larger models with float32 will probably get closer to OOM, specially train time.
Do we really need to use 1500 virtual token for twitter dataset? That seems a lot. Would have implications on quality

Ssukriti added 6 commits February 7, 2024 18:45

add support for data formatting

994e717

Signed-off-by: Sukriti-Sharma4 <[email protected]>

add infer_max_steps

c105341

Signed-off-by: Sukriti-Sharma4 <[email protected]>

remove print statements

3c9ed2f

Signed-off-by: Sukriti-Sharma4 <[email protected]>

return torch_utils_Dataset

e7c5ff4

Signed-off-by: Sukriti-Sharma4 <[email protected]>

revert use HF dataset

f890287

Signed-off-by: Sukriti-Sharma4 <[email protected]>

add support for iterable and torch dataset

39b4ef3

Signed-off-by: Sukriti-Sharma4 <[email protected]>

gkumbhat reviewed Feb 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft : HACK to preprocess data and pass to SFT TRainer #38

Draft : HACK to preprocess data and pass to SFT TRainer #38

Ssukriti commented Feb 9, 2024 •

edited

Loading

gkumbhat Feb 12, 2024

gkumbhat Feb 12, 2024

gkumbhat Feb 12, 2024

gkumbhat commented Feb 12, 2024

Draft : HACK to preprocess data and pass to SFT TRainer #38

Are you sure you want to change the base?

Draft : HACK to preprocess data and pass to SFT TRainer #38

Conversation

Ssukriti commented Feb 9, 2024 • edited Loading

gkumbhat Feb 12, 2024

Choose a reason for hiding this comment

gkumbhat Feb 12, 2024

Choose a reason for hiding this comment

gkumbhat Feb 12, 2024

Choose a reason for hiding this comment

gkumbhat commented Feb 12, 2024

Ssukriti commented Feb 9, 2024 •

edited

Loading