Feature/ft -> First fine tuning version #4

karinassini · 2025-11-03T19:55:00Z

Purpose

Introduce the entire Whisper fine-tuning workspace under [whisper_fine_tuning], covering data prep, training, evaluation, and Azure ML deployment assets.
Align project documentation and entrypoints around this new app structure so the team can run end-to-end Whisper fine-tuning from within language-creation.
Refresh Whisper fine-tuning docs so they reflect the new data layout under apps/whisper_fine_tuning/data/raw/audios/ and the silver-stage workflow.
Add a consolidated training README with quick-start commands, Azure ML guidance, and tuning tips aligned with the updated pipeline.

Does this introduce a breaking change?

[ ] Yes
[x ] No

Pull Request Type

What kind of change does this Pull Request introduce?

[ ] Bugfix
[ x] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[x] Documentation content changes
[ ] Other... Please describe:

Other Information

Next Steps
Automate the raw-to-silver data ingestion flow (scheduling or pipeline scripting).
Wire the ingestion results into the existing training jobs so the full pipeline can run unattended.
Create tests

Copilot

Pull Request Overview

This PR introduces a comprehensive Whisper fine-tuning application for speech-to-text model training on custom datasets. The implementation includes data preparation pipelines, LoRA-based fine-tuning for both Whisper (Transformers) and NeMo RNNT models, evaluation utilities, and Azure ML deployment infrastructure.

Key changes:

Complete data preparation pipeline converting raw audio files into HuggingFace datasets format
Training implementations for Whisper models with LoRA/PEFT support and NeMo RNNT experiments
Evaluation scripts with WER/CER metrics
Azure ML infrastructure setup (Bicep templates, deployment configurations)
MLflow experiment tracking integration

Reviewed Changes

Copilot reviewed 48 out of 59 changed files in this pull request and generated 19 comments.

Show a summary per file

File	Description
`apps/whisper_fine_tuning/src/utils/utils.py`	Helper function for displaying datasets in notebooks
`apps/whisper_fine_tuning/src/core/train/train_nemo.py`	NeMo RNNT trainer with LoRA support and manifest generation
`apps/whisper_fine_tuning/src/core/train/train.py`	Whisper transformer trainer with LoRA and MLflow logging
`apps/whisper_fine_tuning/src/core/train/main_train_nemo.py`	CLI entry point for NeMo RNNT training
`apps/whisper_fine_tuning/src/core/train/main_train.py`	CLI entry point for Whisper training
`apps/whisper_fine_tuning/src/core/train/config.py`	Configuration dataclasses for training and LoRA
`apps/whisper_fine_tuning/src/core/evaluation/evaluation_process.py`	Evaluation script with WER/CER metrics
`apps/whisper_fine_tuning/src/core/data_prep/*.py`	Data preparation utilities and pipelines
`apps/whisper_fine_tuning/deployment/**`	Azure ML job specs, model registration, endpoint deployment
`apps/whisper_fine_tuning/infra/**`	Bicep templates for Azure ML workspace setup
`.pre-commit-config.yaml`	Pre-commit hooks for code quality
`.gitignore`	Updated to exclude ML artifacts and local data

Copilot · 2025-11-03T19:58:26Z

apps/whisper_fine_tuning/src/utils/utils.py

+def display_table(dataset_or_sample):
+    # A helper fuction to display a Transformer dataset or single sample contains multi-line string nicely


Corrected spelling of 'fuction' to 'function'.

@copilot open a new pull request to apply changes based on this feedback

Copilot · 2025-11-03T19:58:26Z

apps/whisper_fine_tuning/src/core/train/train_nemo.py

+    manifest_file = Path(manifest_path)
+    manifest_file.parent.mkdir(parents=True, exist_ok=True)
+
+    print("*********")


Replace debug print statement with proper logging. Use logger.debug() or logger.info() instead of print() to maintain consistent logging throughout the application.

Copilot · 2025-11-03T19:58:26Z

apps/whisper_fine_tuning/src/core/train/train_nemo.py

+            val_cfg.num_workers = self.data_config.num_workers
+            val_cfg.pin_memory = self.data_config.pin_memory
+
+        print("*****************")


Replace debug print statement with proper logging. Use logger.debug() or logger.info() instead of print() to maintain consistent logging throughout the application.

@copilot open a new pull request to apply changes based on this feedback

Copilot · 2025-11-03T19:58:26Z

apps/whisper_fine_tuning/src/core/evaluation/evaluation_process.py

+        raise ValueError(
+            "Expected transcript column of either 'text', 'sentence', 'normalized_text' or 'transcript'. Got sample of "
+            ".join{sample.keys()}. Ensure a text column name is present in the dataset."
+        )


The error message contains invalid string formatting syntax. .join{sample.keys()} should be ''.join(sample.keys()) or use an f-string: f\"...Got sample of {', '.join(sample.keys())}...\" to properly display the available keys.

@copilot open a new pull request to apply changes based on this feedback

Copilot · 2025-11-03T19:58:27Z

apps/whisper_fine_tuning/src/core/train/main_train.py

+            data_collator=data_collator,
+        )
+    except Exception as e:
+        logger.error(f"erro durante o treinamento: {e}", exc_info=True)


Error message is in Portuguese. Should be in English: 'error during training' instead of 'erro durante o treinamento'.

@copilot open a new pull request to apply changes based on this feedback

Copilot · 2025-11-03T19:58:30Z

apps/whisper_fine_tuning/src/core/data_prep/main_silver_data_prep.py

+        )
+
+    if freeze_feature_encoder:
+        model.freeze_feature_encoder()


This statement is unreachable.

Copilot · 2025-11-03T19:58:30Z

apps/whisper_fine_tuning/src/core/data_prep/main_silver_data_prep.py

+        model.freeze_feature_encoder()
+
+    if freeze_encoder:
+        model.freeze_encoder()


This statement is unreachable.

Copilot · 2025-11-03T19:58:30Z

apps/whisper_fine_tuning/src/core/train/main_train.py

+        )
+
+    if freeze_feature_encoder:
+        model.freeze_feature_encoder()


This statement is unreachable.

Copilot · 2025-11-03T19:58:30Z

apps/whisper_fine_tuning/src/core/train/main_train.py

+        model.freeze_feature_encoder()
+
+    if freeze_encoder:
+        model.freeze_encoder()


This statement is unreachable.

Copilot · 2025-11-03T19:58:30Z

apps/whisper_fine_tuning/src/core/evaluation/evaluation_process.py

+            op_file = op_file + "_" + args.hf_model.replace("/", "_")
+        else:
+            op_file = op_file + "_" + args.ckpt_dir.split("/")[-1].replace("/", "_")
+        result_file = open(op_file, "w")


File may not be closed if this operation raises an exception.
File may not be closed if this operation raises an exception.
File may not be closed if this operation raises an exception.
File may not be closed if this operation raises an exception.
File may not be closed if this operation raises an exception.
File may not be closed if this operation raises an exception.
File may not be closed if this operation raises an exception.

Copilot · 2025-11-03T20:17:08Z

@karinassini I've opened a new pull request, #5, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2025-11-03T20:17:37Z

@karinassini I've opened a new pull request, #6, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2025-11-03T20:19:20Z

@karinassini I've opened a new pull request, #7, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2025-11-03T20:22:53Z

@karinassini I've opened a new pull request, #8, to work on those changes. Once the pull request is ready, I'll request review from you.

karinassini added 3 commits November 3, 2025 13:43

first version - FT

df0b782

fix: fixing and testing training script

3d2d02a

chore: readme improvement

4a2bb72

karinassini requested review from Cataldir and Copilot November 3, 2025 19:55

black fixes

097f30a

Copilot AI reviewed Nov 3, 2025

View reviewed changes

Copilot AI mentioned this pull request Nov 3, 2025

Fix spelling typo in utils.py comment #5

Draft

Copilot AI mentioned this pull request Nov 3, 2025

Replace debug print with logger in train_nemo.py #6

Draft

Copilot AI mentioned this pull request Nov 3, 2025

Fix invalid string formatting syntax in evaluation error message #7

Draft

Copilot AI mentioned this pull request Nov 3, 2025

Fix Portuguese error message in training script #8

Draft

karinassini force-pushed the feature/ft branch from 079b511 to 097f30a Compare November 3, 2025 20:30

karinassini merged commit f8c8db5 into main Nov 3, 2025
1 of 3 checks passed

		def display_table(dataset_or_sample):
		# A helper fuction to display a Transformer dataset or single sample contains multi-line string nicely

Feature/ft -> First fine tuning version #4

Feature/ft -> First fine tuning version #4

Uh oh!

Conversation

karinassini commented Nov 3, 2025

Purpose

Does this introduce a breaking change?

Pull Request Type

Other Information

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

karinassini Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

karinassini Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

karinassini Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

karinassini Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Nov 3, 2025

Uh oh!

Copilot AI commented Nov 3, 2025

Uh oh!

Copilot AI commented Nov 3, 2025

Uh oh!

Copilot AI commented Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants