-
Notifications
You must be signed in to change notification settings - Fork 0
Feature/ft -> First fine tuning version #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a comprehensive Whisper fine-tuning application for speech-to-text model training on custom datasets. The implementation includes data preparation pipelines, LoRA-based fine-tuning for both Whisper (Transformers) and NeMo RNNT models, evaluation utilities, and Azure ML deployment infrastructure.
Key changes:
- Complete data preparation pipeline converting raw audio files into HuggingFace datasets format
- Training implementations for Whisper models with LoRA/PEFT support and NeMo RNNT experiments
- Evaluation scripts with WER/CER metrics
- Azure ML infrastructure setup (Bicep templates, deployment configurations)
- MLflow experiment tracking integration
Reviewed Changes
Copilot reviewed 48 out of 59 changed files in this pull request and generated 19 comments.
Show a summary per file
| File | Description |
|---|---|
apps/whisper_fine_tuning/src/utils/utils.py |
Helper function for displaying datasets in notebooks |
apps/whisper_fine_tuning/src/core/train/train_nemo.py |
NeMo RNNT trainer with LoRA support and manifest generation |
apps/whisper_fine_tuning/src/core/train/train.py |
Whisper transformer trainer with LoRA and MLflow logging |
apps/whisper_fine_tuning/src/core/train/main_train_nemo.py |
CLI entry point for NeMo RNNT training |
apps/whisper_fine_tuning/src/core/train/main_train.py |
CLI entry point for Whisper training |
apps/whisper_fine_tuning/src/core/train/config.py |
Configuration dataclasses for training and LoRA |
apps/whisper_fine_tuning/src/core/evaluation/evaluation_process.py |
Evaluation script with WER/CER metrics |
apps/whisper_fine_tuning/src/core/data_prep/*.py |
Data preparation utilities and pipelines |
apps/whisper_fine_tuning/deployment/** |
Azure ML job specs, model registration, endpoint deployment |
apps/whisper_fine_tuning/infra/** |
Bicep templates for Azure ML workspace setup |
.pre-commit-config.yaml |
Pre-commit hooks for code quality |
.gitignore |
Updated to exclude ML artifacts and local data |
| def display_table(dataset_or_sample): | ||
| # A helper fuction to display a Transformer dataset or single sample contains multi-line string nicely |
Copilot
AI
Nov 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'fuction' to 'function'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
| manifest_file = Path(manifest_path) | ||
| manifest_file.parent.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| print("*********") |
Copilot
AI
Nov 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace debug print statement with proper logging. Use logger.debug() or logger.info() instead of print() to maintain consistent logging throughout the application.
| val_cfg.num_workers = self.data_config.num_workers | ||
| val_cfg.pin_memory = self.data_config.pin_memory | ||
|
|
||
| print("*****************") |
Copilot
AI
Nov 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace debug print statement with proper logging. Use logger.debug() or logger.info() instead of print() to maintain consistent logging throughout the application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
| raise ValueError( | ||
| "Expected transcript column of either 'text', 'sentence', 'normalized_text' or 'transcript'. Got sample of " | ||
| ".join{sample.keys()}. Ensure a text column name is present in the dataset." | ||
| ) |
Copilot
AI
Nov 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message contains invalid string formatting syntax. .join{sample.keys()} should be ''.join(sample.keys()) or use an f-string: f\"...Got sample of {', '.join(sample.keys())}...\" to properly display the available keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
| data_collator=data_collator, | ||
| ) | ||
| except Exception as e: | ||
| logger.error(f"erro durante o treinamento: {e}", exc_info=True) |
Copilot
AI
Nov 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error message is in Portuguese. Should be in English: 'error during training' instead of 'erro durante o treinamento'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
| ) | ||
|
|
||
| if freeze_feature_encoder: | ||
| model.freeze_feature_encoder() |
Copilot
AI
Nov 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This statement is unreachable.
| model.freeze_feature_encoder() | ||
|
|
||
| if freeze_encoder: | ||
| model.freeze_encoder() |
Copilot
AI
Nov 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This statement is unreachable.
| ) | ||
|
|
||
| if freeze_feature_encoder: | ||
| model.freeze_feature_encoder() |
Copilot
AI
Nov 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This statement is unreachable.
| model.freeze_feature_encoder() | ||
|
|
||
| if freeze_encoder: | ||
| model.freeze_encoder() |
Copilot
AI
Nov 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This statement is unreachable.
| op_file = op_file + "_" + args.hf_model.replace("/", "_") | ||
| else: | ||
| op_file = op_file + "_" + args.ckpt_dir.split("/")[-1].replace("/", "_") | ||
| result_file = open(op_file, "w") |
Copilot
AI
Nov 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File may not be closed if this operation raises an exception.
File may not be closed if this operation raises an exception.
File may not be closed if this operation raises an exception.
File may not be closed if this operation raises an exception.
File may not be closed if this operation raises an exception.
File may not be closed if this operation raises an exception.
File may not be closed if this operation raises an exception.
|
@karinassini I've opened a new pull request, #5, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@karinassini I've opened a new pull request, #6, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@karinassini I've opened a new pull request, #7, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@karinassini I've opened a new pull request, #8, to work on those changes. Once the pull request is ready, I'll request review from you. |
079b511 to
097f30a
Compare
Purpose
Add a consolidated training README with quick-start commands, Azure ML guidance, and tuning tips aligned with the updated pipeline.
Does this introduce a breaking change?
Pull Request Type
What kind of change does this Pull Request introduce?
Other Information
Next Steps
Automate the raw-to-silver data ingestion flow (scheduling or pipeline scripting).
Wire the ingestion results into the existing training jobs so the full pipeline can run unattended.
Create tests