Skip to content

Conversation

@karinassini
Copy link
Contributor

Purpose

  • Introduce the entire Whisper fine-tuning workspace under [whisper_fine_tuning], covering data prep, training, evaluation, and Azure ML deployment assets.
  • Align project documentation and entrypoints around this new app structure so the team can run end-to-end Whisper fine-tuning from within language-creation.
  • Refresh Whisper fine-tuning docs so they reflect the new data layout under apps/whisper_fine_tuning/data/raw/audios/ and the silver-stage workflow.
    Add a consolidated training README with quick-start commands, Azure ML guidance, and tuning tips aligned with the updated pipeline.

Does this introduce a breaking change?

[ ] Yes
[x ] No

Pull Request Type

What kind of change does this Pull Request introduce?

[ ] Bugfix
[ x] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[x] Documentation content changes
[ ] Other... Please describe:

Other Information

Next Steps
Automate the raw-to-silver data ingestion flow (scheduling or pipeline scripting).
Wire the ingestion results into the existing training jobs so the full pipeline can run unattended.
Create tests

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a comprehensive Whisper fine-tuning application for speech-to-text model training on custom datasets. The implementation includes data preparation pipelines, LoRA-based fine-tuning for both Whisper (Transformers) and NeMo RNNT models, evaluation utilities, and Azure ML deployment infrastructure.

Key changes:

  • Complete data preparation pipeline converting raw audio files into HuggingFace datasets format
  • Training implementations for Whisper models with LoRA/PEFT support and NeMo RNNT experiments
  • Evaluation scripts with WER/CER metrics
  • Azure ML infrastructure setup (Bicep templates, deployment configurations)
  • MLflow experiment tracking integration

Reviewed Changes

Copilot reviewed 48 out of 59 changed files in this pull request and generated 19 comments.

Show a summary per file
File Description
apps/whisper_fine_tuning/src/utils/utils.py Helper function for displaying datasets in notebooks
apps/whisper_fine_tuning/src/core/train/train_nemo.py NeMo RNNT trainer with LoRA support and manifest generation
apps/whisper_fine_tuning/src/core/train/train.py Whisper transformer trainer with LoRA and MLflow logging
apps/whisper_fine_tuning/src/core/train/main_train_nemo.py CLI entry point for NeMo RNNT training
apps/whisper_fine_tuning/src/core/train/main_train.py CLI entry point for Whisper training
apps/whisper_fine_tuning/src/core/train/config.py Configuration dataclasses for training and LoRA
apps/whisper_fine_tuning/src/core/evaluation/evaluation_process.py Evaluation script with WER/CER metrics
apps/whisper_fine_tuning/src/core/data_prep/*.py Data preparation utilities and pipelines
apps/whisper_fine_tuning/deployment/** Azure ML job specs, model registration, endpoint deployment
apps/whisper_fine_tuning/infra/** Bicep templates for Azure ML workspace setup
.pre-commit-config.yaml Pre-commit hooks for code quality
.gitignore Updated to exclude ML artifacts and local data

Comment on lines +5 to +6
def display_table(dataset_or_sample):
# A helper fuction to display a Transformer dataset or single sample contains multi-line string nicely
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'fuction' to 'function'.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

manifest_file = Path(manifest_path)
manifest_file.parent.mkdir(parents=True, exist_ok=True)

print("*********")
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace debug print statement with proper logging. Use logger.debug() or logger.info() instead of print() to maintain consistent logging throughout the application.

Copilot uses AI. Check for mistakes.
val_cfg.num_workers = self.data_config.num_workers
val_cfg.pin_memory = self.data_config.pin_memory

print("*****************")
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace debug print statement with proper logging. Use logger.debug() or logger.info() instead of print() to maintain consistent logging throughout the application.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Comment on lines +36 to +39
raise ValueError(
"Expected transcript column of either 'text', 'sentence', 'normalized_text' or 'transcript'. Got sample of "
".join{sample.keys()}. Ensure a text column name is present in the dataset."
)
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message contains invalid string formatting syntax. .join{sample.keys()} should be ''.join(sample.keys()) or use an f-string: f\"...Got sample of {', '.join(sample.keys())}...\" to properly display the available keys.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

data_collator=data_collator,
)
except Exception as e:
logger.error(f"erro durante o treinamento: {e}", exc_info=True)
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message is in Portuguese. Should be in English: 'error during training' instead of 'erro durante o treinamento'.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

)

if freeze_feature_encoder:
model.freeze_feature_encoder()
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is unreachable.

Copilot uses AI. Check for mistakes.
model.freeze_feature_encoder()

if freeze_encoder:
model.freeze_encoder()
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is unreachable.

Copilot uses AI. Check for mistakes.
)

if freeze_feature_encoder:
model.freeze_feature_encoder()
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is unreachable.

Copilot uses AI. Check for mistakes.
model.freeze_feature_encoder()

if freeze_encoder:
model.freeze_encoder()
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is unreachable.

Copilot uses AI. Check for mistakes.
op_file = op_file + "_" + args.hf_model.replace("/", "_")
else:
op_file = op_file + "_" + args.ckpt_dir.split("/")[-1].replace("/", "_")
result_file = open(op_file, "w")
Copy link

Copilot AI Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File may not be closed if this operation raises an exception.
File may not be closed if this operation raises an exception.
File may not be closed if this operation raises an exception.
File may not be closed if this operation raises an exception.
File may not be closed if this operation raises an exception.
File may not be closed if this operation raises an exception.
File may not be closed if this operation raises an exception.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI commented Nov 3, 2025

@karinassini I've opened a new pull request, #5, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link

Copilot AI commented Nov 3, 2025

@karinassini I've opened a new pull request, #6, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link

Copilot AI commented Nov 3, 2025

@karinassini I've opened a new pull request, #7, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link

Copilot AI commented Nov 3, 2025

@karinassini I've opened a new pull request, #8, to work on those changes. Once the pull request is ready, I'll request review from you.

@karinassini karinassini merged commit f8c8db5 into main Nov 3, 2025
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants