Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new-contrib: Audio Whisper API with Local Device Microphones #1271

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

CarlKho-Minerva
Copy link

@CarlKho-Minerva CarlKho-Minerva commented Jul 6, 2024

Summary

This PR adds a new notebook that demonstrates how to use the Whisper API to transcribe text from your device's microphone. The notebook includes steps to record audio, transcribe it using the Whisper API, and copy the transcription to the clipboard. It aims to provide a practical guide for users who want to integrate speech-to-text functionality into their applications.

*This pull request was written by Chat GPT and reviewed by a human. The article, however, is made by a human.

Motivation

This tutorial was created because the functionality to transcribe speech to text from a microphone is not well-documented. I found the mic speech-to-text option in the ChatGPT apps (not websites) extremely helpful for day-to-day operations and wanted to save others from having to learn about different audio processing modules.

For new content

When contributing new content, read through our contribution guidelines, and mark the following action items as completed:

  • I have added a new entry in registry.yaml (and, optionally, in authors.yaml) so that my content renders on the cookbook website.
  • I have conducted a self-review of my content based on the contribution guidelines (my previous PR message detailed on every one of these 😅):
    • Relevance: This content is related to building with OpenAI technologies and is useful to others.
    • Uniqueness: I have searched for related examples in the OpenAI Cookbook and verified that my content offers new insights or unique information compared to existing documentation.
    • Spelling and Grammar: I have checked for spelling or grammatical mistakes.
    • Clarity: I have done a final read-through and verified that my submission is well-organized and easy to understand.
    • Correctness: The information I include is correct, and all of my code executes successfully.
    • Completeness: I have explained everything fully, including all necessary references and citations.

@CarlKho-Minerva
Copy link
Author

My previous PR message before I updated. It's mostly justification for each criteria.

Introduction:

This contribution introduces a practical guide on using the Whisper API to transcribe audio from a device's microphone. The notebook includes steps to record audio, transcribe it using the Whisper API, and copy the transcription to the clipboard, providing an accessible and useful resource for AI builders.

Justification:

1. Relevance:
This guide is relevant as it utilizes OpenAI's Whisper API, allowing users to transcribe audio directly from their devices. This functionality aligns with OpenAI's mission to provide practical applications of AI technologies.

2. Usefulness:
The contribution is highly useful for developers who need a reliable method to convert speech to text. It can be used in various applications, such as real-time transcription services, voice command systems, and accessibility tools for individuals with hearing impairments. I found the mic speech-to-text option in the ChatGPT apps (not websites) very helpful for day-to-day operations and wanted to extend this functionality.

3. Uniqueness:
While there are existing examples of using the Whisper API, this notebook uniquely combines multiple functionalities—recording audio, transcribing it, and copying the transcription to the clipboard—into one cohesive guide. This integration simplifies the process for users and provides a complete solution in a single resource. Given that this functionality isn't extensively documented yet, I believe this tutorial can fill an important gap.

4. Clarity:
The notebook is written in clear, easy-to-understand language, with step-by-step instructions and code snippets. It includes detailed comments and explanations, making it accessible even to beginners.

5. Correctness:
The code has been tested and verified for accuracy. It includes all necessary imports and setup instructions, ensuring that users can replicate the process without errors.

6. Conciseness:
The guide is concise yet thorough, covering all essential steps without unnecessary information. It is structured to provide maximum value in a compact format.

7. Completeness:
The contribution is complete, covering everything from setting up microphone permissions to troubleshooting common issues. It provides all necessary context and resources, ensuring users have a comprehensive understanding of the process.

8. Grammar:
The notebook is free from grammatical and spelling errors, ensuring professional quality and readability.

@CarlKho-Minerva
Copy link
Author

gently bumping this up. willing to revise and have learned a lot about dealing after submitting my hackathon entry for Gemini API ^_^

Copy link
Collaborator

@ibigio ibigio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Criteria Description Score
Relevance Is the content related to building with OpenAI technologies? Is it useful to others? 4
Uniqueness Does the content offer new insights or unique information compared to existing documentation? 4
Clarity Is the language easy to understand? Are things well-explained? Is the title clear? 4
Correctness Are the facts, code snippets, and examples correct and reliable? Does everything execute correctly? 2
Conciseness Is the content concise? Are all details necessary? Can it be made shorter? 4
Completeness Is the content thorough and detailed? Are there things that weren’t explained fully? 4
Grammar Are there grammatical or spelling errors present? 4

Really solid contribution, thank you! Motivation is clear, steps are broken down well, and the sections make sense. Caught a few mistakes here and there (mostly to do with using the SDK the old way), but once you correct them you're all set to merge!

examples/Whisper_transcribe_device_microphone.ipynb Outdated Show resolved Hide resolved
examples/Whisper_transcribe_device_microphone.ipynb Outdated Show resolved Hide resolved
examples/Whisper_transcribe_device_microphone.ipynb Outdated Show resolved Hide resolved
examples/Whisper_transcribe_device_microphone.ipynb Outdated Show resolved Hide resolved
examples/Whisper_transcribe_device_microphone.ipynb Outdated Show resolved Hide resolved
examples/Whisper_transcribe_device_microphone.ipynb Outdated Show resolved Hide resolved
examples/Whisper_transcribe_device_microphone.ipynb Outdated Show resolved Hide resolved
examples/Whisper_transcribe_device_microphone.ipynb Outdated Show resolved Hide resolved
@CarlKho-Minerva
Copy link
Author

CarlKho-Minerva commented Aug 24, 2024

Changelog

Hi @ibigio. Heavily revised my article now that I'm a month wiser. :)

Updated Image:

  • Added whisper_onChatGPTApp_cvk.gif to the images/ directory.

Content Structure:

  • Removed duplicate table of contents.
  • Replaced bolded numbered lists with H3 headers for cleaner formatting.

Code Improvements:

  • Added translation functionality alongside transcription.
  • Wrapped helper functions into separate functions for better modularity.
  • Implemented .env for secure API key management.
  • Specified data types for function parameters.
  • Updated main function to include is_english parameter for language selection.
  • Added timed recording option with timed_recording and record_seconds parameters.

OpenAI API Updates:

  • Updated OpenAI library usage.
  • Implemented client.audio.translations.create for non-English audio.

Documentation:

  • Updated docstrings to reflect new functionality.
  • Added additional demos for transcription and translation.
  • Updated troubleshooting section and FAQ to cover new features.

Terminology:

  • Updated text to reflect both "transcribe" and "translate" where appropriate.

Aesthetic Improvements:

  • Enhanced overall formatting for better readability in VS Code.
  • Created a more engaging recording when showcasing ChatGPT's Whisper Button interface.

@CarlKho-Minerva CarlKho-Minerva requested a review from ibigio August 24, 2024 10:54
@QWolfp3 QWolfp3 mentioned this pull request Aug 25, 2024
@CarlKho-Minerva
Copy link
Author

Hope everything is well, @ibigio. Is there anything else you'd want me to modify for this PR? Also, hope you saw my SWE internship application too. 🤭

@CarlKho-Minerva
Copy link
Author

@gabor-openai @ericning-o @danielin-openai @ray-openai hello folks!

Perhaps @ibigio is caught up in the business amidst the good work he's doing for OAI.

Gently tagging you guys so you could give the green signal in publishing this should it be satisfactory.

Thank you very much for your hardwork!

Copy link
Collaborator

@ibigio ibigio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code seems to run now mostly free of errors, left a couple comments around code clarity and correctness.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the reader speaks Filipino they can't test this part out – how about translating from a more common second language like Spanish?

Also, an indefinite record makes many notebooks crash – set a 5-10 second limit as well.

# Demo: Transcribe lengthy Filipino speech and translate into English with proper grammar and punctuation
result = transcribe_audio(
    debug=False,
    prompt="Filipino spoken. Proper grammar and punctuation. Skip fillers.",
    timed_recording=False,
    record_seconds=0,
    is_english=False,
)

print("\nTranscription/Translation:", result)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combining transcribing and translating here is a bit weird in this function, and also drops the prompt param for translations. (The prompt should be in english for translation and language of choice in a transcription). I'd split this out into two clear helper functions for translate and transcribe.

def process_audio(file_name, is_english=True, prompt=""):
    with open(file_name, "rb") as audio_file:
        if is_english:
            response = client.audio.transcriptions.create(
                model="whisper-1", file=audio_file, prompt=prompt
            )
        else:
            response = client.audio.translations.create(
                model="whisper-1", file=audio_file
            )

        return response.text.strip()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't this this is how we intend for the prompt parameter to be used – looking at our docs, it is more of an example(s) than an instruction.
Screenshot 2024-11-26 at 4 03 42 PM

@CarlKho-Minerva
Copy link
Author

Thanks @ibigio! Will get these fixed within an hour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants