Skip to content

Voice-to-Voice Assistant for Instant Insights from Screenshots

License

Notifications You must be signed in to change notification settings

3choff/SnapIntel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SnapIntel

Voice-to-Voice Assistant for Instant Insights from Screenshots

SnapIntel is a personal voice-to-voice assistant that provides immediate, actionable insights from the screenshots you decide to share. Whether you're solving an issue or looking for deeper understanding, SnapIntel is here to help.

This project is an open-source initiative that leverages Google Gemini to analyze images and provide responses. Various services are used to transcribe the user queries and generate spoken responses, including the local services FastWhisperAPI and FastXttsAPI.

If you find SnapIntel useful, please consider leaving a star ⭐ or donate.

Video Demo

Video Demo

Features

  • Easy and Intuitive Interface: Use voice-to-voice interactions for a seamless user experience.
  • Privacy-Focused Assistant: Maintain control over your data; decide what to share with a simple key combination press.
  • Instant Insights: Receive actionable information quickly from screenshots you choose to analyze.
  • Local Services Integration: Integrate with FastWhisperAPI and FastXttsAPI for localized query transcription and response vocalization.
  • Chat History: Records images and interactions within the session, enabling follow-up questions on images and recalling previous queries or responses.
  • Real-Time Session Logging: Automatically logs session history in a neatly formatted markdown file, accessible in real-time from the local logs folder.
  • Flexibility and Expandability: Built to adapt and grow with future enhancements and integrations.
  • Transcription Services: Support OpenAI, Groq, Deepgram, and FastWhisperAPI (Faster Whisper) for efficient transcription of user queries.
  • Speech Services: Support OpenAI, ElevenLabs, Cartesia, Deepgram, and FastXttsAPI (Coqui) for quick and natural-sounding vocalization of responses.

Requirements

  • Python 3.10 or greater
  • FFmpeg. Instructions on how to install it can be found here
  • FastWhisperAPI and FastXttsAPI offer local transcription and speech solutions. Their use is optional. For information on deployment and requirements of these services, please refer to their respective documentation.

Dependencies

This project depends on the following libraries:

  • pillow
  • python-dotenv
  • keyboard
  • requests
  • colorama
  • SpeechRecognition
  • google.generativeai
  • websocket-client
  • pyaudio
  • numpy

Installation

  1. Clone the repository:

    git clone https://github.com/3choff/SnapIntel.git
  2. Navigate to the project directory:

    cd SnapIntel
  3. Create a new environment:

    python3 -m venv SnapIntel
  4. Activate the virtual environment:

    • On Unix/Linux/macOS:

      source SnapIntel/bin/activate
    • On Windows:

      SnapIntel\Scripts\activate
  5. Install the required packages:

    pip install -r requirements.txt

Configuration

API keys

SnapIntel uses dotenv to set the API keys. Create a .env file in the root directory with your API keys. Follow the structure of the example.env file as a template.

Transcription and Speech services

The app supports multiple transcription and speech services right out of the box. You can select from the following options:

Transcription Services:

  • Deepgram
  • Openai
  • Groq
  • FastWhisperAPI, a local transcription API server using Faster Whisper.

Speech Services:

  • Deepgram
  • OpenAI
  • ElevenLabs
  • Cartesia (EXPERIMENTAL)
  • FastXttsAPI, a local speech API server using Coqui.

To change the transcription or speech service, simply edit the relevant variables in the Config.py file located in the services folder. The accepted choices are commented next to each variable.

In the same file, you can change other related variables such as voices and language.

Usage

To run the SnapIntel, use the following command:

python app.py

When the app starts, it will prompt you to either start a new session or resume a previous session stored in the history folder. After making your choice, you can interact with the LLM using these key combinations:

  • Press Ctrl+Alt+Space to capture and analyze the screen and invoke the voice assistant.
  • Press Ctrl+Space to ask a question without capturing a screenshot or to ask a follow-up question.
  • Press ESC to stop speech playback.
  • Press Ctrl+C to exit the script.

Support

If you find this project helpful and would like to support its development, there are several ways you can contribute:

  • Star: Consider leaving a star ⭐️ to increase the visibility of the project.
  • Support: Consider donate to support my work.
  • Contribute: If you're a developer, feel free to contribute to the project by submitting pull requests or opening issues.
  • Spread the Word: Share this project with others who might find it useful.

Your support means a lot and helps keep this project going. Thank you for your contribution!

Acknowledgements

This project is inspired by innovative features showcased by OpenAI in their demo of the upcoming features of ChatGPT, combining voice and vision capabilities to provide assistance and insights. The Verbi chatbot project and the Screen to Voice Tutorial of All About AI have significantly influenced this project, forming the foundation for its development. I recommend checking the links if you want to know more.

License

This project is licensed under the Apache License 2.0.

About

Voice-to-Voice Assistant for Instant Insights from Screenshots

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published

Languages