A simple, efficient AI assistant that captures screenshots, listens to voice queries, and provides answers using Google's Gemini model.
- Press and hold a keyboard shortcut to activate the assistant
- Automatically captures the active window as a screenshot
- Records audio while the shortcut is held down
- Transcribes speech to text using Groq API
- Sends the question and screenshot to Google's Gemini model
- Saves conversations to markdown files for future reference
- Speaks responses using text-to-speech
- Python 3.8 or higher
- Required Python packages (see requirements.txt)
-
Clone this repository:
git clone https://github.com/yourusername/ai-assistant.git cd ai-assistant
-
Install dependencies:
pip install -r requirements.txt
Or using uv:
uv add -r requirements.txt
-
Configure the application:
- Copy
config/config.yaml.example
toconfig/config.yaml
(or create it if it doesn't exist) - Edit
config/config.yaml
to add your Gemini API key and Groq API key - Configure other settings as needed
- Copy
python app.py
Or using uv:
uv run app.py
- Press and hold the configured keyboard shortcut (default:
ctrl+alt+a
) - While holding the shortcut, speak your question
- Release the shortcut when finished speaking
- The assistant will:
- Process your question
- Display the transcribed text
- Send the question and screenshot to Gemini
- Speak the answer using text-to-speech
- Save the interaction to the current session file
To create a new conversation session:
python utils/new_session.py
Edit config/config.yaml
to customize the following settings:
- Keyboard shortcut: Change the key combination to activate the assistant
- Speech settings: Configure STT (Groq API) and TTS options
- AI settings: Set the Gemini model, API key, and parameters
- Session settings: Control how conversation sessions are managed
- Screenshot settings: Configure screenshot format and quality
ai_assistant/
├── config/
│ ├── config.yaml # Configuration file
│ └── config_manager.py # Configuration management
├── core/
│ ├── input_manager.py # Keyboard shortcut handling
│ ├── screenshot.py # Active window screenshots
│ ├── audio_manager.py # Audio recording, STT, TTS
│ ├── ai_connector.py # Gemini API interaction
│ └── session_manager.py # Markdown file management
├── utils/
│ └── new_session.py # Utility to create new sessions
├── prompts/
│ └── system_prompt.md # System prompt for Gemini
├── sessions/ # Where conversation files are stored
├── app.py # Main application entry point
└── README.md # This file
- InputManager: Handles keyboard shortcuts using pynput
- ScreenshotCapture: Captures screenshots of the active window
- AudioManager: Records audio, transcribes with Groq API, and provides TTS
- AIConnector: Processes queries with Google's Gemini API
- SessionManager: Manages conversation history in markdown files
- ConfigManager: Handles application configuration
- pynput: Keyboard input handling
- mss or similar: Screen capture
- sounddevice/soundfile: Audio recording
- groq: Speech-to-text via Groq API
- google-generativeai: Gemini API integration
- TTS libraries: Text-to-speech playback
- pillow: Image processing
- pyyaml: Configuration file parsing
- Add support for multiple AI agents
- Implement tools and function calling
- Add a graphical user interface
- Improve conversation context handling
- Add support for custom TTS voices
- Enhance active window detection for screenshots
- Improve error handling and recovery
MIT