🚀 WhatsThat -(Vision-to-Audio Assistant)

A Software assistant module that helps visually impaired users understand their surroundings by converting camera input into audio descriptions.

📌 Problem Statement -1

Weave AI magic with Groq

🎯 Objective

To build a Realtime Assistive technology that guides blind people in their day to day travel.

🧠 Team & Approach

Team Name:

Quantumania

Team Members:

Your Approach:

This application captures video from the user's device camera, sends frames to a backend server for object detection using YOLOv8s ML model (default), enhances the descriptions using a LLM (llama3-70b-8192), and sends back response descriptions which will be converted to audio for the user.

It can be integrated into various Web apps (or) IoT so that it guides the user with what's in front of them whether that's a threat or general obstacles like so. (currently we present the web-app for demonstration purpose with 'basic utility')

System Architecture

---

🛠️ Tech Stack

Core Technologies Used:

Frontend: React (for visual demo at present) but front end is not mandatory
Backend: FastAPI
Object detection ML-model: YOLOv8s
APIs: Groq's for LLM integration
Hosting: netlify (for frontend)

Sponsor Technologies Used (if any):

✅ Groq: We used Groq for tailoring the responses for the user as fast as possible for the set of detected objects in the video scene

✨ Key Features

✅ Modular Architecture
✅ Visual recognition scene by multiple objects with respect to their timings in the Video Frame
✅ User friendly
✅ Responsive

Output Screenshots:

---

📽️ Demo & Deliverables

Demo Video Link: [https://youtu.be/ss8takCc2xk?si=mAEUUjNx7uWKhPXz]

FAQs ❔

Q: How will a blind person use this this ?

Currently for demonstration we've added the frontend web interface, but we can separately integrate the functionality to custom hardware projects to make the application auto run or turn on using voice commands for Real-life usage

Q: How is Groq's API used in the application ?

We've used open source LLM (Llama) through Groq's API's to tailor custom responses quickly that help the user navigate with respect to the objects in the field of view

Q: Does it provide responses only in English ?

As per the available opensource models, English responses are given correctly and we'd try to improve and integrate more languages from other FOSS foundations to bring diversity and inclusivity

Q: What model is used for Object detection and what is the data source ?

We've used Ultralytics' opensource YOLOv8s model with default COCO dataset who provide some of the industry's best ML models in computer vision

Q: I have other Question where can I ask it ?

We are open to resolve your queries and eager to collaborate. You can open an issue here or mail us at: link

✅ Tasks & Bonus Checklist

✅ All members of the team completed the mandatory task - Followed at least 2 of our social channels and filled the form (Details in Participant Manual)
✅ All members of the team completed Bonus Task 1 - Sharing of Badges and filled the form (2 points) (Details in Participant Manual)
✅ All members of the team completed Bonus Task 2 - Signing up for Sprint.dev and filled the form (3 points) (Details in Participant Manual)

🧪 How to Run the Project

Requirements:

Python 3.11+
uv latest Rust based project management tool for python "install for your platform from here" (optional)
Node.js 20+
Get your Groq API key (https://console.groq.com/) .env file in backend
Download a suitable Yolo model or use it from the Repo we provided (we've used 'YOLOv8s') (https://github.com/ultralytics/ultralytics)

Local Setup:

open two separate terminals into same "whatsthat" folder:,
(Make sure uv is installed by checking :)

uv --version   #to check if uv is installed properly

Clone the repository:

git clone https://github.com/Pramod-325/whatsthat.git
cd whatsthat

Backend Setup (in Terminal 1)

Open backend folder

cd backend        # run this in 1st Terminal

Install dependencies:

 uv add -r requirements.txt          #in backend terminal (if uv is installed)
                 (or)
 pip install -r requirements.txt     #if uv is not installed

Create a .env file with your Groq API key: "in backend Terminal"
```
GROQ_API_KEY=your_groq_api_key_here
```
And Place the downloaded yolo models in "yolo_models directory" or use the given one

Start the backend server:

 uv run main.py           #if uv is installed
         (or)
 python main.py

Frontend Setup (in Terminal-2)

Navigate to the frontend directory from 'whatsthat' folder:
```
cd frontend
```
Install dependencies:
```
npm install
```
Start the development server:
```
npm run dev
```
Then Navigate to path in URL or click below link:
Open your browser to http://localhost:5173/ (Make sure WebCam is present)

Using the Application

Grant camera access when prompted
Click "Start Vision Assistant" to begin processing
The application will detect objects and provide audio descriptions
Click "Stop Vision Assistant" to end the session

Provide any backend/frontend split or environment setup notes here.

🧬 Future Scope

📈 Improved YOLO models with custom data training
🛡️ Security enhancements like integrating everything locally
🌐 Adding more LLM integration for native languages for worldwide users

📎 Resources / Credits

GROQ's LPU's APIs for fast LLM response
Open source libraries or tools: ReactJS, FastAPI, YOLOv8s COCO dataset for object detection
Youtube Video References by:

🏁 Final Words

It's our first online hackathon and a completely a new experience which we enjoyed a lot, there were challenges like working of project in one's computer and not others 😂. We learnt how to properly collab online to complete a project using Github's core functionality and the attempts we made to deploy the application and will forever be a memorable one beacause of the way Namespace-community has planned & executed it, so a huge shoutout also goes to them 🎊🎉🎉

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
backend		backend
demo_files		demo_files
frontend		frontend
v2		v2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
WhatsThat_Architecture.png		WhatsThat_Architecture.png
sample.env		sample.env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 WhatsThat -(Vision-to-Audio Assistant)

📌 Problem Statement -1

🎯 Objective

🧠 Team & Approach

Team Name:

Team Members:

Your Approach:

System Architecture

🛠️ Tech Stack

Core Technologies Used:

Sponsor Technologies Used (if any):

✨ Key Features

Output Screenshots:

📽️ Demo & Deliverables

FAQs ❔

✅ Tasks & Bonus Checklist

🧪 How to Run the Project

Requirements:

Local Setup:

Backend Setup (in Terminal 1)

Frontend Setup (in Terminal-2)

Using the Application

🧬 Future Scope

📎 Resources / Credits

🏁 Final Words

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 WhatsThat -(Vision-to-Audio Assistant)

📌 Problem Statement -1

🎯 Objective

🧠 Team & Approach

Team Name:

Team Members:

Your Approach:

System Architecture

🛠️ Tech Stack

Core Technologies Used:

Sponsor Technologies Used (if any):

✨ Key Features

Output Screenshots:

📽️ Demo & Deliverables

FAQs ❔

✅ Tasks & Bonus Checklist

🧪 How to Run the Project

Requirements:

Local Setup:

Backend Setup (in Terminal 1)

Frontend Setup (in Terminal-2)

Using the Application

🧬 Future Scope

📎 Resources / Credits

🏁 Final Words

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages