Web application built with Python3.10
and Vosk
for transcribing audio from microphone on the fly and detecting questions (Russian language), useful for meetings.
Features:
- recording speakers audio data and preparing speakers pool
- online microphone stream processing with transcription and speaker detection
- export of finished audio sessions (metadata + detected questions)
Docker, x86_64/amd64 architecture, 16+ GB RAM (vosk server models use at least 8 GB)
For manual run - Python3.10, installed MongoDB, Vosk Server, and optionally Node, if you wish to rebuild js bundles
Python3.10
VOSK_SERVER_WS_URL
- url where vosk websocket server is started, default isws://localhost:2700
MONGODB_URI
- url to mongodb, default ismongodb://localhost:27017
MONGO_VOSK_DB_NAME
- name of mongo database, default isnir-zoom
MONGO_SPEAKERS_COL_NAME
- name for collection with speaker records, default isspeakers
MONGO_SESSIONS_COL_NAME
- name for collection with meetings records, default issessions
GOOD_SPK_FRAMES_NUM
- value for analyzing quality of recorded speaker features, default is300
MIN_SPK_VECTORS_NUM
- value for checking quantity of recorded speaker features, default is8
SPK_GOOD_RATIO
- value restricting min border of ratio = (num good speaker features / num all speaker features), default is0.65
MERGE_DIFF_SEC
- value in secs - how close should be two phrases to be combined into one, default is2.5
s
- Prepare
.env
file with needed variables, except for VOSK_SERVER_WS_URL and MONGODB_URI, they would be set automatically because specific images of Vosk Server and MongoDB would be used - Run
docker-compose up -d
- Navigate to
http://localhost:3030
- Install dependencies using
pip install -r requirements.txt
. If you plan to run Vosk Server code manually, install vosk deps usingpip install -r vosk-requirements.txt
- Check that Vosk Server and MongoDB are up and running, for Vosk Server you can use
supersolik/vosk-ru-spk:latest
docker image, or download vosk models - ru model and spk model and run server code manually - Set described env vars, required in this case are VOSK_SERVER_WS_URL and MONGODB_URI to point app to your running Vosk Server and MongoDB
- Run
uvicorn backend:app.app --host 127.0.0.1 --port 3030
- Navigate to
http://localhost:3030
REST endpoints docs available at GET /docs
endpoint, navigate using browser
Also app contains two websocket endpoints: /spk/ws
and /meeting/ws
, responsible for processing audio chunks sent from client using websockets for speaker and meeting sessions.
- To record speaker, click on the
Record speakers
link - To record meeting session, click on the
Record meeting
link - To view and export recorded meetings, click on the
Export meeting stats
link
- Input speaker name in the form and click
Set speaker
button, app would initialize a speaker recording session, and recording button will become available. Name cannot be changed, if you made a typo you would need to reload page and create new speaker - Click on the
Start recording
button and allow microphone usage in the following broweser prompt - Wait for 1 sec and start speaking, loud and clear. Try to speak in long sentences, with pauses between sentences for about 2 seconds. Total speaking time should be about 1 - 2 minutes
- When you're done, click on the
Stop recording
button. App will analyze recorded data and display total time, along with quality and quantity of recorded data. If recording is not quite good, there will be recommendations for improvment - Recording complete, if you wish to record another speaker, simply reload the page or click
Record another speaker
button (will appear after recording is stopped).
if you want to record different data for the same speaker, input the same name in next recording session, data for previous recording would be overwritten.
- Input meeting name in form and select speakers from dropdown multiselect with checkboxes. After click on the
Set meeting data
button to initilize a meeting session. - Click
Start recording
and allow microphone usage in the following broweser prompt, then after 1 sec app would start processing microphone audio data on the fly and display detected phrases with speaker name in the appeared textarea below buttons. - When you wish to stop processing and finish recording microphone data, click
Stop recording
button. App will analyze recorded data and display total time, along with info about how many speakers have been actually detected (from those you've chosen at the start) - The
Export
button will appear, if you wish to export meeting stats right away. If you don't, you can always view and export stats at export page. - Recorging is done, you can start another audio session or close the page
App will add a timestamp to each session name, so same names can be used for different sessions
App will use the microphone which is being used by the browser, so if you wish to change micro, you will have to do it in browser or OS settings
Page support pagination, page shows 8 records, when sessions number will be over 8, the controls for navigating between different pages will appear around Page N
text.
To export any record, just click Export
, data will be exported in zip archive with 2 csv files - metadata.csv
will contain data shown in table, questions.csv
will contain speakers questionta daa with timestamps in seconds relative to start of the recording
- go to
./frontend_scripts
- run
npm install --include dev
- to build js bundles, run
npm run build_speaker
andnpm run build_meeting
- Speaker detection is based on
cosine distance
between detected speaker features vector (using Vosk) and pre-recorded features vectors - for each speaker set the mean distance is calculated and speaker with minimal mean distance is being picked - Question detecting is rule-based, rules are gathered in
QUESTION_RULES
array at the top ofvosk_utils/__init__.py
file, each rule is the boolen function, with checks keywords entry in passed text - Forbidden ("bad") words are being stored as pickled python list in
bad_words.pkl
invosk_utils
folder, if you wish to extend this list, you need to unpicke this file, extend python list with needed words and pickle it again. Bad words are being simply erased for recognized phrases - Questions during export are being calculated in following algo: if rule-based algo determined the phrase is a question, at most 5 next subsequent phrases from same speaker (without interruption from other speakers) would be added to the question text. The logic behind this is that rule-based algo looking to the start of the question, but following phrases even after a pauses are usually still a part of the question, even if they don't contain explicit question keywords
Activate virtual environment
with Python3.10, then install dependencies using:
pip install -r requirements.txt
pip install -r vosk-requirements.txt
Use python3.10 main.py --help
to invoke detailed description:
Activate virtual environment
with Python3.10, then install dependencies using:
pip install -r whisper-pyannote-requirements.txt
Use python3.10 main_whisper.py --help
to invoke detailed description on how to run
Link to screencast with first results (19.03.2022 - cli app) - https://drive.google.com/file/d/1MQdnaoQoiWK9L1MZP185QtTg8CifSfz9/view?usp=sharing