Speaker Identification System

A complete end-to-end speaker identification system for audio conversations, featuring:

Automatic transcription with speaker diarization
Speaker identification using neural embeddings
Short utterance handling
Self-improving database
Conversation management and organization

Features

Conversation Processing: Transcribe and analyze audio files with speaker diarization
Speaker Identification: Match speakers to a database of voice embeddings using NeMo TitaNet
Short Utterance Handling: Special handling for very short utterances, combining utterances when needed
Self-improving Database: Automatically adds high-quality utterances to improve future recognition
Complete Organization: Structured storage of conversations, utterances, and metadata
Rename Functionality: Tools to rename speakers and update all related files

System Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│                 │     │                 │     │                 │
│  Audio Input    │────▶│  Transcription  │────▶│  Speaker ID     │
│  (.m4a, .wav)   │     │  (AssemblyAI)   │     │  (NeMo TitaNet) │
│                 │     │                 │     │                 │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                        │
                                                        ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│                 │     │                 │     │                 │
│  Frontend UI    │◀───▶│  Speaker DB     │◀────│  Conversation   │
│  (Web/Mobile)   │     │  (Pinecone)     │     │  Metadata       │
│                 │     │                 │     │                 │
└─────────────────┘     └─────────────────┘     └─────────────────┘

Core Scripts and Front-End Integration

Main Scripts

speaker_id_testing.py: Main script to process conversations
- Input: Audio file path
- Output: Processed conversation with metadata, transcript, and utterances
- Front-End Usage: Call via API to process new audio files
- Format: Returns JSON metadata with conversation details
update_speaker_db_verified.py: Add verified utterances to the database
- Input: Speaker directory path, confidence threshold
- Output: Updates Pinecone database with new voice embeddings
- Front-End Usage: Call to add new speaker samples or improve existing ones
rename_speaker.py: Rename speakers and update all related files
- Input: Conversation path, old speaker name, new speaker name
- Output: Updated files with the new speaker name
- Front-End Usage: Call to correct misidentified speakers
direct_model_download.py: Download the TitaNet model
- Input: None
- Output: Downloaded model to models directory
- Front-End Usage: Call during initial setup or model updates

Data Structures

Conversation Metadata (JSON)

{
  "conversation_id": "conversation_20250312_161957",
  "original_audio": "original_audio.m4a",
  "date_processed": "2025-03-12T16:19:57.123456",
  "duration_seconds": 360.5,
  "speakers": ["Mike Shaffer", "Simeon Reyes"],
  "utterances": [
    {
      "id": "utterance_001",
      "start_time": "00:00:05",
      "end_time": "00:00:10",
      "start_ms": 5000,
      "end_ms": 10000,
      "speaker": "Mike Shaffer",
      "text": "Hello, how are you?",
      "confidence": 0.85,
      "embedding_id": "speaker_Mike_Shaffer_abc123",
      "audio_file": "utterances/utterance_001.wav"
    },
    // More utterances...
  ],
  "short_utterance_stats": {
    "total": 14,
    "identified_directly": 12,
    "identified_combined": 2,
    "unidentified": 0
  },
  "database_update_stats": {
    "added": 5,
    "skipped_low_confidence": 2,
    "skipped_unknown": 1,
    "skipped_duplicate": 6
  }
}

Speaker Database Structure (Pinecone)

Each vector in Pinecone has:

ID: Unique identifier (e.g., "speaker_Mike_Shaffer_abc123")
Vector: 192-dimensional voice embedding from TitaNet
Metadata:
- speaker_name: Name of the speaker
- source_file: Source audio file
- is_short_utterance: Whether this is a short utterance
- duration_seconds: Duration of the utterance

Front-End Development Guide

Suggested Features

Conversation Management
- Upload and process new audio files
- List all processed conversations
- View conversation details, transcripts, and utterances
- Visualize speaker timeline and participation
Speaker Management
- List all speakers in the database
- View speaker details and statistics
- Add new speakers
- Rename speakers
- Add utterances to improve speaker recognition
Transcript Visualization
- Interactive transcript with speaker highlighting
- Timeline view of speaker contributions
- Search functionality for transcript content
- Export options (JSON, TXT, SRT)
Database Management
- View database statistics
- Manage voice embeddings (add, remove)
- Adjust confidence thresholds
- Backup/restore functionality

REST API Endpoints (To Be Implemented)

For a full frontend, you would want to implement these API endpoints:

GET /api/conversations - List all conversations
GET /api/conversations/{id} - Get conversation details
POST /api/conversations - Upload and process a new conversation
DELETE /api/conversations/{id} - Delete a conversation

GET /api/speakers - List all speakers
GET /api/speakers/{id} - Get speaker details
POST /api/speakers - Add a new speaker
PUT /api/speakers/{id} - Update speaker (rename)
DELETE /api/speakers/{id} - Delete a speaker

POST /api/utterances - Add an utterance to the database
GET /api/utterances/{speaker_id} - Get utterances for a speaker
DELETE /api/utterances/{id} - Delete an utterance

Integration Steps

Create a Python Flask/FastAPI Backend
- Wrap existing scripts in API endpoints
- Handle file uploads and processing
- Manage authentication and permissions
- Implement caching for better performance
Build a Modern Frontend
- Use React, Vue, or Angular for the UI
- Implement responsive design for mobile/desktop
- Create interactive visualizations for transcripts
- Design an intuitive speaker management interface
Real-time Processing
- Implement WebSockets for real-time processing updates
- Show progress during long-running operations
- Provide notification system for completed processes
Deployment Considerations
- CPU/GPU requirements for the NeMo model
- Storage for audio files and processed conversations
- API rate limits for AssemblyAI and Pinecone
- Authentication and authorization

Setup

Clone this repository
Install dependencies: pip install -r requirements.txt
Set up environment variables:
- ASSEMBLYAI_API_KEY: Your AssemblyAI API key
- PINECONE_API_KEY: Your Pinecone API key
Run python direct_model_download.py to download the TitaNet model
Create a Pinecone index named "speaker-embeddings"

Directory Structure

speaker-identification/
├── models/                  # Downloaded NeMo models
├── processed_conversations/ # Organized conversation data
│   └── conversation_id/
│       ├── metadata.json    # Conversation metadata
│       ├── transcript.txt   # Formatted transcript
│       ├── original_audio.* # Original audio file
│       ├── utterances/      # Individual audio segments
│       └── speakers/        # Utterances organized by speaker
├── speaker_utterances/      # Legacy storage for utterances
├── hf_cache/                # HuggingFace cache directory
├── *.py                     # Python scripts
└── requirements.txt         # Dependencies

Notes

This repository includes conversation metadata but excludes audio files
You'll need to create your own .env file with your API keys
First-time setup requires downloading the TitaNet model (~1GB)
For production use, consider implementing a proper REST API layer

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
backend		backend
frontend		frontend
processed_conversations		processed_conversations
speaker_utterances		speaker_utterances
.gitignore		.gitignore
README.md		README.md
add_short_utterances.py		add_short_utterances.py
direct_model_download.py		direct_model_download.py
identify_conversation.py		identify_conversation.py
main.py		main.py
manage_voice_db.py		manage_voice_db.py
rename_speaker.py		rename_speaker.py
requirements.txt		requirements.txt
run_app.sh		run_app.sh
speaker_id_testing.py		speaker_id_testing.py
test_match.py		test_match.py
update_speaker_db_verified.py		update_speaker_db_verified.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Speaker Identification System

Features

System Architecture

Core Scripts and Front-End Integration

Main Scripts

Data Structures

Conversation Metadata (JSON)

Speaker Database Structure (Pinecone)

Front-End Development Guide

Suggested Features

REST API Endpoints (To Be Implemented)

Integration Steps

Setup

Directory Structure

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Languages

banddude/speakerId

Folders and files

Latest commit

History

Repository files navigation

Speaker Identification System

Features

System Architecture

Core Scripts and Front-End Integration

Main Scripts

Data Structures

Conversation Metadata (JSON)

Speaker Database Structure (Pinecone)

Front-End Development Guide

Suggested Features

REST API Endpoints (To Be Implemented)

Integration Steps

Setup

Directory Structure

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages