Skip to content
Chad edited this page Sep 23, 2017 · 11 revisions

Scope

This is minutes’ scope document.

Vision

To cluster and recognize different speakers in audio recorded conversations, produce transcriptions of these conversations, and label individual phrases with speakers and time stamps.

Minimum Viable Scope

Given an audio recording (this is loosely defined for the time being) of a conversation with n speakers, identify which speakers spoke which phrases and produce a list of phrases from the conversation - each phrase should have the following keys: speaker, start_time, end_time and body.

Example Output

[
    {
          "speaker": 0,
          "start_time": 148172245,
          "end_time": 148172251,
          "body": "Hello everyone, thanks for coming in. We have a lot to get through today so let’s get started."
    },
    {
          "speaker": 1,
          "start_time": 148172251,
          "end_time": 148172253,
          "body": "Happy to be here."
    }
]

In general, phrases may overlap with one another in time, but no two phrases from the same speaker should overlap.

Acceptable Misclassification Rates

  • A minimum of 90% accuracy on an out of sample cross validation test for MVP.
  • A minimum of 99% accuracy on an out of sample cross validation test for production.

Anti-vision

Voice-to-text is a largely solved problem. Google Speech and similar API’s provide near perfect speech recognition and transcription; therefore speech recognition is not minutes’ goal, but it will likely leverage some of these tools.

Clone this wiki locally