Timed Text Editor Domain Problem - draft

TL;DR: The issue is that when the user corrects the text, it might delete, substitute or insert new words. These operations tend to loose the time-codes originally associated with each word. So we are investigating a way to re-align the and preserve the timcodes. Keeping an eye on performance issues that arise for transcripts over one 1 hour.

Pagination might be a ~~quick~~ fix, but it introduces another set of problems.

Context

Some ~~quick~~ background for those new to the project.

slate-transcript-editor builds on top of the lessons learned from developing @bbc/react-transcript-editor (based on draftJs).

As the name suggests slate-transcript-editor is built on top of slateJs augmenting it with transcript editing domain specific functionalities.

For more on "draftjs vs slatejs" for this use case, see these notes.

It is a react transcript editor component to allow users to correct automated transcriptions of audio or video generated from speech to text services.

It is used in use cases such as autoEdit, an app to edit audio/video interviews, as well as other situation where users might need to correct transcriptions, for a variety of use cases.

The ambition is to have a component that takes in timed text (eg a list of words with start times), allows the user to correct the text (providing some convenience features, such pause while typing, and keeping some kind of correspondence between the text and audio/video) and on save returns timed text in the same json format (referred to, for convenience, as dpe format, after the digital paper edit project where it was first formalized).

{
  "words": [
    {
      "end": 0.46, // in seconds
      "start": 0,
      "text": "Hello"
    },
    {
      "end": 1.02,
      "start": 0.46,
      "text": "World"
    },
    ...
    ]
    "paragraphs": [
    {
      "speaker": "SPEAKER_A",
      "start": 0,
      "end": 3
    },
    {
      "speaker": "SPEAKER_B",
      "start": 3,
      "end": 19.2
    },
    ...
    ]
 }

see here for more info on the dpe format

As part of slate-transcript-editor this dpe format is then converted into slateJs data model.

see storybook demo to see the slate-transcript-editor react componet it in practice

Over time in this domain folks have tried a variety of approaches to solve this problem.

compute the timings

listening to char insertion, deletion and detecting word boundaries, you could estimate the time-codes. This is a very fiddly approach, as there's a lot of edge cases to handle. Eg what if a user deletes a whole paragraph? And over time the accuracy of the time-codes slowly fades (if there's a lot of correction done to the text, eg if the STT is not very accurate).

alignment - server side - Aeneas

Some folks have had some success running server side alignment. For example in pietrop/fact2_transcription_editor the editor was one giant content editable div, and on save it would send to the server plain text version (literally using .innerText). @frisch1 then server side would then align it against the original media using the aeneas aligner by @pettarin.

Aeneas converts the text into speech (TTS) and then uses that wave form to compare it against the original media to very quickly produce the alignment, restoring time-codes, either at word or line level depending on your preferences.

Aeneas uses dynamic time warping of math frequency capsule coefficient algo (🤯). You can read more about how Aeneas works in the How Does This Thing Work? section of their docs.

This approach for fact2_transcription_editor was some what successful, Aeneas is very fast. However

the alignment is only done on save to the database.
If a user continues to edit the page over time more and more of the time-codes will disappear until the refresh the page, and the "last saved and aligned" transcript gets fetch from the db.
And to set this up as "a reusable component" you'd always have to pair with a server side module to do the alignment
Aeneas is great but in it's current form does not exist as an npm module (as far as I am aware of?) it's written in python and has some system dependencies such as ffmpeg, TTS engine etc..

side note on word level time-codes and clickable words

I should mention that in fact2_transcription_editor you could click on individual words to jump to corresponding point in the media.

With something equivalent to

<span data-start-time="0" data-end-time="0.46" classNames="words"> Hello </span> ...

A pattern I had first come across in hyperaud.io's blog description of "hypertranscripts" by @maboa & @gridinoc

STT based alignment - Gentle

Some folks have also used Gentle, by @maxhawkins, a forced aligner based on Kaldi as a way to get alignment info.

I've personally used it for autoEdit2 as an open source offline option for users to get transcriptions. But I haven't used it for alignment, as STT based alignment is slower then TTS one.

alignment - client side - option 1 (stt-align)

Another option is to run the alignment client side. by doing a diff between the human corrected (accurate) text and the timed text from the STT engine, and to transpose the time-codes from the second to the first.

some more background and info on this solution

This solution was first introduced by @chrisbaume in bbc/dialogger (presented at textAV 2017) it modified CKEditor (at the time draftJS was not around yet) and run the alignment server side in a custom python module sttalign.py

With @chrisbaume's help I converted the python code into a node module stt-align-node which is used in @bbc/react-transcript-editor and slate-transcript-editor

one issue in converting from python to the node version is that for diffing python uses the difflib that is part of the core library while in the node module we use , difflib.js which might not be as performant (❓ 🤷‍♂️ )

When a word is inserted, (eg was not recognized by the STT services and the users adds it manually) in this type of alignment there are no time-codes for it. Via interpolation of time-codes of neighboring words, we bring back add some time-codes. In the python version the time-codes interpolation is done via numpy to linearly interpolate the missing times

In the node version the interpolation is done via the everpolate module and again it might not be as performant as the python version (❓ 🤷‍♂️ ).

However in @bbc/react-transcript-editor and slate-transcript-editor initially every time the user stopped typing for longer then a few seconds, we'd trigger a save, which was proceeded by an alignment. This became very un-performant, especially for long transcriptions, (eg approximately over 1 hour) because whether you change a paragraph or just one word, it would run the alignment across the whole text. Which turned out to be a pretty expensive operation.

This lead to removing user facing word level time-codes in the slateJs version to improve performance on long transcriptions. and removing auto save. However, on long transcription, even with manual save, sometimes the stt-align-node module can temporary freeze the UI for a few seconds 😬 or in the worst case scenario sometimes even crash the page 😓 ☠️

more on retaining speaker labels after alignement

There is also a workaround for handling retaining speaker labels at paragraph level when using this module to run the alignment.

The module itself only aligns the words. To re-introduce the speakers, you just compare the aligned words with the paragraphs with speaker info. Example of converting into slateJs format or into dpe format from slateJs

Which is why in PR #30 we are considering pagination. But before a closer look into that, let's consider one more option.

alignment - client side - option 2 (web-aligner)

Another option explored by @chrisbaume at textAV 2017 was to make a webaligner (example here and code of the example here) to create a ~~simple~~ lightweight client-side forced aligner for timed text levering the browser audio API (AudioContext), and doing computation similar to Aeneas(? not sure about this last sentce?).

This option is promising, but was never fully fleshed out to a usable state. It might also only work when aligning small sentences due to browser's limitations(?).

Overtyper

Before considering pagination, a completely different approach to the UX problem of correcting text is overtyper by @alexnorton & @maboa from textAV 2017. Where you follow along a range of words being hiligteed as the media plays. To correct you start typing from the last correct word you heard until the next correct one, so that the system can adjust/replace/insert all the once in between. This makes the alignment problem a lot more narrow, and new word timings can be more easily computed.

This is promising, but unfortunately as far as I know there hasn't been a lot of user testing to this approach to validate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the-timed-text-editor-domain-problem.md

the-timed-text-editor-domain-problem.md

Timed Text Editor Domain Problem - draft

Context

compute the timings

alignment - server side - Aeneas

STT based alignment - Gentle

alignment - client side - option 1 (stt-align)

alignment - client side - option 2 (web-aligner)

Overtyper

Files

the-timed-text-editor-domain-problem.md

Latest commit

History

the-timed-text-editor-domain-problem.md

File metadata and controls

Timed Text Editor Domain Problem - draft

Context

compute the timings

alignment - server side - Aeneas

STT based alignment - Gentle

alignment - client side - option 1 (stt-align)

alignment - client side - option 2 (web-aligner)

Overtyper