Skip to content

DAPT Explained

Nigel Megitt edited this page Sep 1, 2023 · 5 revisions

DAPT Explained

Authors:

  • Cyril Concolato
  • Nigel Megitt

Participate

Introduction

DAPT (Dubbing and Audio description Profiles of TTML2) defines a profile of the Timed Text Markup Language version 2.0 [TTML2] intended to support dubbing and audio description workflows worldwide and to permit usage of visual presentation features from other TTML2 profiles such as [IMSC].

This profile can be used by content providers to meet the MAUR requirements for Alternative Content Technologies, specifically described video, subtitles and transcripts.

An illustration of how content can be produced in dubbing and audio description workflows and on where DAPT content fits in is given in the figure below. The different steps in this workflow can be performed by different authors with different tools. DAPT documents can represent content at various stages in this workflow.

Hypothetical combined workflow for audio description and dubbing

For more details about this figure, and the corresponding steps, refer to the [DAPT Requirements].

An example of a DAPT document used for dubbing is as follows:

<tt xmlns="http://www.w3.org/ns/ttml"
    xmlns:ttp="http://www.w3.org/ns/ttml#parameter"
    ttp:contentProfiles="http://www.w3.org/ns/ttml/profile/dapt1.0/content"
    xml:lang="en"
    xmlns:daptm="http://www.w3.org/ns/ttml/profile/dapt#metadata"
    daptm:workflowType="dubbing"
    daptm:scriptType="translatedTranscript">
  <head>
    <metadata>
      <ttm:agent type="character" xml:id="character_1">
        <ttm:name type="alias">ASSANE</ttm:name>
      </ttm:agent>
    </metadata>
  </head>
  <body>
    <div begin="10s" end="13s" ttm:agent="character_1">
      <p xml:lang="fr" daptm:langSrc="original">
        <span>Et c'est grâce à ça qu'on va devenir riches.</span>
      </p>
      <p xml:lang="en" daptm:langSrc="translation">
        <span>And thanks to that, we're gonna get rich.</span>
      </p>
    </div>
  </body>
</tt>

In this example, the document represents dialogue transcribed from a movie, and then translated. The corresponding actor is identified to permit recording by the corresponding voice. Both the original and translated content are preserved to permit any last minute adaptation of the translated content, while keeping the original intent. Richer dubbing examples can be found in the draft specification.

An example of DAPT document representing audio descriptions that could be rendered on the client side is as follows:

...
    <div begin="18s" end="20s">
      <p daptm:langSrc="original">
        <span tta:speak="normal">
          The woman pulls the tiller and the boat turns.</span>
      </p>
    </div>
...

In this example, additional attributes are provided to drive a text-to-speech engine. These attributes are already defined in TTML2 which in turn can be mapped to SSML. The text-to-speech engine can be run server side, in which case, what is delivered to the end user can simply be an audio track. It can also be run client side, for example in a browser. Other audio description examples can be found in the draft specification, some that include mixing pre-rendered audio recordings (when the quality of a text-to-speech engine is not satisfactory) or some that include animating gain parameters (for example to make the audio description more easily heard).

Goals [or Motivating Use Cases, or Scenarios]

The dubbing and audio description industry lacks a standard exchange format to represent transcripts of what is happening in a media asset or to represent what needs to be or has been recorded as audio tracks. Current practices include exchanging scans of printed and manually annotated documents, spreadsheets or Word documents with no clearly defined structures, or other formats.

The goals of DAPT are:

  • provide a standard format for the exchange of transcripts in dubbing and audio description authoring workflows
  • provide a standard format for the delivery of audio descriptions to clients

The end-users of the DAPT specification are content creators (audio description writers, translators, dubbing actors) and visually-impaired users consuming audio descriptions on their devices with text-to-speech tools (e.g. browsers).

Non-goals

  • It is NOT the goal of DAPT to define yet another timed text format. DAPT reuses existing W3C standards.
  • It is NOT the goal of DAPT to define APIs to create or consume DAPT content. DAPT only defines a serialization format.
  • It is NOT the goal of DAPT to define how authoring tools should display DAPT content. DAPT does not define layout features.
  • It is NOT the goal of DAPT to define full-fledged audio mixing tools. DAPT only allows basic audio mixing features.

User research

DAPT is based off results of two previous initiatives:

Feature: Representing Timed Text

A key requirement of DAPT is the ability to represent text with times matching media times in the corresponding media asset.

TTML defines lots of semantics around timing. DAPT restricts these to match the needs of dubbing and audio description workflows and simplify implementations. For example, TTML can express times using wall clock times, DAPT does not. In its simplest example, a DAPT document only needs the begin and end attributes as in the following example.

...
    <div begin="10s" end="13s">
    </div>
    <div begin="18s" end="20s">
    </div>
...

Timing can also be provided at a finer granularity to assist voice actors during the recording process.

    <div begin="10s" end="13s">
      <p ...>
        <span begin="0s">And thanks to that,</span><span begin="1.5s"> we're gonna get rich.</span>
      </p>
   </div>

Feature: Language identification

Another key requirement of DAPT is the ability to represent the same content in multiple languages. This is necessary for dubbing workflows, as preserving the original language content as long as possible can improve the quality of the dubbed content. DAPT also allows tagging what text content is the original and which one is the translated content. This is especially useful when the source content uses mixed languages. An example of DAPT is as follows:

    <div begin="10s" end="13s">
      <p xml:lang="fr" daptm:langSrc="original">
        <span>Et c'est grâce à ça qu'on va devenir riches.</span>
      </p>
      <p xml:lang="en" daptm:langSrc="translation">
        <span>And thanks to that, we're gonna get rich.</span>
      </p>
    </div>

Feature: Character identification

In some workflows, it is necessary to identify the pieces of text that need to be voiced by the same voice actor, and to provide metadata around the actor. TTML2 already defines the corresponding vocabulary. DAPT prescribes how to use them as in the following example:

...
<metadata>
  <ttm:agent type="person" xml:id="actor_A">
    <ttm:name type="full">Matthias Schoenaerts</ttm:name>
  </ttm:agent>
  <ttm:agent type="character" xml:id="character_2">
    <ttm:name type="alias">BOOKER</ttm:name>
    <ttm:actor agent="actor_A"/>
  </ttm:agent>
</metadata>
...

Feature: Annotations

Dubbing or audio description workflows require adding various annotations to text content. They can be proprietary, specific to the authoring tools. They can also be common to many workflows such as: indicating if the text content corresponds to dialog in the media asset or if it corresponds to burned-in text in the video; or identifying if the actor is on or off screen; etc. TTML2 defines basic vocabulary. DAPT uses and extends this vocabulary, as in the example below:

    <div begin="10s" end="13s">
      <ttm:desc daptm:descType="scene">Scene 1</ttm:desc>
      <ttm:desc daptm:descType="plotSignificance">High</ttm:desc>
      <p daptm:langSrc="original">
        <span>A woman climbs into a small sailing boat.</span>
      </p>
      <p xml:lang="fr" daptm:langSrc="translation">
        <span>Une femme monte à bord d'un petit bateau à voile.</span>
      </p>
    </div>

Feature: Audio support

For audio description workflows, DAPT integrates audio aspects in three ways:

  • it defines instructions to generate audio content from text content. This can be done server-side or client-side.
  • it defines how to associate pre-recorded audio content with text content.
  • it defines mixing instructions, possibly animated, to render audio description content with the rest of the audio programme.

In the following example, the pre-recorded audio content in clip3.wav corresponds to the text The sails billow in the wind. and is intended to be mixed with a gain animation at the start and end of the text. This supports the current industry practice to specify gain curves applied to the main programme audio to allow the audio description to be heard more clearly.

<tt ...
  daptm:workflowType="audioDescription"
  daptm:scriptType="asRecorded"
  xml:lang="en">
  ...
    <div begin="25s" end="28s">
      <p daptm:langSrc="original">
        <animate begin="0.0s" end="0.3s" tta:gain="1;0.39" fill="freeze"/>
        <animate begin="2.7s" end="3s" tta:gain="0.39;1"/>
        <span begin="0.3s" end="2.7s">
          <audio src="clip3.wav"/>
          The sails billow in the wind.</span>
      </p>
    </div>
...

Detailed design discussion

Audio resources

The current draft specification points to various issues related to audio integration. TTML2 offers many options (embedded resources, linked resources) and DAPT could restrict those:

Considered alternatives

Netflix TTAL

Netflix initially proposed TTAL to address the challenges of DAPT. TTAL is a JSON-based format, but it was considered that basing DAPT on TTML was better for the industry, as TTML is already widespread in the subtitling authoring industry.

WebVTT

The Timed Text Working Group maintains 2 Timed Text formats: TTML and WebVTT. WebVTT is used for subtitling but it is mostly targeted at playback devices and not as an interchange format. While theoretically, it could be possible to extend WebVTT to address DAPT interchange needs, TTML2 already defines most of the required vocabulary, while WebVTT would have needed more work.

Future extensions

One comment received already is that the format would be useful as a way to exchange post-production scripts of completed programmes, which are a common deliverable to commissioners of television and movie content. Although the initial requirements appear quite narrow, the design makes it feasible to augment the profile to accommodate such additional uses in future iterations. In other words, the design is intended to be open to additional uses, rather than closed to meet only the initial narrow set of requirements.

Stakeholder Feedback / Opposition

At this stage, only positive feedback has been received from the industry (e.g. EBU Timed Text Working Group, other studios, vendors) as recorded in the GitHub issues. The Wide Review process has been initiated (outreach log) and further feedback will be collected.

References & acknowledgements

TBD