Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the IMSC HRM for EBU-TT-D documents #66

Open
wants to merge 32 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
e5d5d11
wip
nigelmegitt Nov 23, 2021
541a48b
Remove Pipfile and don't need pytest-catchlog
nigelmegitt Nov 26, 2021
04a1044
Load EBU-TT-D documents from XML
nigelmegitt Jan 17, 2023
379c181
Test EBU-TT-D <--> XML
nigelmegitt Oct 20, 2023
883e5e3
Fix element name hack when there's no prefix
nigelmegitt Nov 17, 2023
913a5d6
Empty validator code with tests running
nigelmegitt Nov 17, 2023
1c5f6c6
Remove content with no associated region from imsc-hrm tests
nigelmegitt Nov 23, 2023
ca05e81
Calculate if an ISD is empty
nigelmegitt Nov 23, 2023
3d2578f
Compute styles for spans in EBU-TT-D
nigelmegitt Nov 27, 2023
fcb6a56
Compute drawing area S
nigelmegitt Nov 27, 2023
e6a1ba0
Add hash function to CellFontSizeType
nigelmegitt Nov 27, 2023
e5005a3
WIP implementing textDuration
nigelmegitt Nov 27, 2023
f964cf2
Script to generate Python file for uax24
nigelmegitt Dec 1, 2023
0baeb58
Add missing region
nigelmegitt Dec 3, 2023
4395753
Complete implementation
nigelmegitt Dec 3, 2023
7ed4c34
Handle text broken into different children of span
nigelmegitt Dec 3, 2023
4a13966
log not print, don't reprocess text due to `<br/>`
nigelmegitt Dec 4, 2023
c784d56
Fix up dur003 tests
nigelmegitt Dec 4, 2023
b1a948b
Handle character codes more than 4 digits long
nigelmegitt Dec 4, 2023
70053d5
extend the character ranges by 1 at the end
nigelmegitt Dec 4, 2023
e9636da
More elegant solution to the range problem
nigelmegitt Dec 4, 2023
94af5b8
Ignore content without a region when checking if an ISD is empty
nigelmegitt Dec 4, 2023
b2a50f7
ISD handling, regions with backgrounds
nigelmegitt Dec 6, 2023
d0fdf17
Integrate the IMSC HRM Validator with a command line switch
nigelmegitt Dec 6, 2023
9321951
Add showBackground region tests
nigelmegitt Dec 6, 2023
348e650
Add unit tests for IMSC HRM Validator
nigelmegitt Dec 7, 2023
d0093e6
Edge case fixes
nigelmegitt Dec 7, 2023
45454e7
Incorporate proposed fix for w3c/imsc-hrm-tests#12
nigelmegitt Dec 7, 2023
e030def
Delete p0bmslf8_gaps.json
nigelmegitt Dec 7, 2023
8e835ad
Delete statement about conversion to EBU-TT-D
nigelmegitt Dec 7, 2023
13c2a7f
Don't re-add Pipfile
nigelmegitt Dec 26, 2023
dc086c9
Add documentation for `imscHrmValidator`
nigelmegitt Dec 26, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions docs/source/ebu_tt_live.scripts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,3 +58,12 @@ scripts Package
:undoc-members:
:show-inheritance:


:mod:`imsc_hrm_validator` Module
-------------------------------------

.. autoclass:: ebu_tt_live.scripts.imsc_hrm_validator.imscHrmValidator
:members:

.. automodule:: ebu_tt_live.scripts.imsc_hrm_validator
:show-inheritance:
5 changes: 5 additions & 0 deletions docs/source/scripts_and_their_functions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -226,10 +226,15 @@ This script loads a file from the file system and attempts to validate it
as the specified format, either EBU-TT Part 1, EBU-TT Part 3 or EBU-TT-D.
By default the expected format is EBU-TT-D.

Additionally, EBU-TT-D documents can be validated against the
`IMSC-HRM <https://www.w3.org/TR/imsc-hrm/>`_ by adding the ``--hrm`` flag.

Example command lines:

``validator -i path/to/ebu-tt-1-file-to-test.xml -f 1``

``validator -i path/to/ebu-tt-3-file-to-test.xml -f 3``

``validator -i path/to/ebu-tt-d-file-to-test.xml -f D``

``validator -i path/to/ebu-tt-d-file-to-test.xml -f D --hrm``
17 changes: 17 additions & 0 deletions docs/source/validation_framework.rst
Original file line number Diff line number Diff line change
Expand Up @@ -146,3 +146,20 @@ by using the context manager class and instead of the context being passed
around as a parameter among functions the binding classes call the
:py:func:`ebu_tt_live.bindings.pyxb_utils.get_xml_parsing_context` function to
gain access to the parsing context object.


Validation outside document objects
===================================

When constraints beyond the document specification need to be validated,
validation code can be written outside the document and bindings objects themselves.

IMSC-HRM validation
-------------------

The :py:class:`ebu_tt_live.scripts.imscHrmValidator` class is an example
of such out-of-document validation. It provides a single
:py:func:`ebu_tt_live.scripts.imscHrmValidator.validate` method that
processes the provided validated EBU-TT-D document, according to the
`IMSC-HRM <https://www.w3.org/TR/imsc-hrm/>`_ algorithm,
and returns true or false as appropriate.
6 changes: 5 additions & 1 deletion ebu_tt_live/adapters/document_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,11 @@ class XMLtoEBUTTDAdapter(IDocumentDataAdapter):
_provides = EBUTTDDocument

def convert_data(self, data, **kwargs):
return EBUTTDDocument.create_from_xml(data), kwargs
doc = EBUTTDDocument.create_from_xml(data)
kwargs.update(dict(
raw_xml=data
))
return doc, kwargs


class EBUTTDtoXMLAdapter(IDocumentDataAdapter):
Expand Down
20 changes: 20 additions & 0 deletions ebu_tt_live/adapters/test/test_data/testEbuttd.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
<?xml version="1.0" encoding="UTF-8"?>
<tt:tt ttp:timeBase="media" xml:lang="en-GB" xmlns:ebuttm="urn:ebu:tt:metadata" xmlns:ebuttp="urn:ebu:tt:parameters" xmlns:tt="http://www.w3.org/ns/ttml" xmlns:ttp="http://www.w3.org/ns/ttml#parameter" xmlns:tts="http://www.w3.org/ns/ttml#styling" xmlns:xml="http://www.w3.org/XML/1998/namespace">
<tt:head>
<tt:metadata>
<ebuttm:documentMetadata/>
</tt:metadata>
<tt:styling>
<tt:style xml:id="s0"/>
</tt:styling>
<tt:layout>
<tt:region xml:id="r0" tts:origin="10% 10%" tts:extent="80% 80%"></tt:region>
</tt:layout>
</tt:head>
<tt:body>
<tt:div>
<tt:p xml:id="ID001" begin="01:23:45.670" end="01:23:45.890">It only took me six days.</tt:p>
</tt:div>
</tt:body>
</tt:tt>

56 changes: 46 additions & 10 deletions ebu_tt_live/adapters/test/test_document_data_adapters.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,11 +136,49 @@ def test_sequence_id_mismatch(self):


class TestXMLtoEBUTTDAdapter(TestCase):
_output_type = documents.EBUTTDDocument
_adapter_class = document_data.XMLtoEBUTTDAdapter
_expected_keys = []
_test_xml_file = 'testEbuttd.xml'
_test_data_dir_path = os.path.join(os.path.dirname(__file__), 'test_data')
_test_xml_path = os.path.join(_test_data_dir_path, _test_xml_file)
_output_type = documents.EBUTTDDocument
_expected_keys = [
'raw_xml'
]
instance = None

def setUp(self):
self.instance = self._adapter_class()
self.assertIsInstance(self.instance, IDocumentDataAdapter)

# TODO: Finish this once we have EBUTT-D parsing
def _assert_output_type(self, result):
self.assertIsInstance(result, self._output_type)

def _assert_kwargs_passtrough(self, result_kwargs, expected_keys):
self.assertEqual(set(result_kwargs.keys()), set(expected_keys))

def _get_xml(self):
with open(self._test_xml_path, 'r') as xml_file:
xml_data = xml_file.read()
return xml_data

def _get_input(self):
return self._get_xml()

def test_success(self):
expected_keys = []
expected_keys.extend(self._expected_keys)
result, res_kwargs = self.instance.convert_data(self._get_input())
self._assert_output_type(result)
self._assert_kwargs_passtrough(res_kwargs, expected_keys)

def test_kwargs_passthrough(self):
in_kwargs = {
'foo': 'bar'
}
expected_keys = ['foo']
expected_keys.extend(self._expected_keys)
result, res_kwargs = self.instance.convert_data(self._get_input(), **in_kwargs)
self._assert_kwargs_passtrough(res_kwargs, expected_keys)


class TestEBUTT3toXMLAdapter(TestXMLtoEBUTT3Adapter):
Expand All @@ -164,20 +202,18 @@ def test_sequence_id_match(self):
pass


class TestEBUTTDtoXMLAdapter(TestEBUTT3toXMLAdapter):
class TestEBUTTDtoXMLAdapter(TestXMLtoEBUTTDAdapter):
_output_type = six.text_type
_adapter_class = document_data.EBUTTDtoXMLAdapter
_expected_keys = []

def _get_input(self):
return documents.EBUTTDDocument.create_from_xml(self._get_xml())

def _get_input(self):
input_doc = documents.EBUTTDDocument(lang='en-GB')
return input_doc

def test_sequence_id_mismatch(self):
pass

def test_sequence_id_match(self):
pass


class TestEBUTT3toEBUTTDAdapter(TestXMLtoEBUTT3Adapter):
_adapter_class = document_data.EBUTT3toEBUTTDAdapter
Expand Down
12 changes: 12 additions & 0 deletions ebu_tt_live/bindings/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -1539,6 +1539,11 @@ def _validateBinding_vx(self):

super(d_tt_type, self)._validateBinding_vx()

def get_timing_type(self, timedelta_in):
if self.timeBase == 'media':
return ebuttdt.FullClockTimingType(timedelta_in)
else:
log.error('d_tt_type.get_timing_type() where self.timeBase == {}'.format(self.timeBase))

raw.d_tt_type._SetSupersedingClass(d_tt_type)

Expand Down Expand Up @@ -1942,6 +1947,10 @@ def _semantic_before_traversal(
parent_binding=None):
self._semantic_preprocess_timing(
dataset=dataset, element_content=element_content)
self._semantic_collect_applicable_styles(
dataset=dataset,
style_type=style_type,
parent_binding=parent_binding)

def _semantic_after_traversal(
self,
Expand Down Expand Up @@ -2041,6 +2050,9 @@ def _validateBinding_vx(self):
raw.layout: layout,
raw.body_type: body_type,
},
'ebuttd': {
raw.d_tt_type: d_tt_type,
},
}


Expand Down
10 changes: 8 additions & 2 deletions ebu_tt_live/bindings/_ebuttdt.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,8 +68,11 @@ def _ConvertArguments_vx(cls, args, kw):
context = get_xml_parsing_context()
if context is not None:
# This means we are in XML parsing context. There should be a timeBase and a timing_attribute_name in the
# context object.
time_base = context['timeBase']
# context object. But if there's no timeBase, in the context
# of EBU-TT-D, we will assume media. Some files in the wild
# trigger this behaviour, for reasons not yet identified, i.e.
# we somehow get here without having a timeBase context set.
time_base = context.get('timeBase', 'media')
# It is possible for a timing type to exist as the value of an element not an attribute,
# in which case no timing_attribute_name is in the context; in that case don't attempt
# to validate the data against a timebase. At the moment this only affects the
Expand Down Expand Up @@ -611,6 +614,9 @@ def _do_eq(self, other):

def __eq__(self, other):
return self._do_eq(other)

def __hash__(self):
return hash((self.horizontal, self.vertical))


ebuttdt_raw.cellFontSizeType._SetSupersedingClass(CellFontSizeType)
Expand Down
2 changes: 1 addition & 1 deletion ebu_tt_live/bindings/pyxb_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ def get_xml_parsing_context():
into account the timeBase attribute on the tt element. In that case when the timeBase element is encountered by the
parser is is added to the parsing context object to help PyXB make the right type in the timingType union.

:return: dict that is te parsing context for the currently running parser
:return: dict that is the parsing context for the currently running parser
:return: None if not in parsing mode
"""
log.debug('Accessing xml_parsing_context: {}'.format(__xml_parsing_context))
Expand Down
16 changes: 14 additions & 2 deletions ebu_tt_live/documents/ebuttd.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ class EBUTTDDocument(SubtitleDocument, TimelineUtilMixin):
_encoding = 'UTF-8'

def __init__(self, lang):
self.load_types_for_document()
self._ebuttd_content = bindings.ttd(
timeBase='media',
head=bindings.d_head_type(
Expand Down Expand Up @@ -46,13 +47,23 @@ def validate(self):
document=self
)

@classmethod
def load_types_for_document(cls):
bindings.load_types_for_document('ebuttd')

@classmethod
def create_from_xml(cls, xml):
# NOTE: This is a workaround to make the bindings accept separate root element identities
# for the same name. tt comes in but we rename it to ttd to make the xsd validate.
cls.load_types_for_document()
xml_dom = minidom.parseString(xml)
if xml_dom.documentElement.tagName == 'tt':
xml_dom.documentElement.tagName = 'ttd'
if xml_dom.documentElement.namespaceURI == 'http://www.w3.org/ns/ttml':
if xml_dom.documentElement.prefix is not None and \
xml_dom.documentElement.prefix != '' and \
xml_dom.documentElement.tagName == xml_dom.documentElement.prefix + ':tt':
xml_dom.documentElement.tagName = xml_dom.documentElement.prefix + ':ttd'
elif xml_dom.documentElement.tagName == 'tt':
xml_dom.documentElement.tagName = 'ttd'
instance = cls.create_from_raw_binding(
binding=bindings.CreateFromDOM(
xml_dom
Expand All @@ -62,6 +73,7 @@ def create_from_xml(cls, xml):

@classmethod
def create_from_raw_binding(cls, binding):
cls.load_types_for_document()
instance = cls.__new__(cls)
instance._ebuttd_content = binding
return instance
Expand Down
116 changes: 116 additions & 0 deletions ebu_tt_live/gen_uax24.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
"""Process the UAX24 scripts at https://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt
to generate a Python equivalent.

For example a command like:
python ebu_tt_live/gen_uax24.py -scriptFile uax24scripts.txt -outFile ebu_tt_live/uax24.py

will generate a Python file that specifies script lists that can be queried.
"""

import argparse
import sys
from csv import reader

LIST_SUFFIX='_list'
TRIPLE_QUOTE='"""'
SCRIPTS_TO_LIST={
'Common': [],
'Latin': [],
'Greek': [],
'Cyrillic': [],
'Hebrew': [],
'Han': [],
'Katakana': [],
'Hiragana': [],
'Bopomofo': [],
'Hangul': [],
}

# https://stackoverflow.com/questions/14158868/python-skip-comment-lines-marked-with-in-csv-dictreader
def decomment(csvfile):
for row in csvfile:
raw = row.split('#')[0].strip()
if raw: yield raw

def writeComments(outFile):
outFile.write(TRIPLE_QUOTE)
outFile.write(
'Utility for discovering which UAX24 script a given character code is in,\n'
'useful for example in computing the copy or render times in the IMSC-HRM.\n'
'\n'
'Auto-generated from UAX24 Scripts.txt using gen_uax24.py\n')
outFile.write(TRIPLE_QUOTE)
outFile.write('\n')
return

def writeFuncs(outFile):
outFile.write(
'def lr(a, b):\n'
' return list(range(a, b + 1))\n'
'\n')
return

def genLists(csv_reader):
for row in csv_reader:
scr = row[1].strip().split(' ', maxsplit=1)[0]
if scr in SCRIPTS_TO_LIST:
SCRIPTS_TO_LIST[scr].append(row[0].strip())
return

def charOrRange(char_code: str) -> str:
range_indicator = char_code.find('..')
if range_indicator != -1:
return '*lr(0x{}, 0x{})'.format(
char_code[0:range_indicator],
char_code[range_indicator+2:] # assume already stripped of trailing spaces
)
else:
return '0x{}'.format(char_code)

def writeLists(outFile):
for script, char_codes in SCRIPTS_TO_LIST.items():
outFile.write('\n{}{} = [\n'.format(script, LIST_SUFFIX))
for char_code in char_codes:
outFile.write(' {},\n'.format(
charOrRange(char_code)
))
outFile.write(']\n')
return

def generateUax24(args) -> int:
csv_reader = reader(decomment(args.scriptFile), delimiter=';', skipinitialspace=True)
outFile = args.outFile
writeComments(outFile)
writeFuncs(outFile)
genLists(csv_reader)
writeLists(outFile)

return 1

def main():
parser = argparse.ArgumentParser()

parser.add_argument(
'-scriptFile',
type=argparse.FileType('rt'),
required=True,
help='UAX24 Scripts file',
action='store')

parser.add_argument(
'-outFile',
type=argparse.FileType('wt'),
default=sys.stdout,
nargs='?',
help='Location to write the python file representing the scripts',
action='store')

parser.set_defaults(func=generateUax24)

args = parser.parse_args()
return args.func(args)


if __name__ == "__main__":
# execute only if run as a script
main()
Loading
Loading