[CharArrayParser] Skip malformed EOL #25323

CastagnaIT · 2024-06-12T08:22:18Z

Description

Some subtitles may have a malformed line breaks as follow: CR+CR+LF
that is a mixed line breaks between macintosh + windows format
on other media players like VLC or MPV somewhat are able skip malformed chars to allow to try read the file in a better way

subtitles like this use case was working on Kodi <= 19 because to read text lines was using std::stringstream getline
that by default identify the new line by checking for \n and somewhat other EOL chars was ignored
but from Kodi 20 has been introduced CharArrayParser to get the text lines
this to allow read Macintosh format subtitles (usually webvtt) but dont handle this situation

Motivation and context

fix #25299

How has this been tested?

sample video + srt
Video test+srt.zip

What is the effect on users?

Subs working despite malformed EOL chars

Screenshots (if appropriate):

Types of change

Bug fix (non-breaking change which fixes an issue)
Clean up (non-breaking change which removes non-working, unmaintained functionality)
Improvement (non-breaking change which improves existing functionality)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that will cause existing functionality to change)
Cosmetic change (non-breaking change that doesn't touch code)
Student submission (PR was done for educational purposes and will be treated as such)
None of the above (please explain below)

Checklist:

My code follows the Code Guidelines of this project
My change requires a change to the documentation, either Doxygen or wiki
I have updated the documentation accordingly
I have read the Contributing document
I have added tests to cover my change
All new and existing tests passed

Some subtitles may have a malformed Macintosh line break as follow: CR+CR+LF a mixed line breaks between macintosh + windows format on other media players this is somewhat ignored to allow to try read the file in a better way

arnova · 2024-06-12T09:55:56Z

Why not simply consider the first \r or \n as EOL and skip any proceeding \r and \n ? That would probably make it even more robust and code wise simpler... ?

CastagnaIT · 2024-06-12T12:28:40Z

Why not simply consider the first \r or \n as EOL and skip any proceeding \r and \n ? That would probably make it even more robust and code wise simpler... ?

idk if i understood correctly (if so put an example)
you want to suggest to replace

  // Skip EOL chars
  if (m_data[m_position] == '\r')
  {
    m_position++;

    if (m_data[m_position] == '\n')
      m_position++;
    // Malformed EOL as \r\r\n
    else if (m_position + 1 <= m_limit && m_data[m_position] == '\r' &&
             m_data[m_position + 1] == '\n')
      m_position += 2;
  }
  else if (m_data[m_position] == '\n')
  {
    m_position++;
  }

with something similar to this?

  while (m_data[m_position] == '\r' || m_data[m_position] == '\r')
  {
    ...
  }

if so, this is yes simpler, but delete also all empty lines to be kept
examples:

Line1\r\n\r\nLine3\r\n (or: Line1\r\r\n\r\r\nLine3\r\r\n)

will result as:

Line1
Line3

instead of:

Line1
(empty line 2)
Line3

arnova · 2024-06-13T12:28:42Z

Why not simply consider the first \r or \n as EOL and skip any proceeding \r and \n ? That would probably make it even more robust and code wise simpler... ?

idk if i understood correctly (if so put an example) you want to suggest to replace
  // Skip EOL chars
  if (m_data[m_position] == '\r')
  {
    m_position++;

    if (m_data[m_position] == '\n')
      m_position++;
    // Malformed EOL as \r\r\n
    else if (m_position + 1 <= m_limit && m_data[m_position] == '\r' &&
             m_data[m_position + 1] == '\n')
      m_position += 2;
  }
  else if (m_data[m_position] == '\n')
  {
    m_position++;
  }
with something similar to this?
  while (m_data[m_position] == '\r' || m_data[m_position] == '\r')
  {
    ...
  }
if so, this is yes simpler, but delete also all empty lines to be kept examples:
Line1\r\n\r\nLine3\r\n (or: Line1\r\r\n\r\r\nLine3\r\r\n)
will result as:
Line1
Line3
instead of:
Line1
(empty line 2)
Line3

Yes, you're correct. I didn't realize this is a global function that's not exclusively used for subtitle parsing. Note that I don't know about all the different subtitle formats (except for subrip/srt) out there but I suspect for most of them (any?) this shouldn't be a problem. But again since you're modifying a global function here, you current/initial change may be best after all.

[CharArrayParser] Skip malformed EOL

a401c31

Some subtitles may have a malformed Macintosh line break as follow: CR+CR+LF a mixed line breaks between macintosh + windows format on other media players this is somewhat ignored to allow to try read the file in a better way

CastagnaIT added Type: Fix non-breaking change which fixes an issue Backport: Needed Component: Subtitles v22 "P" labels Jun 12, 2024

CastagnaIT added this to the "P" 22.0 Alpha 1 milestone Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CharArrayParser] Skip malformed EOL #25323

[CharArrayParser] Skip malformed EOL #25323

CastagnaIT commented Jun 12, 2024

arnova commented Jun 12, 2024

CastagnaIT commented Jun 12, 2024

arnova commented Jun 13, 2024

[CharArrayParser] Skip malformed EOL #25323

Are you sure you want to change the base?

[CharArrayParser] Skip malformed EOL #25323

Conversation

CastagnaIT commented Jun 12, 2024

Description

Motivation and context

How has this been tested?

What is the effect on users?

Screenshots (if appropriate):

Types of change

Checklist:

arnova commented Jun 12, 2024

CastagnaIT commented Jun 12, 2024

arnova commented Jun 13, 2024