Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CharArrayParser] Skip malformed EOL #25323

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

CastagnaIT
Copy link
Collaborator

Description

Some subtitles may have a malformed line breaks as follow: CR+CR+LF
that is a mixed line breaks between macintosh + windows format
on other media players like VLC or MPV somewhat are able skip malformed chars to allow to try read the file in a better way

subtitles like this use case was working on Kodi <= 19 because to read text lines was using std::stringstream getline
that by default identify the new line by checking for \n and somewhat other EOL chars was ignored
but from Kodi 20 has been introduced CharArrayParser to get the text lines
this to allow read Macintosh format subtitles (usually webvtt) but dont handle this situation

Motivation and context

fix #25299

How has this been tested?

sample video + srt
Video test+srt.zip

What is the effect on users?

Subs working despite malformed EOL chars

Screenshots (if appropriate):

Types of change

  • Bug fix (non-breaking change which fixes an issue)
  • Clean up (non-breaking change which removes non-working, unmaintained functionality)
  • Improvement (non-breaking change which improves existing functionality)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that will cause existing functionality to change)
  • Cosmetic change (non-breaking change that doesn't touch code)
  • Student submission (PR was done for educational purposes and will be treated as such)
  • None of the above (please explain below)

Checklist:

  • My code follows the Code Guidelines of this project
  • My change requires a change to the documentation, either Doxygen or wiki
  • I have updated the documentation accordingly
  • I have read the Contributing document
  • I have added tests to cover my change
  • All new and existing tests passed

Some subtitles may have a malformed Macintosh line break as follow:
CR+CR+LF
a mixed line breaks between macintosh + windows format
on other media players this is somewhat ignored to allow to try read the
file in a better way
@CastagnaIT CastagnaIT added this to the "P" 22.0 Alpha 1 milestone Jun 12, 2024
@arnova
Copy link
Member

arnova commented Jun 12, 2024

Why not simply consider the first \r or \n as EOL and skip any proceeding \r and \n ? That would probably make it even more robust and code wise simpler... ?

@CastagnaIT
Copy link
Collaborator Author

Why not simply consider the first \r or \n as EOL and skip any proceeding \r and \n ? That would probably make it even more robust and code wise simpler... ?

idk if i understood correctly (if so put an example)
you want to suggest to replace

  // Skip EOL chars
  if (m_data[m_position] == '\r')
  {
    m_position++;

    if (m_data[m_position] == '\n')
      m_position++;
    // Malformed EOL as \r\r\n
    else if (m_position + 1 <= m_limit && m_data[m_position] == '\r' &&
             m_data[m_position + 1] == '\n')
      m_position += 2;
  }
  else if (m_data[m_position] == '\n')
  {
    m_position++;
  }

with something similar to this?

  while (m_data[m_position] == '\r' || m_data[m_position] == '\r')
  {
    ...
  }

if so, this is yes simpler, but delete also all empty lines to be kept
examples:

Line1\r\n\r\nLine3\r\n (or: Line1\r\r\n\r\r\nLine3\r\r\n)

will result as:

Line1
Line3

instead of:

Line1
(empty line 2)
Line3

@arnova
Copy link
Member

arnova commented Jun 13, 2024

Why not simply consider the first \r or \n as EOL and skip any proceeding \r and \n ? That would probably make it even more robust and code wise simpler... ?

idk if i understood correctly (if so put an example) you want to suggest to replace

  // Skip EOL chars
  if (m_data[m_position] == '\r')
  {
    m_position++;

    if (m_data[m_position] == '\n')
      m_position++;
    // Malformed EOL as \r\r\n
    else if (m_position + 1 <= m_limit && m_data[m_position] == '\r' &&
             m_data[m_position + 1] == '\n')
      m_position += 2;
  }
  else if (m_data[m_position] == '\n')
  {
    m_position++;
  }

with something similar to this?

  while (m_data[m_position] == '\r' || m_data[m_position] == '\r')
  {
    ...
  }

if so, this is yes simpler, but delete also all empty lines to be kept examples:

Line1\r\n\r\nLine3\r\n (or: Line1\r\r\n\r\r\nLine3\r\r\n)

will result as:

Line1
Line3

instead of:

Line1
(empty line 2)
Line3

Yes, you're correct. I didn't realize this is a global function that's not exclusively used for subtitle parsing. Note that I don't know about all the different subtitle formats (except for subrip/srt) out there but I suspect for most of them (any?) this shouldn't be a problem. But again since you're modifying a global function here, you current/initial change may be best after all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Subtitles not shown when playing avi file with external srt
2 participants