Skip to content

Non-printable unicode characters are inserted at the beginning of the transcript #127

@kolaente

Description

@kolaente

When transcribing audio, non-readable characters are continuously inserted at the beginning of the transcribed text. Like this:

Let's try this.

Edit: It seems like GitHub removes these characters when creating an issue, here's a gist with the characters preserved.

I have no idea what these are so I've asked Claude:

Those characters at the beginning are combining diacritical marks - specifically, they appear to be combining macron below marks (Unicode U+0331).

These are special characters that are meant to be combined with base letters to add marks below them, but when they appear without a base character (or stacked multiple times), they can show up as a series of underlines or marks. In your text, there are about 15 of them before "Let's try this."

This kind of thing can happen when:

  • Text is copied from certain sources that use Unicode combining characters
  • There's a formatting or encoding issue
  • Someone intentionally adds them for visual effect (sometimes called "Zalgo text" when taken to extremes)

In most contexts, these would just be removed or cleaned up since they're not adding any meaningful content to your text.

To me, this seems like a bug. Maybe an easy solution would be to trim characters like this when they are included? I assume they are coming from the transcription itself.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions