Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode-normalised filenames not matched by non-normalised links #611

Closed
cormacrelf opened this issue May 5, 2021 · 1 comment
Closed

Comments

@cormacrelf
Copy link
Contributor

cormacrelf commented May 5, 2021

Ref #419

In a nutshell

There's a second layer to the unicode file name problem. Many filesystems and programs perform Unicode normalisation on file names, and this can change on the fly as people open a zettelkasten on a different computer. This is an issue relevant to Neuron, because when people write [[links]] in markdown files (not normalised), those links have to be reconciled with file names. Both file names and zettel links should be considered to contain unknown unicode normalisation, so you can get broken links in a number of ways.

What should Neuron do?

Add a call to a Unicode normalisation library like unicode-transforms to normalise to NFC when creating a Zettel ID, whether from a file name, or from a [[link]]. This enables linkage to be independent of whatever filesystem you're running on. And probably also document that Zettel IDs are NFC normalised.

Context

Here's the lay of the land as I understand it, for file names:

  • On Linux, most filesystems apply NFC normalisation (canonical decompose + canonical recompose).
  • On Windows I think they just used UTF-16 and called it a day.
  • On macOS, HFS+ used to use NFD (canonical decompose only), but now it's more complicated: many apps (e.g. TextEdit) will normalise in some way, but APFS will not normalise anything itself and can happily hold NFD or NFC text, or really most byte strings.
    • nvim filename.txt on the command line does not normalise.
    • The system 'Save File' dialogue DOES normalise to NFD.
    • If you check files into Git on APFS, it normalises all file names with NFC for better compatibility. Probably only changes them when you checkout again though!

However, for file contents, no text editors perform any unicode normalisation, and so they shouldn't; they store whatever you type in there. System input methods do not normalise either. This is all correct and good; normalisation is used only for "set semantics" where having multiple encodings of é is really confusing, i.e. you get two files in a folder both apparently called é.txt but using different Unicode representations of é.

Basically, Neuron Zettel IDs have these "set semantics". So they need to be normalised. The best form to use would be NFC, because:

  • Git did it and they're probably right
  • HFS+, the only enforced NFD normaliser, is basically dead, but this still doesn't even break it because if you're serving /blah.html then it normalises when you ask for that path.
  • It doesn't have so many display issues in dumb environments like terminals that don't recompose (image)

Repro/example

You can repro this example by saving Korean file names on a filesystem which can store file names in NFD. I'm using a Mac with APFS, and saving the file with TextEdit.

$ ls
index.md
이분.md
neuron.dhall

That filename without extension is the byte sequence e1 84 8b e1 85 b5 e1 84 87 e1 85 ae e1 86 ab, made via NFD normalisation. To be sure, ls > file.txt and inspect its bytes from there using a Hex editor. Here's index.md:

Link to the other zettel using text typed on the keyboard [[이분]]
This link target happens to be in NFC form, byte sequence `ec 9d b4 eb b6 84`
But user input may not be in NFC form. It's purely coincidental.

Basically:

image

Viewing the generated impulse.html in a hex editor shows the problem:

image

image

If you have a Mac, you can download the Apfelstrudel utility to get a sense of how differently a filename is stored in NFD VS how it is stored when you type it again in a [[link]].

On APFS, because normalisation is not enforced anywhere, you can also attempt to "fix" the zettels by running:

brew install convmv
convmv -r -f utf8 -t utf8 --nfc --notest .
# and break it again
convmv -r -f utf8 -t utf8 --nfd --notest .

Noting, again, that this only works because the index.md happened to have a NFC normalised link in it, but that will not always be the case.

@srid
Copy link
Owner

srid commented May 6, 2021

Awesome, thanks for the explanation!

@srid srid closed this as completed in 39597ae May 6, 2021
srid added a commit to srid/ema that referenced this issue May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants