Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Unicode CLDR artifact size #373

Closed
jaakkor2 opened this issue Feb 23, 2022 · 5 comments · Fixed by #450
Closed

Reduce Unicode CLDR artifact size #373

jaakkor2 opened this issue Feb 23, 2022 · 5 comments · Fixed by #450

Comments

@jaakkor2
Copy link

On my local drive size of
.julia\artifacts\8d7201f06ff8061c1b5bff0f661b5297ec1d3158\cldr-release-40
is 318 MB.

If I searched code correctly, only one 49 kB file is actually used
https://github.com/unicode-org/cldr/blob/release-40/common/supplemental/windowsZones.xml
here
https://github.com/JuliaTime/TimeZones.jl/blob/master/src/winzone/WindowsTimeZoneIDs.jl

Could the artifact point directly to the windowsZones.xml file?

@jaakkor2
Copy link
Author

jaakkor2 commented May 4, 2023

..\.julia\artifacts\768b8b82edd878449348f31bf989867bb13e42a9\cldr-release-43
folder size is 335 MB, and this package is using one file of it (IIUC)
..\.julia\artifacts\768b8b82edd878449348f31bf989867bb13e42a9\cldr-release-43\common\supplemental\windowsZones.xml
has the size of 49 kB.

@omus
Copy link
Member

omus commented May 5, 2023

You are correct that TimeZones.jl only requires the windowsZones.xml file from the CLDR artifacts. This is an unfortunate side effect of this package switching to use Julia artifacts which requires an archive to be referenced and not just a flat file.

In order to address this problem with the artifacts system we'd need to generate our own permanent artifact archive for each CLDR release. Probably the right approach would be to have these artifacts as GitHub release assets which follows what JLLs do. It also probably makes sense to have this not be a part of the TimeZones.jl repo but probably a CLDRWindowsZones_jll repo so that the tags/releases make more sense. Additionally, it may make more sense to make a CLDR_jll.jl and have just the windowsZones.xml included in a special release asset so we can retrieve it separately.

The end result is having a much smaller artifact footprint as well as being able to update CLDR versions without having to make a new TimeZones.jl release. Unfortunately, this is a bunch of work but I do want to tackle this before or around JuliaCon 2023.

@omus omus changed the title Minimized needed cldr artifact size Reduce Unicode CLDR artifact size Aug 22, 2023
@omus
Copy link
Member

omus commented Aug 22, 2023

The PR #439 changed the Unicode CLDR artifacts to no longer be non-lazy and platform specific. Although this doesn't impact the download size on Windows machines it does limit the downloads to only be on Windows.

It still makes sense to make separate data package with this information akin to what was done in #441. I do want to have #441 be out in the wild for a little while to ensure this artifact model works well.

@visr
Copy link
Contributor

visr commented Aug 29, 2023

Besides the artifact size, the many nested directories also can cause issues with the maximum file path on Windows.

In this case using PackageCompiler to create a library with the cldr artifact led to an error like:

ERROR: LoadError: IOError: open("libribasim\\share\\julia\\artifacts\\40b35727ea0aff9a9f28b7454004b68849caf67b\\cldr-release-43-1\\tools\\cldr-code\\src\\main\\resources\\org\\unicode\\cldr\\util\\data\\transforms\\internal_English-IPA-backwards.txt", 769, 33060): no such file or directory (ENOENT)

Enabling long paths in the Windows Registry fixed the issue.

https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation

@visr
Copy link
Contributor

visr commented Oct 24, 2023

It still makes sense to make separate data package with this information akin to what was done in #441.

@omus if I understand correctly windowsZones.xml is only used to fill const WINDOWS_TRANSLATION = Dict{String, String}(). Would you consider a PR that switches from the cldr artifact to an auto-generated Dict literal? It's only about 140 entries.

I had a look at the tzlocal Python package and saw they do the same. This is their dict literal, which they generate from this script. The dict lookup is done here after reading the registry (rather than tzinfo).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants