Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UnicodeFuncs with safe-to-implement traits #197

Merged
merged 26 commits into from
Sep 6, 2023

Conversation

sffc
Copy link
Contributor

@sffc sffc commented Mar 31, 2023

This PR makes it easier for clients to implement and use hb_unicode_funcs_t.

We want to implement these in ICU4X: unicode-org/icu4x#2514

CC @hsivonen @echeran

pub trait GeneralCategoryFunc {
/// Given a code point, return the general category as a
/// [`hb_unicode_general_category_t`].
fn general_category(&self, ch: u32) -> hb_unicode_general_category_t;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: Perhaps these return values should not use harfbuzz-sys types such that these traits can be implemented without a harfbuzz-sys dependency (see unicode-org/icu4x#3175)

@sffc
Copy link
Contributor Author

sffc commented Apr 7, 2023

👋 Can I get a signal on whether this is the right direction and the shape of what you would approve?

@Manishearth
Copy link
Member

cc @jdm I'm happy to review this but there's a bit of a conflict of interest, would rather get a 👍 from other Servo people on whether this is a direction we want to go in.

Perhaps we should mention this at the next TSC?

(ICU4X may also be happy to take on some of the burden of maintaining this crate, the way we are doing with unicode-bidi)

@mrobinson
Copy link
Member

This seems fine to me, though I wonder if it is ever the case that someone would implement only a subset of the functions? It would be simpler to only expose a single trait for implementing functions if it is always the case that you implement all of them.

@Manishearth
Copy link
Member

One thing it helps is that it allows people to separate different data sources into different crates and implementors, even allowing for mixing.

Unclear if that level of granularity is desired, but it's a potential source of flexibility. (@sffc, thoughts?)

@sffc
Copy link
Contributor Author

sffc commented Aug 1, 2023

This seems fine to me, though I wonder if it is ever the case that someone would implement only a subset of the functions? It would be simpler to only expose a single trait for implementing functions if it is always the case that you implement all of them.

There could be use cases. For example, maybe you want to overwrite your Script data but leave the Normalizer alone. Harfbuzz seems to make no requirement that all six functions be implemented.

This is ready for re-review.

@Manishearth
Copy link
Member

Manishearth commented Aug 2, 2023

r? @hsivonen

Ideally someone with familiarity with harfbuzz reviews this, I can get it merged after that.

(TSC decision here to allow icu4x to co-maintain harfbuzz)

@waywardmonkeys
Copy link
Collaborator

@sffc If you rebase forward, the underlying macOS build issue in CI is fixed.

@hsivonen
Copy link

I have a couple of concerns:

Concern 1:

Does this add function dereferences for each call compared to the case without this abstraction? I think it does when comparing against a case where a callback without this abstraction wouldn't use the user data pointer, but I think it does not when comparing against a case where the callback without this abstraction would use the user data pointer. Is that correct? (I think it's OK if this abstraction doesn't add pointer chasing compared to the case where the user data pointer is used anyhow.)

Concern 2:

It's a bit odd for the traits to be simultaneously high-level enough to be implementable without unsafe and low-level and/or HarfBuzz-idiosyncratic in the argument and return types.

Unicode Scalar Values are represented as u32 instead of char. As noted in unicode-org/icu4x#2840 , if we took the position that calling hb_buffer_add_codepoints is conceptually like calling Rust's core::str::from_utf8_unchecked, it would be proper for the trait method signatures to use char for Unicode Scalar Value and for the glue code provided here to perform an unchecked conversion from hb_codepoint_t to char when receiving hb_codepoint_t from HarfBuzz.

Canonical Combining Class has numeric values defined by Unicode and the Unicode stability policy guarantees their possible range to fit in the u8 range. (Specifically, the guaranteed value space is one item smaller than the value space of u8, since 255 is guaranteed not to be used as a canonical combining class.) As CombiningClassFunc involves passing Canonical Combining Class only in the direction of Rust to HarfBuzz, it would make sense to use u8 in the trait and have the glue code zero-extend it in order to avoid C-dependent types in the trait signatures when possible.

The Unicode stability policy guarantees that the number of items for General Category won't change and HarfBuzz has assigned all the numeric values to fit in u8. Unfortunately, the numeric values that HarfBuzz expects are not defined by Unicode but by HarfBuzz, so leaving the type as is would require the trait implementor to get the HarfBuzz numeric assignments from somewhere. icu_harfbuzz currently gets them by depending on harfbuzz-sys. To the extent the point of this PR is to allow the trait implementor not to directly depend on harfbuzz-sys, that's not ideal.

Since the HarfBuzz numeric assignments cannot change, I suggest creating an enum in this crate alongside the traits with such that the enum representation is u8 with explicitly-assigned HarfBuzz-compatible values and having the glue code zero-extend the enum into c_uint. By making the Rust-visible enum representation have the size of u8 would enable the permutation table in icu_harfbuzz to become type-safe without becoming larger.

@sffc
Copy link
Contributor Author

sffc commented Aug 24, 2023

I have a couple of concerns:

Concern 1:

Does this add function dereferences for each call compared to the case without this abstraction? I think it does when comparing against a case where a callback without this abstraction wouldn't use the user data pointer, but I think it does not when comparing against a case where the callback without this abstraction would use the user data pointer. Is that correct? (I think it's OK if this abstraction doesn't add pointer chasing compared to the case where the user data pointer is used anyhow.)

Everything is super generic so I hope the compiler will inline these abstractions. For example, in this code:

        let compose_ptr: *mut F = Box::into_raw(f);
        extern "C" fn impl_compose<F: ComposeFunc>(
            // ...
            user_data: *mut c_void,
        ) -> hb_bool_t {
            let result = unsafe { &*(user_data as *mut F) }
                .compose(hb_codepoint_t_to_char(a), hb_codepoint_t_to_char(b));
            // ...
        }

the compiler should be able to figure out the address of .compose relative to the address of user_data, I think.

Concern 2:

It's a bit odd for the traits to be simultaneously high-level enough to be implementable without unsafe and low-level and/or HarfBuzz-idiosyncratic in the argument and return types.

Unicode Scalar Values are represented as u32 instead of char. As noted in unicode-org/icu4x#2840 , if we took the position that calling hb_buffer_add_codepoints is conceptually like calling Rust's core::str::from_utf8_unchecked, it would be proper for the trait method signatures to use char for Unicode Scalar Value and for the glue code provided here to perform an unchecked conversion from hb_codepoint_t to char when receiving hb_codepoint_t from HarfBuzz.

Changed from u32 to char: 5b8a4aa

Canonical Combining Class has numeric values defined by Unicode and the Unicode stability policy guarantees their possible range to fit in the u8 range. (Specifically, the guaranteed value space is one item smaller than the value space of u8, since 255 is guaranteed not to be used as a canonical combining class.) As CombiningClassFunc involves passing Canonical Combining Class only in the direction of Rust to HarfBuzz, it would make sense to use u8 in the trait and have the glue code zero-extend it in order to avoid C-dependent types in the trait signatures when possible.

Changed from c_uint to u8: 39af919

The Unicode stability policy guarantees that the number of items for General Category won't change and HarfBuzz has assigned all the numeric values to fit in u8. Unfortunately, the numeric values that HarfBuzz expects are not defined by Unicode but by HarfBuzz, so leaving the type as is would require the trait implementor to get the HarfBuzz numeric assignments from somewhere. icu_harfbuzz currently gets them by depending on harfbuzz-sys. To the extent the point of this PR is to allow the trait implementor not to directly depend on harfbuzz-sys, that's not ideal.

Since the HarfBuzz numeric assignments cannot change, I suggest creating an enum in this crate alongside the traits with such that the enum representation is u8 with explicitly-assigned HarfBuzz-compatible values and having the glue code zero-extend the enum into c_uint. By making the Rust-visible enum representation have the size of u8 would enable the permutation table in icu_harfbuzz to become type-safe without becoming larger.

Changed from c_uint to a #[repr(u8)] enum with Harfbuzz's values: 6868c97

@sffc
Copy link
Contributor Author

sffc commented Aug 24, 2023

@waywardmonkeys can you kick CI?

Copy link

@hsivonen hsivonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Looking very good, though I have one nit about using Rust-idiomatic naming for GeneralCategory items.

/// [`hb_unicode_general_category_t`]: crate::sys::hb_unicode_general_category_t
#[repr(u8)]
#[derive(Debug, Copy, Clone, PartialEq, Eq)]
#[allow(non_camel_case_types)] // the names are defined by Unicode

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest dropping the underscores like ICU4X does. That way, the names would be idiomatic for a Rust enum. Even if the names are defined by Unicode, it's pretty obvious to the reader what the applied name transformation would be if the names were idiomatic to Rust.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, note the casing of the letter "u" in PrivateUse/Private_Use.

@waywardmonkeys
Copy link
Collaborator

@waywardmonkeys can you kick CI?

I can't ... I didn't get commit privs back after losing them.

@waywardmonkeys
Copy link
Collaborator

I don't understand why this harfbuzz would ever be built without harfbuzz-sys (at least in such a way that all of that code is removed). That makes it seem like some of this code should be in a third crate and not part of harfbuzz or harfbuzz-sys.

Similarly, I'm not sure what the use case is for no-std, but if that's what we want / need for harfbuzz / harfbuzz-sys, then I think we should do that as a separate commit with a clear history and rationale.

@waywardmonkeys
Copy link
Collaborator

The need for cmake is due to freetype which is an optional dependency. You should be able to build this with only pkg-config and cc and we could restructure things slightly to do away with cc (it isn't used when pkg-config is) as a required dependency and make it only required when bundled is enabled.

But I still think that perhaps having a harfbuzz-unicode-funcs crate that provides the bare minimum there and then doing a re-export of it from harfbuzz would be cleaner and less maintenance going forward (as harfbuzz needs to grow a lot of other features to be useful and it seems wrong to predicate them all on harfbuzz-sys being enabled).

@sffc
Copy link
Contributor Author

sffc commented Aug 25, 2023

Okay, how about naming it harfbuzz-traits? I can make these changes once #232 lands.

@waywardmonkeys
Copy link
Collaborator

That sounds good. And now that I understand better, that is a good place perhaps for the draw traits as well when we support that.

@sffc
Copy link
Contributor Author

sffc commented Aug 28, 2023

I merged this PR to sit on top of #232 so it is ready once #232 lands.

Copy link

@hsivonen hsivonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-raising the previous nits. I trust the substance of the rest of the code hasn't changed--at least that's how it appeared from a cursory look.

Control = 0,
Format = 1,
Unassigned = 2,
Private_use = 3,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use should be Use with upper-case U.

I still think the underscores in these values should be removed the way ICU4X removes the upstream Unicode underscores to get idiomatic Rust enum values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with both of those requests, fwiw.

Copy link
Collaborator

@waywardmonkeys waywardmonkeys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that harfbuzz should re-export the traits crate, like it does for harfbuzz-sys, so that people who do depend on harfbuzz can get access to the same version that their harfbuzz dependency is using.

I might be wrong, so say so if I am!

@waywardmonkeys
Copy link
Collaborator

I think this has resolved all of my comments and concerns! If no one has other issues, I'll merge in about 12 hours.

@waywardmonkeys waywardmonkeys added this pull request to the merge queue Sep 6, 2023
Merged via the queue into servo:master with commit 2d74475 Sep 6, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants