Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmenter constructor options/preferences normalization #5856

Open
sffc opened this issue Nov 22, 2024 · 7 comments
Open

Segmenter constructor options/preferences normalization #5856

sffc opened this issue Nov 22, 2024 · 7 comments
Labels
C-segmentation Component: Segmentation discuss-priority Discuss at the next ICU4X meeting S-small Size: One afternoon (small bug fix or enhancement)

Comments

@sffc
Copy link
Member

sffc commented Nov 22, 2024

In #3284 we decided to add a content_locale option to segmenter.

In #5839, @zbraniecki asked some questions about the signature of the constructors.

Currently we have constructors of the following shape:

  1. new: singleton static data (infallible), no content locale tailorings
  2. try_new_unstable: dynamic data, no content locale tailorings
  3. try_new_with_options: static data with content locale tailorings
  4. try_new_with_options_unstable: dynamic data with content locale tailorings

Which of the following should we do?

  1. Keep all 4 sets of constructors listed above
  2. Remove constructor sets 1 and 2. Rename 3 and 4 to not have with_options.
  3. Remove constructor set 2. Rename 3 and 4 to not have with_options. Keep constructor 1 as a compiled data infallible optimization.

And, should we take a LanguageIdentifier or a preferences bag for the content locale? (Please understand the discussion in #3284 before stating an opinion on this)

@zbraniecki @makotokato @aethanyc @Manishearth

@sffc sffc added C-segmentation Component: Segmentation needs-approval One or more stakeholders need to approve proposal labels Nov 22, 2024
@sffc sffc added this to the ICU4X 2.0 ⟨P1⟩ milestone Nov 22, 2024
@aethanyc
Copy link
Contributor

If we were to reduce the number of constructors, I'm leaning toward option 3 to remove constructor set 2. I feel it is not too terrible to ask the user to pass Default::default() as the option to the constructor.

Note: there is probably a naming inconsistency in LineSegmenter. We should prepend try_ to the method name in LineSegmenter::new_auto_with_options, LineSegmenter::new_dictionary_with_options, and LineSegmenter::new_lstm_with_options, because they all take LineBreakOptions which contains content_locale.

@sffc
Copy link
Member Author

sffc commented Nov 28, 2024

@aethanyc I lean toward option 3 a well, but I think we should take the opportunity to give the function ::new() a better name. This is also aligned with the discussion in #5554.

Some ideas:

  • new_root
  • new_invariant
  • new_without_options

@sffc
Copy link
Member Author

sffc commented Dec 10, 2024

@Manishearth @makotokato @zbraniecki, anything to add to the list above?

What should we name the segmenter constructor that takes no arguments, uses compiled data, is infallible, and uses Unicode data with no CLDR locale tailorings? (It is currently named ::new() but there are reasons we want to change it.)

@Manishearth
Copy link
Member

Not a huge fan of using the somewhat-CLDR-internal concept "root" in the name. new_untailored or new_no_tailoring? (tailoring is also a bit of jargon, but it's jargon in the segmentation space, not specific to CLDR. Unclear to me if that distinction actually matters)

@makotokato
Copy link
Member

I don't have strong opinions for this. Actually we have no plan to add new options. If we support auto-phase in css text, it can add it in css property.

@sffc sffc added S-small Size: One afternoon (small bug fix or enhancement) discuss-priority Discuss at the next ICU4X meeting and removed needs-approval One or more stakeholders need to approve proposal labels Jan 7, 2025
@Manishearth
Copy link
Member

#5958 adds new_root or new_root_foo to all segmenters except Grapheme.

We can rename root to something else if we want, I am happy with root for now. We should perhaps get WG approval on the name.

We should also figure out what we want to do with Grapheme. Grapheme does not support any tailorings right now, but it could in the future? I'm inclined to let Grapheme have a new() and use the deprecation route if we need to split it.

@aethanyc
Copy link
Contributor

aethanyc commented Jan 8, 2025

We can rename root to something else if we want, I am happy with root for now. We should perhaps get WG approval on the name.

I like new_without_options proposed in #5856 (comment). Although it is a bit mouthful, but it clearly indicate that it accepts no options like WordBreakOptions or LineBreakOptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation discuss-priority Discuss at the next ICU4X meeting S-small Size: One afternoon (small bug fix or enhancement)
Projects
None yet
Development

No branches or pull requests

4 participants