Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is καί lower case? #131

Open
johanneswilm opened this issue Oct 23, 2022 · 9 comments
Open

Is καί lower case? #131

johanneswilm opened this issue Oct 23, 2022 · 9 comments

Comments

@johanneswilm
Copy link
Member

Hey,
@retorquere ,
I am trying to understand what is supposed to happen with the test file Title case of latex greek text on biblatex export #564.bib. I added a number of extra fields that users jave been requesting [1]. These seem to just not be listed in the biblatex documentation of the various types even though the fields themselves are documented I also added title casing to journal title and series. I noticed that that ended up adding case protection on a lot of imports. While this is technically correct, in most cases that is likely not what is intended. So I tried to remove the case protection for those text nodes that are entirely lowercase. In said test file there is clearly intended case protection for the work "καί" which JavaScript reads as entirely lowercase.

Could explain why the case protection is needed? is it not really lowercase?

[1] main...originfo

@retorquere
Copy link
Contributor

καί matches ^\p{Ll}+$ so I'd say yes, this is lowercase.

While this is technically correct, in most cases that is likely not what is intended.

In this case I usually opt for configurable behavior, so both the correct and the other behavior is available.

In said test file there is clearly intended case protection for the work "καί" which JavaScript reads as entirely lowercase.

The phrase The Use of καί appears twice in the input, once with <span class="nocase"> and once without; the one without exports as The {{Use}} of Καί, and the one with exports as The {{Use}} of {{καί}}.

Technically, since bib(la)tex styles assume title-case, they never apply title-casing, so there is no need to protect all-lowercase words, but since Zotero does it the other way around, case-protecting lowercase words is necessary for it, and my exporter just follows that case-protection, so it produces {{καί}}.

@johanneswilm
Copy link
Member Author

Technically, since bib(la)tex styles assume title-case, they never apply title-casing, so there is no need to protect all-lowercase words, but since Zotero does it the other way around, case-protecting lowercase words is necessary for it, and my exporter just follows that case-protection, so it produces {{καί}}.

Ok, but in that case, isn't the case protection quite useless? The words that are protected will be quite different ones depending on the direction of the conversion. Take a title like

The 2006 <cp>Season</cp> was a Scam-alike Event.

This title is imported from bibtex and the case-protection is around "Season" because that could be converted to lowercase when converting to sentence case. But if instead CSL wants to convert from sentence-case to title case, the word that would potentially need protection would instead be "-alike" to provent it from turning into:

The 2006 <cp>Season</cp> was a Scam-Alike Event.

It seems like a bad decision to me that they are tyring to force storing English language titles in sentence cases. There will be lots of software projects trying to do both bibtex and CSL and not allowing sotring titles in title case is just asking for problems for the entire community for decades to come.

@retorquere
Copy link
Contributor

Ok, but in that case, isn't the case protection quite useless?

I can consider removing it, but it's currently following explicitly stated user intent. That's why it's there.

But if instead CSL wants to convert from sentence-case to title case, the word that would potentially need protection would instead be "-alike" to provent it from turning into:

The 2006 <cp>Season</cp> was a Scam-Alike Event.

I do heuristics to detect some language features but I don't do any NLP, so you have just given me a good case on why I should simply follow user intent on this one and don't second-guess the user. The user can enter that as

The 2006 <span class="nocase">Season</span> was a scam<span class="nocase>-like</span> event.

in Zotero and that will work as intended (and may well be required in Zotero in any case for a CMOS style).

It seems like a bad decision to me that they are tyring to force storing English language titles in sentence cases. There will be lots of software projects trying to do both bibtex and CSL and not allowing sotring titles in title case is just asking for problems for the entire community for decades to come.

Life would have been simpler for me, yes, but I don't see why CSL should be beholden to historic choices by bib(la)tex. It would surprise me if the CSL authors had no knowledge of the existence of bib(la)tex, so I assume they had good reasons to go with sentencecase. In any case, both citation processors now have software dependent on their current behavior, and there's about as much chance of getting CSL to switch to title case as there is of getting biblatex to switch to sentence case. I don't know what the rationale was for bibtex to go with title case either.

@johanneswilm
Copy link
Member Author

Life would have been simpler for me, yes, but I don't see why CSL should be beholden to historic choices by bib(la)tex. It would surprise me if the CSL authors had no knowledge of the existence of bib(la)tex, so I assume they had good reasons to go with sentencecase. In any case, both citation processors now have software dependent on their current behavior, and there's about as much chance of getting CSL to switch to title case as there is of getting biblatex to switch to sentence case. I don't know what the rationale was for bibtex to go with title case either.

Hmm...

Ok, so then the thing to do, I guess is to add sentence-case conversion to the converter from the intermediate format to the CSL output. There is no way to specify that a given title is in title case and will only potentially need to be converted to sentence-case, is there?

The sentence-case conversion I would create would do something like this:

  • Only be active on English language items (ignore non-English items)
  • Keep any word that is entirely uppercase in uppercase.
  • Keep uppercase letters that are inside of case-protect in uppercase but remove the case-protection.
  • Lowercase every letter outside of case-protection.
  • Put entire words that were previously in lowercase into case-protect.

Does that sound correct?

@retorquere
Copy link
Contributor

Ok, so then the thing to do, I guess is to add sentence-case conversion to the converter from the intermediate format to the CSL output. There is no way to specify that a given title is in title case and will only potentially need to be converted to sentence-case, is there?

I don't understand the question

The sentence-case conversion I would create would do something like this:

  • Only be active on English language items (ignore non-English items)

Correct.

  • Keep any word that is entirely uppercase in uppercase.
  • Keep uppercase letters that are inside of case-protect in uppercase but remove the case-protection.
  • Lowercase every letter outside of case-protection.
  • Put entire words that were previously in lowercase into case-protect.

Does that sound correct?

Mostly; I have the following exceptions, I keep

  • quoted parts
  • acronyms like U.S.A.
  • stuff like Q&A

in uppercase

@johanneswilm
Copy link
Member Author

Ok, so then the thing to do, I guess is to add sentence-case conversion to the converter from the intermediate format to the CSL output. There is no way to specify that a given title is in title case and will only potentially need to be converted to sentence-case, is there?

I don't understand the question

I thought that maybe there was a way to specify that the field value already is in title case. I was thinking that because I can see that in the Zotero user interface I can run a translator to title or to sentence case on the the title field. I guess then that information that the field is stored in title case is just lost?

Mostly; I have the following exceptions, I keep

quoted parts
acronyms like U.S.A.
stuff like Q&A
in uppercase

Is this a repository that is under an open source license? Is there a package just for the sentence-case translator that I can import? Or is it so little code that it doesn't really make sense and I should just write it from scratch?

@johanneswilm
Copy link
Member Author

Is it essentially this part: https://github.com/retorquere/bibtex-parser/blob/master/index.ts#L18-L110 (MIT) ? Does it respect <span class="nocase">...</span> or how does that work?

@retorquere
Copy link
Contributor

I thought that maybe there was a way to specify that the field value already is in title case. I was thinking that because I can see that in the Zotero user interface I can run a translator to title or to sentence case on the the title field. I guess then that information that the field is stored in title case is just lost?

Ah so. That is not lost so much as that it was never present to begin with. Titles in Zotero are assumed to be stored in sentence case at all times. Some import translators make sure that happens, some don't/can't. The case-conversion functions in the UI are very naive utility functions and the user is expected to inspect/correct the result.

Is this a repository that is under an open source license? Is there a package just for the sentence-case translator that I can import? Or is it so little code that it doesn't really make sense and I should just write it from scratch?

It'd be hard to split off since it is not expected to do it's work fully standalone, it post-processes the results further.

Is it essentially this part: https://github.com/retorquere/bibtex-parser/blob/master/index.ts#L18-L110 (MIT)?

Yes.

Does it respect <span class="nocase">...</span> or how does that work?

That part of the code does not, it just sentence cases everything. This line restores protected parts, but it depends on metadata generated by the peg parser.

It's fairly simple to add though. The code you highlighted sentence-case everything, another part of the code then scan the input for <span class="nocase">...</span> and paste that back into the sentence-cased string. That latter part would just be a few lines of code to isolate.

@johanneswilm
Copy link
Member Author

Ok, I have now added your sentence caser [1] as well as testing of export files [2] and so far this seems to work.

I usually opt for configurable behavior, so both the correct and the other behavior is available.

Makes sense. I will add that next.

[1] 10e185b
[2] 782a1ca

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants