Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove unnecessary LaTeX encodings à la pylatexenc.latex2text #25

Open
GraemeWatt opened this issue Jul 28, 2020 · 6 comments
Open

Remove unnecessary LaTeX encodings à la pylatexenc.latex2text #25

GraemeWatt opened this issue Jul 28, 2020 · 6 comments

Comments

@GraemeWatt
Copy link

GraemeWatt commented Jul 28, 2020

Firstly, thanks for releasing the packages on PyPI and npm. The Python package unicodeit v0.7.0 is now used in the @HEPData code for tweeting titles of high-energy physics publications via https://twitter.com/HEPData.

After applying unicodeit.replace to each paper title, we need to apply some cleanup operations to remove characters like $, {, }, ~, and encodings like \mathrm, \text, and \rm. Recently, we were alerted via a reply to a tweet that our code fails for \mathrm{t}\overline{\mathrm{t}}. We would need to first remove \mathrm (with appropriate matching of braces) before applying UnicodeIt, not afterwards. An alternative would be to use pylatexenc.latex2text which applies appropriate cleanup operations (although it seems \overline is not supported). The problematic paper title was:

Search for resonant $ \mathrm{t}\overline{\mathrm{t}} $ production in proton-proton collisions at $ \sqrt{s}=13 $ TeV

where UnicodeIt gives "Search for resonant $ \mathrm{t}\̅athrm{t}} $ production in proton−proton collisions at $ √{s}=13 $ TeV" and latex2text gives "Search for resonant tt production in proton-proton collisions at √(s)=13 TeV". For our intended application, it would probably make sense to switch to pylatexenc.latex2text instead of unicodeit. Clemens Lange (@clelange) pointed to some code based on pylatexenc.latex2text used for cleaning paper titles tweeted from the @CMSpapers, @LHCb_results, and @AtlasPapers Twitter accounts.

Is there any possibility to extend UnicodeIt to appropriately remove LaTeX encodings like $, {, }, ~, \mathrm, \text, \rm, etc., in a similar way to pylatexenc.latex2text? Feel free to close this issue if you think it is beyond the intended scope of UnicodeIt.

@svenkreiss
Copy link
Owner

Thanks @GraemeWatt for pointing that out. I think you are actually pointing out multiple things to improve that should all be addressed. Will definitely leave this issue open until this is addressed.

@GraemeWatt
Copy link
Author

I'll try to give some more examples of problematic paper titles when I spot them that might be useful for future testing.

  1. Tweet for Measurement of the $CP$ violating phase $\phi_{\text{s}}$ in the $\mathrm{B}_s \to \mathrm{J}/\psi\,\phi(1020) \to \mu^+\mu^-\,\mathrm{K}^+\mathrm{K}^-$ channel in proton-proton collisions at $\sqrt{s} = 13~\mathrm{TeV}$.
    unicodeit: Measurement of the $CP$ violating phase $ϕ_{\text{s}}$ in the $\mathrm{B}ₛ → \mathrm{J}/ψ ϕ(1020) → μ⁺μ⁻ \mathrm{K}⁺\mathrm{K}⁻$ channel in proton−proton collisions at $√{s} = 13~\mathrm{TeV}$
    latex2text: Measurement of the CP violating phase ϕ_s in the B_s →J/ψ ϕ(1020) →μ^+μ^- K^+K^- channel in proton-proton collisions at √(s) = 13 TeV

@HDembinski HDembinski mentioned this issue Apr 5, 2023
@HDembinski
Copy link

@GraemeWatt Could you have a look at https://github.com/HDembinski/unicodeitplus ? I tried to address the issues with the parsing of a mix of LaTeX code and normal text in unicodeitplus. Running it on your Tweet gives me

Measurement of the 𝐶𝑃 violating phase 𝜙ₛ in the Bₛ→J/𝜓 𝜙(1020)→𝜇⁺𝜇⁻ K⁺K⁻ channel in proton-proton collisions at √𝑠̅=13~TeV

I am lacking a rule for ~, but that can be added easily.

@GraemeWatt
Copy link
Author

@HDembinski : thanks, unicodeitplus looks great and better suited for our use case than the original unicodeit. I've opened HEPData/hepdata#664 to make the switch after some more testing. I've already identified some minor problems and I'll open new issues in the unicodeitplus repository.

@HDembinski
Copy link

Excellent thanks!

@GraemeWatt
Copy link
Author

I just wrote a Jupyter notebook that gets the titles of all (almost 10,000) HEPData records and compares the output from latex2text, unicodeit and unicodeitplus. I hope it will be useful in testing future improvements to these tools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants