Remove unnecessary LaTeX encodings à la pylatexenc.latex2text #25

GraemeWatt · 2020-07-28T10:55:24Z

Firstly, thanks for releasing the packages on PyPI and npm. The Python package unicodeit v0.7.0 is now used in the @HEPData code for tweeting titles of high-energy physics publications via https://twitter.com/HEPData.

After applying unicodeit.replace to each paper title, we need to apply some cleanup operations to remove characters like $, {, }, ~, and encodings like \mathrm, \text, and \rm. Recently, we were alerted via a reply to a tweet that our code fails for \mathrm{t}\overline{\mathrm{t}}. We would need to first remove \mathrm (with appropriate matching of braces) before applying UnicodeIt, not afterwards. An alternative would be to use pylatexenc.latex2text which applies appropriate cleanup operations (although it seems \overline is not supported). The problematic paper title was:

Search for resonant $ \mathrm{t}\overline{\mathrm{t}} $ production in proton-proton collisions at $ \sqrt{s}=13 $ TeV

where UnicodeIt gives "Search for resonant $ \mathrm{t}\̅athrm{t}} $ production in proton−proton collisions at $ √{s}=13 $ TeV" and latex2text gives "Search for resonant tt production in proton-proton collisions at √(s)=13 TeV". For our intended application, it would probably make sense to switch to pylatexenc.latex2text instead of unicodeit. Clemens Lange (@clelange) pointed to some code based on pylatexenc.latex2text used for cleaning paper titles tweeted from the @CMSpapers, @LHCb_results, and @AtlasPapers Twitter accounts.

Is there any possibility to extend UnicodeIt to appropriately remove LaTeX encodings like $, {, }, ~, \mathrm, \text, \rm, etc., in a similar way to pylatexenc.latex2text? Feel free to close this issue if you think it is beyond the intended scope of UnicodeIt.

The text was updated successfully, but these errors were encountered:

svenkreiss · 2020-07-28T11:45:06Z

Thanks @GraemeWatt for pointing that out. I think you are actually pointing out multiple things to improve that should all be addressed. Will definitely leave this issue open until this is addressed.

GraemeWatt · 2020-08-12T16:34:09Z

I'll try to give some more examples of problematic paper titles when I spot them that might be useful for future testing.

Tweet for Measurement of the $CP$ violating phase $\phi_{\text{s}}$ in the $\mathrm{B}_s \to \mathrm{J}/\psi\,\phi(1020) \to \mu^+\mu^-\,\mathrm{K}^+\mathrm{K}^-$ channel in proton-proton collisions at $\sqrt{s} = 13~\mathrm{TeV}$.
unicodeit: Measurement of the $CP$ violating phase $ϕ_{\text{s}}$ in the $\mathrm{B}ₛ → \mathrm{J}/ψ ϕ(1020) → μ⁺μ⁻ \mathrm{K}⁺\mathrm{K}⁻$ channel in proton−proton collisions at $√{s} = 13~\mathrm{TeV}$
latex2text: Measurement of the CP violating phase ϕ_s in the B_s →J/ψ ϕ(1020) →μ^+μ^- K^+K^- channel in proton-proton collisions at √(s) = 13 TeV

HDembinski · 2023-06-15T09:00:54Z

@GraemeWatt Could you have a look at https://github.com/HDembinski/unicodeitplus ? I tried to address the issues with the parsing of a mix of LaTeX code and normal text in unicodeitplus. Running it on your Tweet gives me

Measurement of the 𝐶𝑃 violating phase 𝜙ₛ in the Bₛ→J/𝜓 𝜙(1020)→𝜇⁺𝜇⁻ K⁺K⁻ channel in proton-proton collisions at √𝑠̅=13~TeV

I am lacking a rule for ~, but that can be added easily.

GraemeWatt · 2023-06-16T14:55:46Z

@HDembinski : thanks, unicodeitplus looks great and better suited for our use case than the original unicodeit. I've opened HEPData/hepdata#664 to make the switch after some more testing. I've already identified some minor problems and I'll open new issues in the unicodeitplus repository.

HDembinski · 2023-06-16T16:34:46Z

Excellent thanks!

GraemeWatt · 2023-06-20T11:11:13Z

I just wrote a Jupyter notebook that gets the titles of all (almost 10,000) HEPData records and compares the output from latex2text, unicodeit and unicodeitplus. I hope it will be useful in testing future improvements to these tools.

HDembinski mentioned this issue Apr 5, 2023

Project dead? #71

Closed

GraemeWatt mentioned this issue Jun 16, 2023

twitter: replace unicodeit with unicodeitplus HEPData/hepdata#664

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unnecessary LaTeX encodings à la pylatexenc.latex2text #25

Remove unnecessary LaTeX encodings à la pylatexenc.latex2text #25

GraemeWatt commented Jul 28, 2020 •

edited

Loading

svenkreiss commented Jul 28, 2020

GraemeWatt commented Aug 12, 2020

HDembinski commented Jun 15, 2023

GraemeWatt commented Jun 16, 2023

HDembinski commented Jun 16, 2023

GraemeWatt commented Jun 20, 2023

Remove unnecessary LaTeX encodings à la pylatexenc.latex2text #25

Remove unnecessary LaTeX encodings à la pylatexenc.latex2text #25

Comments

GraemeWatt commented Jul 28, 2020 • edited Loading

svenkreiss commented Jul 28, 2020

GraemeWatt commented Aug 12, 2020

HDembinski commented Jun 15, 2023

GraemeWatt commented Jun 16, 2023

HDembinski commented Jun 16, 2023

GraemeWatt commented Jun 20, 2023

GraemeWatt commented Jul 28, 2020 •

edited

Loading