-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove unnecessary LaTeX encodings à la pylatexenc.latex2text #25
Comments
Thanks @GraemeWatt for pointing that out. I think you are actually pointing out multiple things to improve that should all be addressed. Will definitely leave this issue open until this is addressed. |
I'll try to give some more examples of problematic paper titles when I spot them that might be useful for future testing.
|
@GraemeWatt Could you have a look at https://github.com/HDembinski/unicodeitplus ? I tried to address the issues with the parsing of a mix of LaTeX code and normal text in unicodeitplus. Running it on your Tweet gives me Measurement of the 𝐶𝑃 violating phase 𝜙ₛ in the Bₛ→J/𝜓 𝜙(1020)→𝜇⁺𝜇⁻ K⁺K⁻ channel in proton-proton collisions at √𝑠̅=13~TeV I am lacking a rule for |
@HDembinski : thanks, |
Excellent thanks! |
I just wrote a Jupyter notebook that gets the titles of all (almost 10,000) HEPData records and compares the output from latex2text, unicodeit and unicodeitplus. I hope it will be useful in testing future improvements to these tools. |
Firstly, thanks for releasing the packages on PyPI and npm. The Python package
unicodeit
v0.7.0 is now used in the @HEPData code for tweeting titles of high-energy physics publications via https://twitter.com/HEPData.After applying
unicodeit.replace
to each paper title, we need to apply some cleanup operations to remove characters like$
,{
,}
,~
, and encodings like\mathrm
,\text
, and\rm
. Recently, we were alerted via a reply to a tweet that our code fails for\mathrm{t}\overline{\mathrm{t}}
. We would need to first remove\mathrm
(with appropriate matching of braces) before applying UnicodeIt, not afterwards. An alternative would be to use pylatexenc.latex2text which applies appropriate cleanup operations (although it seems\overline
is not supported). The problematic paper title was:where UnicodeIt gives "Search for resonant $ \mathrm{t}\̅athrm{t}} $ production in proton−proton collisions at $ √{s}=13 $ TeV" and
latex2text
gives "Search for resonant tt production in proton-proton collisions at √(s)=13 TeV". For our intended application, it would probably make sense to switch topylatexenc.latex2text
instead ofunicodeit
. Clemens Lange (@clelange) pointed to some code based onpylatexenc.latex2text
used for cleaning paper titles tweeted from the @CMSpapers, @LHCb_results, and @AtlasPapers Twitter accounts.Is there any possibility to extend UnicodeIt to appropriately remove LaTeX encodings like
$
,{
,}
,~
,\mathrm
,\text
,\rm
, etc., in a similar way topylatexenc.latex2text
? Feel free to close this issue if you think it is beyond the intended scope of UnicodeIt.The text was updated successfully, but these errors were encountered: