Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-ascii tag is not parsed #58

Open
yqu212 opened this issue Jun 22, 2021 · 8 comments
Open

non-ascii tag is not parsed #58

yqu212 opened this issue Jun 22, 2021 · 8 comments
Labels
bug Something isn't working

Comments

@yqu212
Copy link

yqu212 commented Jun 22, 2021

Describe the bug
Non-ascii tag is not parsed.

To Reproduce
Steps to reproduce the behavior:

(read-str "* headline :标签:")
{:headlines
 [{:headline
   {:level 1,
    :title [[:text-normal "headline :标签:"]],
    :planning [],
    :tags []}}]}

Expected behavior

{:headlines
 [{:headline
   {:level 1,
    :title [[:text-normal "headline"]],
    :planning [],
    :tags ["tag"]}}]}

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
[org-parser "0.1.24"]

@yqu212 yqu212 added the bug Something isn't working label Jun 22, 2021
@schoettl
Copy link
Collaborator

schoettl commented Jun 22, 2021

Thanks for the report.

TAGS is made of words containing any alpha-numeric character, underscore, at sign, hash sign or percent sign, and separated with colons.

The regex for tag names is currently [a-zA-Z0-9_@#%] (see function extract-tags).

It must include also unicode characters but JavaScript regexes cannot do that. Only Java has such a character class.

If we invert the regex like [^ \t-.…] we would have to exclude too many characters.

Other ideas? Add unicode ranges next to a-zA-Z? That will get messy, too :/

PS: It would be interesting how org mode does this. Maybe they have a special character class for unicode chars.

@yqu212
Copy link
Author

yqu212 commented Jun 22, 2021

Yes. Elisp has [:multibyte:].

Chinese is not parsed in another parser orgajs implemented by javascript.
https://github.com/orgapp/orgajs/blob/eac72e62b902b79289cfacd97e9bdf5e09bc9030/packages/orga/src/tokenize/headline.ts#L61

Maybe we can make org-parser support java only for now?

p.s. Is this one useful?
https://stackoverflow.com/questions/21109011/javascript-unicode-string-chinese-character-but-no-punctuation

@munen
Copy link
Contributor

munen commented Jun 22, 2021

Maybe we can make org-parser support java only for now?

No. This would be very much against https://github.com/200ok-ch/org-parser/#what-does-this-project-do and https://github.com/200ok-ch/org-parser/#why-is-this-project-useful--rationale.

Having said that, JavaScript has "Unicode property escapes" . Maybe we can use it for the a-zA-Z part of the regexp:

> ":标签:".match(/\p{Letter}+/gu)
[ '标签' ]

@munen
Copy link
Contributor

munen commented Jun 22, 2021

Looks like this also works as part of a 'regular' regular expression (pardon the pun).:

> ":标签:".match(/[\p{Letter}0-9_@#%]+/gu)
[ '标签' ]

@munen
Copy link
Contributor

munen commented Jun 22, 2021

@yqu212 Do you want to make your first PR and include Chinese characters by employing above Regexp for CLJS and the equivalent for CLJ?

@yqu212
Copy link
Author

yqu212 commented Jun 22, 2021

It's a good idea. However, it will taks some time to write the test since I am not familar with
CLJS.

@schoettl
Copy link
Collaborator

Looks, like it pays off that we doing tag extraction in the transformation, not EBNF ^^

@munen
Copy link
Contributor

munen commented Jun 22, 2021

@yqu212

It's a good idea. However, it will taks some time to write the test since I am not familar with CLJS.

No worries, take your time!

Good luck and enjoy 🙏

@schoettl

Looks, like it pays off that we doing tag extraction in the transformation, not EBNF ^^

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants