non-ascii tag is not parsed #58

yqu212 · 2021-06-22T07:33:27Z

Describe the bug
Non-ascii tag is not parsed.

To Reproduce
Steps to reproduce the behavior:

(read-str "* headline :标签:")
{:headlines
 [{:headline
   {:level 1,
    :title [[:text-normal "headline :标签:"]],
    :planning [],
    :tags []}}]}

Expected behavior

{:headlines
 [{:headline
   {:level 1,
    :title [[:text-normal "headline"]],
    :planning [],
    :tags ["tag"]}}]}

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
[org-parser "0.1.24"]

The text was updated successfully, but these errors were encountered:

schoettl · 2021-06-22T07:46:04Z

Thanks for the report.

TAGS is made of words containing any alpha-numeric character, underscore, at sign, hash sign or percent sign, and separated with colons.

https://orgmode.org/worg/dev/org-syntax.html

The regex for tag names is currently [a-zA-Z0-9_@#%] (see function extract-tags).

It must include also unicode characters but JavaScript regexes cannot do that. Only Java has such a character class.

If we invert the regex like [^ \t-.…] we would have to exclude too many characters.

Other ideas? Add unicode ranges next to a-zA-Z? That will get messy, too :/

PS: It would be interesting how org mode does this. Maybe they have a special character class for unicode chars.

yqu212 · 2021-06-22T08:20:06Z

Yes. Elisp has [:multibyte:].

Chinese is not parsed in another parser orgajs implemented by javascript.
https://github.com/orgapp/orgajs/blob/eac72e62b902b79289cfacd97e9bdf5e09bc9030/packages/orga/src/tokenize/headline.ts#L61

Maybe we can make org-parser support java only for now?

p.s. Is this one useful?
https://stackoverflow.com/questions/21109011/javascript-unicode-string-chinese-character-but-no-punctuation

munen · 2021-06-22T08:57:54Z

Maybe we can make org-parser support java only for now?

No. This would be very much against https://github.com/200ok-ch/org-parser/#what-does-this-project-do and https://github.com/200ok-ch/org-parser/#why-is-this-project-useful--rationale.

Having said that, JavaScript has "Unicode property escapes" . Maybe we can use it for the a-zA-Z part of the regexp:

> ":标签:".match(/\p{Letter}+/gu)
[ '标签' ]

munen · 2021-06-22T09:01:19Z

Looks like this also works as part of a 'regular' regular expression (pardon the pun).:

> ":标签:".match(/[\p{Letter}0-9_@#%]+/gu)
[ '标签' ]

munen · 2021-06-22T09:02:03Z

@yqu212 Do you want to make your first PR and include Chinese characters by employing above Regexp for CLJS and the equivalent for CLJ?

yqu212 · 2021-06-22T09:13:33Z

It's a good idea. However, it will taks some time to write the test since I am not familar with
CLJS.

schoettl · 2021-06-22T09:19:25Z

Looks, like it pays off that we doing tag extraction in the transformation, not EBNF ^^

munen · 2021-06-22T09:20:19Z

@yqu212

It's a good idea. However, it will taks some time to write the test since I am not familar with CLJS.

No worries, take your time!

Good luck and enjoy 🙏

@schoettl

Looks, like it pays off that we doing tag extraction in the transformation, not EBNF ^^

👍

yqu212 added the bug Something isn't working label Jun 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-ascii tag is not parsed #58

non-ascii tag is not parsed #58

yqu212 commented Jun 22, 2021

schoettl commented Jun 22, 2021 •

edited

Loading

yqu212 commented Jun 22, 2021 •

edited

Loading

munen commented Jun 22, 2021 •

edited

Loading

munen commented Jun 22, 2021

munen commented Jun 22, 2021

yqu212 commented Jun 22, 2021

schoettl commented Jun 22, 2021

munen commented Jun 22, 2021 •

edited

Loading

non-ascii tag is not parsed #58

non-ascii tag is not parsed #58

Comments

yqu212 commented Jun 22, 2021

schoettl commented Jun 22, 2021 • edited Loading

yqu212 commented Jun 22, 2021 • edited Loading

munen commented Jun 22, 2021 • edited Loading

munen commented Jun 22, 2021

munen commented Jun 22, 2021

yqu212 commented Jun 22, 2021

schoettl commented Jun 22, 2021

munen commented Jun 22, 2021 • edited Loading

schoettl commented Jun 22, 2021 •

edited

Loading

yqu212 commented Jun 22, 2021 •

edited

Loading

munen commented Jun 22, 2021 •

edited

Loading

munen commented Jun 22, 2021 •

edited

Loading