Skip to content

Commit

Permalink
Port over changes from explosion#1333 and add comments
Browse files Browse the repository at this point in the history
  • Loading branch information
ines committed Oct 14, 2017
1 parent a5da683 commit 09aed58
Showing 1 changed file with 9 additions and 1 deletion.
10 changes: 9 additions & 1 deletion spacy/lang/char_classes.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,19 @@
'kg g mg µg t lb oz m/s km/h kmh mph hPa Pa mbar mb MB kb KB gb GB tb '
'TB T G M K %')
_currency = r'\$ £ € ¥ ฿ US\$ C\$ A\$'

# These expressions contain various unicode variations, including characters
# used in Chinese (see #1333, #1340, #1351) – unless there are cross-language
# conflicts, spaCy's base tokenizer should handle all of those by default
_punct = r'… …… , : ; \! \? ¿ ¡ \( \) \[ \] \{ \} < > _ # \* & 。 ? ! , 、 ; : ~ ·'
_quotes = r'\' \'\' " ” “ `` ` ‘ ´ ‚ , „ » «'
_quotes = r'\' \'\' " ” “ `` ` ‘ ´ ‘‘ ’’ ‚ , „ » « 「 」 『 』 ( ) 〔 〕 【 】 《 》 〈 〉'
_hyphens = '- – — -- --- —— ~'

# Various symbols like dingbats, but also emoji
# Details: https://www.compart.com/en/unicode/category/So
_other_symbols = r'[\p{So}]'


UNITS = merge_chars(_units)
CURRENCY = merge_chars(_currency)
QUOTES = merge_chars(_quotes)
Expand Down

0 comments on commit 09aed58

Please sign in to comment.