Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyphens and spaces omitted in scraped words #3

Open
dkalantzi opened this issue Apr 26, 2024 · 1 comment
Open

Hyphens and spaces omitted in scraped words #3

dkalantzi opened this issue Apr 26, 2024 · 1 comment

Comments

@dkalantzi
Copy link

Hello,

Thank you for this very useful resource. I've noticed two potential issues with the words in the scraped data:

  • Hyphens Omission: Hyphens from the original dictionary entries seem to be missing. For example, the first word in the txt is aalugalog, but the entry in the dictionary is aalug-alog.
  • Spaces Omission: Spaces between words in phrases from the original dictionary also seem to be missing, causing phrases to be scraped as single words. For example, patay na Bulan is scraped as pataynaBulan, “patay na hayop” is scraped as pataynahayop.

Kind regards,
Dimitra

@nikitimi
Copy link

nikitimi commented Sep 20, 2024

Hi @dkalantzi, I'm building an application like Duolingo, but for the native dialects here in the Philippines.
I've came across this tagalog web scraper by @raymelon , so here is the response regarding your issue:

You can changed the Regular Expression in line 161 of collect_tagalog.py

all_words.append(re.compile('[^a-zA-Z]').sub('', word.next.next))

with:

all_words.append(re.compile('[ ]?\(+?[\s\w\d\W]+\)').sub('', word.next.next).lower())
  • The '[ ]' checks for whitespaces, you can also write \s.
  • The following flags: \w is for words, \d is for decimals, \W is for non-words like '&'.
  • Lastly all words will be converted to lowercase with lower() method.

Check RegExr's RegEx reference for more information about Regular Expressions.

Here is the result of scraped words with the modified RegEx above.
tagalog_dict.txt

You can adjust the regex to get the result that is in your liking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants