Ating pag-ibayuhin ang ating talahuluganan!
Collects Tagalog words from tagalog.pinoydictionary.com, a database of Tagalog words powered by Cyberspace.ph Web Hosting. This script uses a common web scraping technique known as HTML parsing.
See the word list at tagalog_dict.txt
Served through GitHub Pages, the scraped words are accessible via REST resource.
Host
https://raymelon.github.io/tagalog-dictionary-scraper/
Method
GET
Resources Available
Resource | Display | Endpoint |
---|---|---|
csv |
default |
/tagalog_dict.csv |
csv |
with lines |
/tagalog_dict_lines.csv |
json |
default |
/tagalog_dict.json |
json |
with lines |
/tagalog_dict_lines.json |
txt |
default |
/tagalog_dict.txt |
Each webpage is loaded and parsed, extracting the words enclosed in <h2 class='word-entry'>
tag.
Included is tagalog.pinoydictionary.com
html
snippet containing the source of
http://tagalog.pinoydictionary.com/list/a/
to serve as point of reference on how dictionary words from the page are extracted.
Disclaimer:
I do not own the html
code cited above, it is owned by tagalog.pinoydictionary.com.
The main purpose of this project is for a Scrabble ® Tagalog dictionary database, but other uses may vary.
- Python3 v3.5+ 🐍
- beautifulsoup4 v4.5.1 🍜 📦 for parsing html pages
python -m pip install -U pip beautifulsoup4
- requests-futures v1.0.0 ⚡ for request concurrency
python -m pip install -U pip requests-futures
- Run the scraper script
collect_tagalog.py
- See the output of collected words at
tagalog_dict.txt
- Match
max_workers
value with the CPU and network capacity of the environment. See the comment for estimated values and expected download rates.