Skip to content

Builds a Tagalog dictionary by collecting Tagalog words from tagalog.pinoydictionary.com

License

Notifications You must be signed in to change notification settings

raymelon/tagalog-dictionary-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

37b6c0b · Feb 19, 2023

History

55 Commits
Nov 13, 2016
Feb 19, 2023
Apr 7, 2017
Apr 8, 2017
Nov 13, 2016
Feb 19, 2023
Feb 19, 2023
Apr 8, 2017
Feb 19, 2023
Feb 19, 2023
Feb 19, 2023
Feb 19, 2023
Feb 19, 2023
Feb 19, 2023
Feb 19, 2023
Feb 19, 2023
Feb 19, 2023

Repository files navigation

Tagalog Dictionary Scraper 📒 Tweet

Ating pag-ibayuhin ang ating talahuluganan!

Collects Tagalog words from tagalog.pinoydictionary.com, a database of Tagalog words powered by Cyberspace.ph Web Hosting. This script uses a common web scraping technique known as HTML parsing.

42,723 words (as of Feb 19, 2023)

See the word list at tagalog_dict.txt

License: GPL v3 Build Status codecov

contributions welcome

API Resource

Served through GitHub Pages, the scraped words are accessible via REST resource.

Host

https://raymelon.github.io/tagalog-dictionary-scraper/

Method

GET

Resources Available

Resource Display Endpoint
csv default /tagalog_dict.csv
csv with lines /tagalog_dict_lines.csv
json default /tagalog_dict.json
json with lines /tagalog_dict_lines.json
txt default /tagalog_dict.txt

How is it done? 💪

Each webpage is loaded and parsed, extracting the words enclosed in <h2 class='word-entry'> tag.

Included is tagalog.pinoydictionary.com html snippet containing the source of http://tagalog.pinoydictionary.com/list/a/ to serve as point of reference on how dictionary words from the page are extracted.

Disclaimer: I do not own the html code cited above, it is owned by tagalog.pinoydictionary.com.

How did the project started? 💭

The main purpose of this project is for a Scrabble ® Tagalog dictionary database, but other uses may vary.

Tools ✏️

  python -m pip install -U pip beautifulsoup4
  python -m pip install -U pip requests-futures

Notes 📌

License License: GPL v3

GNU General Public License 3.0