Skip to content

Tool for offline Wikipedia search with WikiDumps / Archives

Notifications You must be signed in to change notification settings

HappyBravo/WikiDump_Search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔍 WIKIDUMP SEARCH

It is an offline utility/tool made for searching 'keywords' in Wikipedia Archive instead of using any online WikipediaAPI.


🎯 BENEFITS

  • when you need to search for many 'keywords' in Wikipedia. WikipediaAPI such as Wikipedia may slow down after few dozens of calls.
  • if your internet connection is not fast, then this is beneficial as it is an offline search.
  • uses very minimal onboard resource.

🛠️ REQUIREMENTS

or you can install using pip install -r "./requirements.txt"

Also, you need to download one image/backup from this wiki-archive page


⚙️ SETUP

Download

  • enwiki-{data}-pages-articles-multistream.xml.bz2 (~23 GB)
  • enwiki-{date}-pages-articles-multistream-index.txt.bz2 (~250 MB)
    • Extract this file. It will contain enwiki-{date}-pages-articles-multistream-index.txt (~1.2 GB)

These file's filepaths will be required when initializing thhe offline wiki class


📝 EXAMPLE

See testing.ipynb