Skip to content

How to add a new wiki

Abel Serrano Juste edited this page Feb 14, 2019 · 8 revisions

Here are the steps for adding a new wiki to WikiChron (note that, by default, wikichron already includes some wikis for testing and example purposes).

WikiChron can use any kind of mediawiki wiki as long as the input data is in the right format. You will have to process the data into a csv with some specific format and data. To do that we provide a script that transforms any mediawiki XML dump into that csv input data file. Below, you can find the steps to go and add a new wiki to your WikiChron instance.

Get the wiki history dump

First, you will need the xml file corresponding to the full revision history of the wiki.

If you have shell access to the server, there is an easy way to generate the dump manually using the dumpBackup.php script

If you don't have shell access to the server you have different options available to download it depending on the hosting provider that your wiki is using:

  • Wikia wikis: Supposedly, Wikia automatically generates dumps and provides them for every wiki they host. However, there is a known bug in this generation that cuts off the output dumps for large wikis. The best option here is to use this nice script that request and download the complete xml dump through the Special:Export interface. Please, keep in mind that this script does not download all the namespaces available when generating dumps for a wiki, but a wide subset of them (you can fin more detailed info in the wiki).
  • Wikimedia project wikis: For wikis belonging to the Wikimedia project, you already have a regular updated repo with all the dumps here: http://dumps.wikimedia.org. Select your target wiki from the list and download the complete edit history dump and uncompress it.
  • For other wikis, like self-hosted wikis, you should use the wikiteam's dumpgenerator.py script. You have a simple tutorial in their wiki: https://github.com/WikiTeam/wikiteam/wiki/Tutorial#I_have_no_shell_access_to_server. Its usage is very straightforward and the script is well maintained. Remember to use the --xml option to download the full history dump.

Remember to join all possible parts of the dump in one dump only and make sure it has the full history of every page of the wiki you want to analyze, and not only the current-only version of the dump.

Process the dump

Once you have your XML dump, you need to process the dump in order to get the corresponding .csv file. To do so, go run the script dump_parser, this script is listed as requirement for WikiChron, but you could also install it standalone with pip install wiki-dump-parser. This script process any mediawiki dump and outputs a pre-processed and simplified csv file with all the information that WikiChron needs to print its plots. Run the script using

python3 -m dump_parser data/<name_of_your.xml>

This will create the corresponding .csv file in your local data/ directory. If you have more than one XML file, run the script as follows:

python3 -m dump_parser data/*.xml

NOTE: all this information can be found in the "Process the dump" subsection of the "XML Dumps" section of the README.

Modify the wikis.json file

As it is stated in the "provide some metadata of the wiki" section of the README, you need to provide some metadata of your wiki in the wikis.json file, like the number of pages, the number of users, the user ids of the bots, etc.

You can edit this file by hand and write the corresponding data or, in case you are using Wikia wikis or similar compatible wikis, you can use the script generate_wikis_json.py coded for this purpose.

This script gets a file called wikis.csv as input, which has a list of the wiki urls and the filename of the csvs you want to add to WikiChron, and properly find the metadata needed and edits the file wikis.json accordingly. You can see the wikis.csv given as an example for the wikis data provided with WikiChron by default. Once you have set your wikis.csv file (you can just append your wikis to the one provided, since the script won't overwrite the previous gotten wikis.json data), just run:

ptyhon3 generate_wikis_json.py

Launch WikiChron and now you should see your new wikis added to the list. Note that you might need to restart WikiChron if it was running before you added the new wiki