Query for set information #2

kevinlul · 2022-07-26T02:49:39Z

Collecting set information is the last piece for YAML Yugi to exceed parity with other solutions. Unlike the other data collected so far, which are contained in flat categories, sets are indexed on Yugipedia in hierarchical categories. This means that instead of a target category for sets directly containing an article about a set, categories may be nested. When querying the MediaWiki API, only the immediate members of a category are returned, including the names of child categories, but the members of those child categories are not returned. Therefore, new code is required in order to download entire category hierarchies and subscribe to updates on them. Category hierarchies are allowed to contain cycles, and while this is not expected of the categories for sets, our code should be correct even if cycles are encountered and not fall into an infinite loop.

Design

Either create or extend the current full download script to recursively download a targeted category, without falling into infinite loops. For example, after fetching https://yugipedia.com/api.php?action=query&redirects=true&generator=categorymembers&prop=revisions&rvprop=content&format=json&formatversion=2&gcmlimit=50&gcmtitle=Category:Yu-Gi-Oh!_Master_Duel_sets, the the ns=14 category items in the response should be stored in a ordered set for additional follow-up requests once the current category is completely downloaded.

To subscribe to incremental updates, the existing script can be used, but each time, it should be called with all the known descendant categories cached from the last full download, in addition to the top-level category itself. This is because the MediaWiki API only provides the immediate parent categories of an article, not all ancestor categories.

Subtasks

Create or extend full download script to be recursive
Test full download of https://yugipedia.com/wiki/Category:Yu-Gi-Oh!_Master_Duel_sets
Set up incremental updates for Yu-Gi-Oh!_Master_Duel_sets
Set up regular full downloads and incremental updates for other game formats

kevinlul · 2022-10-18T03:54:37Z

Recursive on

Top level for exploration: https://yugipedia.com/wiki/Category:Sets

kevinlul · 2024-04-11T03:21:02Z

Notes:

Recursive full download needs to return a list of found categories (ns=14) after each page downloaded
In the main loop, this is appended to an OrderedSet
There's an additional outer loop iterating over the OrderedSet, thus fetching all categories,
without infinite recursion in the case of cycles, because the category will already be in the set and
have been iterated

Should I just add the second return value for the list of categories or switch to OOP?

xyj-3 · 2024-07-20T01:55:03Z

Is OrderedSet a specific thing? Also what do you mean by "add the second return value for the list of categories or switch to OOP"?

Also do you want the new downloaded files to be flat in the top level category folder or nested?

xyj-3 · 2024-07-21T03:42:06Z

How does gcmcontinue and grccontinue work, like when are you using it and what value do you give it

kevinlul · 2024-07-22T00:53:42Z

I'm describing the changes that need to happen to the main logic in the download function in https://github.com/DawnbrandBots/yaml-yugipedia/blob/master/src/utils.py

Currently the category is specified to the MediaWiki API by the gcmtitle URL parameter in main.py. However, this only retrieves direct members of the category, so the download logic needs to keep track of child category pages that were retrieved, to be downloaded by another request to the MediaWiki API. I mentioned an OrderedSet because that is one way to keep track of the categories already downloaded and newly discovered in order to avoid infinite looping.

gcmcontinue and grccontinue are pagination tokens in the response from MediaWiki APIs when a generator is used, when the results don't fit on a single page of results. In the download scripts, this is populated from the previous request so it downloads all pages, but can also be provided on the command-line to restart a previous set of downloads from the middle. The parameter varies by script. For main.py, the generator CategoryMembers is used, so the parameter is GCMcontinue. For incremental.py, the generator RecentChanges is used, so the parameter is GRCcontinue.

https://yugipedia.com/api.php
https://www.mediawiki.org/wiki/API:Query#query:generator
https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bcategorymembers
https://www.mediawiki.org/w/api.php?action=help&modules=query%2Brecentchanges

xyj-3 · 2024-07-24T21:23:26Z

I got it to work with recursion and a second return value but it doesn't look that nice right now so I'm considering restructuring it.

The biggest issue so far is actually getting an identifier for the top category for preventing loops. The generator=categorymembers doesn't return any info about the category itself.

I figure you can use pageid or title to track if there is looping. So in that case it looks like you either have to

change your categories.txt to be a list of pageids
Make a request for every category to get the pageid/title
Do some string manipulation to transform between "TCG_Speed_Duel_Forbidden_%26_Limited_Lists" and "Category:TCG Speed Duel Forbidden & Limited Lists"

What do you think? Do you have any preferences because otherwise I'm probably picking making another request for every category.

kevinlul · 2024-07-24T21:51:13Z

Feel free to restructure. I already anticipated it would be necessary and there's actually very little code in this repository. The only interface that needs to be respected for full downloads is the command-line interface. Everything else is an implementation detail.

This comment was marked as outdated.

Sign in to view

kevinlul mentioned this issue Aug 31, 2022

Forbidden & Limited Lists DawnbrandBots/yaml-yugi#8

Closed

kevinlul mentioned this issue Oct 6, 2023

Add data for card set information DawnbrandBots/yaml-yugi#76

Open

kevinlul mentioned this issue Nov 10, 2023

Generate full historical card pools based on date DawnbrandBots/yaml-yugi-limit-regulation#8

Open

This comment was marked as outdated.

Sign in to view

kevinlul mentioned this issue Aug 28, 2024

Add more filed for Master Duel card data DawnbrandBots/yaml-yugi#159

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query for set information #2

Query for set information #2

kevinlul commented Jul 26, 2022 •

edited

Loading

This comment was marked as outdated.

kevinlul commented Oct 18, 2022

This comment was marked as outdated.

This comment was marked as outdated.

kevinlul commented Apr 11, 2024 •

edited

Loading

xyj-3 commented Jul 20, 2024 •

edited

Loading

xyj-3 commented Jul 21, 2024 •

edited

Loading

kevinlul commented Jul 22, 2024

xyj-3 commented Jul 24, 2024

kevinlul commented Jul 24, 2024

Query for set information #2

Query for set information #2

Comments

kevinlul commented Jul 26, 2022 • edited Loading

Design

Subtasks

This comment was marked as outdated.

kevinlul commented Oct 18, 2022

This comment was marked as outdated.

This comment was marked as outdated.

kevinlul commented Apr 11, 2024 • edited Loading

xyj-3 commented Jul 20, 2024 • edited Loading

xyj-3 commented Jul 21, 2024 • edited Loading

kevinlul commented Jul 22, 2024

xyj-3 commented Jul 24, 2024

kevinlul commented Jul 24, 2024

kevinlul commented Jul 26, 2022 •

edited

Loading

kevinlul commented Apr 11, 2024 •

edited

Loading

xyj-3 commented Jul 20, 2024 •

edited

Loading

xyj-3 commented Jul 21, 2024 •

edited

Loading