Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please add https://wizards.com sites for MTG stories #1300

Closed
Darthagnon opened this issue May 5, 2024 · 5 comments
Closed

Please add https://wizards.com sites for MTG stories #1300

Darthagnon opened this issue May 5, 2024 · 5 comments

Comments

@Darthagnon
Copy link
Contributor

I'm starting work on an archival project, to convert Magic: the Gathering web fiction to EPUB (here and on my HDD), as it is slowly disappearing from the website with slapdash updates. Web2Epub is the best tool for the job, and I have been using it successfully with the default parser using the following:

New MTGStory:
URL structure: https://magic.wizards.com/en/news/magic-story/hero-iroas-2014-03-05
Include: #article-body
Title: #article-body > div > article > header > h1
Exclude: #article-body > div > aside, #article-body > div > article > div.css-AerwF

Old MTGStory:
URL structure: https://magic.wizards.com/en/articles/archive/magic-story/zendikars-last-stand-2016-02-17
Include: #main-content > article
#content-detail-page-of-an-article (to exclude author)
Title: #main-content > h1 
Exclude: #content > aside

Really old Magic Uncharted Realms:
URL structure: http://www.wizards.com/Magic/Magazine/Article.aspx?x=mtg/daily/ur/263
Unselect all except for "skip to content", as that is your article
Include: #content > div.center-content
Title: #ctl00_ctl00_ContentPlaceHolder1_mainContent_Article_headerPanel > div.description > h4
Exclude: #topNav, #leftColumn, #footerWrap, #ctl00_ctl00_ContentPlaceHolder1_mainContent_Article_footerPanel, #ctl00_ctl00_ContentPlaceHolder1_MagicTopNavigation_topNavigation, #ctl00_ctl00_ContentPlaceHolder1_mainContent_Article_socialbar, #ctl00_ctl00_ContentPlaceHolder1_mainContent_Article_HeadingLinks

mtglore.com:
Include: #content > div
Title: #content > div > div

The story is spread across 4 different URL/article website structures, half of which are only on the Internet Archive. Different chapters can exist under different structures, and the TOCs (if they exists) are not comprehensive.

My workflow currently involves getting the Archive.org links for as many chapters on one website structure as possible, as mix-and-match of Includes and Excludes doesn't seem to work very well (? or maybe I should just use more commas), testing, then editing the chapter list and pasting in the links to what I actually want to download, e.g.

<a href="https://web.archive.org/web/20230208082604/https://magic.wizards.com/en/news/making-magic/planeswalkers-guide-theros-part-1-2013-08-21">01 Planeswalker's Guide to Theros, Part 1: The Plane of Theros</a>
<a href="https://web.archive.org/web/20230330201809/https://magic.wizards.com/en/news/making-magic/planeswalkers-guide-theros-part-2-2013-08-28">02 Planeswalker's Guide to Theros, Part 2: The Poleis</a>
<!--<a href="https://web.archive.org/web/20140302084755/http://www.wizards.com/Magic/Magazine/Article.aspx?x=mtg/daily/ur/263">03 The Lost Confession</a>-->
<a href="https://web.archive.org/web/20230620224550/https://magic.wizards.com/en/news/making-magic/planeswalkers-guide-theros-part-3-2013-09-04">04 Planeswalker's Guide to Theros, Part 3: Nonhuman Creatures</a>
<a href="https://web.archive.org/web/20230128080100/https://magic.wizards.com/en/news/feature/prince-anax-part-1-2013-09-18">05 Prince Anax, Part 1</a>
<a href="https://magic.wizards.com/en/news/feature/prince-anax-part-2-2013-09-23">06 Prince Anax, Part 2</a>
<a href="https://web.archive.org/web/20231208032539/https://magic.wizards.com/en/news/feature/nymphs-theros-2013-10-02">07 Nymphs of Theros</a>
<a href="https://web.archive.org/web/20230923144610/https://magic.wizards.com/en/news/feature/consequences-attraction-2013-10-09">08 The Consequences of Attraction</a>
<a href="https://web.archive.org/web/20230929173716/https://magic.wizards.com/en/news/feature/tragedy-2013-10-23">09 Tragedy</a>
<a href="https://magic.wizards.com/en/news/making-magic/unanswered-questions-theros-2013-11-04">10 Unanswered Questions: Theros</a>
<a href="https://magic.wizards.com/en/news/feature/i-iroan-2013-11-04">11 I Iroan</a>
<a href="https://web.archive.org/web/20230203095421/https://magic.wizards.com/en/news/feature/sea-gods-labyrinth-part-1-2013-11-13">12 The Sea God's Labyrinth, Part 1</a>
<a href="https://web.archive.org/web/20230208082828/https://magic.wizards.com/en/news/feature/sea-gods-labyrinth-part-2-2013-11-20">13 The Sea God's Labyrinth, Part 2</a>
<a href="https://magic.wizards.com/en/news/feature/building-toward-dream-part-1-2013-11-27">14 Building Toward a Dream, Part 1</a>
<a href="https://web.archive.org/web/20230127023202/https://magic.wizards.com/en/news/feature/building-toward-dream-part-2-2013-12-04">15 Building Toward a Dream, Part 2</a>
<a href="https://web.archive.org/web/20230202181952/https://magic.wizards.com/en/news/feature/asphodel-2013-12-11">16 Asphodel</a>
<a href="https://web.archive.org/web/20230924005808/https://magic.wizards.com/en/news/feature/planeswalkers-guide-born-gods-2014-01-08">17 Planeswalker's Guide to Born of the Gods</a>
<a href="https://magic.wizards.com/en/news/magic-story/cowardice-hero-2014-01-22">18 Cowardice of the Hero</a>
<a href="https://magic.wizards.com/en/news/magic-story/emonberry-red-2014-02-05">19 Emonberry Red</a>
<a href="https://magic.wizards.com/en/news/magic-story/kioras-followers-2014-02-14">20 Kiora's Followers</a>
<a href="https://magic.wizards.com/en/news/magic-story/dance-flitterstep-2014-02-19">21 Dance of the Flitterstep</a>
<a href="https://magic.wizards.com/en/news/magic-story/walls-akros-2014-02-26">22 The Walls of Akros</a>
<a href="https://magic.wizards.com/en/news/magic-story/hero-iroas-2014-03-05">23 The Hero of Iroas</a>
<a href="https://magic.wizards.com/en/news/magic-story/oracle-ephara-2014-03-19">24 The Oracle of Ephara</a>
<a href="https://web.archive.org/web/20211205115729/https://magic.wizards.com/en/articles/archive/uncharted-realms/seasons-setessa-2014-03-26)">25 Seasons in Setessa</a>
<a href="https://web.archive.org/web/20230922141754/https://magic.wizards.com/en/news/feature/planeswalkers-guide-journey-nyx-2014-04-02">26 Planeswalker's Guide to Journey into Nyx</a>
<!--<a href="https://magic.wizards.com/en/news/magic-story/ajani-mentor-heroes-2014-12-17">27 Ajani, Mentor of Heroes</a>-->
<a href="https://web.archive.org/web/20211023194705/https://magic.wizards.com/en/articles/archive/uncharted-realms/labyrinth-labors-2014-04-16">28 The Labyrinth of Labors</a>
<a href="https://web.archive.org/web/20231204094100/https://magic.wizards.com/en/news/feature/desperate-stand-2014-04-16-0">29 Desperate Stand</a>
<!--<a href="https://magic.wizards.com/en/news/magic-story/dreams-city-2014-04-23">30 Dreams of the City</a>-->
<a href="https://web.archive.org/web/20230226010657/https://magic.wizards.com/en/news/magic-story/thank-gods-2014-04-30">31 Thank the Gods</a>
<a href="https://web.archive.org/web/20230926130458/https://magic.wizards.com/en/news/making-magic/journeys-end-2014-05-26">32 Journey's End</a>
<a href="https://web.archive.org/web/20230130113433/https://magic.wizards.com/en/news/magic-story/kruphixs-insight-2014-06-11">33 Kruphix's Insight</a>
<a href="https://web.archive.org/web/20221204012503/https://magic.wizards.com/en/news/feature/ajanis-vengeance-2014-07-23">34 Ajani's Vengeance</a>
<a href="https://web.archive.org/web/20230205144542/https://magic.wizards.com/en/news/magic-story/drop-drop-2015-05-20">35 Drop for Drop</a>
<a href="https://web.archive.org/web/20230131203628/https://magic.wizards.com/en/news/magic-story/its-time-talk-commander-2016-edition-2016-10-26">36 It's Time to Talk Commander (2016 Edition)!</a>

I am starting work on the parser, but was wondering if there was a way for it to target different sites, and to ignore the TOC, and only request a manual list? My workflow would be improved if there was just a box for URLs and it could extract the titles from that, rather than having to write an HTML chapter list with a href="">Title here</a> - is there a way to force this with a new parser?

@Darthagnon Darthagnon changed the title Please add site https://wizards.com Please add https://wizards.com sites for MTG stories May 5, 2024
@dteviot
Copy link
Owner

dteviot commented May 7, 2024

@Darthagnon

I'm not quite sure what you're asking for.
Is it something that walks https://magic.wizards.com/en/news/archive, treating each article as a chapter to collect?

@Darthagnon
Copy link
Contributor Author

Darthagnon commented May 8, 2024

Apologies, my explanation was rather confusing.

"Edit chapter URLs" is the view I use to collect chapter lists to convert to EPUB, because

  • the Wizards website is broken/useless/missing chapters, so there is no auto-parser that could work (EDIT: without too much work). An auto-parser would need to process https://magic.wizards.com/en/news/archive (2024), https://web.archive.org/web/2023mmddetc/https://magic.wizards.com/en/articles/archive (unreliable infinite scroller) and https://web.archive.org/web/2020mmddetc/http://www.wizards.com/Magic/Magazine/Archive.aspx (paginated, mostly 404s), https://web.archive.org/web/2014mmddetc/http://www.wizards.com/Magic/Magazine/Article.aspx
  • a lot of chapters are not story-related, so less useful for EPUB.

Questions

  1. Parser: Is it possible for a single parser to target multiple domains (e.g. archive.org, multiple versions of wizards.com) with separate filters on a per-chapter basis? Or are there any example site templates that are similar to this use case?
  2. Extension: Would it be possible to (optionally) open "Edit chapter URLs" by default, rather than auto-parsing all chapters? For the default parser and manual listing on weird random websites, auto-parsing doesn't seem to work very well, gathering lots of irrelevant navbar links. It would help my workflow if I could paste story links directly, rather than fiddling with the UI to eliminate irrelevant chapters.
  3. Extension: Edit chapter URLs requires specifying the title to use per chapter, as well as html tags, eg. a href="">Title here</a> - could it be changed to just take a list of URLs? e.g. instead of
<a href="https://magic.wizards.com/en/news/magic-story/cowardice-hero-2014-01-22">18 Cowardice of the Hero</a>
<a href="https://magic.wizards.com/en/news/magic-story/emonberry-red-2014-02-05">19 Emonberry Red</a>
<a href="https://magic.wizards.com/en/news/magic-story/kioras-followers-2014-02-14">20 Kiora's Followers</a>

we could have

https://magic.wizards.com/en/news/magic-story/cowardice-hero-2014-01-22
https://magic.wizards.com/en/news/magic-story/emonberry-red-2014-02-05
https://magic.wizards.com/en/news/magic-story/kioras-followers-2014-02-14

... and the titles read according to the filter template to editable fields in the chapter list:

chrome_240508_57

Many thanks for any advice or help!

@Darthagnon
Copy link
Contributor Author

@dteviot
Copy link
Owner

dteviot commented Sep 13, 2024

@Darthagnon

Parser: Is it possible for a single parser to target multiple domains (e.g. archive.org, multiple versions of wizards.com) with separate filters on a per-chapter basis? Or are there any example site templates that are similar to this use case?

yes. https://github.com/dteviot/WebToEpub/blob/ExperimentalTabMode/plugin/js/parsers/NoblemtlParser.js Although it's not obvious how it works. An example is

    findCoverImageUrl(dom) {
        return util.getFirstImgSrc(dom, ".thumbook, .sertothumb");
    }

This looks for an image using two CSS selectors. ".thumbook" and ".sertothumb" and picks the first it finds. As the two sites have a different layout, only one will succeed.

An alternate way to handle multiple sites, is the "dom" parameter holds the URL of the page in dom.baseURI. You could extract the hostname from the URL and then switch the logic based on that.

That said, WebToEpub is supposed to check the URL for each page, and then select the appropriate parser even if the Table of Contents is a mixture of sites. So, you might not need a combined parser. Just write one for each site.

@gamebeaker
Copy link
Collaborator

fixed in #1500
@Darthagnon
Updated version (1.0.0.0) has been submitted to Firefox and Chrome stores.
Firefox/ Chrome version is available now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants