-
-
Notifications
You must be signed in to change notification settings - Fork 939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation: How to develop new supported site #5750
Comments
I'm working on one as well, and as far as I can tell it's all about cherry picking from similar extractors. I know that's not a good answer, but it's all I've got so far. |
I kind of want to rework (and hopefully improve) most of the current extractor infrastructure in v2.0, so writing a guide on how to develop new extractors in the old style is somewhat of a waste I thought and it therefore hasn't happened till now. Look at merged PRs and commits that add new extractors / support for a new site and adopt their code. |
Thank you both for your comments and feedback. Hey mikf, I totally understand your position, and don't want to detract from your efforts to rework the infrastructure. And normally I would totally agree about the time/effort, it's just that some of the sites I'm looking at are removing content. So the longer it takes me, the less quality downloads I can add to my archive. I know you have priorities, and am not asking to be one of them. I appreciate the time you've put into this project! I'm going to keep learning and working, and am hoping someone will help answer a few questions so I can keep trying to figure out the current extractors. I'll also gladly migrate to the new format as soon as it comes out. With that in mind, I'm still open to any guidance and suggestions to help steer my learning.
Input that could help me:
Thanks for any and all direction! |
Quick edit; I think I answered one of my own questions.
I went through a good sampling of the existing Extracts and each seems to have their own taxonomy. Some of the common ones are:
So it seems this isn't something required by the code, just a term you apply to the site. Looks like gallery-dl will support whatever you want to call it. |
Yes, basically. Usually, the naming in the extractor reflects the nomenclature used by the site the extractor is written for.
Not sure for what you would need to use Docker for here. You only need Python, and git (Which is already included if you use something like https://github.com/apps/desktop, which would be the simplest way to do this)
Possible, although maybe not the best example to use as a starting point, because the directlink extractor is not really similar to any other extractor.
Uh, depends? 😄 To be sure, you would have to show us an example.
Yes, you have to add
This is the necessary part, but you should also add your extractor to Not sure if
If it still defaults to the directlink extractor, your test URL still seems to match the directlink pattern. You can check this with https://pythex.org/ or https://regex101.com/ (don't forget to set the regex flavor to Python first here) A somewhat simple example to start would maybe be this one: |
Thank your for your thoughts and answers, Hrxn!
Well, this was the best way I could figure out how to develop. If I installed gallery-dl with pip, then running gallery-dl would use that install, and not my modified/forked version. And if I just cloned it to a new environment, it wasn't "installed" so I couldn't run it. Docker is how I got a fork that I can modify code and run the modified version. If I'm missing another option that's easier, would love to hear it!
OK, that is key information that I missed before. Thank you, thank you!
Yeah, OK... Just reading these questions is pointing me in a good direction. I'll dig in from here and see where that takes me.
Once I add them, how can I trigger the test for that one extractor? Every time I try to use the commands I'm used to, either it tries to run all of the tests or none of them. |
You can run Python code from source:
See https://docs.python.org/3/using/cmdline.html#cmdoption-m for details.
By running
is just that: documentation. The minimum requirements for an extractor to be recognized is an entry in the from .common import Extractor, Message
from .. import text
class ExampleTestExtractor(Extractor):
category = "example"
subcategory = "test"
pattern = r"(?:https?://)?example\.org"
def items(self):
url = "https://www.iana.org/_img/2022/iana-logo-header.svg"
data = text.nameext_from_url(url)
yield Message.Directory, data
yield Message.Url, url, data Some simple, albeit older, examples from PRs would be b17e2dc and 2529781. |
Outstanding! Thanks so much for your help, you've gotten me over several hurdles! I got it working with a single URL. (Here's my working code if that helps.) Now need to figure out how the GalleryExtractor works. I see from the Catbox example (and others) that we're returning a dictionary with details in metadata(), and a URL. But I can't tell what is required, and what's extraneous for that example. How can I tell what's necessary for the Album/Gallery extractor to identify links and send to the Extractor? And like, what is the I've started putting my lessons learned into a wiki as a draft. Like I said, if you're going to change the way extractors work, then this is just a learning exercise for me. But if that rewrite is a ways out, maybe this can help other noob/part-time developers add some functionality. |
I've gotten two going so far, but both are just "single image from a image page." I still need a hand figuring out how a Album/Gallery works. Open to suggestions and feedback! |
I am trying to write an extractor which is similar to the imagechest extractor. I'm able to call the API and get post data, but it aggregates all posts before moving on to grab the data of each post. Ideally I'd like for it to scrape data from the posts in the given range, then move onto the next set of posts using the params. |
Sounds like you should be using generators (functions that |
See attached |
I tried to clean up your code a bit: https://gist.github.com/mikf/999147ca6c381a067c2d450ac3510ae9 This currently only prints IDs of accessible and not accessible posts. What I've noticed:
|
So one thing I'm trying is to incorporate debug logging into my extractors so I can see what they're doing. Example here, line 38 properly outputs the results of But it doesn't seem to be working when I'm trying with the Gallery Extractor. Example here, lines 24 and 35 where I attempt to see what's happening, but just get a Traceback error. Any ideas how I can do some debugging and see what's going on? |
You need to call diff --git a/gwm.py b/gwm2.py
index a30b76b..d8c0890 100644
--- a/gwm.py
+++ b/gwm2.py
@@ -13,11 +13,14 @@ class GirlsWithMuscleGalleryExtractor(GalleryExtractor):
"""Extractor for catbox albums"""
category = "gwm"
subcategory = "album"
- pattern = BASE_PATTERN + r"/images/\?name=[\w\s%]*"
+ pattern = BASE_PATTERN + r"/images/\?name=([^&#]+)"
filename_fmt = "{filename}.{extension}" # Not sure if this is used?
directory_fmt = ("{category}", "{album_name} ({album_id})") # Not sure if this is used?
archive_fmt = "{album_id}_{filename}" # Not sure if this is used?
+ def __init__(self, match):
+ url = "https://www.girlswithmuscle.com/images/?name=" + match.group(1)
+ GalleryExtractor.__init__(self, match, url)
def metadata(self, page):
extr = text.extract_from(page) Instead of |
Thanks again mikf! Quick update; I now have it tentatively working. I certainly need to do more testing before submitting a PR, but did some debug stepping and understand a lot more about what's going on. I saw in various issues that you welcome documentation submissions, so I did some rewriting of the Wiki. It didn't do like a PR and ask for permission, it just allowed me to change it. I hope I didn't overstep. And feel free to let me know if I'm going in a bad direction, I'll be happy to rework something. Also, if you're OK with it, I'd like to work on a detailed docstring PR for common.Extractor() and GalleryExtractor(). I think an explanation in there could help new developers understand what's going on and make it easier to spread the load of the extractor work. Something along the lines of the extractor comments in youtube-dl. |
OK, I switched to working on another extractor and it's helping me see what I did wrong with my first pass at the documentation. So things are coming along in that department, though it is slow. Where I could use help: I have the Gallery extraction working, but only on the first page. How do I get it to recognize that there's a 2nd page and keep iterating? |
I think I'm good with this now. I see a new PR for the same site I was working on (#6016) and hunter-gatherer8 got the I will continue working through more debugging and documentation and check back later, thanks! |
Nice, that's good to hear. |
So where exactly would you dump json data for each item? |
Hey all!
I'd like to work on adding a new supported site. However, it's unclear to someone with my skill level how to do that.
I can write a web crawler, so am comfortable with using requests and BeautifulSoup, but don't know how to take that knowledge and integrate with the gallery-dl classes.
If someone would be willing to jot down some notes and/or answer some questions, I'd be happy to write the steps out long form so it could be added to the wiki.
When you find a new site and want to extend, what do you do first? What info do you need from the site to create an extractor?
If this is the wrong place to ask, feel free to let me know a better place!
The text was updated successfully, but these errors were encountered: