-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Apply rules" instead of overrides #77
Comments
I'm also slightly in favor to go with Approach 1 since the API is quite direct. The idea of simply specifying the item that you want rather than the PageObject needed to extract it is quite attractive. ✨ I think one of the common use cases would be extracting a subset of the fields, which is a problem as you've pointed out. I'm curious as to how we may be able to solve it. Adding fields is quite straightforward, since a new ItemType can be created containing the new fields.
You mentioned the following idea above has some challenges. Could you expand on the specific issues surrounding it? I was under the impression that this "reverse registry" works almost the same as the one with Approach 2 (but the pairs are reversed). In any case, I think using |
Take my feedback with a grain of salt. I feel quite unfamiliar with the use of My thoughts:
|
I'm trying to recall what was that, but can't :) So, likely you're right, the same challenges as with Approach 2.
get_origin and get_args are not available in 3.7, so we'd need to have some backports anyways. |
Most likely, it's a ProductPage subclass. So far, we've been using exact classes everywhere (scrapy-poet, andi), without allowing subclasses. So, while possible, I think there would be challenges and edge cases which we were avoiding so far. For example, a subclass may return a different item type, and this "implicit" binding to ProductPage could be wrong. But I'm not sure.
We might need something like Annotated anyways. If a Product instance is returned by HTTP API, and HTTP API allows to pick fields, how would you specify a Page Object dependency for that?
This "where needed" is an issue. With the common syntax, you need to use This is solvable though - if I'm not mistaken, it should be possible to use
+1 . |
I was thinking you would ask for the page object instead, and use page params. But I am not sure anymore. My lack of love for
Oh, right, the page object and the spider may be defined by separate people on separate projects, which is the whole point of web poet to begin with.
I see how this kind of defeats the benefits of approach 1. |
After re-reading it:
So, now it's +0.75 from me for Approach 1, not +0.25 :) |
We have it released. |
hey! I’m exploring how to make the following work:
In other words, how to make
@handle_urls("example.com")
work for anItemPage[Product]
subclass without a need to useinstead_of
in handle_urls, and without a need to use a base page object forinstead_of
.I can see 2 main approaches here.
Approach 1: support is directly
In the example, handle_urls doesn't really define any override rule. Instead, we have a declaration that ExamplePage can return Product item for example.com page. This information should be enough to allow creation of a
scrapy-poet
provider for items:We know the website, we know which item is needed, and can use Page Object registry to find a right page object, according to domain and priority.
To implement it, web-poet needs to be slightly refactored:
ApplyRule(for_patterns=..., use=ExampleProductPage, instead_of=GenericProductPage)
, it’d be possible to specifyApplyRule(for_patterns=..., use=ExampleProductPage, to_return=Product)
handle_urls
decorator would pick upto_return=Product
from theItemPage[Product]
automatically (probably unless instead_of argument is defined? but that's not clear, there are use cases for supporting both at the same time).When implementing it, we should make sure that priorities work as intended. For example, it should be possible to define and configure a Page Object which provides Product instances using AutoExtract API, and have it behaving in a sensible way (enabled for all domains, custom Page Objects should take priority over this default one).
Approach 2: define standard generic Page Object classes
In this case, "override rules" stay override rules. There is a registry of
{item_type: generic_page_object_class}
defined somewhere, e.g.{Product: ProductPage}
. Or, maybe,{Product: [ProductPage, AnotherGenericProductPage]}
. There should be an API to extend and modify this registry. handle_urls looks up this registry, and picksinstead_of
argument based on it.Pros/cons
Overall, I like an approach with items more, because it seems to provide a nicer API:
A risk here is that we may decide that standardized page objects are actually very important. For example, unlike items, they allow to extract only some of the fields. They may also provide some methods other than to_item which can be used in overrides.
A challenge with standard base classes is how much logic you put there. On one hand, there is a lot of useful logic which can be put in a base class, and which can save developer time. For example, default implementation for some of the attributes, or helper methods. But the more powerful your base class is, the more you need to assume about the implementation. So, it might be wise to have these "standard" page objects mostly as "markers", to ensure that a wide range of page objects is compatible with each other. But, if we do so, we still need to put the "powerful" logic somewhere.
If you have separate "marker" base class used for overrides, and feature-full base class used in a project, usage becomes less straightforward - you'd probably need to do something like this - the original challenge is not solved:
Alternative is
But it defeats the purpose of defining a standard ProductPage which is not tied to a project-specific spider - generic spider above doesn't support pages which are just ProductPage subclasses, it needs its own, PowerfulProductPage subclasses. It also requires solving a case when a page objects which uses handle_urls is not a strict subclass of PowerfulProductPage.
With "items" approach it's not an issue - as soon as page objects return the same item class, they're considered compatible, there is no need for them to share exactly same base class, besides having
ItemPage[ItemClass]
orReturns[ItemClass]
somewhere in hierarchy.Not extracting all fields with Approach 1
The main cons I see with the Approach 1 is a case where not all item fields are required in a generic spider.
There is no ProductPage anymore, and if we define it, it's not stated that ExamplePage can be used instead of ProductPage. There are some possible solutions to it, e.g. using typing.Annotated:
There could be also a "reverse" registry,
{ProductPage: Product}
, which scrapy-poet can use to pick a right Page Object from the registry if ProductPage is requested, though this approach has some challenges.Using custom methods
Users might want to have a Page Object with some custom methods, and use it in a crawler:
This is compatible with both Approach 1 and Approach 2, if ProductListPage uses ProductList as a dependency:
In this case, one can define POs for ProductList using Approach 1, and they will be used automatically to create
item
dependency. It's also possible to override ProductListPage itself, if it's desired to have a customlinks
implementation. So, both pages below would be applied properly:Conclusion
Approach 1 is tempting because it seems to provide a better API - one can receive items in callbacks, and just use them; there is no need e.g. to use
item = await ensure_deferred(page.to_item())
in every callback, there is no need to useasync def parse
. It also gives fully typing support, which can be hit or miss with page objects.For me it's also much easier to understand - there are no "overrides" anymore, no semi-magical replacement of one class with another. We're just telling "this Page Object returns Product item on example.com website", and then, when Product item is needed on example.com website, we know how to get it.
Approach 2 looks more powerful, especially if web-poet fields are used heavily, and a spider actually needs to access only some of the fields, but not all of them.
Currently I'm +0.25 to implement, follow and recommend Approach 1, mostly because of it's simplicity, even if it doesn't support some features which Approach 2 gives. But I do see advantages of Approach 2 as well, so it's not a strong opinon.
Thoughts?
The text was updated successfully, but these errors were encountered: