Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A collection of processors to clean-up field values #18

Closed
BurnzZ opened this issue Sep 29, 2022 · 5 comments
Closed

A collection of processors to clean-up field values #18

BurnzZ opened this issue Sep 29, 2022 · 5 comments
Labels
enhancement New feature or request

Comments

@BurnzZ
Copy link
Contributor

BurnzZ commented Sep 29, 2022

Stemming off from the discussion in #15.


We need a library of functions that can be used to preprocess the data values placed into the item. Ideally, this should be used by https://github.com/scrapinghub/web-poet fields. For example:

from web_poet import ItemPage, field
from zyte_common_items import Product
from zyte_common_items.processors import clean_str


class ProductPage(ItemPage[Product]):
    @field
    @clean_str
    def name(self):
        return "  some value \n\n\t"

page = ProductPage()
assert page.name == "some value"
assert page.name == page.to_item().name

A guideline in web-poet should be created as well to properly document the processors found in zyte-common-items.

@BurnzZ BurnzZ added the enhancement New feature or request label Sep 29, 2022
@BurnzZ BurnzZ changed the title A collection of preprocessors to clean-up field values A collection of processors to clean-up field values Sep 29, 2022
@kmike
Copy link
Contributor

kmike commented Sep 30, 2022

I think we have several options on how to implement it.

Option 1: Decorators

The idea is that all processing logic would be implemented as decorators, which you can use to decorate extraction methods (later to be decorated with @field). Example:

    @field
    @clean_str
    def name(self):
        return "  some value \n\n\t"

Pro:

  • easy to use
  • decorators with parameters can be defined
  • it's possible to combine several decorators
  • can be used on methods which are not @field

Cons:

  • more complex implementation; every processor needs to be implemented as a decorator
  • these decorators must handle async def code as well, so it's even more complicated implementation
  • code is harder to use if a decorator API is not desirable
  • almost always you'd need @field decorator
  • if several decorators are applied, it's not always obvious in which order they'll be called
  • one need to remember that @field decorator should be at the top (not a big deal though)

Option 1.1: Field decorators

Extraction decorators can be pre-decorated with @field:

    @clean_str_field
    def name(self):
        return "  some value \n\n\t"

# or
from zyte_common_items import fields

# ...
    @fields.clean_str
    def name(self):
        return "  some value \n\n\t"

# or
from zyte_common_items.product import fields

# ...
    @fields.name
    def name(self):
        return "  some value \n\n\t"

It's very similar to (1. Decorators) approach. Pros and Cons compared to (1):

Pro:

  • easier to use, no need to use @field decorator
  • no need to recall the order of @field vs other decorators

Cons:

  • harder to combine several processing decorators, not possible with naive implementation
  • more complex implementation
  • can't be used on regular methods, which are not fields

To be continued with other options :)

@kmike
Copy link
Contributor

kmike commented Sep 30, 2022

Option 2. Regular functions

One can write regular processing functions, and use them in the fields:

    @field
    def name(self):
        return clean_str("  some value \n\n\t")

Pros:

  • Implementation is very straightforward
  • It's possible to combine the functions in arbitrary way
  • Easy to reuse in all contexts
  • No need to worry about sync/async semantics

Cons:

  • If there are several "return points", more code changes might be required, to ensure cleaning is always applied
  • It might be easier to forget to call a function
  • It is less clear, from the field definition, what kind of cleaning is performed, as the cleaning code is mixed with extraction code

Option 3: @field decorator with processors support

We can modify @field decorator to support output processors:

    @field(out=clean_str)
    def name(self):
        return "  some value \n\n\t"

# or multiple processors:

    @field(out=[clean_str, str.title])
    def name(self):
        return "  some value \n\n\t"

Pro:

  • Processors are still regular functions, so all the pros of regular functions apply
  • Processing logic is not mixed with the extraction logic, similar to decorators
  • No confusion about order of @field decorator vs processing decorators

Cons:

  • To use parameters with processing functions one need to use functools.partial
  • If multiple processors are needed, a list may look cluttered (not sure)
  • Typing may be more tricky to get right (though not sure about it)
  • We'd need to think how to handle sync vs async processing functions. Probably, only support sync.

Option 4: field output processors + recommend some functional programming library

Instead of supporting a list in field processors, we can recommend using some FP library:

from toolz.functoolz import compose_left
# ...

    @field(out=compose_left(clean_str, str.title))
    def name(self):
        return "  some value \n\n\t"

Pros:

  • The same as in Option 3.
  • Instead of functools.partial we might promote currying, which makes for a nicer API for passing parameters:
from toolz.functoolz import curry

@curry
def clean_str(input, normalize_space=True):
    # ...

# page objects
from toolz.functoolz import compose_left

# ...

    @field(out=compose_left(clean_str, str.title))
    def name2(self):
        return "  some value \n\n\t"

    # no need to use functools.partial, this works
    @field(out=compose_left(clean_str(normalize_space=False), str.title))
    def name(self):
        return "  some value \n\n\t"

Cons:

  • Not sure what's the advantage of promoting compose/compose_left, over supporting a list.

To be continued.

@kmike
Copy link
Contributor

kmike commented Sep 30, 2022

Option 5: combine Option 3 and Option 1.1

We can implement Option 3, and provide some helpers for creation of combined "fields + processing" decorators.
In the simplest case, stdlib is enough:

from functools import partial
from web_poet import field

name_field = partial(field, out=[clean_str])

# ...
    # parameters work
    @name_field(cached=True)
    def name(self):
        return "  some value \n\n\t"

We may have something more advanced to support customizing out better, but naive implementation could be enough: name_field(out=...) replaces default out. An argument in favor of doing something custom with out (e.g. appending or prepending to the default processing) is that name_field(out=...) might be useless, one can just use use field(out=...). But we might add something else to the field, e.g. cache some fields by default, or pass some meta parameter.

I'm currently in favor of going with Option 5, because

  1. Writing a processor function is straightforward, it's a regular Python function. Testing it is straightforward as well. No new concepts. I think that's important, because developers shouldn't be only using the standard processing functions provided by libraries, they should also write their own processing code.
  2. Using custom processing functions is straightforward, use @field(out=...).
  3. It's possible to provide shortcuts, so common cases can be optimized from the usage point of view. We don't need to settle on the shortcuts from the day 1, it can be done iteratively.

@Gallaecio
Copy link
Contributor

+1 to 3, +0.5 to 5.

I find @field(cache=True, out=[clean_str]) more readable than @name_field(cached=True), I think it makes it more obvious that some cleanup is being done, and which one.

If a long out value is the issue, I would suggest a slightly different approach:

clean_name = [clean_str]

@field(cache=True, out=clean_name)
def name(self):
    pass

@kmike
Copy link
Contributor

kmike commented Aug 24, 2023

Closing this, as processors feature is implemented in web-poet, and zyte-common-items already gained some built-in processors.

@kmike kmike closed this as completed Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants