A collection of processors to clean-up field values #18

BurnzZ · 2022-09-29T13:13:56Z

Stemming off from the discussion in #15.

We need a library of functions that can be used to preprocess the data values placed into the item. Ideally, this should be used by https://github.com/scrapinghub/web-poet fields. For example:

from web_poet import ItemPage, field
from zyte_common_items import Product
from zyte_common_items.processors import clean_str


class ProductPage(ItemPage[Product]):
    @field
    @clean_str
    def name(self):
        return "  some value \n\n\t"

page = ProductPage()
assert page.name == "some value"
assert page.name == page.to_item().name

A guideline in web-poet should be created as well to properly document the processors found in zyte-common-items.

kmike · 2022-09-30T07:55:27Z

I think we have several options on how to implement it.

Option 1: Decorators

The idea is that all processing logic would be implemented as decorators, which you can use to decorate extraction methods (later to be decorated with @field). Example:

    @field
    @clean_str
    def name(self):
        return "  some value \n\n\t"

Pro:

easy to use
decorators with parameters can be defined
it's possible to combine several decorators
can be used on methods which are not @field

Cons:

more complex implementation; every processor needs to be implemented as a decorator
these decorators must handle async def code as well, so it's even more complicated implementation
code is harder to use if a decorator API is not desirable
almost always you'd need @field decorator
if several decorators are applied, it's not always obvious in which order they'll be called
one need to remember that @field decorator should be at the top (not a big deal though)

Option 1.1: Field decorators

Extraction decorators can be pre-decorated with @field:

    @clean_str_field
    def name(self):
        return "  some value \n\n\t"

# or
from zyte_common_items import fields

# ...
    @fields.clean_str
    def name(self):
        return "  some value \n\n\t"

# or
from zyte_common_items.product import fields

# ...
    @fields.name
    def name(self):
        return "  some value \n\n\t"

It's very similar to (1. Decorators) approach. Pros and Cons compared to (1):

Pro:

easier to use, no need to use @field decorator
no need to recall the order of @field vs other decorators

Cons:

harder to combine several processing decorators, not possible with naive implementation
more complex implementation
can't be used on regular methods, which are not fields

To be continued with other options :)

kmike · 2022-09-30T12:37:03Z

Option 2. Regular functions

One can write regular processing functions, and use them in the fields:

    @field
    def name(self):
        return clean_str("  some value \n\n\t")

Pros:

Implementation is very straightforward
It's possible to combine the functions in arbitrary way
Easy to reuse in all contexts
No need to worry about sync/async semantics

Cons:

If there are several "return points", more code changes might be required, to ensure cleaning is always applied
It might be easier to forget to call a function
It is less clear, from the field definition, what kind of cleaning is performed, as the cleaning code is mixed with extraction code

Option 3: `@field` decorator with processors support

We can modify @field decorator to support output processors:

    @field(out=clean_str)
    def name(self):
        return "  some value \n\n\t"

# or multiple processors:

    @field(out=[clean_str, str.title])
    def name(self):
        return "  some value \n\n\t"

Pro:

Processors are still regular functions, so all the pros of regular functions apply
Processing logic is not mixed with the extraction logic, similar to decorators
No confusion about order of @field decorator vs processing decorators

Cons:

To use parameters with processing functions one need to use functools.partial
If multiple processors are needed, a list may look cluttered (not sure)
Typing may be more tricky to get right (though not sure about it)
We'd need to think how to handle sync vs async processing functions. Probably, only support sync.

Option 4: field output processors + recommend some functional programming library

Instead of supporting a list in field processors, we can recommend using some FP library:

from toolz.functoolz import compose_left
# ...

    @field(out=compose_left(clean_str, str.title))
    def name(self):
        return "  some value \n\n\t"

Pros:

The same as in Option 3.
Instead of functools.partial we might promote currying, which makes for a nicer API for passing parameters:

from toolz.functoolz import curry

@curry
def clean_str(input, normalize_space=True):
    # ...

# page objects
from toolz.functoolz import compose_left

# ...

    @field(out=compose_left(clean_str, str.title))
    def name2(self):
        return "  some value \n\n\t"

    # no need to use functools.partial, this works
    @field(out=compose_left(clean_str(normalize_space=False), str.title))
    def name(self):
        return "  some value \n\n\t"

Cons:

Not sure what's the advantage of promoting compose/compose_left, over supporting a list.

To be continued.

kmike · 2022-09-30T12:51:26Z

Option 5: combine Option 3 and Option 1.1

We can implement Option 3, and provide some helpers for creation of combined "fields + processing" decorators.
In the simplest case, stdlib is enough:

from functools import partial
from web_poet import field

name_field = partial(field, out=[clean_str])

# ...
    # parameters work
    @name_field(cached=True)
    def name(self):
        return "  some value \n\n\t"

We may have something more advanced to support customizing out better, but naive implementation could be enough: name_field(out=...) replaces default out. An argument in favor of doing something custom with out (e.g. appending or prepending to the default processing) is that name_field(out=...) might be useless, one can just use use field(out=...). But we might add something else to the field, e.g. cache some fields by default, or pass some meta parameter.

I'm currently in favor of going with Option 5, because

Writing a processor function is straightforward, it's a regular Python function. Testing it is straightforward as well. No new concepts. I think that's important, because developers shouldn't be only using the standard processing functions provided by libraries, they should also write their own processing code.
Using custom processing functions is straightforward, use @field(out=...).
It's possible to provide shortcuts, so common cases can be optimized from the usage point of view. We don't need to settle on the shortcuts from the day 1, it can be done iteratively.

Gallaecio · 2022-10-13T05:34:07Z

+1 to 3, +0.5 to 5.

I find @field(cache=True, out=[clean_str]) more readable than @name_field(cached=True), I think it makes it more obvious that some cleanup is being done, and which one.

If a long out value is the issue, I would suggest a slightly different approach:

clean_name = [clean_str]

@field(cache=True, out=clean_name)
def name(self):
    pass

kmike · 2023-08-24T14:43:41Z

Closing this, as processors feature is implemented in web-poet, and zyte-common-items already gained some built-in processors.

BurnzZ added the enhancement New feature or request label Sep 29, 2022

BurnzZ mentioned this issue Sep 29, 2022

create use_converter_shortcuts() as a field_transformer #15

Closed

4 tasks

BurnzZ changed the title ~~A collection of preprocessors to clean-up field values~~ A collection of processors to clean-up field values Sep 29, 2022

wRAR mentioned this issue Oct 14, 2022

Add support for field processors. scrapinghub/web-poet#85

Merged

kmike closed this as completed Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A collection of processors to clean-up field values #18

A collection of processors to clean-up field values #18

BurnzZ commented Sep 29, 2022 •

edited

Loading

kmike commented Sep 30, 2022 •

edited

Loading

kmike commented Sep 30, 2022 •

edited

Loading

kmike commented Sep 30, 2022 •

edited

Loading

Gallaecio commented Oct 13, 2022

kmike commented Aug 24, 2023

A collection of processors to clean-up field values #18

A collection of processors to clean-up field values #18

Comments

BurnzZ commented Sep 29, 2022 • edited Loading

kmike commented Sep 30, 2022 • edited Loading

Option 1: Decorators

Option 1.1: Field decorators

kmike commented Sep 30, 2022 • edited Loading

Option 2. Regular functions

Option 3: @field decorator with processors support

Option 4: field output processors + recommend some functional programming library

kmike commented Sep 30, 2022 • edited Loading

Option 5: combine Option 3 and Option 1.1

Gallaecio commented Oct 13, 2022

kmike commented Aug 24, 2023

BurnzZ commented Sep 29, 2022 •

edited

Loading

kmike commented Sep 30, 2022 •

edited

Loading

kmike commented Sep 30, 2022 •

edited

Loading

Option 3: `@field` decorator with processors support

kmike commented Sep 30, 2022 •

edited

Loading