Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_udf: Simplified and clarified the schema for data #515 #519

Open
wants to merge 1 commit into
base: draft
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `filter_spatial`: Clarified that masking is applied using the given geometries. [#469](https://github.com/Open-EO/openeo-processes/issues/469)
- `mod`: Clarified behavior for y = 0
- `sqrt`: Clarified that NaN is returned for negative numbers.
- `run_udf` and `run_udf_externally`: Simplified and clarified the schema for `data` - no functional change. [#515](https://github.com/Open-EO/openeo-processes/issues/515)

## [2.0.0-rc.1] - 2023-05-25

Expand Down
19 changes: 4 additions & 15 deletions proposals/run_udf_externally.json
Original file line number Diff line number Diff line change
Expand Up @@ -11,21 +11,10 @@
"parameters": [
{
"name": "data",
"description": "The data to be passed to the UDF.",
"schema": [
{
"title": "Array",
"type": "array",
"minItems": 1,
"items": {
"description": "Any data type."
}
},
{
"title": "Single Value",
"description": "A single value of any data type."
}
]
"description": "The data to be passed to the UDF.\n\nThe data must be given in a way that the UDF can understand it. Usually, this process is executed as part of a data cube process such as `reduce_dimension` and in this case the process would expect an array of values as provided in the parameter `data` by `reduce_dimension`. Please refer to the documentation of the UDF runtime for details on how to provide the data.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see run_udf comment

"schema": {
"description": "A value of any data type."
}
},
{
"name": "url",
Expand Down
19 changes: 4 additions & 15 deletions run_udf.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,21 +10,10 @@
"parameters": [
{
"name": "data",
"description": "The data to be passed to the UDF.",
"schema": [
{
"title": "Array",
"type": "array",
"minItems": 1,
"items": {
"description": "Any data type."
}
},
{
"title": "Single Value",
"description": "A single value of any data type."
}
]
"description": "The data to be passed to the UDF.\n\nThe data must be given in a way that the UDF can understand it. Usually, this process is executed as part of a data cube process such as `reduce_dimension` and in this case the process would expect an array of values as provided in the parameter `data` by `reduce_dimension`. Please refer to the documentation of the UDF runtime for details on how to provide the data.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data must be given in a way that the UDF can understand it.

From my UDF-usage experience this statement sounds a bit backwards. When working with UDFs in practice, the data and its form are a given and you have to tweak your UDF to match that form.
Basically in practice, the "parent" process (e.g. reduce_dimension) dictates the form of data, so it might be confusing to the reader trying to understand how run_udf operates by stating something like "the data must be given in a way ..."

I know that it's possible to have processes before run_udf in reduce_dimension's reducer, which would allow for some kind of data preparation, but I don't think there is an implementation that supports that (the geopyspark one doesn't in any case).

Also note that there isn't necessarily direct correspondence between the data argument of run_udf and the actual data that is available in the UDF code. For example: with apply, the data of the "child" callback is a single pixel value, but the UDF code could still receive larger multidimensional chunks for performance reasons. So I would avoid over-specifying the link between run_udf's data parameter and the actual data container available in the UDF.

Please refer to the documentation of the UDF runtime for details on how to provide the data.

Likewise: the UDF runtime docs can provide little help here, it's the parent process that dictates the form of run_udf's data parameter

Attempt to tweak the description:

Suggested change
"description": "The data to be passed to the UDF.\n\nThe data must be given in a way that the UDF can understand it. Usually, this process is executed as part of a data cube process such as `reduce_dimension` and in this case the process would expect an array of values as provided in the parameter `data` by `reduce_dimension`. Please refer to the documentation of the UDF runtime for details on how to provide the data.",
"description": "The data to be passed to the UDF.\n\nUsually, `run_udf` process is used as a child process of a data cube process such as `reduce_dimension` or `apply`. These data cube processes establish the actual type of this data. For example, in case of `reduce_dimension`, `data` will be an array of values, while in case of `apply` it will be a single scalar value.",

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also fine to just leave the description as is ("The data to be passed to the UDF."), to avoid bike-shedding this too much.

"schema": {
"description": "A value of any data type."
}
},
{
"name": "udf",
Expand Down
Loading