Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_udf: Simplified and clarified the schema for data #515 #519

Open
wants to merge 1 commit into
base: draft
Choose a base branch
from

Conversation

m-mohr
Copy link
Member

@m-mohr m-mohr commented Jul 26, 2024

Fixes #515

@m-mohr m-mohr added this to the 2.0.0 milestone Jul 26, 2024
@m-mohr m-mohr requested a review from soxofaan July 26, 2024 09:40
"description": "A single value of any data type."
}
]
"description": "The data to be passed to the UDF.\n\nThe data must be given in a way that the UDF can understand it. Usually, this process is executed as part of a data cube process such as `reduce_dimension` and in this case the process would expect an array of values as provided in the parameter `data` by `reduce_dimension`. Please refer to the documentation of the UDF runtime for details on how to provide the data.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data must be given in a way that the UDF can understand it.

From my UDF-usage experience this statement sounds a bit backwards. When working with UDFs in practice, the data and its form are a given and you have to tweak your UDF to match that form.
Basically in practice, the "parent" process (e.g. reduce_dimension) dictates the form of data, so it might be confusing to the reader trying to understand how run_udf operates by stating something like "the data must be given in a way ..."

I know that it's possible to have processes before run_udf in reduce_dimension's reducer, which would allow for some kind of data preparation, but I don't think there is an implementation that supports that (the geopyspark one doesn't in any case).

Also note that there isn't necessarily direct correspondence between the data argument of run_udf and the actual data that is available in the UDF code. For example: with apply, the data of the "child" callback is a single pixel value, but the UDF code could still receive larger multidimensional chunks for performance reasons. So I would avoid over-specifying the link between run_udf's data parameter and the actual data container available in the UDF.

Please refer to the documentation of the UDF runtime for details on how to provide the data.

Likewise: the UDF runtime docs can provide little help here, it's the parent process that dictates the form of run_udf's data parameter

Attempt to tweak the description:

Suggested change
"description": "The data to be passed to the UDF.\n\nThe data must be given in a way that the UDF can understand it. Usually, this process is executed as part of a data cube process such as `reduce_dimension` and in this case the process would expect an array of values as provided in the parameter `data` by `reduce_dimension`. Please refer to the documentation of the UDF runtime for details on how to provide the data.",
"description": "The data to be passed to the UDF.\n\nUsually, `run_udf` process is used as a child process of a data cube process such as `reduce_dimension` or `apply`. These data cube processes establish the actual type of this data. For example, in case of `reduce_dimension`, `data` will be an array of values, while in case of `apply` it will be a single scalar value.",

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also fine to just leave the description as is ("The data to be passed to the UDF."), to avoid bike-shedding this too much.

"description": "A single value of any data type."
}
]
"description": "The data to be passed to the UDF.\n\nThe data must be given in a way that the UDF can understand it. Usually, this process is executed as part of a data cube process such as `reduce_dimension` and in this case the process would expect an array of values as provided in the parameter `data` by `reduce_dimension`. Please refer to the documentation of the UDF runtime for details on how to provide the data.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see run_udf comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

run_udf: Unclear why two data types are defined
2 participants