-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check UDF params types in SignalSchema #973
Conversation
Deploying datachain-documentation with
|
Latest commit: |
237e3e0
|
Status: | ✅ Deploy successful! |
Preview URL: | https://a6d039bc.datachain-documentation.pages.dev |
Branch Preview URL: | https://respect-udf-param-types.datachain-documentation.pages.dev |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #973 +/- ##
==========================================
+ Coverage 88.08% 88.18% +0.10%
==========================================
Files 133 133
Lines 12123 12145 +22
Branches 1687 1695 +8
==========================================
+ Hits 10678 10710 +32
+ Misses 1025 1017 -8
+ Partials 420 418 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
for more information, see https://pre-commit.ci
@@ -1003,8 +1003,9 @@ def _udf_to_obj( | |||
func: Optional[Union[Callable, UDFObjT]], | |||
params: Union[None, str, Sequence[str]], | |||
output: OutputType, | |||
signal_map, | |||
signal_map: dict[str, Callable], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just improving typings here.
) -> UDFObjT: | ||
is_batch = target_class.is_input_batched |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be able to check UDF function input arguments for batch UDFs.
params_schema = signals_schema.slice( | ||
sign.params, self._setup, is_batch=is_batch | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to be able to know if UDF is batching or now when we do checking UDF input argument types.
@@ -130,7 +130,7 @@ def read_meta( # noqa: C901 | |||
# | |||
|
|||
def parse_data( | |||
file: File, | |||
file: TextFile, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because File.open
returns binary data, and we need text (str
) below. Now File
types in UDF will be automatically converted based on UDF input param.
def slice( | ||
self, keys: Sequence[str], setup: Optional[dict[str, Callable]] = None | ||
self, | ||
params: dict[str, Union[DataType, Any]], | ||
setup: Optional[dict[str, Callable]] = None, | ||
is_batch: bool = False, | ||
) -> "SignalSchema": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main changes of this PR goes into this function.
params
is now a dictionary with param names as keys and types as values instead of just a list of input params names.
is_batch
is needed to be able to check batching UDF input params types.
Let's say we do have a signal foo
with type int
, then regular UDF might looks like this:
def process(foo: int) -> int:
...
and batching UDF looks like this:
def process_batch(foo: list[int]) -> int:
...
To be able to check types here we need to know if UDF is batching or not.
if param_origin is Union and type(None) in get_args(param_type): | ||
param_type = get_args(param_type)[0] | ||
|
||
if is_batch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of batch UDFs these both declarations will be the same:
def process(foo: list[int]) -> int:
...
and
def process(foo: list) -> int:
...
if param_type == schema_type or ( | ||
isclass(param_type) | ||
and isclass(schema_type) | ||
and issubclass(param_type, File) | ||
and issubclass(schema_type, File) | ||
): | ||
schema[param] = schema_type | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UDF input param type should be the same as DataSet signal type.
Special case here: we are converting File
signal to TextFile
, ImageFile
or VideoFile
based on UDF input param type signature, and vise-versa, TextFile
DataSet signal can be used as regular File
type if needed.
Let's say we do have DataSet with File
type:
DataChain.from_storage("s3://bucket/")
if we know these files are image files, we can use them in UDF like this:
def process(file: ImageFile) -> ImageMeta:
return file.get_info()
This might be useful, since File
class have no get_info
method.
JFYI, other possible way to do this is:
def process(file: File) -> ImageMeta:
return file.as_image_file().get_info()
Same forks in the opposite way. Let's say, we do have DataSet with ImageFile
files:
DataChain.from_storage("s3://bucket/", type="image")
we can use regular File
in UDF:
def process(file: File) -> int:
...
type will be converted automatically in both ways.
@@ -18,7 +18,7 @@ def __init__(self, chain: str, msg): | |||
@dataclass | |||
class UdfSignature: | |||
func: Union[Callable, UDFBase] | |||
params: Sequence[str] | |||
params: dict[str, Union[DataType, Any]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UDF signature params is now a dictionary with param names as keys and param types as values, instead of just a list of param names.
func_params_map_sign, func_outs_sign, is_iterator = cls._func_signature( | ||
chain, udf_func | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why not to use self
here 👀
udf_params = ( | ||
{params: Any} if isinstance(params, str) else dict.fromkeys(params, Any) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Any
type here for params with unknown type, since we don't have an Unknown
type in Python 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can it be None? what is parameter type is type annotation is missing ... e.g. how does mypy treat it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the case when params
is set in map
chain method. Here we are not checking UDF signature param types, e.g.:
dc.map(lambda file: file.parse_info(), params="file")
or
def process_signals(signal1, signal2) -> int:
return signal1 + signal2
dc.map(process_signals, params=["signal1", "signal2"])
Answering your question, if parameter type is missing, inspect
will return it as inspect.Parameter.empty
(see below). In this case we are returning it's type as Any
(since where is no Unknown
type in Python).
@@ -159,6 +159,7 @@ def process(self, file) -> list[float]: | |||
``` | |||
""" | |||
|
|||
is_input_batched = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use the same is_output_batch
? (may be rename it to `is_batched)? it seems it always has the same value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For generators input will not be batched (single row), but output is batched, thus we need to have separate params for these.
|
||
param_type = get_args(param_type)[0] | ||
|
||
if param_type == schema_type or ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we do this only for File? does it make sense to make it general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. For Files we are sure no additional fields are exists in different File type models (File, ImageFile, VideoFile, etc). But in general other models inherited from each other may have additional fields and simple type conversion may fails. We can also check this, but it will complicate things and I am not sure I can see the value here.
We can do it later if needed.
Haven't looked into the PR yet but am I the only one little bit concerned with these implicit |
This is because I prefer to think about this the same way as
Different types, but converted seamless without any warning. Looks logical, intuitive and natural to me 😇 |
Yes, there are some implicit casting happening in some languages, but what I was worried more is to be inconsistent and to only do implicit casting for |
Then IMO we need to make sure no additional fields were added in subclasses. See my comment above. |
Follow-up for the #966
See this thread for more details: #966 (comment)
We are now checking types of UDF arguments if they are defined. Also there is a special case for
File
signal: it will be automatically converted toTextFile
,ImageFile
orVideoFile
if one of these types are used in UDF. See also my comments about implementation and examples below 🙏Very basic example
Let's say we have Dataset
my-ds
with two signals:name: str
andvalue: int
.We can now define UDF to process these signals, e.g.:
In this PR we are adding type checking for UDF param types based on dataset, such as the following UDF:
will return an error:
because in dataset
my-ds
signalname
type isstr
, notint
.Another real-life example
In this example we are creating dataset from Google Storage bucket with signal
file
typeFile
. Next we are going to process these files as images to get image meta from them. Since generalFile
type have no methods to work with images, we need type conversion (File
->ImageFile
), which can be done by usingfile.as_image_file()
, but it is also possible to do it on-the-fly based on UDF signature types, which is the subject of this PR.Query
Before
After
TODO