Allow discrimination based on a key that is not part of the model #135

mishamsk · 2023-08-03T04:00:18Z

Is your feature request related to a problem? Please describe.
I'am using mashumaro as (de)serialization backend of my AST library

What this means is that I (de)serialize huge (millions of nodes) nested dataclasses with fields usually having a general type that may have tens and sometimes hunderds subclasses. E.g. borrowing from the example in my own docs:

@dataclass
class Expr(ASTNode):
    """Base for all expressions"""
    pass

@dataclass
class Literal(Expr):
    """Base for all literals"""
    value: Any

@dataclass
class NumberLiteral(Literal):
    value: int

@dataclass
class NoneLiteral(Literal):
    value: None = field(default=None, init=False)

@dataclass
class BinExpr(Expr):
    """Base for all binary expressions"""
    left: Expr
    right: Expr

Expr can (or rather has in real life) hundreds of subclasses.

Thus it is impractical to let mashumaro try discriminating the actual type by iteratively trying every subclass during deserialization. Also, it will be incorrect in some cases because two subclasses may have the same structure, but I need exactly the same type as it was written during serialization.

I am already handling this via my own extension: see code

My custom implementation adds __type key during serialization which is read during deserialization and then dropped. If it is not available, it falls back to the default logic.

However, this way I can't use some of the other mashumaro features, such as annotations in _serialize/_deserialize, and I have to resort to "dirty" class state to pass additional options (what has been discussed in #100)

Describe the solution you'd like
I suggest adding an option to the discriminator feature to allow using keys that are not part of the data class itself.

Describe alternatives you've considered
See above my current implementation

The text was updated successfully, but these errors were encountered:

Fatal1ty · 2023-08-05T17:04:58Z

Hey @mishamsk

First of all, congratulations on your project publication! I'll consider using it when I need to do some parsing stuff.

When I was developing discriminator feature, I thought about adding an attribute to Discriminator that would keep a tag generator function. In the current implementation it's hardcoded to take the tag from a dataclass field which you specifed in Discriminator object. For example, if we have Discriminator(field="__type", include_subtypes=True), the following variant registration code will be generated:

for variant in iter_all_subclasses(__main__.Node):
    try:
        variants_map[variant.__dict__['__type']] = variant
    except KeyError:
        continue

By introducing a tag generator function we can change it. For example, if we have Discriminator(field="__type", variant_tag_getter=lambda cls: cls.__name__, include_subtypes=True) the following variant registration code will be generated:

for variant in iter_all_subclasses(__main__.Node):
    try:
        variants_map[variant_tag_getter(variant)] = variant
    except KeyError:
        continue

I'm bad at naming but I guess that variant_tag_getter will more clearly indicate the purpose instead of variant_tagger_fn which may imply the "post serialize hook" functionality.

The question is, do we need or will we need functionality of automatic adding tags to all variants on serialization, so we could avoid using __post_serialize__ hooks. We can add a bool parameter add_tag_on_serialization (just a working name) to Discriminator for this but let's keep it for a separate issue.

What do you think? Maybe I'm missing something and you have your own vision of solving this issue.

mishamsk · 2023-08-07T16:02:44Z

@Fatal1ty first thanks for the wishes! it took me way to much time;-)

rgd your design suggestion, I have a couple of thoughts.

First, rgd my specific use case. I think it's a distinct usage pattern from a "model a messaging protocol and parse". Instead it's more of a "pickle" use case. I am controlling both serialization and deserialization, so do not care about validations or trying to decode data originating from some external system. I have a hunch that only this kind of usage pattern where one can expect a literal class name in serialized data. None of the public specs, except the ones used for interoperability, would ever dictate to store a literal class name in serialized data, because class name is obviously implementation specific.

So if you think that mashumaro is commonly used for such things, it may be worth implementing explicitly, instead of a "generic" approach. But you'd know better if this is the case. My usage pattern may be rare.

Rgd your proposal for the universal solution. I'd definitely say that whatever it will be, it's going to be the harder part to grasp for users. I had to read mashumaro's code to understand where this "getter" stuff will pop-up. I am not sure if you need a different argument for the Discriminator though. Why not just inspect the field value, and allow callable?

I do think that using tag instead of field would be generally better, now that the discriminator is not necessarily a field. It would also match some other libs (e.g. msgspec), but that would break the API...

Fatal1ty · 2023-08-11T16:35:07Z

I am not sure if you need a different argument for the Discriminator though. Why not just inspect the field value, and allow callable?

Assuming we are using JSON serialization format, we need two things:

A JSON key name that holds a tag value in the serialized data. We will take this tag value on deserialization and find it in the "tag → class" mapping to determine which variant class should be deserialized.
A function that returns a tag value for a class so that we could go through the class hierarchy and create "tag → class" mapping.

Currently we can configure only first item through Discriminator "field" attribute. The second item is hardcoded:

mashumaro/mashumaro/core/meta/types/unpack.py

Lines 347 to 350 in c0a4cf8

    
           lines.append( 
        
               f"variants_map[variant.__dict__[" 
        
               f"'{discriminator.field}']] = variant" 
        
           )

Semantically this function (I called it variant_tagger_fn in my proposal) returns a value for the class level attribute with the name discriminator.field:

def variant_tagger_fn(cls):
    return cls.__dict__[discriminator.field]

In your case you don't have class level attributes containing unique literals, so you could use the following function instead to get dynamic tag values:

def variant_tagger_fn(cls):
    return cls.__name__  # or something more practical

I do think that using tag instead of field would be generally better, now that the discriminator is not necessarily a field. It would also match some other libs (e.g. msgspec), but that would break the API...

msgspec has its tag_field parameter which semantically is the same as discriminator.field in mashumaro. The difference is that by default msgspec consider the class name as a tag value, and mashumaro consider the value of a specific class level attribute as a tag value. My proposal is to allow the modification of this "tagger".

mishamsk · 2023-08-12T00:14:59Z

Oh, sorry, I wasn't thinking straight. Yes this makes total sense.

Did you consider keeping only the current field attribute, but allowing it to be a tuple? I.e.:

No attribute - assume subclass based discrimination
Plain string - current logic, implicitly assume that the provide string is a key in __dict__ (which btw apparently won't work for slotted dataclasses! see code sample below. I wasn't expecting that)
A tuple of a string & callable - the new logic, which could also solve for slotted classes

I may be biased, but the word variant doesn't "speak" to me, sounds more like an implementation detail. But, either way, having a variant_tagger_fn is better than not having anything at all;-)

sample code - slotted dataclasses do not have __dict__ on class itself too!

from dataclasses import dataclass
from typing import ClassVar

@dataclass(slots=True)
class Test:
    x: ClassVar[int] = 0
    y: int = 0

print("__dict__" in dir(Test)) # False
print("__dict__" in dir(Test())) # False

Fatal1ty · 2023-08-17T20:36:38Z

@mishamsk Sorry for delay, I was busy at work.

Did you consider keeping only the current field attribute, but allowing it to be a tuple? I.e.:

No attribute - assume subclass based discrimination

Plain string - current logic, implicitly assume that the provide string is a key in __dict__ (which btw apparently won't work for slotted dataclasses! see code sample below. I wasn't expecting that)

A tuple of a string & callable - the new logic, which could also solve for slotted classes

It will work but I don't see any point to make one argument to accept different value types instead of using two arguments. It will also be complicated to explain the use of this parameter in the documentation. Let's make it simple.

I may be biased, but the word variant doesn't "speak" to me, sounds more like an implementation detail. But, either way, having a variant_tagger_fn is better than not having anything at all;-)

I agree that it doesn't sound very well but I didn't manage to find a better word. I do think this complexity has roots in using discriminator for both subclass deserialization and for discriminated union deserialization. If it was only for subclass deserilizaiton then I would called it subclass_tagger_fn. In my defense, I found some articles where the term "variant" is used for the same things:

current logic, implicitly assume that the provide string is a key in __dict__ (which btw apparently won't work for slotted dataclasses! see code sample below. I wasn't expecting that

I grabbed my head and almost believed it, but it made me go to check. Fortunately for us, your assumption is only partially correct. The truth is slotted dataclass instances don't have __dict__ attribute but class objects do have it. Compare this:

print("__dict__" in dir(Test))  # False
print(Test.__dict__)  # {'__module__': '__main__', '__annotations__': {'x': typing.ClassVar[int], ...

You might be curious why I even rely on __dict__ attribute when retrieving a tag. The reason is that I wanted to support intermediate nodes in class tree that don't have explicit tag defined and are only used as a base class for their nodes. If we were using getattr(variant_cls, tag) instead of variant_cls.__dict__[tag] we would get a value inherited from an ancestor for those intermediate nodes and overwrite a class registered for the tag.

mishamsk · 2023-08-18T13:44:44Z

nothing to apologize for! appreciate additional insights.

Sorry for the false alarm, didn’t know that dir had completely different behavior based on what’s being inspected.

Regarding the naming, as I’ve said, I do not have a strong opinion. You’ve build & maintained this library all along, so you’ll know better what works for the community. The only thing I can add, is that I’ve been trying to find other libs that supported auto-subclass (de)serialization, and couldn’t find any. Seems odd to me, that there is nothing in-between pickle and strictly typed validation/serialization (pydantic, msgspec, cattrs etc). So having such a feature in mashumaro would further set it apart!

Also, this change seems relatively simple. Probably even simpler than #136. I am on vacation, so am away from coding for the sake of the family, but I may get some time to draft a PR when I am back if you haven’t done it yourself by then.

Fatal1ty · 2023-08-23T20:37:48Z

It was easy to implement :)

mishamsk · 2023-08-23T21:52:18Z

@Fatal1ty nice! you are definitely moving faster than I do;-)

mishamsk mentioned this issue Aug 3, 2023

Raise the inner exception when parsing nested dataclasses #136

Open

Fatal1ty added the enhancement New feature or request label Aug 5, 2023

Fatal1ty self-assigned this Aug 23, 2023

Fatal1ty mentioned this issue Aug 23, 2023

Add optional variant_tagger_fn to Discriminator #146

Merged

Fatal1ty closed this as completed in #146 Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow discrimination based on a key that is not part of the model #135

Allow discrimination based on a key that is not part of the model #135

mishamsk commented Aug 3, 2023

Fatal1ty commented Aug 5, 2023

mishamsk commented Aug 7, 2023

Fatal1ty commented Aug 11, 2023

mishamsk commented Aug 12, 2023

Fatal1ty commented Aug 17, 2023

mishamsk commented Aug 18, 2023

Fatal1ty commented Aug 23, 2023

mishamsk commented Aug 23, 2023

Allow discrimination based on a key that is not part of the model #135

Allow discrimination based on a key that is not part of the model #135

Comments

mishamsk commented Aug 3, 2023

Fatal1ty commented Aug 5, 2023

mishamsk commented Aug 7, 2023

Fatal1ty commented Aug 11, 2023

mishamsk commented Aug 12, 2023

Fatal1ty commented Aug 17, 2023

mishamsk commented Aug 18, 2023

Fatal1ty commented Aug 23, 2023

mishamsk commented Aug 23, 2023