Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow discrimination based on a key that is not part of the model #135

Closed
mishamsk opened this issue Aug 3, 2023 · 8 comments · Fixed by #146
Closed

Allow discrimination based on a key that is not part of the model #135

mishamsk opened this issue Aug 3, 2023 · 8 comments · Fixed by #146
Assignees
Labels
enhancement New feature or request

Comments

@mishamsk
Copy link
Contributor

mishamsk commented Aug 3, 2023

Is your feature request related to a problem? Please describe.
I'am using mashumaro as (de)serialization backend of my AST library

What this means is that I (de)serialize huge (millions of nodes) nested dataclasses with fields usually having a general type that may have tens and sometimes hunderds subclasses. E.g. borrowing from the example in my own docs:

@dataclass
class Expr(ASTNode):
    """Base for all expressions"""
    pass

@dataclass
class Literal(Expr):
    """Base for all literals"""
    value: Any

@dataclass
class NumberLiteral(Literal):
    value: int

@dataclass
class NoneLiteral(Literal):
    value: None = field(default=None, init=False)

@dataclass
class BinExpr(Expr):
    """Base for all binary expressions"""
    left: Expr
    right: Expr

Expr can (or rather has in real life) hundreds of subclasses.

Thus it is impractical to let mashumaro try discriminating the actual type by iteratively trying every subclass during deserialization. Also, it will be incorrect in some cases because two subclasses may have the same structure, but I need exactly the same type as it was written during serialization.

I am already handling this via my own extension: see code

My custom implementation adds __type key during serialization which is read during deserialization and then dropped. If it is not available, it falls back to the default logic.

However, this way I can't use some of the other mashumaro features, such as annotations in _serialize/_deserialize, and I have to resort to "dirty" class state to pass additional options (what has been discussed in #100)

Describe the solution you'd like
I suggest adding an option to the discriminator feature to allow using keys that are not part of the data class itself.

Describe alternatives you've considered
See above my current implementation

@Fatal1ty
Copy link
Owner

Fatal1ty commented Aug 5, 2023

Hey @mishamsk

First of all, congratulations on your project publication! I'll consider using it when I need to do some parsing stuff.

When I was developing discriminator feature, I thought about adding an attribute to Discriminator that would keep a tag generator function. In the current implementation it's hardcoded to take the tag from a dataclass field which you specifed in Discriminator object. For example, if we have Discriminator(field="__type", include_subtypes=True), the following variant registration code will be generated:

for variant in iter_all_subclasses(__main__.Node):
    try:
        variants_map[variant.__dict__['__type']] = variant
    except KeyError:
        continue

By introducing a tag generator function we can change it. For example, if we have Discriminator(field="__type", variant_tag_getter=lambda cls: cls.__name__, include_subtypes=True) the following variant registration code will be generated:

for variant in iter_all_subclasses(__main__.Node):
    try:
        variants_map[variant_tag_getter(variant)] = variant
    except KeyError:
        continue

I'm bad at naming but I guess that variant_tag_getter will more clearly indicate the purpose instead of variant_tagger_fn which may imply the "post serialize hook" functionality.

The question is, do we need or will we need functionality of automatic adding tags to all variants on serialization, so we could avoid using __post_serialize__ hooks. We can add a bool parameter add_tag_on_serialization (just a working name) to Discriminator for this but let's keep it for a separate issue.

What do you think? Maybe I'm missing something and you have your own vision of solving this issue.

@Fatal1ty Fatal1ty added the enhancement New feature or request label Aug 5, 2023
@mishamsk
Copy link
Contributor Author

mishamsk commented Aug 7, 2023

@Fatal1ty first thanks for the wishes! it took me way to much time;-)

rgd your design suggestion, I have a couple of thoughts.

First, rgd my specific use case. I think it's a distinct usage pattern from a "model a messaging protocol and parse". Instead it's more of a "pickle" use case. I am controlling both serialization and deserialization, so do not care about validations or trying to decode data originating from some external system. I have a hunch that only this kind of usage pattern where one can expect a literal class name in serialized data. None of the public specs, except the ones used for interoperability, would ever dictate to store a literal class name in serialized data, because class name is obviously implementation specific.

So if you think that mashumaro is commonly used for such things, it may be worth implementing explicitly, instead of a "generic" approach. But you'd know better if this is the case. My usage pattern may be rare.

Rgd your proposal for the universal solution. I'd definitely say that whatever it will be, it's going to be the harder part to grasp for users. I had to read mashumaro's code to understand where this "getter" stuff will pop-up. I am not sure if you need a different argument for the Discriminator though. Why not just inspect the field value, and allow callable?

I do think that using tag instead of field would be generally better, now that the discriminator is not necessarily a field. It would also match some other libs (e.g. msgspec), but that would break the API...

@Fatal1ty
Copy link
Owner

I am not sure if you need a different argument for the Discriminator though. Why not just inspect the field value, and allow callable?

Assuming we are using JSON serialization format, we need two things:

  1. A JSON key name that holds a tag value in the serialized data. We will take this tag value on deserialization and find it in the "tag → class" mapping to determine which variant class should be deserialized.
  2. A function that returns a tag value for a class so that we could go through the class hierarchy and create "tag → class" mapping.

Currently we can configure only first item through Discriminator "field" attribute. The second item is hardcoded:

lines.append(
f"variants_map[variant.__dict__["
f"'{discriminator.field}']] = variant"
)

Semantically this function (I called it variant_tagger_fn in my proposal) returns a value for the class level attribute with the name discriminator.field:

def variant_tagger_fn(cls):
    return cls.__dict__[discriminator.field]

In your case you don't have class level attributes containing unique literals, so you could use the following function instead to get dynamic tag values:

def variant_tagger_fn(cls):
    return cls.__name__  # or something more practical

I do think that using tag instead of field would be generally better, now that the discriminator is not necessarily a field. It would also match some other libs (e.g. msgspec), but that would break the API...

msgspec has its tag_field parameter which semantically is the same as discriminator.field in mashumaro. The difference is that by default msgspec consider the class name as a tag value, and mashumaro consider the value of a specific class level attribute as a tag value. My proposal is to allow the modification of this "tagger".

@mishamsk
Copy link
Contributor Author

Oh, sorry, I wasn't thinking straight. Yes this makes total sense.

Did you consider keeping only the current field attribute, but allowing it to be a tuple? I.e.:

  • No attribute - assume subclass based discrimination
  • Plain string - current logic, implicitly assume that the provide string is a key in __dict__ (which btw apparently won't work for slotted dataclasses! see code sample below. I wasn't expecting that)
  • A tuple of a string & callable - the new logic, which could also solve for slotted classes

I may be biased, but the word variant doesn't "speak" to me, sounds more like an implementation detail. But, either way, having a variant_tagger_fn is better than not having anything at all;-)

sample code - slotted dataclasses do not have __dict__ on class itself too!

from dataclasses import dataclass
from typing import ClassVar

@dataclass(slots=True)
class Test:
    x: ClassVar[int] = 0
    y: int = 0

print("__dict__" in dir(Test)) # False
print("__dict__" in dir(Test())) # False

@Fatal1ty
Copy link
Owner

@mishamsk Sorry for delay, I was busy at work.

Did you consider keeping only the current field attribute, but allowing it to be a tuple? I.e.:

  • No attribute - assume subclass based discrimination
  • Plain string - current logic, implicitly assume that the provide string is a key in __dict__ (which btw apparently won't work for slotted dataclasses! see code sample below. I wasn't expecting that)
  • A tuple of a string & callable - the new logic, which could also solve for slotted classes

It will work but I don't see any point to make one argument to accept different value types instead of using two arguments. It will also be complicated to explain the use of this parameter in the documentation. Let's make it simple.

I may be biased, but the word variant doesn't "speak" to me, sounds more like an implementation detail. But, either way, having a variant_tagger_fn is better than not having anything at all;-)

I agree that it doesn't sound very well but I didn't manage to find a better word. I do think this complexity has roots in using discriminator for both subclass deserialization and for discriminated union deserialization. If it was only for subclass deserilizaiton then I would called it subclass_tagger_fn. In my defense, I found some articles where the term "variant" is used for the same things:

  • current logic, implicitly assume that the provide string is a key in __dict__ (which btw apparently won't work for slotted dataclasses! see code sample below. I wasn't expecting that

I grabbed my head and almost believed it, but it made me go to check. Fortunately for us, your assumption is only partially correct. The truth is slotted dataclass instances don't have __dict__ attribute but class objects do have it. Compare this:

print("__dict__" in dir(Test))  # False
print(Test.__dict__)  # {'__module__': '__main__', '__annotations__': {'x': typing.ClassVar[int], ...

You might be curious why I even rely on __dict__ attribute when retrieving a tag. The reason is that I wanted to support intermediate nodes in class tree that don't have explicit tag defined and are only used as a base class for their nodes. If we were using getattr(variant_cls, tag) instead of variant_cls.__dict__[tag] we would get a value inherited from an ancestor for those intermediate nodes and overwrite a class registered for the tag.

@mishamsk
Copy link
Contributor Author

nothing to apologize for! appreciate additional insights.

Sorry for the false alarm, didn’t know that dir had completely different behavior based on what’s being inspected.

Regarding the naming, as I’ve said, I do not have a strong opinion. You’ve build & maintained this library all along, so you’ll know better what works for the community. The only thing I can add, is that I’ve been trying to find other libs that supported auto-subclass (de)serialization, and couldn’t find any. Seems odd to me, that there is nothing in-between pickle and strictly typed validation/serialization (pydantic, msgspec, cattrs etc). So having such a feature in mashumaro would further set it apart!

Also, this change seems relatively simple. Probably even simpler than #136. I am on vacation, so am away from coding for the sake of the family, but I may get some time to draft a PR when I am back if you haven’t done it yourself by then.

@Fatal1ty
Copy link
Owner

It was easy to implement :)

@mishamsk
Copy link
Contributor Author

@Fatal1ty nice! you are definitely moving faster than I do;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants