-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support to deserialize using writer and reader schemas #30
Comments
On what basis will the performance significantly improve? How will the reader schema be handled by the underlying C implementation, AvroC? It is passed a schema and a record is written and returned, are you proposing that optionally another pass is done at the end to only return the fields from C to Python that are given by the reader schema? If so, do you have any benchmarks that may validate this performance increase? |
Hi! thanks for your quick answer. Sorry if something that I say is wrong (I am not a python developer at all but doing my best :)) it looks like the current implementation is using the Decoder directly from c++ (https://github.com/ChrisRx/quickavro/blob/master/src/encoderobject.c#L50) , not going through resolvingDecoder like here https://github.com/apache/avro/blob/master/lang/c++/impl/Generic.cc#L47. Do you think is possible instance the decoder from the resolvingDecoder ? it seems to be already supporting writer/reader schema to get the decoder. About the performance comments, I can prepare some big avro schemas (we have some like those in our production system) and compare how different is the deserialization, however it should be faster considering that the amount of records to deserialize and map to python should be significantly in a lot of scenarios. Anyway I think it would be great add this feature especially considering that java/c++ avro implementations support it Thanks again for your support |
This implementation uses the C library (rather than the C++ one), but it looks like AvroC has both a avro_resolved_reader and avro_resolved_writer. I'm not familiar with the resolved/resolving reader/writers so thank you for pointing that out! I think this is a very good performance enhancement to add. In particular I believe this will be a huge performance increase in scenarios like you presented, by reducing how much cost serialization is done to take the C null-terminated strings and make them Python strings, which in Python is quite a bit (not to mention the AvroC performance gained from not resolving them in the C library either). I want to give it a good once over before committing it, and providing those schemas to help test might be very helpful, however, I think this will be an easy change as I've already made some modifications to the code and it appears to already be working correctly. |
I created 2 schemas. Attached are the reader_schema and the sample_schema (this is the full schema). Please let me know if this work for the testing. |
Using AvroC's resolving writer/reader allows the implementation of schema resolution. In the case where a new "reader schema" is used while reading avro records, this schema will be applied rather than the originating schema (confusingly referred to as the "writer schema"). In cases when writing avro records the so-called "reader schema" is used to resolve to the new schema, allowing such things like promotion of types (where valid). This ends up fixing #30 where the need is to skip fields that are being resolved for performance reasons.
Not sure if you had a chance to look at the |
I did, it is working great. I tested using some schemas similar to those that I provided before and it is taking around 70 to 80 % less. |
It would be great have support for writer and reader schema during record deserialization. It should improve significantly the performance when you work with big avro objects but we you only need a few fields from there. It could be something like
Thanks !
The text was updated successfully, but these errors were encountered: