Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to deserialize using writer and reader schemas #30

Open
hlagosp opened this issue May 1, 2018 · 6 comments
Open

Add support to deserialize using writer and reader schemas #30

hlagosp opened this issue May 1, 2018 · 6 comments

Comments

@hlagosp
Copy link

hlagosp commented May 1, 2018

It would be great have support for writer and reader schema during record deserialization. It should improve significantly the performance when you work with big avro objects but we you only need a few fields from there. It could be something like

encoder = quickavro.BinaryEncoder (schema=writerSchema, readerSchema = myReaderSchema, codec =“null”)

Thanks !

@ChrisRx
Copy link
Owner

ChrisRx commented May 1, 2018

On what basis will the performance significantly improve? How will the reader schema be handled by the underlying C implementation, AvroC? It is passed a schema and a record is written and returned, are you proposing that optionally another pass is done at the end to only return the fields from C to Python that are given by the reader schema? If so, do you have any benchmarks that may validate this performance increase?

@hlagosp
Copy link
Author

hlagosp commented May 2, 2018

Hi! thanks for your quick answer. Sorry if something that I say is wrong (I am not a python developer at all but doing my best :)) it looks like the current implementation is using the Decoder directly from c++ (https://github.com/ChrisRx/quickavro/blob/master/src/encoderobject.c#L50) , not going through resolvingDecoder like here https://github.com/apache/avro/blob/master/lang/c++/impl/Generic.cc#L47. Do you think is possible instance the decoder from the resolvingDecoder ? it seems to be already supporting writer/reader schema to get the decoder.

About the performance comments, I can prepare some big avro schemas (we have some like those in our production system) and compare how different is the deserialization, however it should be faster considering that the amount of records to deserialize and map to python should be significantly in a lot of scenarios. Anyway I think it would be great add this feature especially considering that java/c++ avro implementations support it

Thanks again for your support

@ChrisRx
Copy link
Owner

ChrisRx commented May 2, 2018

This implementation uses the C library (rather than the C++ one), but it looks like AvroC has both a avro_resolved_reader and avro_resolved_writer. I'm not familiar with the resolved/resolving reader/writers so thank you for pointing that out! I think this is a very good performance enhancement to add. In particular I believe this will be a huge performance increase in scenarios like you presented, by reducing how much cost serialization is done to take the C null-terminated strings and make them Python strings, which in Python is quite a bit (not to mention the AvroC performance gained from not resolving them in the C library either).

I want to give it a good once over before committing it, and providing those schemas to help test might be very helpful, however, I think this will be an easy change as I've already made some modifications to the code and it appears to already be working correctly.

@hlagosp
Copy link
Author

hlagosp commented May 2, 2018

I created 2 schemas. Attached are the reader_schema and the sample_schema (this is the full schema). Please let me know if this work for the testing.

reader_schema.txt
sample_schema.txt

ChrisRx added a commit that referenced this issue May 3, 2018
Using AvroC's resolving writer/reader allows the implementation
of schema resolution. In the case where a new "reader schema"
is used while reading avro records, this schema will be applied
rather than the originating schema (confusingly referred to as
the "writer schema"). In cases when writing avro records the
so-called "reader schema" is used to resolve to the new schema,
allowing such things like promotion of types (where valid).
This ends up fixing #30 where the need is to skip fields that
are being resolved for performance reasons.
@ChrisRx
Copy link
Owner

ChrisRx commented May 7, 2018

Not sure if you had a chance to look at the schema-resolution branch, but would welcome any thoughts if you ended up looking at it or trying it out. Otherwise I was going to go ahead and merge it because it appears to be sound, and I added a couple initial unit tests for it that appear to be working as expected.

@hlagosp
Copy link
Author

hlagosp commented May 7, 2018

I did, it is working great. I tested using some schemas similar to those that I provided before and it is taking around 70 to 80 % less.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants