Add support to deserialize using writer and reader schemas #30

hlagosp · 2018-05-01T14:46:49Z

It would be great have support for writer and reader schema during record deserialization. It should improve significantly the performance when you work with big avro objects but we you only need a few fields from there. It could be something like

encoder = quickavro.BinaryEncoder (schema=writerSchema, readerSchema = myReaderSchema, codec =“null”)

Thanks !

ChrisRx · 2018-05-01T21:42:49Z

On what basis will the performance significantly improve? How will the reader schema be handled by the underlying C implementation, AvroC? It is passed a schema and a record is written and returned, are you proposing that optionally another pass is done at the end to only return the fields from C to Python that are given by the reader schema? If so, do you have any benchmarks that may validate this performance increase?

hlagosp · 2018-05-02T01:43:59Z

Hi! thanks for your quick answer. Sorry if something that I say is wrong (I am not a python developer at all but doing my best :)) it looks like the current implementation is using the Decoder directly from c++ (https://github.com/ChrisRx/quickavro/blob/master/src/encoderobject.c#L50) , not going through resolvingDecoder like here https://github.com/apache/avro/blob/master/lang/c++/impl/Generic.cc#L47. Do you think is possible instance the decoder from the resolvingDecoder ? it seems to be already supporting writer/reader schema to get the decoder.

About the performance comments, I can prepare some big avro schemas (we have some like those in our production system) and compare how different is the deserialization, however it should be faster considering that the amount of records to deserialize and map to python should be significantly in a lot of scenarios. Anyway I think it would be great add this feature especially considering that java/c++ avro implementations support it

Thanks again for your support

ChrisRx · 2018-05-02T12:38:13Z

This implementation uses the C library (rather than the C++ one), but it looks like AvroC has both a avro_resolved_reader and avro_resolved_writer. I'm not familiar with the resolved/resolving reader/writers so thank you for pointing that out! I think this is a very good performance enhancement to add. In particular I believe this will be a huge performance increase in scenarios like you presented, by reducing how much cost serialization is done to take the C null-terminated strings and make them Python strings, which in Python is quite a bit (not to mention the AvroC performance gained from not resolving them in the C library either).

I want to give it a good once over before committing it, and providing those schemas to help test might be very helpful, however, I think this will be an easy change as I've already made some modifications to the code and it appears to already be working correctly.

hlagosp · 2018-05-02T16:13:35Z

I created 2 schemas. Attached are the reader_schema and the sample_schema (this is the full schema). Please let me know if this work for the testing.

reader_schema.txt
sample_schema.txt

Using AvroC's resolving writer/reader allows the implementation of schema resolution. In the case where a new "reader schema" is used while reading avro records, this schema will be applied rather than the originating schema (confusingly referred to as the "writer schema"). In cases when writing avro records the so-called "reader schema" is used to resolve to the new schema, allowing such things like promotion of types (where valid). This ends up fixing #30 where the need is to skip fields that are being resolved for performance reasons.

ChrisRx · 2018-05-07T12:42:28Z

Not sure if you had a chance to look at the schema-resolution branch, but would welcome any thoughts if you ended up looking at it or trying it out. Otherwise I was going to go ahead and merge it because it appears to be sound, and I added a couple initial unit tests for it that appear to be working as expected.

hlagosp · 2018-05-07T22:22:18Z

I did, it is working great. I tested using some schemas similar to those that I provided before and it is taking around 70 to 80 % less.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to deserialize using writer and reader schemas #30

Add support to deserialize using writer and reader schemas #30

hlagosp commented May 1, 2018

ChrisRx commented May 1, 2018

hlagosp commented May 2, 2018 •

edited

Loading

ChrisRx commented May 2, 2018

hlagosp commented May 2, 2018 •

edited

Loading

ChrisRx commented May 7, 2018

hlagosp commented May 7, 2018

Add support to deserialize using writer and reader schemas #30

Add support to deserialize using writer and reader schemas #30

Comments

hlagosp commented May 1, 2018

ChrisRx commented May 1, 2018

hlagosp commented May 2, 2018 • edited Loading

ChrisRx commented May 2, 2018

hlagosp commented May 2, 2018 • edited Loading

ChrisRx commented May 7, 2018

hlagosp commented May 7, 2018

hlagosp commented May 2, 2018 •

edited

Loading

hlagosp commented May 2, 2018 •

edited

Loading