Current Implementation
The current aloha_avro_score.avdl uses Avro's support for co-products based on the built-in union type. As such, the description of a score looks like:
record Score {
union { null, ModelId } model = null;
union {
null, boolean, int, long, float, double, string,
array<union{boolean, int, long, float, double, string}>
} value = null;
union { array<Score>, null } subvalues = [];
union { array<string>, null } errorMsgs = [];
union { array<string>, null } missingVarNames = [];
union { null, float } prob = null;
}
Deficiencies
There are some deficiencies with this approach. Using the avro-tools tool chain to compile this Java code results in a Score.value of type Object since Java has poor support for co-products.
Most non-default tool chains don't support union types with 3 or more constituent types and only support 2-type unions when one of the types is null. The support for the latter is done to encode Option types in Scala.
Additionally, union types in Avro can't support a union of different arrays. Therefore, array types appearing in a union must be mixed, even if that's not desired (as it's a more lax type).
Proposal
This proposal is to make the Avro Score message more like the protobuf version that encodes co-products via multiple nullable variables, one per desired output type and a type indicator variable.
While the type indicator variable isn't strictly necessary, it can speed lookups from O(T) (where T is the number of types) to O(1). Obviously, for this to be type-safe, we need type classes (one instance per possible type) to extract the value from the Score.
This approach will work better with current tool chains, which will allow us to not have to special case Aloha scores.
Avro IDL for Aloha Score
So, the proposed Avro description of Score would be something like:
record Score {
union { null, ModelId } model = null;
enum ValueType {
NO_VALUE,
BOOLEAN, INT, LONG, FLOAT, DOUBLE, STRING,
ARRAY_BOOLEAN, ARRAY_INT, ARRAY_LONG, ARRAY_FLOAT,
ARRAY_DOUBLE, ARRAY_STRING
}
/*
Indicates which one of the *Value fields are populated. If the model
could not produce a value, value_type should be set to NO_VALUE.
*/
union { null, ValueType } value_type = null;
/*
Value types: Since this is a representation of a coproduct, either 0 or
1 of the following should be populated. If 0 *Value fields are
populated, the value_type field should be NO_VALUE. Otherwise,
value_type should be set to the type whose *Value field is populated.
*/
union { null, boolean } booleanValue = null;
union { null, int } intValue = null;
union { null, long } longValue = null;
union { null, float } floatValue = null;
union { null, double } doubleValue = null;
union { null, string } stringValue = null;
union { null, array<boolean> } booleanArrayValue = null;
union { null, array<int> } intArrayValue = null;
union { null, array<long> } longArrayValue = null;
union { null, array<float> } floatArrayValue = null;
union { null, array<double> } doubleArrayValue = null;
union { null, array<string> } stringArrayValue = null;
/*
Previously, these three values had a default value of an empty list
rather than null, but certain consumers of Aloha have said this is
an issue with their Avro tooling.
*/
union { null, array<Score> } subvalues = null;
union { null, array<string> } errorMsgs = null;
union { null, array<string> } missingVarNames = null;
union { null, float } prob = null;
}
Aloha Score extensions
Scala Extraction Syntax
Like in the protobuf case for protobuf Scores, we should have type classes to extract the value. For simplicity, it would be nice to have get and apply syntax analogous to Scala's scala.collection.GenMapLike trait. This should be very easy to accomplish and would look like this
val s: Score = ???
val oi: Option[Int] = s.get[Int] // Safe.
val i: Int = s[Int] // May throw java.util.NoSuchElementException.
This should involve importing the syntax, but the conversions should all be in the appropriate implicit scope so that converters don't need to be explicitly pulled into scope.
Additional Coproduct Support
Perhaps it's also useful to consider coproducts at the type level using an encoding with something like cats or iota.
Java Extraction Syntax
T.B.D.
Perhaps something a rich score wrapper with getT methods for each output type T. These should probably throw when the incorrect type is provided. A determination of up-casting should be considered. For instance, what happens when the consumers asks for a Double but the type was a Long or a Float. The current protobuf Score code does this but this should be readdressed at this time and the Avro and protobuf should be made consistent. Perhaps common interfaces could be extracted and we could have rich implementations that provide more seamless interoperability.
Implementation Details
Map types are coming to Avro soon for multilabel learning and this presents a problem. Quadratically many types in the key and value spaces will be required to encode this type of score coproduct. Therefore, it might be a good time to think about generating the Avro schema programmatically. This might be possible with something like the sbt-boilerplate plugin. If not, we might want to think about writing such a plugin.
Current Implementation
The current aloha_avro_score.avdl uses Avro's support for co-products based on the built-in
uniontype. As such, the description of a score looks like:Deficiencies
There are some deficiencies with this approach. Using the
avro-toolstool chain to compile this Java code results in aScore.valueof typeObjectsince Java has poor support for co-products.Most non-default tool chains don't support
uniontypes with 3 or more constituent types and only support 2-typeunions when one of the types isnull. The support for the latter is done to encodeOptiontypes in Scala.Additionally,
uniontypes in Avro can't support a union of differentarrays. Therefore,arraytypes appearing in aunionmust be mixed, even if that's not desired (as it's a more lax type).Proposal
This proposal is to make the Avro
Scoremessage more like the protobuf version that encodes co-products via multiple nullable variables, one per desired output type and a type indicator variable.While the type indicator variable isn't strictly necessary, it can speed lookups from O(T) (where T is the number of types) to O(1). Obviously, for this to be type-safe, we need type classes (one instance per possible type) to extract the
valuefrom theScore.This approach will work better with current tool chains, which will allow us to not have to special case Aloha scores.
Avro IDL for Aloha Score
So, the proposed Avro description of
Scorewould be something like:Aloha Score extensions
Scala Extraction Syntax
Like in the protobuf case for protobuf Scores, we should have type classes to extract the value. For simplicity, it would be nice to have
getandapplysyntax analogous to Scala's scala.collection.GenMapLike trait. This should be very easy to accomplish and would look like thisThis should involve importing the syntax, but the conversions should all be in the appropriate implicit scope so that converters don't need to be explicitly pulled into scope.
Additional Coproduct Support
Perhaps it's also useful to consider coproducts at the type level using an encoding with something like cats or iota.
Java Extraction Syntax
T.B.D.
Perhaps something a rich score wrapper with
getTmethods for each output typeT. These should probably throw when the incorrect type is provided. A determination of up-casting should be considered. For instance, what happens when the consumers asks for aDoublebut the type was aLongor aFloat. The current protobuf Score code does this but this should be readdressed at this time and the Avro and protobuf should be made consistent. Perhaps common interfaces could be extracted and we could have rich implementations that provide more seamless interoperability.Implementation Details
Map types are coming to Avro soon for multilabel learning and this presents a problem. Quadratically many types in the key and value spaces will be required to encode this type of score coproduct. Therefore, it might be a good time to think about generating the Avro schema programmatically. This might be possible with something like the sbt-boilerplate plugin. If not, we might want to think about writing such a plugin.