Skip to content

Consider Avro Score format changes #191

@deaktator

Description

@deaktator

Current Implementation

The current aloha_avro_score.avdl uses Avro's support for co-products based on the built-in union type. As such, the description of a score looks like:

record Score {
    union { null, ModelId } model = null;

    union {
      null, boolean, int, long, float, double, string,
      array<union{boolean, int, long, float, double, string}>
    } value = null;

    union { array<Score>, null } subvalues = [];
    union { array<string>, null } errorMsgs = [];
    union { array<string>, null } missingVarNames = [];
    union { null, float } prob = null;
  }

Deficiencies

There are some deficiencies with this approach. Using the avro-tools tool chain to compile this Java code results in a Score.value of type Object since Java has poor support for co-products.

Most non-default tool chains don't support union types with 3 or more constituent types and only support 2-type unions when one of the types is null. The support for the latter is done to encode Option types in Scala.

Additionally, union types in Avro can't support a union of different arrays. Therefore, array types appearing in a union must be mixed, even if that's not desired (as it's a more lax type).

Proposal

This proposal is to make the Avro Score message more like the protobuf version that encodes co-products via multiple nullable variables, one per desired output type and a type indicator variable.

While the type indicator variable isn't strictly necessary, it can speed lookups from O(T) (where T is the number of types) to O(1). Obviously, for this to be type-safe, we need type classes (one instance per possible type) to extract the value from the Score.

This approach will work better with current tool chains, which will allow us to not have to special case Aloha scores.

Avro IDL for Aloha Score

So, the proposed Avro description of Score would be something like:

record Score {
    union { null, ModelId } model = null;

    enum ValueType {
        NO_VALUE,

        BOOLEAN, INT, LONG, FLOAT, DOUBLE, STRING, 

        ARRAY_BOOLEAN, ARRAY_INT, ARRAY_LONG, ARRAY_FLOAT, 
        ARRAY_DOUBLE, ARRAY_STRING
    }

    /*
      Indicates which one of the *Value fields are populated.  If the model 
      could not produce a value, value_type should be set to NO_VALUE.
    */
    union { null, ValueType      } value_type = null;

    /* 
      Value types:  Since this is a representation of a coproduct, either 0 or
      1 of the following should be populated.  If 0 *Value fields are 
      populated, the value_type field should be NO_VALUE.  Otherwise, 
      value_type should be set to the type whose *Value field is populated.
    */
    union { null, boolean        } booleanValue      = null;
    union { null, int            } intValue          = null;
    union { null, long           } longValue         = null;
    union { null, float          } floatValue        = null;
    union { null, double         } doubleValue       = null;
    union { null, string         } stringValue       = null;
    union { null, array<boolean> } booleanArrayValue = null;
    union { null, array<int>     } intArrayValue     = null;
    union { null, array<long>    } longArrayValue    = null;
    union { null, array<float>   } floatArrayValue   = null;
    union { null, array<double>  } doubleArrayValue  = null;
    union { null, array<string>  } stringArrayValue  = null;

    /*
      Previously, these three values had a default value of an empty list 
      rather than null, but certain consumers of Aloha have said this is 
      an issue with their Avro tooling.
    */
    union { null, array<Score>   } subvalues         = null;
    union { null, array<string>  } errorMsgs         = null;
    union { null, array<string>  } missingVarNames   = null;

    union { null, float          } prob              = null;
}

Aloha Score extensions

Scala Extraction Syntax

Like in the protobuf case for protobuf Scores, we should have type classes to extract the value. For simplicity, it would be nice to have get and apply syntax analogous to Scala's scala.collection.GenMapLike trait. This should be very easy to accomplish and would look like this

val s: Score = ???
val oi: Option[Int] = s.get[Int] // Safe.
val i: Int = s[Int]              // May throw java.util.NoSuchElementException.

This should involve importing the syntax, but the conversions should all be in the appropriate implicit scope so that converters don't need to be explicitly pulled into scope.

Additional Coproduct Support

Perhaps it's also useful to consider coproducts at the type level using an encoding with something like cats or iota.

Java Extraction Syntax

T.B.D.

Perhaps something a rich score wrapper with getT methods for each output type T. These should probably throw when the incorrect type is provided. A determination of up-casting should be considered. For instance, what happens when the consumers asks for a Double but the type was a Long or a Float. The current protobuf Score code does this but this should be readdressed at this time and the Avro and protobuf should be made consistent. Perhaps common interfaces could be extracted and we could have rich implementations that provide more seamless interoperability.

Implementation Details

Map types are coming to Avro soon for multilabel learning and this presents a problem. Quadratically many types in the key and value spaces will be required to encode this type of score coproduct. Therefore, it might be a good time to think about generating the Avro schema programmatically. This might be possible with something like the sbt-boilerplate plugin. If not, we might want to think about writing such a plugin.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions