# Avro schema references and schema repositories * [Introduction](#Introduction) * [Why avro references](#Whyavroreferences) * [Using maven as a schema repository](#Usingmavenasaschemarepository) * [Schema development with maven](#Schemadevelopmentwithmaven) * [The format](#Theformat) * [The version control](#Theversioncontrol) * [The project structure](#Theprojectstructure) * [The project build lifecle](#Theprojectbuildlifecle) * [Versioning](#Versioning) * [Avro schema references in avro schemas](#Avroschemareferencesinavroschemas) ## Introduction Avro is one of the many new serialization formats that have been created in the last 20 years. For a good introductions and also a comparison between with probably the 2 most popular alternatives [see](https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html) ## Why avro references Since avro does not have "field tags", the schema used to serialize a message is necessary to read the message. (aka writer schema). Avro schema's can be fairly large and add significant overhead to the [wire format](LeveragingAvro). As such "schema registries" have been created as solutions for this problem. ([see](https://aseigneurin.github.io/2018/08/02/kafka-tutorial-4-avro-and-schema-registry.html) for an example) A schema registry basically maintains a ID <-> schema mapping, that uniquely identifies a schemas. This allows passing a relatively small ID instead of a larger schema definition. A communication participant, will resolve the id to the actual schema definition using the schema repository and a local cache. Aditionally schema repositories can index your schemas, validate backwards compatibility or other schema quality checks. Let's look at how this can work: ```json {"type":"array", "items": {"type":"record", "name":"TestRecord", "fields":[{"name":"number","type":["long","null"],"default":0}], "id":"testId"} } ``` can become: ```json {"type":"array","items":{"$ref":"testId"}} ``` Where can this id come from? One solution is to add it during "release" phase of the schema. I prefer using: "group:artifact:version:schemaId" which makes the schema easily identifiable in a maven repo. (If you chose a maven repo as a schema repo, which I describe in more detail bellow) ## Using maven as a schema repository Maven is one piece of technology out there that basically does this for java artifacts (binaries, source, javadoc...). There is no reason why maven could not fit the bill for schema's, and I would argue that it is the best choice for a lot of use cases. Here is some of the advantages I see: * No new piece of infra needed. (you most likely already have a maven instance) * Proven scalability. You will be able to share your data models with the entire world levelraging existing CDN infra (bintray, etc...) * Dependency management that allows schema re-use. * Addresssing + versioning model. * Plugin architecture that allows developing custom plugins. (avrodoc, avro quality checks....) With certain maven repository inplementations like [JFrog Artifactory](https://jfrog.com/artifactory/), you can easilly access individual files (schemas) from within packages without the need to download the entire package. ## Schema development with maven ### The format Although avro schema's can be written in JSON, most humans will prefer the [avro IDL](https://avro.apache.org/docs/current/idl.html). ### The version control Like with any piece of software, schema's should be developed using [version control](https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control). You can also benefit from a code review workflow like Gerrit, PRs. ([see](https://github.com/zolyfarkas/core-schema) for a sample schema project) ### The project structure ```text /pom.xml -- your maven project file /src/main/avro -- your avro schema files. ``` your pom.xml can be as simple as: ```xml 4.0.0 org.spf4j.avro core-schema jar 1.0 ${project.artifactId}-${project.version} An example schema project org.spf4j.avro schema-parent-pom LATEST https://github.com/zolyfarkas/core-schema ${scm.connection} ${scm.connection} ${scm.url} core-schema-0.10 ``` ### The project build lifecle Additional to the standard JAR lifecycle the following is being executed: ```xml org.spf4j:maven-avro-schema-plugin:avro-dependencies org.spf4j:maven-avro-schema-plugin:avro-compile maven-antrun-plugin... org.spf4j:maven-avro-schema-plugin:avro-validate org.spf4j:maven-avro-schema-plugin:avro-package ``` ### Versioning Versioning is identical with the versioning of any other maven project. During the build process every named schema is "stamped" with a unique identifier. The format of the unique identifier is: [groupId]:[artifactId]:[version]:[localId]. An example id is: "org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:2" org.spf4j.demo:jaxrs-spf4j-demo-schema uniquely identifies the schema package, and the localId can be resolved from the schema_index.properties file that is added to the package: ```text #the package coordinates _pkg=org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3 #local id to schema name mapping. 0=org.spf4j.demo.avro.DemoRecord 1=org.spf4j.demo.avro.MetaData 2=org.spf4j.demo.avro.DemoRecordInfo ``` ### Avro schema references in avro schemas Currently avro avsc does not have the concept of schema references, but I really feel that this is something that will need to be eventiually added to the avro spec. Let's say we have the following schema: ```json {"type":"array","items": { "type": "record", "name": "DemoRecordInfo", "namespace": "org.spf4j.demo.avro", "doc": "A record with metadata", "fields": [{ "name": "demoRecord", "type": { "type": "record", "name": "DemoRecord", "doc": "A demo record", "fields": [{ "name": "id", "type": "string", "doc": "id", "default": "" }, { "name": "name", "type": "string", "doc": "record name", "default": "" }, { "name": "description", "type": "string", "doc": "record description", "default": "" }], "sourceIdl": "target/avro-sources/demo.avdl:6:61", "beta": "", "mvnId": "org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:0" }, "doc": "demo record" }, { "name": "metaData", "type": { "type": "record", "name": "MetaData", "doc": "meta data", "fields": [{ "name": "lastAccessed", "type": { "type": "string", "logicalType": "instant" }, "doc": "last accessed" }, { "name": "lastAccessedBy", "type": "string", "doc": "user that last accessed record" }, { "name": "lastModified", "type": { "type": "string", "logicalType": "instant" }, "doc": "last modified" }, { "name": "lastModifiedBy", "type": "string", "doc": "user that last modified record" }, { "name": "asOf", "type": { "type": "string", "logicalType": "instant" }, "doc": "information time" }], "sourceIdl": "target/avro-sources/demo.avdl:18:61", "beta": "", "mvnId": "org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:1" }, "doc": "record metaData" }], "sourceIdl": "target/avro-sources/demo.avdl:33:61", "beta": "", "mvnId": "org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:2" }} ``` As you can see the array elem type has been stamped by the build process with: "org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:2", one way we could describe the above schema would be: ```json {"type":"array", "items": {"$ref":"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:2"} } ``` This makes the schema json small, however to the schema parser will need to be able to resolve the "$ref":"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:2" reference. This size reduction make it possible to use schemas in HTTP headers to describe avro content schema like: ```text Content-Length: 220 Content-Type: application/avro;avsc={"type":"array","items":{"$ref":"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:2"}} ``` this also makes it more efficient to implement a "any" logical type like: ```avdl @logicalType("any") record { /** the object schema */ string schema; /** the avro biinary serialized object*/ bytes object } ``` In the [avro fork](https://github.com/zolyfarkas/avro) there is a implementation for schema references. These references are resolved by pluggable "SchemaResolvers". Since we use maven as a schema repository, it is pretty easy to implement a resolver using maven aether: [spf4j-maven-schema-resolver](https://github.com/zolyfarkas/spf4j/tree/master/spf4j-maven-schema-resolver) which is as simple to use as: ```java File localRepo = new File(System.getProperty("user.home"), ".m2/repository"); RemoteRepository bintray = new RemoteRepository.Builder("central", "default", "https://dl.bintray.com/zolyfarkas/core") .build(); MavenSchemaResolver resolver = new MavenSchemaResolver(Collections.singletonList(bintray), localRepo, null, "jar"); SchemaResolvers.registerDefault(resolver); ``` For where maven aether is not practical to have in your dependency tree. There is also a JAX-RS client based implementation: [spf4j-jaxrs-client](https://github.com/zolyfarkas/spf4j-jaxrs/tree/master/spf4j-jaxrs-client) which is as simple to use as: ```java SchemaClient resolver = new SchemaClient(new URI("https://dl.bintray.com/zolyfarkas/core")); SchemaResolvers.registerDefault(resolver); ``` Both implementations will resolve schema references pretty much the same way.