Designing Data-Intensive Applications [Book Highlights]

[Part I : Chapter IV] Encoding and Evolution

Avro

Avro uses a schema to specify the structure of the data being encoded.

It has two schema languages:

one (Avro IDL) intended for human editing

record Person {
   string                    userName;
   union { null, long }      favoriteNumber = null;
   array<string>             interests;
}

one based on JSON that is more easily machine-readable

{
    "type": "record",
    "name": "Person",
    "fields": [
         {"name": "userName", "type": "string"},
         {"name": "favoriteNumber", "type": ["null", "long"], "default": null},
         {"name": "interests", "type": {"type": "array", "items": "string"}}
      ]
}

If we encode our example record using this schema, the Avro binary encoding is just 32 bytes long [compact of all]

When an application wants to encode/write data it uses the schema which was compiled with the code, that schema is called writer's schema
When an application wants to decode/read data it is expecting data to be in some schema, which is called as reader's schema
The key idea with Avro is that the writer’s schema and the reader’s schema don’t have to be the same—they only need to be compatible.
Avro library resolves the differences by looking at the writer’s schema and the reader’s schema side by side
reader's schema and writer's schema diff
- fields in a different order is acceptable as schema resolution matches up the fields by field name
- a field that appears in the writer’s schema but not in the reader’s schema, it is ignored
- expects some field, but the writer’s schema does not contain a field of that name, it is filled in with a default value declared in the reader’s schema.

Day 35/100

Designing Data-Intensive Applications [Book Highlights]

[Part I : Chapter IV] Encoding and Evolution

Avro