Day 35/100

Day 35/100

Designing Data-Intensive Applications [Book Highlights]

[Part I : Chapter IV] Encoding and Evolution

Avro

  • Avro uses a schema to specify the structure of the data being encoded.
  • It has two schema languages:
    • one (Avro IDL) intended for human editing
      record Person {
         string                    userName;
         union { null, long }      favoriteNumber = null;
         array<string>             interests;
      }
      
    • one based on JSON that is more easily machine-readable
      {
          "type": "record",
          "name": "Person",
          "fields": [
               {"name": "userName", "type": "string"},
               {"name": "favoriteNumber", "type": ["null", "long"], "default": null},
               {"name": "interests", "type": {"type": "array", "items": "string"}}
            ]
      }
      
  • If we encode our example record using this schema, the Avro binary encoding is just 32 bytes long [compact of all]

image.png

  • When an application wants to encode/write data it uses the schema which was compiled with the code, that schema is called writer's schema
  • When an application wants to decode/read data it is expecting data to be in some schema, which is called as reader's schema
  • The key idea with Avro is that the writer’s schema and the reader’s schema don’t have to be the same—they only need to be compatible.
  • Avro library resolves the differences by looking at the writer’s schema and the reader’s schema side by side
  • reader's schema and writer's schema diff
    • fields in a different order is acceptable as schema resolution matches up the fields by field name
    • a field that appears in the writer’s schema but not in the reader’s schema, it is ignored
    • expects some field, but the writer’s schema does not contain a field of that name, it is filled in with a default value declared in the reader’s schema.

image.png