Day 32/100

Day 32/100

Designing Data-Intensive Applications [Book Highlights]

[Part I : Chapter IV] Encoding and Evolution

  • Evolvability: we should aim to build systems that make it easy to adapt to change [Making Change Easy]
  • In order for the system to continue running smoothly, we need to maintain compatibility in both directions:
    • Backward compatibility - Newer code can read data that was written by older code.
    • Forward compatibility - Older code can read data that was written by newer code.

Language-Specific Formats

  • The translation from the in-memory representation to a byte sequence is called encoding (also known as serialization or marshalling), and the reverse is called decoding (parsing, deserialization, unmarshalling)
  • Many programming languages come with built-in support for encoding in-memory objects into byte sequences. For example,

    • Java has java.io.Serializable,
    • Ruby has Marshal,
    • Python has pickle,
    • and so on. Many third-party libraries also exist, such as Kryo for Java.
  • The encoding is often tied to a particular programming language, and reading the data in another language is very difficult.

  • if an attacker can get your application to decode an arbitrary byte sequence, they can instantiate arbitrary classes
  • Versioning data is often an afterthought in these libraries, they often neglect the inconvenient problems of forward and backward compatibility.
  • Efficiency (CPU time taken to encode or decode, and the size of the encoded structure) is also often an afterthought.
  • There is a lot of ambiguity around the encoding of numbers. In XML and CSV, you cannot distinguish between a number and a string
  • JSON distinguishes strings and numbers, but it doesn’t distinguish integers and floating-point numbers
  • JSON and XML have good support for Unicode character strings (i.e., humanreadable text), but they don’t support binary strings
  • We need to encode the binary data as text using Base64. The schema is then used to indicate that the value should be interpreted as Base64-encoded. This works, but it’s somewhat hacky and increases the data size by 33%.
  • There is optional schema support for both XML and JSON.
  • JSON and XML use a lot of space compared to binary formats.
  • This observation led to the development of a profusion of binary encodings

image.png

  • Example record (Example 4-1) encoded using MessagePack.