Designing Data-Intensive Applications [Book Highlights]
[Part I : Chapter IV] Encoding and Evolution
- Evolvability: we should aim to build systems that make it easy to adapt to change [Making Change Easy]
- In order for the system to continue running smoothly, we need to maintain compatibility in both directions:
- Backward compatibility - Newer code can read data that was written by older code.
- Forward compatibility - Older code can read data that was written by newer code.
Language-Specific Formats
- The translation from the in-memory representation to a byte sequence is called encoding (also known as serialization or marshalling), and the reverse is called decoding (parsing, deserialization, unmarshalling)
Many programming languages come with built-in support for encoding in-memory objects into byte sequences. For example,
- Java has java.io.Serializable,
- Ruby has Marshal,
- Python has pickle,
- and so on. Many third-party libraries also exist, such as Kryo for Java.
The encoding is often tied to a particular programming language, and reading the data in another language is very difficult.
- if an attacker can get your application to decode an arbitrary byte sequence, they can instantiate arbitrary classes
- Versioning data is often an afterthought in these libraries, they often neglect the inconvenient problems of forward and backward compatibility.
- Efficiency (CPU time taken to encode or decode, and the size of the encoded structure) is also often an afterthought.
- There is a lot of ambiguity around the encoding of numbers. In XML and CSV, you cannot distinguish between a number and a string
- JSON distinguishes strings and numbers, but it doesn’t distinguish integers and floating-point numbers
- JSON and XML have good support for Unicode character strings (i.e., humanreadable text), but they don’t support binary strings
- We need to encode the binary data as text using Base64. The schema is then used to indicate that the value should be interpreted as Base64-encoded. This works, but it’s somewhat hacky and increases the data size by 33%.
- There is optional schema support for both XML and JSON.
- JSON and XML use a lot of space compared to binary formats.
- This observation led to the development of a profusion of binary encodings
- Example record (Example 4-1) encoded using MessagePack.