Day 39/100

Designing Data-Intensive Applications [Book Highlights]

[ Part II : Chapter V ] Distributed Data

Implementation of Replication Logs

Statement-based replication

  • In the simplest case, the leader logs every write request (statement) that it executes and sends that statement log to its followers.
  • Problems can happen in following cases
    • Any statement that calls a nondeterministic function, such as NOW() or RAND()
    • If statements use an autoincrementing column, or if they depend on the existing data in the database (e.g., UPDATE … WHERE ), they must be executed in exactly the same order on each replica, or else they may have a different effect.
    • Statements that have side effects (e.g., triggers, stored procedures, user-defined functions) may result in different side effects occurring on each replica.
  • Solution - It is possible to work around those issues—for example, the leader can replace any nondeterministic function calls with a fixed return value when the statement is logged so that the followers all get the same value.
  • However, because there are so many edge cases, other replication methods are now generally preferred.

Logical (row-based) log replication

  • An alternative is to use different log formats for replication and for the storage engine, which allows the replication log to be decoupled from the storage engine internals.
    • For an inserted row, the log contains the new values of all columns.
    • For a deleted row, the log contains enough information to uniquely identify the row that was deleted.
    • For an updated row, the log contains enough information to uniquely identify the updated row, and the new values of all columns.
  • A transaction that modifies several rows generates several such log records, followed by a record indicating that the transaction was committed.
  • Allows backward compatible, allowing the leader and the follower to run different versions of the database software, or even different storage engines

Trigger-based replication

  • if you want to only replicate a subset of the data, or want to replicate from one kind of database to another, or if you need conflict resolution logic, then you may need to move replication up to the application layer.
  • A trigger lets you register custom application code that is automatically executed when a data change (write transaction) occurs in a database system.
  • Trigger-based replication typically has greater overheads than other replication methods, and is more prone to bugs and limitations than the database’s built-in replication.
  • However, it can nevertheless be useful due to its flexibility.