Day 12/100

Scylla Compaction Fundamentals

Log Structured Writes

  • Changes to the data are first written to memory and then Flushed into SSTable
  • Updates accumulate overtime in different sstables,
  • Having several version of the same cell is called space amplification

SSTables

  • Immutable
  • contains changes to the data (aka mutations)
  • Sorted (Sorted String Table)
  • Have metafata like Idex, filters, statistics

Why Compaction is needed

  • SSTables are immutable and we can't just keep writing fat
  • Obsolute data needs to be deleted
  • reduce space amplification
  • data might be scattered around, we want to consolidate that

Compaction Fundamentals

  • Compaction first selects a set of sstables to process, based on compaction strategy
  • It then reads the sstable and writes them down compacted, also eliminates overwrites , deleted and expired data
  • Eventually when the output sstables are sealed and storage written down, the input now can be deleted
  • Overwritten, expired (by ttl), deleted (by tombstone), dropable tombstones these are only mutations that can be eliminated.

Compaction Deep Dive

  • Bloom filters are read path optimisation, has lot of false positives i.e. if it says no the record is not present, but the other hand if it says yes not necessarily the data exists.
  • The technique of keeping sorted files and merging them is called Log-Structured Merge tree.

Compaction Strategies

  • Size tiered Compaction Strategy [STCS] - compaction is executed based on sstable size bucket
  • Time Window Compaction Strategy [ TWCS ]- targeted for time series data, Compact buckets, using size tiered.
  • Leveled Compaction Strategies [ LCS ] - Maximum strict bounds on number of sstable