Changes to the data are first written to memory and then Flushed into SSTable
Updates accumulate overtime in different sstables,
Having several version of the same cell is called space amplification
SSTables
Immutable
contains changes to the data (aka mutations)
Sorted (Sorted String Table)
Have metafata like Idex, filters, statistics
Why Compaction is needed
SSTables are immutable and we can't just keep writing fat
Obsolute data needs to be deleted
reduce space amplification
data might be scattered around, we want to consolidate that
Compaction Fundamentals
Compaction first selects a set of sstables to process, based on compaction strategy
It then reads the sstable and writes them down compacted, also eliminates overwrites , deleted and expired data
Eventually when the output sstables are sealed and storage written down, the input now can be deleted
Overwritten, expired (by ttl), deleted (by tombstone), dropable tombstones these are only mutations that can be eliminated.
Compaction Deep Dive
Bloom filters are read path optimisation, has lot of false positives i.e. if it says no the record is not present, but the other hand if it says yes not necessarily the data exists.
The technique of keeping sorted files and merging them is called Log-Structured Merge tree.
Compaction Strategies
Size tiered Compaction Strategy [STCS] - compaction is executed based on sstable size bucket
Time Window Compaction Strategy [ TWCS ]- targeted for time series data, Compact buckets, using size tiered.
Leveled Compaction Strategies [ LCS ] - Maximum strict bounds on number of sstable