Day 10/100

Day 10/100

Scylla Operations Course [Part 3: Scylla Monitoring]

Screenshot 2022-03-30 at 11.39.22 PM.png

Hot Partition Check

  • use nodetool cfhistograms to find the Keyspace/table with higher latency.
  • use nodetool toppartitions to extract the specific partition.

Large Partition-Row-Cell Check

  • Large parition-row-cell are detected at compaction time
  • search the system logs for scylla reporting large partitions/rows/cells
  • Check the local system tables
    • select * from system.large_partitions;
    • select * from system.large_rows;
    • select * from system.large_cells;

Single Node Check

  • Monitoring CPU / OS / I/O
  • Check the system logs for Scylla reporting
    • Errors
    • Stalls
    • Large allocation/ bad_alloc
  • Check the system log for os level errors (OOMKiller / disk errors)

Memory Management

Screenshot 2022-03-31 at 12.05.05 AM.png

Healthy System

  • Usually most of the memory LSA - cache and memtables
  • When LSA memory drops usually it means we had to evict it for other items

Large Allocartions

  • Scylla tries to optimise memory usage - large (contiguous) allocations are bad:
    • They are costly to allocate - Many times it involves freeing a lot of items to reach a point in which we have a large contiguous allocation (at worst case all LSA will need to be evicted)
  • Large Allocations are reported to journal (like stalls)

bad allocs

  • sometimes scylla is not able to allocate memory (especially if it is a large allocation)
  • in some cases it is a transient issue and in other cases we need to analyse