Follow

Follow

Day 10/100

Day 10/100

Akshay Habbu's photo

··

2 min read

Scylla Operations Course [Part 3: Scylla Monitoring]

Screenshot 2022-03-30 at 11.39.22 PM.png

Hot Partition Check

use nodetool cfhistograms to find the Keyspace/table with higher latency.
use nodetool toppartitions to extract the specific partition.

Large Partition-Row-Cell Check

Large parition-row-cell are detected at compaction time
search the system logs for scylla reporting large partitions/rows/cells
Check the local system tables
- select * from system.large_partitions;
- select * from system.large_rows;
- select * from system.large_cells;

Single Node Check

Monitoring CPU / OS / I/O
Check the system logs for Scylla reporting
- Errors
- Stalls
- Large allocation/ bad_alloc
Check the system log for os level errors (OOMKiller / disk errors)

Memory Management

Screenshot 2022-03-31 at 12.05.05 AM.png

Healthy System

Usually most of the memory LSA - cache and memtables
When LSA memory drops usually it means we had to evict it for other items

Large Allocartions

Scylla tries to optimise memory usage - large (contiguous) allocations are bad:
- They are costly to allocate - Many times it involves freeing a lot of items to reach a point in which we have a large contiguous allocation (at worst case all LSA will need to be evicted)
Large Allocations are reported to journal (like stalls)

bad allocs

sometimes scylla is not able to allocate memory (especially if it is a large allocation)
in some cases it is a transient issue and in other cases we need to analyse

2Articles1Week Cassandra Databases 100DaysOfCode