Brendan Gregg¶

vmstat output description

Procs (Processes)

r: The number of runnable processes (running or waiting for run time).
b: The number of processes in uninterruptible sleep. These are processes that are waiting for some I/O access, like disk or network activity.

Memory

Swap

I/O

System

CPU

us: User time. The time the CPU has spent running users’ processes that are not niced.
sy: System time. The time the CPU has spent running the kernel and its processes.
id: Idle time. The time the CPU has spent doing nothing.
wa: Wait time. The time the CPU has spent waiting for I/O to complete.
st: Steal time. The amount of CPU ‘stolen’ from this virtual machine by the hypervisor for other tasks (such as running another virtual machine).

sar -q output description

runq-sz: This is the run queue size (the number of tasks waiting for run time, the number of processes that are currently executing or waiting to execute).
plist-sz: This is the number of tasks in the task list: the total number of tasks managed by the Linux scheduler. This includes running tasks, sleeping tasks, stopped tasks, and zombies.

USE Method & Checklist¶

Terminology definitions:

resource: all physical server functional components (CPUs, disks, busses, …) [1]
utilization: the average time that the resource was busy servicing work [2]
saturation: the degree to which the resource has extra work which it can’t service, often queued
errors: the count of error events

utilizationsaturationerrors

system-wide:
- vmstat 1, sum: us + sy + st
- sar -u, sum fields except %idle and %iowait
- dstat -c, sum fields except idl and wai;
per-cpu:
- mpstat -P ALL 1, sum fields except %idle and %iowait
- sar -P ALL, same as mpstat
per-process:
- top, check: %CPU
- htop, check: CPU%
- ps -o pcpu
- pidstat 1, check: %CPU
per-kernel-thread:
- top/htop (K to toggle), where VIRT == 0 (heuristic)

system-wide:
- vmstat 1, where r > CPU count
- sar -q, where runq-sz > CPU count
- dstat -p, where run > CPU count
per-process:
- /proc/PID/schedstat 2nd field (sched_info.run_delay);
- perf sched latency (shows “Average” and “Maximum” delay per-schedule)
- dynamic tracing, eg, SystemTap schedtimes.stp “queued(us)

perf (LPE) if processor specific error events (CPC) are available eg, AMD64’s “04Ah Single-bit ECC Errors Recorded by Scrubber”

utilizationsaturationerrors

system-wide:
- free -m, check: Mem: (main memory), Swap: (virtual memory)
- vmstat 1, check: free (main memory), swap (virtual memory)
- sar -r, check: %memused
- dstat -m, check: free
- slabtop -s c for kmem slab usage
per-process:
- top/htop, RES (resident main memory), VIRT (virtual memory), Mem for system-wide summary

system-wide:
- vmstat 1, si/so (swapping)
- sar -B, pgscank + pgscand (scanning)
- sar -W
per-process:
- 10th field (min_flt) from /proc/PID/stat for minor-fault rate, or dynamic tracing [5]
- OOM killer: dmesg | grep killed

utilizationsaturationerrors

utilizationsaturationerrors

system-wide:
- iostat -xz 1, %util
- sar -d, %util
per-process:
- iotop
- pidstat -d
- /proc/PID/sched se.statistics.iowait_sum