Brendan Gregg¶
vmstat output description
Procs (Processes)
r
: The number of runnable processes (running or waiting for run time).b
: The number of processes in uninterruptible sleep. These are processes that are waiting for some I/O access, like disk or network activity.
Memory
swpd
: The amount of virtual memory used (in KB).free
: The amount of idle memory (in KB).buff
: The amount of memory used as buffers (in KB).cache
: The amount of memory used as cache (in KB).
Swap
si
: Amount of memory swapped in from disk (in KB per second).so
: Amount of memory swapped out to disk (in KB per second).
I/O
bi
: Blocks received from a block device (blocks input per second).bo
: Blocks sent to a block device (blocks output per second).
System
in
: The number of interrupts per second, including the clock.cs
: The number of context switches per second.
CPU
us
: User time. The time the CPU has spent running users’ processes that are not niced.sy
: System time. The time the CPU has spent running the kernel and its processes.id
: Idle time. The time the CPU has spent doing nothing.wa
: Wait time. The time the CPU has spent waiting for I/O to complete.st
: Steal time. The amount of CPU ‘stolen’ from this virtual machine by the hypervisor for other tasks (such as running another virtual machine).
sar -q output description
runq-sz
: This is the run queue size (the number of tasks waiting for run time, the number of processes that are currently executing or waiting to execute).plist-sz
: This is the number of tasks in the task list: the total number of tasks managed by the Linux scheduler. This includes running tasks, sleeping tasks, stopped tasks, and zombies.
USE Method & Checklist¶
Terminology definitions:
- resource: all physical server functional components (CPUs, disks, busses, …) [1]
- utilization: the average time that the resource was busy servicing work [2]
- saturation: the degree to which the resource has extra work which it can’t service, often queued
- errors: the count of error events
CPU¶
- system-wide:
vmstat 1
, sum:us
+sy
+st
sar -u
, sum fields except%idle
and%iowait
dstat -c
, sum fields exceptidl
andwai
;
- per-cpu:
mpstat -P ALL 1
, sum fields except%idle
and%iowait
sar -P ALL
, same as mpstat
- per-process:
top
, check:%CPU
htop
, check:CPU%
ps -o pcpu
pidstat 1
, check:%CPU
- per-kernel-thread:
top
/htop
(K
to toggle), whereVIRT == 0
(heuristic)
- system-wide:
vmstat 1
, wherer
> CPU countsar -q
, whererunq-sz
> CPU countdstat -p
, whererun
> CPU count
- per-process:
/proc/PID/schedstat
2nd field (sched_info.run_delay);perf sched latency
(shows “Average” and “Maximum” delay per-schedule)- dynamic tracing, eg, SystemTap schedtimes.stp “queued(us)
- perf (LPE) if processor specific error events (CPC) are available eg, AMD64’s “04Ah Single-bit ECC Errors Recorded by Scrubber”
Memory¶
- system-wide:
free -m
, check:Mem:
(main memory),Swap:
(virtual memory)vmstat 1
, check:free
(main memory),swap
(virtual memory)sar -r
, check:%memused
dstat -m
, check:free
slabtop -s c
for kmem slab usage
- per-process:
top
/htop
,RES
(resident main memory),VIRT
(virtual memory),Mem
for system-wide summary
- system-wide:
vmstat 1
,si
/so
(swapping)sar -B
,pgscank
+pgscand
(scanning)sar -W
- per-process:
- 10th field (
min_flt
) from/proc/PID/stat
for minor-fault rate, or dynamic tracing [5] - OOM killer:
dmesg | grep killed
- 10th field (
dmesg
for physical failures- dynamic tracing, eg, SystemTap uprobes for failed malloc()s
Network¶
sar -n DEV 1
,rxKB/s
/maxtxKB/s
/maxip -s link
, RX/TX tput / max bandwidth/proc/net/dev
,bytes
RX/TX tput/maxnicstat
,%Util
[6]
ifconfig
,overruns
,dropped
netstat -s
,segments retransmited
sar -n EDEV
, drop and fifo metrics/proc/net/dev
, RX/TXdrop
nicstat
Sat
[6]- dynamic tracing for other TCP/IP stack queueing [7]
ifconfig
,errors
,dropped
netstat -i
,RX-ERR
/TX-ERR
ip -s link
,errors
sar -n EDEV
,rxerr/s
txerr/s
/proc/net/dev
,errs
,drop
- extra counters may be under /sys/class/net/…
- dynamic tracing of driver function returns 76]
Storage device I/O¶
- system-wide:
iostat -xz 1
,%util
sar -d
,%util
- per-process:
iotop
pidstat -d
/proc/PID/sched
se.statistics.iowait_sum
iostat -xnz 1
,avgqu-sz
> 1, or highawait
sar -d
same- LPE block probes for queue length/latency
- dynamic/static tracing of I/O subsystem (incl. LPE block probes)
- smartctl
- dynamic/static tracing of I/O subsystem response codes [8]