`

Isolating Linux High System Load

 
阅读更多

 

There are typically two reasons that a server will show high load and become unresponsive: CPU and disc utilization. On a rare occasion it's something like a hardware error causing a disc to become unresponsive. There are some great tools for tracking and isolating these issues, as long as you know how to interpret the results.

Uptime

The first thing I'll do when I can get logged into a non-responsive system is do an "uptime". If the load is high, I know I need to start digging into things with the other tools. Uptime gives you 3 numbers which indicate the 1, 5, and 15 minute load averages. From this you can tell if the load is trending up, neutral, or going down:

 

guin:~$ uptime
 13:26:32 up 1 day, 16:52, 21 users,  load average: 0.00, 0.14, 0.15
guin:~$

On my laptop, the load is fairly low, but it is trending down (1 minute average of 0.00, 5 minute of 0.14, and over 15 minutes it's 0.15).

 

database:~ # uptime
 12:29pm  up 1 day 13:29,  1 user,  load average: 0.84, 0.82, 0.80
database:~ #

On this database server the load is somewhat low (it's a quad CPU box, so I wouldn't consider it saturated until it was around 4).

Dmesg

It's also useful to look at the bottom of the "dmesg" output. Usually it isn't particularly revealing, but in the case of hardware errors or the out of memory killer it can very quickly reveal a problem.

Vmstat

Next I will often run "vmstat 1", which prints out statistics every second on the system utilization. The first line is the average since the system was last booted:

 

denver-database:~ # vmstat 1
procs ---------memory---------- --swap-- --io--- -system-- -----cpu-----
 r  b swpd   free   buff  cache  si  so  bi  bo   in   cs  us sy id wa st
 0  0  116 158096 259308 3083748   0   0  47  39   30   58 11  8 76  5  0
 2  0  116 158220 259308 3083748   0   0   0   0 1706 4899 22 14 64  0  0
 1  0  116 158220 259308 3083748   0   0   0 276 1435 1490  4  2 93  0  0
 0  0  116 158220 259308 3083748   0   0   0   0 1502 1569  5  3 92  0  0
 0  0  116 158220 259308 3083748   0   0   0 892 1394 1529  2  1 97  0  0
 1  0  116 158592 259308 3083748   0   0   0 216 1702 1825  8  7 84  1  0
 0  0  116 158344 259308 3083748   0   0   0 368 1465 1461  8  7 84  0  0
 0  0  116 158344 259308 3083748   0   0   0 940 1992 2115  2  2 95  0  0
 0  0  116 158344 259308 3083748   0   0   0 240 1906 1982  6  7 87  0  0

The first thing I'll look at here is the "wa" column; the mount of CPU time spent waiting. If this is high you almost certainly have something hitting the disc hard.

If the "wa" is high, the next thing I'd look at is the "swap" columns "si" and "so". If these are much above 0 on a regular basis, it probably means you're out of memory and the system is swapping. Since RAM is around a million times faster than a hard drive (10ns instead of 10ms), swapping much can cause the system to really grind to a halt. Note however that some swapping, particularly swapping out, is normal.

Next I'd look at the "id" column under "cpu" for the amount of idle CPU time. If this is around 0, it means the CPU is heavily used. If it is, the "sy" and "us" columns tell us how much time is being used by the kernel and user-space processes.

If CPU "sy" time is high, this can often indicate that there are some large directories (say a user's "spam" mail directory) with hundreds of thousand or millions of entries, or other large directory trees. Another common cause of high "sy" CPU time is the system firewall: iptables. There are other causes of course but these seem to be the primary ones.

If CPU "us" is high, that's easy to track down with "top".

ps awwlx --sort=vsz

If there is swapping going on I like to look at the big processes via "ps awwlx --sort=vsz". This shows processes sorted by virtual sizes (which does include shared libraries, but also counts blocks swapped out to disc).

Iostat

For systems where there is a lot of I/O activity (shown via the "bi" and "bo" being high, but "si" and "so" being low), iostat can tell you more about what hard drives the activity is happening on, and what the utilization is. Normally I will run "iostat -x 5" which causes it to print out updated stats every 5 seconds:

 

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.64    0.00    3.95    0.30    0.00   90.11

Device: rrqm/s wrqm/s   r/s   w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda       0.00   9.60  0.60  2.40   6.40  97.60    34.67     0.01  4.80  4.80  1.44

I'll first look at the "%util" column, if it's approaching 100% then that device is being hit hard. In this case we only have one device, so I can't use this to isolate where the heavy activity might be happening, but if the database were on it's own partition that could help track it down.

"await" is a very useful column, it tells us how long the device takes to service a request. If this gets high, it probably indicates saturation.

Other information iostat gives can tell us if the activity is read-oriented or writes, and whether they are small or large writes (based on the sec/s sectors per second rate and the number of read/writes per second).

Iotop

This requires a very recent kernel (2.6.20 or newer), so this isn't something I tend to run very often: most of the systems I maintain are enterprise distros, so they have older kernels. RHEL/CentOS 3/4/5 are too old, Ubuntu Hardy doesn't have iotop, but Lucid does support it.

iotop is like top but it will show processes that are doing heavy I/O. However, often this may be a kernel process so you still may not be able to tell exactly what process is causing the I/O load. It's much better than what we had in the past though.

Top

In the case of high user CPU time, "top" is a great tool for telling you what is using the most CPU.

Munin

Munin is a great tool that tracks long-term system performance trends. However, it's not something you can start using when you have a performance problem. It's the sort of thing you should set up on all your systems so that you can build up the historic usage and have it available when you need it.

It will give you extensive stats about CPU, disc, RAM, network, and other resources, and allow you to see trends to determine an upgrade will be needed in the coming months, rather than that you needed to do one a few months ago. :-)

Conclusions

When performance problems hit, there are many great tools for helping to isolate and resolve them. Using these techniques I've been able to quickly and accurately identify and mitigate performance issues

分享到:
评论

相关推荐

    ISOLATING LEGAL RISK.pdf

    LANGUAGE MODELS AND LEGAL RISK MANAGEMENT 在语言模型(LM)发展过程中,法律风险管理变得越来越重要。随着模型训练数据量的增加,模型对opyrighted 或受限数据的依赖性也在增加,这就带来了法律风险。...

    Plug and Play BIOS Specification

    2.1.3 Isolating Committed Resources 9 2.1.4 System BIOS Resource Allocation 9 2.2 Plug and Play ISA Card Support 11 2.2.1 Assigning CSN to Plug and Play ISA cards 11 2.2.2 Initializing Plug and Play ...

    CCDE 400-007-en-unlocked.pdf

    - **Isolating Fault Domains:** By routing VSANs between sites, the architecture can better handle failures. If a problem occurs in one site's VSAN, it does not affect the other site, thus increasing ...

    高级运维工程师岗位职责.docx

    docker 容器是指一种轻量级的虚拟化技术,用于 isolating applications 和服务。 知识点15:etcd/zk/redis etcd/zk/redis 是指三种常见的分布式存储系统,分别用于处理大量数据的存储和处理。 知识点16:云计算...

    Design Con paper 2019 PAPER 02

    Design Con paper 2019 PAPER 02 Case Studies Isolating Types of Power-Integrity Effects on Signal-Integrity, and Means of Mitigation

    一元一次方程的解法练习题.pdf

    解决一元一次方程的关键在于isolating the variable x。下面我们将通过一系列练习题来熟悉一元一次方程的解法。 首先,让我们看一下第一个方程:2x + 5 = 5x - 7。为了解出这个方程,我们可以使用加减法将x项移到...

    Marvell switch product

    - **Security**: SecureSmart port configuration and VLAN support enhance network security by isolating traffic and controlling access. - **Scalability**: The ability to configure trunk groups and ...

    Compact, integrated PLZT optical switch array

    Due to the excellent electro-optic properties of lead lanthanum zirconium titanate ((Pb,La)(Zr,Ti)O3, PLZT), compact 1×2 and 1×4 PLZT optical switches with a high response speed are proposed and ...

    电力系统和自动化专业外语词汇表.doc

    27. **超高压系统 (Extra High Voltage System)** - **定义**: 工作电压远高于常规高压系统的电力系统。 28. **频率 (Frequency)** - **定义**: 交流电每秒内周期变化的次数。 - **单位**: 赫兹 (Hz)。 29. **...

    Mockito常用方法.pdf

    Mockito 是一个流行的 Java 模拟测试框架,用于 isolating 依赖项并使测试更快速、更可靠。以下将详细介绍 Mockito 的常用方法和应用场景。 Mockito 简介 Mockito 是一个开源的模拟测试框架,用于 isolating 依赖...

    Radio Frequency Integrated Circuit Design

    251 -G m 8.7 Analysis of an Oscillator as a Feedback System 252 8.7.1 Oscillator Closed-Loop Analysis 252 8.7.2 Capacitor Ratios with Colpitts Oscillators 255 8.7.3 Oscillator Open-...

    英文原版-AWS For Admins For Dummies 1st Edition

    Friendly Cloud EnvironmentChapter 13 Isolating Cloud Resources Using Virtual Private CloudChapter 14 Using the Infrastructure SoftwareChapter 15 Supporting Users with Business SoftwarePart 6 The Part ...

    antlrworks-1.3.1.jar

    It combines an excellent grammar-aware editor with an interpreter for rapid prototyping and a language-agnostic debugger for isolating grammar errors. ANTLRWorks helps eliminate grammar ...

    Typemock Isolator-Developers Guide

    Typemock Isolator is a powerful tool that makes unit testing easy by isolating tested classes. Typemock Isolator is an extensive framework with the following features: Support a mock injector AOP ...

    领域驱动设计(精简版)

    Crunching Knowledge 7 Chapter 2: Communication and the Use of Language 23 Chapter 3: Binding Model and Implementation 45 Part II Chapter 4: Isolating the Domain 67 Chapter 5: A Model Expressed in ...

    Direct_CouplerOverview

    Directional couplers are general purpose tools used in RF and microwave signal routing for isolating, separating or combining signals. They find use in a variety of measurement applications

    波音系列飞机维护手册使用介绍.pptx

    这份手册涵盖了多种类型的文档,包括Airplane Maintenance Manual (AMM)、Fault Isolating Manual (FIM)、Illustrated Parts Catalog (IPC)、System Schematic Manual (SSM)、Wiring Diagram Manual (WDM)、Standard...

Global site tag (gtag.js) - Google Analytics