nagios安装及配置

truemylife

浏览: 231808 次
性别:
来自: 杭州

最近访客更多访客>>

gybing

zjfmail

jokerwaver

wanghuan5516

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (33)

社区版块

存档分类

nagios 运维监控 check_http nrpe

本文操作系统基于centos

一、安装nagios依赖

nagios依赖于php、gcc glibc glibc-common、gd gd-devel

yum install php

yum install gcc glibc glibc-common

yum install gd gd-devel

若未安装apache，还需要安装

yum install httpd

二、用户及组设置

/usr/sbin/useradd -m nagios

passwd nagios

Create a new nagcmd group for allowing external commands to be submitted through the web interface. Add both the nagios user and the apache user to the group.

/usr/sbin/groupadd nagios

/usr/sbin/usermod -a -G nagios nagios

/usr/sbin/usermod -a -G nagios apache

三、下载及安装nagios核心

wget http://prdownloads.sourceforge.net/sourceforge/nagios/nagios-3.2.3.tar.gz

tar xzf nagios-3.2.3.tar.gz

cd nagios-3.2.3

Run the Nagios configure script, passing the name of the group you created earlier like so:

./configure --with-command-group=nagios

Compile the Nagios source code.

make all

Install binaries, init script, sample config files and set permissions on the external command directory.

make install

make install-init

make install-config

make install-commandmode

Don't start Nagios yet - there's still more that needs to be done...

Sample configuration files have now been installed in the /usr/local/nagios/etc directory. These sample files should work fine for getting started with Nagios. You'll need to make just one change before you proceed...

Edit the /usr/local/nagios/etc/objects/contacts.cfg config file with your favorite editor and change the email address associated with the nagiosadmin contact definition to the address you'd like to use for receiving alerts.

vi /usr/local/nagios/etc/objects/contacts.cfg

四、配置成web服务

Install the Nagios web config file in the Apache conf.d directory.

make install-webconf //还停留在nagios-3.2.3目录下操作此命令，可以看到安装的httpd的目录/etc/httpd/conf.d下生成了nagios.conf

Create a nagiosadmin account for logging into the Nagios web interface. Remember the password you assign to this account - you'll need it later.

htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin

Restart Apache to make the new settings take effect.

service httpd restart

五、安装plugins

wget https://www.nagios-plugins.org/download/nagios-plugins-1.5.tar.gz

tar zxvf nagios-plugins-1.5.tar.gz

cd nagios-plugins-1.5

Compile and install the plugins.

./configure --with-nagios-user=nagios --with-nagios-group=nagios

make

make install

六、nagios启动

Add Nagios to the list of system services and have it automatically start when the system boots.

chkconfig --add nagios

chkconfig nagios on

Verify the sample Nagios configuration files.

/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

If there are no errors, start Nagios.

service nagios start

七、Modify SELinux Settings

Fedora ships with SELinux (Security Enhanced Linux) installed and in Enforcing mode by default. This can result in "Internal Server Error" messages when you attempt to access the Nagios CGIs.

See if SELinux is in Enforcing mode.

getenforce

Put SELinux into Permissive mode.

setenforce 0

To make this change permanent, you'll have to modify the settings in /etc/selinux/config and reboot.

Instead of disabling SELinux or setting it to permissive mode, you can use the following command to run the CGIs under SELinux enforcing/targeted mode:

chcon -R -t httpd_sys_content_t /usr/local/nagios/sbin/

chcon -R -t httpd_sys_content_t /usr/local/nagios/share/

八、进入web端查看

http://youip:port/nagios

此时只能看到本机最简单的默认配置状况

九、nrpe安装时

官方文档http://nagios.sourceforge.net/docs/nrpe/NRPE.pdf

nrpe只是nagios server与监控到其他host的一种方式，nagios也可以使用其他的Nagios Addon Projects来实现比如

DNX

NRDP

NSCA

NSClient++

见http://www.nagios.org/download/addons

客户端(假设客户端ip:192.168.90.111)安装

1、useradd nagios

passwd nagios(置密码)

2、安装xinetd

centos6.3自带可能没有安装xinted，但是有点奇怪的是，却有/etc/xinetd.d目录

yum install -y xinetd安装完后，可以看到/etc/xinted.conf配置

3、需要安装nagios-plugin

从官网下载nagios-plugin

tar -zxvf nagios-plugins-1.5.tar.gz

cd nagios-plugins-1.5

./configure

make

make install

chown nagios:nagios /usr/local/nagios

chown -R nagios:nagios /usr/local/nagios/libexec

4、从官网下载nrpe

tar -zxvf nrpe-1.15.tar.gz

cd nrpe-1.15

./configure

make all

make install-plugin

make install-daemon

make install-daemon-config

make install-xinetd

在安装nrpe过程中./configure时报错

checking for SSL headers... configure: error: Cannot find ssl headers

yum install -y openssl openssl-devel安装完成后，解决问题

5、nrpe配置及系统设置

vi /etc/xinetd.d/nrpe

修改此行

only_from =127.0.0.1 nagios_server_ip(空格)

保存退出

vi /usr/local/nagios/etc/nrpe.cfg

allowed_hosts=127.0.0.1, nagios_server_ip(半角逗号)

保存退出

vi /etc/services

新增以下内容

nrpe 5666/tcp # NRPE

保存退出

vi /etc/sysconfig/iptables

新增以下内容

-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 5666 -j ACCEPT

保存退出

service iptables restart

service xinetd restart

6、检测客户端是否成功安装nrpe

/usr/local/nagios/libexec/check_nrpe -H 127.0.0.1

NRPE v2.15

/usr/local/nagios/libexec/check_nrpe -H localhost

报错

CHECK_NRPE: Error - Could not complete SSL handshake.

通常是因为/etc/hosts配置有错误引起，vi /etc/hosts发现有这么一行

::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

注释掉或删除后正常

7、如果要在客户端新增监控命令等设置文件在 /usr/local/nagios/etc/nrpe.cfg

服务端nrep安装

1、服务端不安装nagios-plugin，直接安装nrpe

yum install openssl openssl-devel

tar -zxvf nrpe-1.15.tar.gz

cd nrpe-1.15

./configure

make all

make install-plugin

2、检测刚刚安装的客户端

/usr/local/nagios/libexec/check_nrpe -H 192.168.90.111

NRPE v2.15

说明能正常监控到客户端了

3、在服务端修改配置以监控到客户端

vi /usr/local/nagios/etc/commands.cfg

define command{

command_name check_nrpe

command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$

}

保存后退出

以localhost.cfg为模板新建一个文件remotehost.cfg

define host{

use linux-box ; Inherit default values from a template host_name remotehost ; The name we're giving to this server

alias CentOS 6.3 ; A longer name for the server

address 192.168.90.111 ; IP address of the server

}

define service{

use generic-service

host_name remotehost

service_description CPU Load

check_command check_nrpe!check_load

}

The following service will monitor the the number of currently logged in users on the remote host.

define service{

use generic-service

host_name remotehost

service_description Current Users

check_command check_nrpe!check_users

}

The following service will monitor the free drive space on /dev/hda1 on the remote host.

define service{

use generic-service

host_name remotehost

service_description /dev/hda1 Free Space

check_command check_nrpe!check_hda1

}

The following service will monitor the total number of processes on the remote host.

define service{

use generic-service

host_name remotehost

service_description Total Processes

check_command check_nrpe!check_total_procs

}

The following service will monitor the number of zombie processes on the remote host.

define service{

use generic-service

host_name remotehost

service_description Zombie Processes

check_command check_nrpe!check_zombie_procs

}

保存退出

vi /usr/local/nagios/etc/nagios.cfg

新增一行，使新的配置生效

cfg_file=/usr/local/nagios/etc/objects/remotehost.cfg

保存退出

4、/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg

会检测是否配置正确，如果配置有问题，会提示修改直到修改正确。

5、重起服务

service nagios restart

再次查看http://192.168.90.101:8080/nagios/

看到多出了host及host下的service

十、nagios配置详解，官方资料http://nagios.sourceforge.net/docs/nagioscore/3/en/config.html

1、配置文件清单

/usr/local/nagios/etc目录下

cgi.cfg //控制cgi访问的配置文件

nagios.cfg //nagios主配置文件

resource.cfg //变量定义文件，或叫资源文件，通过在此文件中定义变量，以便让其他配置文件引用，如$USER1

objects //objects上当下放置配置文件或配置文件模板，用于定义Nagios对象

objects/commands.cfg //命令定义配置文件，里面定义的命令可以被其他配置文件引用

objects/contacts.cfg //定义联系人和联系人组的配置文件

objects/localhost.cfg //定义监控本地主机的配置文件

objects/printer.cfg //定义监控打印机的一个配置文件模板，默认没有启用此文件

objects/switch.cfg //监控路由器的一个配置文件模板，默认没有启用此文件

objects/templates.cfg //定义主机、服务的一个模板配置文件，可以在其他配置文件中引用

objects/timeperiods.cfg //定义nagios监控时间段的配置文件

objects/windows.cfg //监控windows主机的一个配置文件模板，默认没有启用此文件

2、配置文件之间的关系

在nagios的配置过程中涉及到的几个定义有：主机、主机组，服务、服务组，联系人、联系人组，监控时间，监控命令等，

从这些定义可以看出，nagios各个配置文件之间是互为关联，彼此引用的。

成功配置出一台nagios监控系统，必须要弄清楚每个配置文件之间依赖与被依赖的关系，最重要的有四点：

第一：定义监控哪些主机【host】【hostescalation】【hostdependency】、主机组【hostgroup】、服务

【service】【serviceescalation】【servicedependency】和服务组【servicegroup】

第二：定义这个监控要用什么命令【command】实现，

第三：定义监控的时间段【timeperiod】，

第四：定义主机或服务出现问题时要通知的联系人【contact】和联系人组【contactgroup】。

nagios所有对象

Services

Service Groups

Hosts

Host Groups

Contacts

Contact Groups

Commands

Time Periods

Notification Escalations

Notification and Execution Dependencies

见文档http://nagios.sourceforge.net/docs/nagioscore/3/en/configobject.html

http://nagios.sourceforge.net/docs/nagioscore/3/en/objectdefinitions.html#retain_state_information

3、为了使对象更加清楚，建议将nagios各个定义对象创建独立的配置文件：

即为：

创建hosts.cfg文件来定义主机和主机组

创建services.cfg文件来定义服务

用默认的contacts.cfg文件来定义联系人和联系人组

用默认的commands.cfg文件来定义命令

用默认的timeperiods.cfg来定义监控时间段

用默认的templates.cfg文件作为资源引用文件

nagios主要用于监控主机资源以及服务，在nagios配置中称为对象，为了不必重复定义一些监控对象，Nagios引入了一个模板

配置文件，将一些共性的属性定义成模板，以便于多次引用。为了看起来比较清晰，我们把这些通用的对象定义在这就是templates.cfg。

然后再其他配置文件里定义具体对象时可以继承通用对象(使得use)

十一、commands语法

在/usr/local/nagios/libexec下面可以看到nagios自身提供的commands插件，比如check_ping，可以使用check_ping -h查看

些插件的详细参数说明。一般会在在commands.cfg文件里进一步定义command，比如

define command{

command_name check_ping

command_line $USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5

}

其中参数$HOSTADDRESS$会动态从host_name系统里获取，那么在使用的时候还需要传入两个参数

define service{

use local-service ; Name of service template to use

host_name localhost

service_description PING

check_command check_ping!100.0,20%!500.0,60%

}

第一个!之后的数据表示第一个参数

第二个!之后的数据表示第二个参数

十二、最基本的服务监控

1、机器网络内部是否正常（如果不通肯定出严重情况，但引起的原因可能是网络设备出问题、也可能是机器自身出问题）

define host{

use linux-server

host_name hxdbmaster

address ip

hostgroups linux-servers

}

在服务端定义要监控的机器。如果客户端不安装nagios-plugin，又想监控其网络状态，也可以使用ping服务是否相通

define command{

command_name check-remotehost-alive

command_line $USER1$/check_ping -H $ARG1$ -w 3000.0,80% -c 5000.0,100% -p 5

}

define service{

use local-service ; Name of service template to use

host_name localhost

service_description PING192.168.90.167

check_command check-remotehost-alive!192.168.90.167

}

2、机器磁盘

define service{

use local-service ; Name of service template to use

host_name localhost

service_description Root Partition

check_command check_local_disk!20%!10%!/

}

define service{

use local-service ; Name of service template to use

host_name localhost

service_description HOME Partition

check_command check_local_disk!20%!10%!/home

}

以上表示/根分区(逻辑分区)空间剩20%时警告，10%时严重警告

/home根分区(另外一个逻辑分区)空间剩20%时警告，10%时严重警告

3、CPU使用情况

# 'check_local_load' command definition

define command{

command_name check_local_load

command_line $USER1$/check_load -w $ARG1$ -c $ARG2$

}

# Define a service to check the load on the local machine.

define service{

use local-service ; Name of service template to use

host_name localhost

service_description Current Load

check_command check_local_load!5.0,4.0,3.0!10.0,6.0,4.0

}

当一分钟超过5个进程等待；5分钟超过4个进程等待；15分钟超过3进程等待则提升至Waining状态

当一分钟超过10个进程等待；5分钟超过6个进程等待；15分钟超过4进程等待则提升至Critical状态

4、IO使用情况

Nagios 自带的包里没有直接检查硬盘 I/O 的包: check_iostat.

不过可以到官网上下载一个.下载页面是:

http://exchange.nagios.org/directory/Plugins/Operating-Systems/Linux/check_iostat--2D-I-2FO-statistics/details

下载完后直接上传到监控端和被监控端的的:/usr/local/nagios/libexec/ 目录.

给它执行权限:

chmod +x check_iostat

查看它的帮助:

[root@localhost libexec]# ./check_iostat -help

This plugin shows the I/O usage of the specified disk, using the iostat external program.

It prints three statistics: Transactions per second (tps), Kilobytes per second

read from the disk (KB_read/s) and and written to the disk (KB_written/s)

./check_iostat:

-d Device to be checked (without the full path, eg. sda)

-c Sets the CRITICAL level for tps, KB_read/s and KB_written/s, respectively

-w Sets the WARNING level for tps, KB_read/s and KB_written/s, respectively

可以看到，它是用来检查硬盘上每秒数据写入读取的。

参数分别是:

-d: 要检查的设备名称,不用写全路径

-c: 当达到多少 KB/S 时就报 CRITICAL 级别的警

-w: 当达到多少 KB/S 时就报 WARNING 级别的警

查看本机的硬盘信息:

[root@localhost libexec]# df -h

Filesystem Size Used Avail Use% Mounted on

/dev/mapper/VolGroup00-LogVol00

128G 27G 95G 22% /

/dev/sda1 99M 13M 82M 14% /boot

tmpfs 4.0G 0 4.0G 0% /dev/shm

上面的信息是 sda1, 那么 -d 后就写 sda

另外，还有可能不是 sda 的,如:

[root@li387-161 ~]# df -h

Filesystem Size Used Avail Use% Mounted on

/dev/xvda 79G 38G 40G 49% /

tmpfs 1009M 108K 1009M 1% /dev/shm

上面的情况,-d 后就写 xvda

检查是否能运行:

[root@localhost libexec]# ./check_iostat -d sda -w 1000 -c 2000

OK - I/O stats tps=1.71 KB_read/s=2.77 KB_written/s=26.77 | 'tps'=1.71; 'KB_read/s'=2.77; 'KB_written/s'=26.77;

如果不能运行,报错,先在本机安装 sysstat:

[root@localhost libexec]# yum install sysstat

如果还报错,那就根据报错的信息一步步解决.

比如我这边报过: bc: command not found ; 解决: yum install bc

原文见http://blog.sina.com.cn/s/blog_5f54f0be0101ch4p.html

比如某机器上有sda、sdb两块硬盘

# 'check_iostat' command definition

define command{

command_name check_local_iostat

command_line $USER1$/check_iostat -d $AGR1$ -w $ARG2$ -c $ARG3$

}

define service{

use local-service ; Name of service template to use

host_name localhost

service_description sda io

check_command check_local_disk!sda!1000!2000

}

define service{

use local-service ; Name of service template to use

host_name localhost

service_description sdb io

check_command check_local_disk!sdb!1000!2000

}

需要特别注意的是，以上配置都没有错误，但出人意料的是service起来后，返回255代码出错，看起来check_iostat只支持nrpe的方式。

通过nrep方式配置

修改nrpe.cfg，新加两条配置

command[check_iostatsda]=/usr/local/nagios/libexec/check_iostat -d sda -w 1000 -c 2000

command[check_iostatsdb]=/usr/local/nagios/libexec/check_iostat -d sdb -w 1000 -c 2000

把以上的service配置改成通过nrpe的方式

define service{

use generic-service

host_name localhost

service_description sda io[localhost]

check_command check_nrpe!check_iostatsda

}

define service{

use generic-service

host_name localhost

service_description sdb io[localhost]

check_command check_nrpe!check_iostatsdb

}

5、内存使用

nagios自带没有直接检查内存检测，但有内存交换区检测check_swap。交换区的使用情况跟内在不一定的关联，因此也可

以直接使用check_swap来做内存检测。如果要直接检测内在，从官网上下载

http://exchange.nagios.org/directory/Plugins/Operating-Systems/Linux/check_mem/details

下载check_mem.zip，解开后有check_mem.pl脚本，放到/libexec目录下

chown nagios:nagios check_mem.pl

chmod +x check_mem.pl

./check_mem.pl -w 90,25 -c 95,50 表示内存使用超过90%，swap使用超过25%警告，使用超过95%，swap使用超过50%严重警告

配置如下：

# 'check_memory' command definition

define command{

command_name check_memory

command_line $USER1$/check_mem.pl -w $ARG1$ -c $ARG2$

}

define service{

use local-service ; Name of service template to use

host_name localhost

service_description Memory Usage

check_command check_memory!90,80!95,90

}

简单的交换区使用

# 'check_local_swap' command definition

define command{

command_name check_local_swap

command_line $USER1$/check_swap -w $ARG1$ -c $ARG2$

}

# Define a service to check the swap usage the local machine.

# Critical if less than 10% of swap is free, warning if less than 20% is free(这里的参数表示剩下的空间，

与check_memory刚好相反)

define service{

use local-service ; Name of service template to use

host_name localhost

service_description Swap Usage

check_command check_local_swap!20!10

}

6、web服务某个页面是否正常及响应时间

使用check_http

8、流量监控

指网卡的流量,在nagios下载安装

http://exchange.nagios.org/directory/Plugins/Network-Connections,-Stats-and-Bandwidth/check_traffic-2Esh/details

在安装之前查看是否已安装了snmp

rpm -qa |grep snmp

如果没有安装，必须先安装snmp

yum -y install net-snmp*

下载check_traffice.sh到/usr/local/nagios/libexec/目录下

chown nagios:nagios check_traffic.sh

chmod +x check_traffic.sh

起动snmpd

service snmpd restart

./check_traffic.sh -V 2c -C public -H 10.2.112.xx -L

出现错误

List Interface for host 127.0.0.1.

Interface index = No Such Object available on this agent at this OID

或者错误

Timeout: No Response from 127.0.0.1

没配置好/etc/snmp/snmpd.conf

删除原有内容，粘贴以下内容到snmpd.conf

com2sec notConfigUser 127.0.0.1 public

com2sec notConfigUser 192.168.90.xx public

# Second, map the security name into a group name:

# groupName securityModel securityName

group notConfigGroup v1 notConfigUser

group notConfigGroup v2c notConfigUser

# Third, create a view for us to let the group have rights to:

# Make at least snmpwalk -v 1 localhost -c public system fast again.

# name incl/excl subtree mask(optional)

view systemview included .1.3.6.1.2.1.1

view systemview included .1.3.6.1.2.1.2

view systemview included .1.3.6.1.2.1.25.1.1

view all included .1

# Finally, grant the group read-only access to the systemview view.

# group context sec.model sec.level prefix read write notif

#access notConfigGroup "" any noauth exact mib2 none none

access notConfigGroup "" any noauth exact all none none

## sec.name source community

#com2sec local localhost COMMUNITY

#com2sec mynetwork NETWORK/24 COMMUNITY

com2sec notConfigUser default public

com2sec *.*.*.0 192.168.90.0/24 public

192.168.90.换成你所需的网段，xx换成你实际的ip

service snmpd restart

先直接用snmpwalk -v 2c -c public 192.168.90.xx interfaces

输出类似

IF-MIB::ifNumber.0 = INTEGER: 2

IF-MIB::ifIndex.1 = INTEGER: 1

IF-MIB::ifIndex.2 = INTEGER: 2

IF-MIB::ifDescr.1 = STRING: lo

IF-MIB::ifDescr.2 = STRING: eth0

...

通了。

再使用

./check_traffic.sh -V 2c -C public -H localhost -L

List Interface for host localhost.

Interface index 1 orresponding to lo

Interface index 2 orresponding to eth0

问题解决。snmp问题解决可参考此文http://blog.renren.com/share/222193096/10751098321

# 'check_traffic' command definition

define command{

command_name check_traffic

command_line $USER1$/check_traffic.sh -V 2c -C public -H $HOSTADDRESS$ -I $ARG1$ -w $ARG2$ -c $ARG3$ -K -B

}

-I表示第几个设置

/usr/local/nagios/check_traffic.sh -V 2c -C public -H localhost -L可以看到设备序号，比如本机2表示第一个网卡eth0

define service{

use local-service

host_name localhost

service_description net traffic eth0

check_command check_traffic!2!700,600!1000,900

}

9、邮件警告

10、短信警告(未配置)

11、其他

check_procs用法：

Usage: check_procs -w <range> -c <range> [-m metric] [-s state] [-p ppid] [-u user] [-r rss] [-z vsz] [-P %cpu] [-a argument-array]

[-C command] [-t timeout] [-v]

现在就来解释下逐个参数的意思

-w -c 大家都知道，设置警告和严重警告的范围。一般都是设置一个数字，这样设置的话，进程数比设置的更大才报，比如

[root@udb151k libexec]# ./check_procs -w 84 -c 90

PROCS OK: 83 processes

还具有另一种写法

[root@udb151k libexec]# ./check_procs -w 84: -c :90

PROCS WARNING: 83 processes 冒号的意思是大于或者小于，这里的意思是小于84 或大于90 报警

-m 以什么来衡量报警，后面的参数有

PROCS - number of processes (default) 以进程的数量（默认）

VSZ - virtual memory size 占用虚拟内存的大小

RSS - resident set memory size占用物理内存的大小

CPU - percentage CPU 占用CPU的比例

-s 以进程的状态加以区分，进程的状态有很多种，详细可ps -exX 查看

-p 进程的父进程

-u 进程的UID

-r 实际使用的物理内存

-z 虚拟内存

-P 占用CPU

-a 设定字符串

-C 进程的命令

-t 超时设定

-a 的缺点：很多时候，我们要监控一个进程是否正常，这个时候很多人都喜欢用-a 加上自己进程的参数名称来监控，这样做其实很容易引起不必要的报警，

它会找出所有符合设定的字符串的进程，比如，我们在vi一个同名的文件或者查看该目录下的文件时：

[root@udb151k libexec]# ./check_procs -w 1: -c :2 -a mysqld

PROCS CRITICAL: 3 processes with args 'mysqld'

这个时候用-C是更准确的：

[root@udb151k libexec]# ./check_procs -w 1: -c :2 -C mysqld

PROCS OK: 1 process with command name 'mysqld'

原文：http://hi.baidu.com/zjx416/item/44474b1004b33038b831802f

十三、一些问题

1、Status Information出现中文乱码？没有解决

2、define host时发现check_command可以不用配置，那用什么来检测host的status的呢？

答案是不配置默认使用check_ping。可以显式的把command配置起来，比如command check_http，就会发现Status变成DOWN了。

分享到：

Excel中那些不可见的特殊符 | 无大负载的centos5.3出现IO忙的困惑

2014-02-09 14:14
浏览 6668
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论