prometheus 监控相关（非docker方式）

yjph83

浏览: 336515 次
性别:
来自: 成都

最近访客更多访客>>

rex0654335

jauncehome

shaoaj

a90120411

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

docker/kubernetes
运维中间件配置安装

prometheus node_exporter grafana alertmanager cadvisor

https://github.com/1046102779/prometheus （Prometheus 非官方中文手册）

http://www.bubuko.com/infodetail-2004088.html （基于prometheus监控k8s集群）

http://www.cnblogs.com/sfnz/p/6566951.html （安装prometheus+grafana监控mysql redis kubernetes等，非docker安装）

http://blog.csdn.net/wenwst/article/details/76624019 (Kubernetes 1.6 部署prometheus和grafana数据持久）)

https://github.com/jason-riddle/monitor-k8s-with-prom （Kubernetes 上prometheus监控相关）

https://github.com/kayrus/prometheus-kubernetes （prometheus-kubernetes）

https://github.com/prometheus/node_exporter （prometheus/node_exporter）

http://dockone.io/article/2579 （ Prometheus在Kubernetes下的监控实践）

http://www.ywnds.com/?p=9656 ( 使用Prometheus+Grafana监控MySQL实践)

https://github.com/prometheus/prometheus/releases （prometheus 下载列表）

https://github.com/prometheus/node_exporter/releases/ （node_exporter下载列表）

https://laily.net/article/Prometheus%20%E5%88%9D%E4%BD%93%E9%AA%8C%281%29%20-%20%E5%AE%89%E8%A3%85 (Prometheus 初体验(1) - 安装)

http://blog.csdn.net/u010871982/article/details/77838592?locationNum=2&fps=1 (prometheus简单入门)

https://www.robustperception.io/scaling-and-federating-prometheus/ (prometheus federate)

http://dbaplus.cn/news-72-1462-1.html (360基于Prometheus的在线服务监控实践)

1、prometheus安装

[root@localhost prometheus]# wget https://github.com/prometheus/prometheus/releases/download/v1.7.1/prometheus-1.7.1.linux-amd64.tar.gz

[root@localhost prometheus]# mkdir /opt/prometheus

[root@localhost prometheus]# tar -zxvf prometheus-1.7.1.linux-amd64.tar.gz -C /opt/prometheus --strip-components=1

[root@localhost prometheus]# cd /opt/prometheus/

[root@localhost prometheus]# cp prometheus.yml prometheus.yml.back

[root@localhost prometheus]# vim prometheus.yml #注意 yaml 文件不允许有 tab 符，一律得使用空格

# 全局配置

global:

scrape_interval: 15s #默认 15秒到目标处抓取数据

# 这个标签是在本机上每一条时间序列上都会默认产生的，主要可以用于联合查询、远程存储、Alertmanger时使用。

external_labels:

monitor: 'codelab-monitor'

# 这里就表示抓取对象的配置

# 设置抓取自身数据

scrape_configs:

# job name 这个配置是表示在这个配置内的时间序例，每一条都会自动添加上这个{job_name:"prometheus"}的标签。

- job_name: 'prometheus'

# 重写了全局抓取间隔时间，由15秒重写成5秒。

scrape_interval: 5s

static_configs:

- targets: ['localhost:9090']

启动：

nohup ./prometheus --config.file=prometheus.yml &

或

nohup /opt/ prometheus-1.7.1.linux-amd64/prometheus &

这时浏览器中页面访问http://localhost:9090/ ，可以看到Prometheus的graph页面。

http://www.cnblogs.com/vovlie/p/Prometheus_install.html （参考）

可直接加载Prometheus配置而不停止服务方式让配置生效，在调试过程中，每次修改配置后执行该操作让配置生效更方便：

# curl -X POST http://localhost:9090/-/reload

# netstat -antl|grep 9090 #查看是否启动成功！

如果我们要采用进程方式管理它，则需要创建脚本：

可以创建一个用户名来启动：

[root@localhost config]# useradd prometheus

[root@localhost ~]# vim /etc/systemd/system/prometheus.service

[Unit]

Description=Prometheus Server

Documentation=https://prometheus.io/docs/introduction/overview/

Deion=prometheus

After=network.target

[Service]

Type=simple

User=prometheus

ExecStart=/usr/local/prometheus/prometheus \ #prometheus安装目录

-config.file=/usr/local/prometheus/prometheus.yml \ #prometheus安装目录下的prometheus.yml

-storage.local.path=/home/prometheusdata

Restart=on-failure

[Install]

WantedBy=multi-user.target

说明： -storage.local.path=/home/prometheusdata 指定的存储目录必须要让创建的prometheus用户有权限

保存退出后，此时可以用命令启动 systemctl start prometheus

# systemctl enable Prometheus.service

# systemctl restart Prometheus.service

2、Grafana 安装

[root@localhost prometheus]# wget https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-4.5.0-1.x86_64.rpm

[root@localhost prometheus]# yum install initscripts fontconfig -y

[root@localhost prometheus]# rpm -Uvh grafana-4.5.0-1.x86_64.rpm

warning: grafana-4.5.0-1.x86_64.rpm: Header V4 RSA/SHA1 Signature, key ID 24098cb6: NOKEY

error: Failed dependencies:

urw-fonts is needed by grafana-4.5.0-1.x86_64

安装发现报错;所以采用如下命令重新安装：

[root@localhost prometheus]# yum localinstall grafana-4.5.0-1.x86_64.rpm

[root@localhost prometheus]# service grafana-server start #启动服务

Starting grafana-server (via systemctl): [ OK ]

[root@localhost prometheus]# netstat -anp|grep 3000

查看到3000 端口已经OK；

页面http://localhost:3000 ，默认账号、密码admin/admin

http://docs.grafana.org/installation/rpm/ （gragana 官方文档）

可以将Grafana设置为系统服务

#mkdir-p/var/run/grafana

#chowngrafana.grafana/var/run/grafana

#vim/etc/sysconfig/grafana-server,

添加:PID_FILE_DIR=/var/run/grafan

#vim/etc/systemd/system/grafana.service

[Unit]

Description=GrafanaServices

Documentation=https://github.com/grafana/grafana

After=network.target

[Service]

EnvironmentFile=/etc/sysconfig/grafana-server

User=grafana

Group=grafana

Type=simple

WorkingDirectory=/usr/share/grafana

RuntimeDirectory=grafana

RuntimeDirectoryMode=0750

ExecStart=/usr/sbin/grafana-server\

--config=${CONF_FILE} \

--pidfile=${PID_FILE_DIR}/grafana-server.pid \

cfg:default.paths.logs=${LOG_DIR} \

cfg:default.paths.data=${DATA_DIR} \

cfg:default.paths.plugins=${PLUGINS_DIR}

LimitNOFILE=10000

TimeoutStopSec=20UMask=0027

[Install]

WantedBy=multi-user.target

#以上配置文件中的变量${CONF_FILE}读取的是/etc/sysconfig/grafana-server中的内容

#配置文件变更后必须先reload

# systemctl daemon-reload

# systemctl restart grafana.service

# systemctl enable grafana.service

Prometheus 和 Grafana 的对接如下：

https://prometheus.io/docs/visualization/grafana/ （prometheus和grafana对接文档）

替换grafana的dashboards

Grafana 并没有太多的配置好的图表模板，除了 Percona 开源的一些外，很多需要自行配置。

[root@localhost prometheus]# yum install git -y

[root@localhost prometheus]# git clone https://github.com/percona/grafana-dashboards.git

Cloning into 'grafana-dashboards'...

remote: Counting objects: 1308, done.

remote: Compressing objects: 100% (31/31), done.

remote: Total 1308 (delta 32), reused 40 (delta 21), pack-reused 1256

Receiving objects: 100% (1308/1308), 6.39 MiB | 1.67 MiB/s, done.

Resolving deltas: 100% (982/982), done.

[root@localhost prometheus]# cp -r grafana-dashboards/dashboards /var/lib/grafana/

[root@localhost prometheus]# vim /etc/grafana/grafana.ini

修改如下：

[dashboards.json]
enabled = true
path = /var/lib/grafana/dashboards

[root@localhost prometheus]# service grafana-server restart

或用如下命令重启：

[root@localhost prometheus]# systemctl restart grafana-server

3、node_exporter 安装

[root@localhost prometheus]# wget https://github.com/prometheus/node_exporter/releases/download/v0.14.0/node_exporter-0.14.0.linux-amd64.tar.gz

[root@localhost prometheus]# tar -zxvf node_exporter-0.14.0.linux-amd64.tar.gz

[root@localhost local]# mv /home/prometheus/node_exporter-0.14.0.linux-amd64 ./node_exporter-0.14.0

[root@localhost local]# cd node_exporter-0.14.0/

[root@localhost node_exporter-0.14.0]# nohup ./node_exporter &

查看进程是否OK

[root@localhost node_exporter-0.14.0]# ps -ef|grep node_exporter

root 24760 24106 0 14:39 pts/1 00:00:00 ./node_exporter

root 24766 24106 0 14:39 pts/1 00:00:00 grep --color=auto node_exporter

node_exporter 也可做成服务进程启动，

[root@localhost ~]# vim /etc/systemd/system/node_exporter.service

提供的node exporter 的 systemd 脚本如下：

[Unit]

Deion=node_exporter

Description=Prometheus node exporter

After=local-fs.target network-online.target network.target

Wants=local-fs.target network-online.target network.target

[Service]

Type=simple

User=prometheus #用户prometheus

ExecStart=/usr/local/prometheus/node_exporter/node_exporter

Restart=on-failure

[Install]

WantedBy=multi-user.target

# systemctl enable node_export.service

# systemctl restart node_export.service

4、alertManager 安装

http://blog.csdn.net/y_xiao_/article/details/50818451

Prometheus Alertmanager报警组件

http://www.jianshu.com/p/239b145e2acc (Prometheus Alertmanager报警组件)

Alertmanager报警模块

https://github.com/prometheus/alertmanager ）（alertmanager gighub）

Alert template:

https://prometheus.io/blog/2016/03/03/custom-alertmanager-templates/ （自定义的alertmanager 模板）

Sending alert notifications to multiple destinations

https://www.robustperception.io/sending-alert-notifications-to-multiple-destinations/ (发送提醒到多目的地)

Alert tree:

https://prometheus.io/webtools/alerting/routing-tree-editor/ (Routing tree editor)

[root@localhost prometheus]# wget https://github.com/prometheus/alertmanager/releases/download/v0.9.1/alertmanager-0.9.1.linux-amd64.tar.gz

[root@localhost prometheus]# tar -zxvf alertmanager-0.9.1.linux-amd64.tar.gz

[root@localhost prometheus]# mv alertmanager-0.9.1.linux-amd64 /opt/alertmanager

[root@localhost prometheus]# cd /opt/alertmanager

[root@localhost prometheus]# nohup ./alertmanager -config.file=simple.yml &

重启prometheus 服务：

# ./prometheus -config.file=prometheus.yml -alertmanager.url http://localhost:9093

也可以通过加载配置文件方式而不重启Alertmanager服务：

# curl -XPOST http://localhost:9093/-/reload

# 设置Alertmanager 系统服务

# vim /etc/systemd/system/alertmanager.service

[Unit]

Description=Prometheus Alertmanager.

Documentation=https://github.com/prometheus/alertmanager

After=network.target

[Service]

EnvironmentFile=-/etc/alertmanager/template

User=root

ExecStart=/opt/alertmanager/alertmanager \

-config.file=/opt/alertmanager/simple.yml \

-storage.path=/home/alertmanager \

$ALERTMANAGER_OPTS

ExecReload=/bin/kill -HUP $MAINPID

Restart=on-failure

[Install]

WantedBy=multi-user.target

最后执行：

# systemctl enable alertmanager.service

# systemctl restrart alertmanager.service

访问Alertmanager页面：http://ip:9093/#/alerts

配置 Alertmanager

报警分两部分，报警条件规则文件默认放在Prometheus安装目录下，文件名为 alert.rules。具体通知内容，例如邮件地址和通知人员设置在Alertmanager安装目录下的simply.yml文件，以下是一些基础和常用配置，阈值和时间根据自己需求进行修改。

#alert.rules:

ALERT node_down

IF up == 0 AND job="node"

FOR 5m

ANNOTATIONS {

summary = "Node is down",

description = "Node has been unreachable for more than 5 minutes.",

severity = "warning"

}

ALERT snmp_down

IF up == 0 AND job="snmp"

FOR 5m ANNOTATIONS {

summary = "SNMP is down",

description = "SNMP has been unreachable for more than 5 minutes.",

severity = "warning"

}

ALERT fs_at_80_percent

IF hrStorageUsed{hrStorageDescr=~"/.+"} / hrStorageSize >= 0.8

FOR 15m

ANNOTATIONS {

summary = "File system {{$labels.hrStorageDescr}} is at 80%",

description = "{{$labels.hrStorageDescr}} has been at 80% for more than 15 Minutes.",

severity = "warning"

}

ALERT fs_at_90_percent

IF hrStorageUsed{hrStorageDescr=~"/.+"} / hrStorageSize >= 0.9

FOR 15m

ANNOTATIONS {

summary = "File system {{$labels.hrStorageDescr}} is at 90%",

description = "{{$labels.hrStorageDescr}} has been at 90% for more than 15 Minutes.",

severity = "average"

}

ALERT disk_load_mostly_random_reads

IF rate(diskIOReads{diskIODevice=~"sd[a-z]+"}[5m]) > 20 AND

rate(diskIONReadX{diskIODevice=~"sd[a-z]+"}[5m]) / rate(diskIOReads{diskIODevice=~"sd[a-z]+"}[5m]) < 10000

FOR 15m

ANNOTATIONS { summary = "Disk {{$labels.diskIODevice}} reads are mostly random.",

description = "{{$labels.diskIODevice}} reads have been mostly random for the past 15 Minutes.",

severity = "info"

}

ALERT disk_load_mostly_random_writes

IF rate(diskIOWrites{diskIODevice=~"sd[a-z]+"}[5m]) > 20 AND

rate(diskIONWrittenX{diskIODevice=~"sd[a-z]+"}[5m]) / rate(diskIOWrites{diskIODevice=~"sd[a-z]+"}[5m]) < 10000

FOR 15m

ANNOTATIONS {

summary = "Disk {{$labels.diskIODevice}} writes are mostly random.",

description = "{{$labels.diskIODevice}} writes have been mostly random for the past 15 Minutes.",

severity = "info"

}

ALERT disk_load_high

IF diskIOLA1{diskIODevice=~"s|vd[a-z]+"} > 30

FOR 15m

ANNOTATIONS {

summary = "Disk {{$labels.diskIODevice}} is at 30%",

description = "{{$labels.diskIODevice}} Load has exceeded 30% over the past 15 Minutes.",

severity = "warning"

}

ALERT cpu_load_high

IF ssCpuIdle < 70

FOR 15m

ANNOTATIONS {

summary = "CPU is at 30%",

description = "CPU Load has constantly exceeded 30% over the past 15 Minutes.",

severity = "warning"

}

ALERT linux_load_high

IF laLoad1 > 50

FOR 15m

ANNOTATIONS {

summary = "Linux Load is at 40",

description = "Linux Load has constantly exceeded 40 over the past 15 Minutes.",

severity = "average"

}

ALERT if_operstatus_changed

IF delta(ifOperStatus[15m]) != 0

ANNOTATIONS {

summary = "Port {{$labels.ifDescr}} changed status",

description = "Port {{$labels.ifDescr}} went up or down in the past 15 Minutes",

severity = "info"

}

ALERT if_traffic_at_30_percent

IF ifSpeed > 10000000 AND

ifOperStatus == 1 AND

rate(ifInOctets[5m]) > ifSpeed * 0.3

FOR 15m

ANNOTATIONS {

summary = "Port {{$labels.ifDescr}} is at 30%",

description = "Port {{$labels.ifDescr}} has had at least 30% traffic over the past 15 Minutes.",

severity = "warning"

}

ALERT if_traffic_at_70_percent

IF ifSpeed > 10000000 AND

ifOperStatus == 1 AND rate(ifInOctets[5m]) > ifSpeed * 0.7

FOR 15m

ANNOTATIONS {

summary = "Port {{$labels.ifDescr}} is at 70%",

description = "Port {{$labels.ifDescr}} has had at least 70% traffic over the past 15 Minutes.",

severity = "average"

}

# CPU告警

ALERT cpu_overload

IF node_load1 >= 0.8

FOR 3m

LABELS { severity = "all" }

ANNOTATIONS {

summary = "Instance {{ $labels.instance }} cpu_load1 over 80% for 3 minutes",

description = "{{ $labels.instance }} of job {{ $labels.job }} cpu_load1 over 80% for 3 minutes.",

}

# 内存告警

ALERT memory_overload

IF (node_memory_MemTotal-node_memory_MemFree)/node_memory_MemTotal >= 0.8

FOR 3m

LABELS { severity = "all" }

ANNOTATIONS {

summary = "Instance {{ $labels.instance }} memory_load over 80% for 3 minutes",

description = "{{ $labels.instance }} of job {{ $labels.job }} memory_load over 80% for 3 minutes.",

}

---------------------------------------------------

# simply.yml

主要分三部分,Global部分设置发送邮件服务器信息，route设置规则和报警时间间隔等，receivers设置接收人。

global:

#设置发送邮件的地址和smtp信息

smtp_smarthost:'smtp.abc.com'

smtp_from:'prometheus@abc.com'

smtp_auth_username:'prometheus'

smtp_auth_password:'abcd’

route:receiver:'team-X-mails'group_by:['alertname']group_wait:30s

group_interval:5m

repeat_interval:6h

inhibit_rules:

-source_match:

severity:'critical'

target_match:

severity:'warning'

#Applyinhibitionifthealertnameisthesame.

equal:['alertname']

receivers:

-name:'team-X-mails'

email_configs:

-to:'support@abc.com'

send_resolved:true

#设置完毕后需要重新加载配置文件

5、cadvisor 安装配置

docker run -d --restart=always --volume=/:/rootfs:ro --volume=/var/run:/var/run:rw --volume=/sys:/sys:ro --volume=/var/lib/docker/:/var/lib/docker:ro --volume=/dev/disk/:/dev/disk:ro --publish=8090:8080 --detach=true --name=cadvisor google/cadvisor:latest

在浏览器中：http://ip:8090 就可以访问了

# 监控cAdvisor报警条件：

# vim containers.rules

ALERT cAdvisor_down

IF absent(container_memory_usage_bytes{name="cadvisor"})

FOR 1m

LABELS { severity = "critical" }

ANNOTATIONS {

summary= "cAdvisor containers down",

description= "cAdvisor container is down for more than 1 minutes."

}

ALERT cAdvisor_high_cpu

IF sum(rate(container_cpu_usage_seconds_total{name="cadvisor"}[1m])) / count(node_cpu{mode="system"}) * 100 > 10

FOR 5m

LABELS { severity = "warning" }

ANNOTATIONS {

summary= "cAdvisor high CPU usage",

description= "cAdvisor CPU usage is {{ humanize $value}}%."

}

ALERT cAdvisor_high_memory

IF sum(container_memory_usage_bytes{name="cadvisor"}) > 1200000000 FOR 5m

LABELS { severity = "warning" }

ANNOTATIONS {

summary = "cAdvisor high memory usage",

description = "cAdvisor memory consumption is at {{ humanize $value}}.",

}

分享到：

2017-09-25 13:41
浏览 3919
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论