Monitor and Alarm 2019(1)Prometheus Grafana Alertmanager

sillycat

浏览: 2573954 次
性别:
来自: 成都

最近访客更多访客>>

huageng520

learnmore

u012363178

ymgjava

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Summary

Monitor and Alarm 2019(1)Prometheus Grafana Alertmanager

Find the download from here
https://prometheus.io/download/
I choose the Operating System Linux, Architecture amd64
> wget https://github.com/prometheus/prometheus/releases/download/v2.14.0/prometheus-2.14.0.linux-amd64.tar.gz
> tar zxvf prometheus-2.14.0.linux-amd64.tar.gz
> mv prometheus-2.14.0.linux-amd64 ~/tool/prometheus-2.14.0
> sudo ln -s /home/carl/tool/prometheus-2.14.0 /opt/prometheus-2.14.0
> sudo ln -s /opt/prometheus-2.14.0 /opt/prometheus

> vi ~/.bash_profile
PATH=$PATH:/opt/prometheus
> . ~/.bash_profile

Check version
> prometheus --version
prometheus, version 2.14.0 (branch: HEAD, revision: edeb7a44cbf745f1d8be4ea6f215e79e651bfe19)
build user:       root@df2327081015
build date:       20191111-14:27:12
go version:       go1.13.4

Keep the default configuration file
> cat prometheus.yml
# my global config
global:
scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
    - targets:
      # - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
    - targets: ['localhost:9090']

Start the Service
> prometheus --config.file=prometheus.yml

Visit the console
http://rancher-home:9090/graph
Metrics
http://rancher-home:9090/metrics

Get a warning from the console
Warning! Detected 48737.85 seconds time difference between your browser and the server. Prometheus relies on accurate time and time drift might cause unexpected query results.

Solution:
Sync the clock
> sudo yum install ntp ntpdate
> sudo systemctl start ntpd
> sudo systemctl enable ntpd
> sudo systemctl status ntpd

The warning is gone after that.
On the console, we can search [prometheus_http_requests_total{code="200”}]

Or

We can get a count using this expression [count(prometheus_http_requests_total{code="200"})]

More example about query Prometheus
https://prometheus.io/docs/prometheus/latest/querying/basics/

Node Exporter
> wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz
> tar zxvf node_exporter-0.18.1.linux-amd64.tar.gz
> mv node_exporter-0.18.1.linux-amd64 ~/tool/node_exporter-0.18.1
> sudo ln -s /home/carl/tool/node_exporter-0.18.1 /opt/node_exporter-0.18.1
> sudo ln -s /opt/node_exporter-0.18.1 /opt/node_exporter

Add this to the PATH
PATH=$PATH:/opt/node_exporter

Start the service
> node_exporter

The metrics is here
http://rancher-home:9100/metrics
Add the node exporter to prometheus
- job_name: 'node'
    static_configs:
    - targets: ['localhost:9100']

Start the service again
> prometheus --config.file=prometheus.yml --web.enable-lifecycle

Curl command to reload the configuration files
> curl -X POST http://localhost:9090/-/reload

We can see a lot of node_ starts monitoring and when we check up, there are 2 things running
job=“prometheus”
job=“node”

We can use this library to do that
https://github.com/prometheus/client_golang

Install Grafana
https://www.jianshu.com/p/6ebbb7fe35aa
https://www.jianshu.com/p/e475fab6e41a
https://blog.csdn.net/wzygis/article/details/52727067

Here is the download page
https://grafana.com/grafana/download
> wget https://dl.grafana.com/oss/release/grafana-6.4.4.linux-amd64.tar.gz
> tar zxvf grafana-6.4.4.linux-amd64.tar.gz
> mv grafana-6.4.4 ~/tool/
> sudo ln -s /home/carl/tool/grafana-6.4.4 /opt/grafana-6.4.4
> sudo ln -s /opt/grafana-6.4.4 /opt/grafana
Add to the PATH
PATH=$PATH:/opt/grafana/bin

Try to start the sever with sample configuration and default.ini
> grafana-server --config conf/sample.ini

Visit the console page, username admin, password admin
http://rancher-home:3000/login

Grafana and Prometheus Settings
https://www.jianshu.com/p/82abd86ef447
https://learnku.com/articles/22193
[Add Data Source] —> [Prometheus] —> URL http://rancher-home:9090 —> Save and Test —> [New Dashboard] —> [Prometheus 2.0 Stats]

We can get some template from here https://grafana.com/grafana/dashboards?dataSource=prometheus, place the template ID there and [Load]
Choose and import the template https://grafana.com/grafana/dashboards/11074 for Node Exporter, it works well.

Install AlertManager
https://www.jianshu.com/p/655cb5f85a33
https://www.cnblogs.com/longcnblogs/p/9620733.html
https://yunlzheng.gitbook.io/prometheus-book/parti-prometheus-ji-chu/alert/install-alert-manager

> wget https://github.com/prometheus/alertmanager/releases/download/v0.19.0/alertmanager-0.19.0.linux-amd64.tar.gz
> tar zxvf alertmanager-0.19.0.linux-amd64.tar.gz
> mv alertmanager-0.19.0.linux-amd64 ~/tool/alertmanager-0.19.0
> sudo ln -s /home/carl/tool/alertmanager-0.19.0 /opt/alertmanager-0.19.0
> sudo ln -s /opt/alertmanager-0.19.0 /opt/alertmanager

Add this to PATH
PATH=$PATH:/opt/alertmanager

Check the default configuration file
> vi alertmanager.yml

Create the data directory
> mkdir data
Start the Service
> alertmanager --config.file=alertmanager.yml --storage.path=/opt/alertmanager/data

Visit the console page
http://rancher-home:9093/#/alerts

Configure the Prometheus to AlertManager
alerting:
alertmanagers:
- static_configs:
    - targets: ['localhost:9093']

Reload the configuration
> curl -X POST http://localhost:9090/-/reload

Define the alert rules in Prometheus
https://yunlzheng.gitbook.io/prometheus-book/parti-prometheus-ji-chu/alert/prometheus-alert-rule

This document seems great
https://yunlzheng.gitbook.io/prometheus-book/parti-prometheus-ji-chu/alert/prometheus-alert-manager-overview

Change the prometheus configuration
rule_files:
- /opt/prometheus/rules/*.rules

Create the rule files
> mkdir rules
> vi rules/hoststats-alert.rules
> cat rules/hoststats-alert.rules
groups:
- name: hostStatsAlert
rules:
- alert: hostCpuUsageAlert
    expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} CPU usgae high"
      description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"
- alert: hostMemUsageAlert
    expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} MEM usgae high"
      description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }})"

Reload the configuration
> curl -X POST http://localhost:9090/-/reload

We can see the rules we configured
http://rancher-home:9090/rules

We can see there is no alerts as well
http://rancher-home:9090/alerts

Manually make the CPU high over 1 minute
> cat /dev/zero>/dev/null

No, not working as expect, first of all, there are 2 core, so one cat command can only make one core 100%.
Then, it seems the Node Exporter, I am using the latest version. So the data in Prometheus changed I guess
rate(node_cpu_seconds_total{mode="system"}[1m])

So the latest should be
sum(avg without (cpu)(irate(node_cpu_seconds_total{mode!='idle'}[5m]))) by (instance)
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)/node_memory_MemTotal_bytes

> cat rules/hoststats-alert.rules
groups:
- name: hostStatsAlert
rules:
- alert: hostCpuUsageAlert
    expr: sum(avg without (cpu)(irate(node_cpu_seconds_total{mode!='idle'}[5m]))) by (instance) > 0.85
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} CPU usgae high"
      description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"
- alert: hostMemUsageAlert
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)/node_memory_MemTotal_bytes > 0.85
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} MEM usgae high"
      description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }})"

It works pretty well in
http://rancher-home:9090/alerts
http://rancher-home:3000/d/hb7fSE0Zz/1-node-exporter-for-prometheus-dashboard-english-version-update-1102?orgId=1&var-job=node&var-hostname=rancher-home&var-node=All&var-maxmount=%2F&var-env=&var-name=
http://rancher-home:9093/#/alerts

References:
https://www.jianshu.com/p/ddd0fb816b6d
https://www.yangcs.net/prometheus/3-prometheus/gettingstarted.html
https://www.cnblogs.com/chenqionghe/p/10494868.html
https://www.ibm.com/developerworks/cn/cloud/library/cl-lo-prometheus-getting-started-and-practice/index.html
https://www.hi-linux.com/posts/25047.html
https://yunlzheng.gitbook.io/prometheus-book/parti-prometheus-ji-chu/alert/prometheus-alert-manager-overview

分享到：

Nginx and Proxy 2019(1)Nginx Enable Lua ... | MongoDB 2019(3)Security and Auth

2019-11-17 14:23
浏览 334
评论(0)
分类:企业架构
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论