prometheus

开源地址

文档地址

架构图

prometheus-server

setup.sh
config/prometheus.yml
#!/bin/bash
echo "Asia/Shanghai" > /etc/timezone

mkdir -p $(pwd)/data
chown -R 65534:65534 $(pwd)/data

port=9090

docker stop prometheus
docker rm prometheus
docker run -d --net host \
--name prometheus \
-v /etc/timezone:/etc/timezone:ro \
-v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro \
-v $(pwd)/config/:/etc/prometheus/ \
-v $(pwd)/data/:/prometheus/ \
-v $(pwd)/groups/:/usr/local/prometheus/groups/ \
-v $(pwd)/rules/:/usr/local/prometheus/rules/ \
prom/prometheus:v2.46.0 \
--config.file=/etc/prometheus/prometheus.yml \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.console.templates=/etc/prometheus/consoles \
--storage.tsdb.path=/prometheus \
--storage.tsdb.retention.time=60d \
--web.enable-admin-api
TIP

如果在调试的时候需要查看请求的日志, 在启动脚本中新增--log.level=debug配置

exporter

集合不同节点数据

主配置

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ["127.0.0.1:9090"]

  - job_name: 'federate'
    scrape_interval: 5s
    scrape_timeout: 3s
    metrics_path: '/federate'
    honor_labels: true
    params:
      'match[]':
        - '{job!=""}'
    file_sd_configs:
      - files: ['/usr/local/prometheus/groups/federate/*.json','/usr/local/prometheus/groups/federate/*.yml']

组配置

- targets: [ "127.0.0.1:9090"]
  labels:
    job_name: test
    instance: test
    comment: "test"

pushgetway

启动

#!/bin/bash

docker kill pushgateway
docker rm pushgateway
docker run -d --net host \
--name pushgateway \
--restart=always \
-v /etc/localtime:/etc/localtime:ro \
-v /etc/tiemzone:/etc/timezone:ro \
prom/pushgateway:v1.7.0 \
--web.listen-address=0.0.0.0:9091 \
--web.enable-admin-api \
--web.telemetry-path=/metrics

prometheus

prometheus.yaml中增加:

- job_name: 'aliyun-server-info'
  scrape_interval: 60s
  scrape_timeout: 35s
  static_configs:
    - targets: ['192.168.1.1:9091']
      labels:
        instance: aliyun-server-info

推送数据

一般uri格式为/metrics/job/<job_name>/instance/<instance_name>

cat <<EOF | curl --data-binary @- http://192.168.1.1:9091/metrics/job/balanceinfo/instance/${account}
${balance}
${balancetime}
EOF

grafana

数据源

添加Prometheus,修改HTTP中的urlhttp://10.0.18.2:9090

仪表盘

https://grafana.com/grafana/dashboards/中寻找模板

推荐模板:

在mysql的仪表盘中可以添加一个监控项

mysql_version_info{instance="$host"}

options中选择instantlegend填写{{version}}

右侧选择stat,Text mode选择Name

推送

prometheus-webhook-dingtalk

config/config.yml
templates/alertmanager-dingtalk.tmpl
setup.sh
# Request timeout
timeout: 5s

# Customizable templates path
templates:
  - templates/alertmanager-dingtalk.tmpl

targets:
  webhook:
    # 运维内部群机器人
    url: https://oapi.dingtalk.com/robot/send?access_token=1adaa314f6d04b7
    # secret for signature
    secret: SEC9e23
    message:
      text: '{{ template "dingtalk.to.message" . }}'

alertmanager

config/alertmanager.yml
templates/alertmanager-email.tmpl
setup.sh
global:
  # 每2分钟检查一次是否恢复
  resolve_timeout: 2m
# route用来设置报警的分发策略

route:
  receiver: 'ops-dingtalk'
  group_by: ['...']
  group_wait: 3s
  group_interval: 1m
  repeat_interval: 5m
  routes:
    - receiver: 'ops-dingtalk'
      group_by: ['...']
      # 等待时间,如果同一个组有新的告警会被合并到同一个消息内
      group_wait: 3s
      # 异常持续报警间隔时间为 group_interval + repeat_interval 总和
      # group_interval 相同的Group之间发送告警通知的时间间隔
      group_interval: 1m
      # 一条成功发送的告警,在最终发送通知之前的等待时间
      repeat_interval: 5m
      matchers:
        - severity=~"^信息$|^警告$|^一般严重$|^严重$|^灾难$|^测试模板$"
receivers:
  - name: 'ops-dingtalk'
    webhook_configs:
      - send_resolved: true
        url: 'http://10.0.18.2:8060/dingtalk/webhook/send'
        max_alerts: 0

配置增加

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - '10.0.18.2:9093'