Prometheus+Grafana 监控 Kubernetes

ops/2024/10/17 16:58:00/

文章目录

  • 一、Prometheus介绍
  • 二、Prometheus组件与监控
  • 三、Prometheus基本使用:怎么来监控
  • 四、Kubernetes监控指标
  • 五、Prometheus的基本使用:部署
    • 1.基于docker部署prometheus+grafana
    • 2. 查看prometheus配置文件
    • 3. 监控Linux服务器
      • 3.1找到自己相应的系统去下载软件包
      • 3.2获取到节点的指标后,需要去编辑prometheus配置文件让它来收集这些指标
      • 3.3 用grafana将收集到的数据展示出来
  • 六、在kubernetes平台部署prometheus相关组件
  • 七、Prometheus的基本使用:查询数据


一、Prometheus介绍

prometheus是一个最初在SoundCloud上构建的监控系统,自2012年成为社区开源项目,拥有非常活跃的开发人员和用户社区。为强调开源及独立维护,Prometheus于2016年加入云原生云计算基金会(CNCF),成为继kubernetes之后的第二个托管项目。
项目地址:https://prometheus.io/
https://github.com/prometheus


二、Prometheus组件与监控

在这里插入图片描述

  • Prometheus Server:收集指标和存储时间序列数据,并提供查询接口;
  • ClientLibrary:客户端库;
  • Push Gateway:短期存储指标数据,主要用于临时性任务;
  • Exports:采集已有的第三方服务监控指标并暴露metrics;
  • Alertmanager:告警;
  • Web UI:简单的Web控制台;

三、Prometheus基本使用:怎么来监控

如果想要监控,前提是能获取被监控端的指标数据,并且这个数据个谁必须遵循Prometheus的数据模型,这样才能识别和采集,一般是用export提供监控指标数据,如下图
在这里插入图片描述
export在这里采集各个应用的指标数据,然后提供给prometheus server端进行处理,这里的export就是一个数据模型

export列表:
https://prometheus.io/docs/instrumenting/exporters/
这个地址是各个应用的export组件,ngx、数据库、消息队列等等


四、Kubernetes监控指标

Kubernetes本身监控:

  • Node资源利用率
  • Node数量
  • 每个Node运行Pod数量
  • 资源对象状态(deployment、svc、pod等状态)

Pod监控:

  • Pod总数量及每隔控制器预期数量
  • Pod状态(是否为running)
  • 容器资源利用率:CPU、内存、网络

kubernetes监控思路:
在这里插入图片描述
在这里插入图片描述

  • Pod:kubelet的节点使用cAdvisor提供的metrics接口获取该节点所有Pod和容器相关的性能指标数据;
    接口指标:https://NodeIP:10250/metrics/cadvisor
    curl -k https://NodeIP:10250/metrics/cadvisor 因需要自签证书,所以会显示未授权
  • Node
    使用node_exporter收集器采集节点资源利用率(daemonset形式);
    项目地址:https://github.com/prometheus/node_exporter
  • k8s资源对象
    kube-state-metrics采集了k8s中各个资源对象的状态信息;
    项目地址:https://github.com/kubernetes/kube-state-metrics

五、Prometheus的基本使用:部署

prometheus部署文档:https://prometheus.io/docs/prometheus/latest/installation/

grafana部署文档:https://grafana.com/docs/grafana/latest/setup-grafana/installation/docker/

grafana仪表盘地址:https://grafana.com/grafana/dashboards/?pg=graf&plcmt=dashboard-below-text

prometheusgrafana_71">1.基于docker部署prometheus+grafana

prometheus
docker部署的prometheus的数据目录为/prometheus

mkdir /home/prometheus-data
chown -R 65534:65534 prometheus-data/docker run -itd \-p 9090:9090 \#将自己配置好的prometheus配置文件映射到容器目录里(prometheus容器启动时会需要这个文件)-v /home/prometheus.yml:/etc/prometheus/prometheus.yml \#将数据目录持久化到宿主机-v /home/prometheus-data:/prometheus \prom/prometheus

在这里插入图片描述

grafana
docker部署的prometheus的数据目录为/prometheus

mkdir /home/grafana-data
chmod 777 /home/grafana-data
docker run -d -p 3000:3000 --name=grafana \--volume /home/grafana-data:/var/lib/grafana \grafana/grafana:8.4.0
#用户名/密码:admin/admin # 第一次需要重置密码

在这里插入图片描述

prometheus_102">2. 查看prometheus配置文件

#全局配置
global:#采集数据 时间的间隔,默认为1minscrape_interval: 15s#评估告警规则 时间的间隔,默认为1minevaluation_interval: 15s#采集数据超时时间,默认10sscrape_timeout: 5s#告警配置
alerting:alertmanagers:- static_configs:- targets:# - alertmanager:9093#告警规则存放的文件
rule_files:# - "first_rules.yml"# - "second_rules.yml"#配置监控端,成为target,每个target用job_name分组管理,又分为静态配置和服务发现;
scrape_configs:# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.- job_name: "prometheus"# metrics_path defaults to '/metrics'# scheme defaults to 'http'.static_configs:- targets: ["localhost:9090"]
  • 目标(targets):被监控端
  • 实例(Instances):每个被监控端成为实例
  • 作业(Job):具有相同目标的实例集合(组)称为作业

3. 监控Linux服务器

node_export:用于监控Linux系统的指标采集器
常用指标:

  • CPU
  • 内存
  • 硬盘
  • 网络流量
  • 文件描述符
  • 系统负载
  • 系统服务

数据接口:http://IP:9100

使用文档:https://prometheus.io/docs/guides/node-exporter/
github:https://github.com/prometheus/node_exporter

3.1找到自己相应的系统去下载软件包

https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz

在linux机器上下载后解压并直接运行二进制文件即可

[root@k8s-node1 fands]# tar -zxf node_exporter-1.8.2.linux-amd64.tar.gz 
[root@k8s-node1 fands]# cd node_exporter-1.8.2.linux-amd64
[root@k8s-node1 node_exporter-1.8.2.linux-amd64]# ./node_exporter 

启动后可直接去浏览器访问IP:9100/metrics,node_exporter会将收集到的节点指标都放在这里
在这里插入图片描述
在这里插入图片描述

prometheus_174">3.2获取到节点的指标后,需要去编辑prometheus配置文件让它来收集这些指标

[root@k8s-master prometheus]# docker ps -a | grep prometheus 
6b25be4431be   prom/prometheus                                     "/bin/prometheus --c…"   4 weeks ago   Up 6 minutes               0.0.0.0:9090->9090/tcp   prometheus
[root@k8s-master prometheus]# docker exec -it 6b25be4431be /bin/sh
/prometheus $ vi /etc/prometheus/prometheus.yml 
scrape_configs:- job_name: "prometheus"static_configs:- targets: ["localhost:9090"] #添加监控项- job_name: "linux_server"static_configs:- targets: ["192.168.1.2:9100"]

接着重启prometheus容器

[root@k8s-master prometheus]# docker restart 6b25be4431be
6b25be4431be

就可以在prometheus看到我们采集的linux服务器数据了
在这里插入图片描述
在查询框内输入 node开头就可以看到node_export组件收集到的所有数据,例如节点内存
在这里插入图片描述

grafana_201">3.3 用grafana将收集到的数据展示出来

grafana中添加prometheus数据源,并导入"9276"监控模版即可
在这里插入图片描述
在这里插入图片描述

在这里插入图片描述



kubernetesprometheus_211">六、在kubernetes平台部署prometheus相关组件

1.部署组件

prometheusconfigmap_213">1.1 prometheus-configmap

#prometheus-config.yaml主配置文件,主要配置kubernetes各服务发现以及报警

apiVersion: v1
kind: ConfigMap
metadata:name: prometheus-confignamespace: ops 
data:prometheus.yml: |rule_files:- /etc/config/rules/*.rulesscrape_configs:- job_name: prometheusstatic_configs:- targets:- localhost:9090# 采集kube-apiserver的指标- job_name: kubernetes-apiservers# 基于k8s服务自动发现kubernetes_sd_configs:- role: endpointsrelabel_configs:- action: keep# 获取默认命名空间里名为kubernetes https的endpoint(kubectl get ep)regex: default;kubernetes;httpssource_labels:- __meta_kubernetes_namespace- __meta_kubernetes_service_name- __meta_kubernetes_endpoint_port_namescheme: https# 授权(ca.crt为集群的根证书ca.crt kubeadm方式安装的默认在/etc/kubernetes/pki/ca.crt)# token 为serviceaccount自动生成的token,会使用这个token访问kube-apiserver获取地址tls_config:ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crtinsecure_skip_verify: truebearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token# 采集集群中所有节点的指标 - job_name: 'kubernetes-nodes-kubelet'kubernetes_sd_configs:- role: noderelabel_configs:- source_labels: [__meta_kubernetes_node_name]regex: (.+)target_label: __metrics_path__# 实际访问指标接口 https://NodeIP:10250/metricsreplacement: /metricsscheme: httpstls_config:ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crtinsecure_skip_verify: truebearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token# 采集当前节点上所存在的pod指标- job_name: 'kubernetes-nodes-cadvisor'kubernetes_sd_configs:- role: noderelabel_configs:- source_labels: [__meta_kubernetes_node_name]regex: (.+)target_label: __metrics_path__# 实际访问指标接口 https://NodeIP:10250/metrics/cadvisor,这里替换默认指标URL路径replacement: /metrics/cadvisormetric_relabel_configs:# 将指标名instance的 换为 指标名node- source_labels: [instance]separator: ;regex: (.+)target_label: nodereplacement: $1action: replacescheme: httpstls_config:ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crtinsecure_skip_verify: truebearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token# 采集service关联的endpoint 后端的pod指标  - job_name: kubernetes-service-endpointskubernetes_sd_configs:- role: endpoints  # 从Service列表中的Endpoint发现Pod为目标relabel_configs:# Service没配置注解prometheus.io/scrape的不采集- action: keepregex: truesource_labels:- __meta_kubernetes_service_annotation_prometheus_io_scrape# 重命名采集目标协议- action: replaceregex: (https?)source_labels:- __meta_kubernetes_service_annotation_prometheus_io_schemetarget_label: __scheme__# 重命名采集目标指标URL路径- action: replaceregex: (.+)source_labels:- __meta_kubernetes_service_annotation_prometheus_io_pathtarget_label: __metrics_path__# 重命名采集目标地址- action: replaceregex: ([^:]+)(?::\d+)?;(\d+)replacement: $1:$2source_labels:- __address__- __meta_kubernetes_service_annotation_prometheus_io_porttarget_label: __address__# 将K8s标签(.*)作为新标签名,原有值不变- action: labelmapregex: __meta_kubernetes_service_label_(.+)# 生成命名空间标签- action: replacesource_labels:- __meta_kubernetes_namespacetarget_label: kubernetes_namespace# 生成Service名称标签- action: replacesource_labels:- __meta_kubernetes_service_nametarget_label: kubernetes_name# 采集k8s资源对象的指标(deployment\svc\ing\secerts等)- job_name: kube-state-metricskubernetes_sd_configs:- role: endpointsnamespaces:names:- opsrelabel_configs:- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]regex: kube-state-metricsreplacement: $1action: keep# 采集pod的指标(scrape值为ture的pod)- job_name: kubernetes-podskubernetes_sd_configs:- role: pod   # 以Pod为目标# 重命名采集目标协议relabel_configs:- action: keepregex: truesource_labels:- __meta_kubernetes_pod_annotation_prometheus_io_scrape# 重命名采集目标指标URL路径- action: replaceregex: (.+)source_labels:- __meta_kubernetes_pod_annotation_prometheus_io_pathtarget_label: __metrics_path__# 重命名采集目标地址- action: replaceregex: ([^:]+)(?::\d+)?;(\d+)replacement: $1:$2source_labels:- __address__- __meta_kubernetes_pod_annotation_prometheus_io_porttarget_label: __address__# 将K8s标签(.*)作为新标签名,原有值不变- action: labelmapregex: __meta_kubernetes_pod_label_(.+)# 生成命名空间标签- action: replacesource_labels:- __meta_kubernetes_namespacetarget_label: kubernetes_namespace# 生成Service名称标签- action: replacesource_labels:- __meta_kubernetes_pod_nametarget_label: kubernetes_pod_namealerting:alertmanagers:- static_configs:- targets: ["alertmanager:80"]

prometheusrules_397">1.2 prometheus-rules

#prometheus-rules.yaml #prometheus告警规则配置文件 配置了各个指标的阈值

apiVersion: v1
kind: ConfigMap
metadata:name: prometheus-rulesnamespace: ops
data:general.rules: |groups:- name: general.rulesrules:- alert: InstanceDownexpr: up == 0for: 1mlabels:severity: erroralertinstance: '{{ $labels.job }}/{{ $labels.instance }}'annotations:summary: "Instance {{ $labels.instance }} 停止工作"description: "{{ $labels.instance }} job {{ $labels.job }} 已经停止5分钟以上."node.rules: |groups:- name: node.rulesrules:- alert: NodeFilesystemUsageexpr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs|ext3"} / node_filesystem_size_bytes{fstype=~"ext4|xfs|ext3"} * 100) > 80for: 1mlabels:severity: warning alertinstance: '{{ $labels.instance }}:{{ $labels.device }}'annotations:summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率过高"description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分区使用大于80% (当前值: {{ $value }})"- alert: NodeMemoryUsageexpr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80for: 1mlabels:severity: warningalertinstance: '{{ $labels.instance }}'annotations:summary: "Instance {{ $labels.instance }} 节点内存使用率过高"description: "{{ $labels.instance }}节点内存使用大于80% (当前值: {{ $value }})"- alert: NodeCPUUsage    expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60 for: 1mlabels:severity: warningalertinstance: '{{ $labels.instance }}'annotations:summary: "Instance {{ $labels.instance }} 节点CPU使用率过高"       description: "{{ $labels.instance }}节点CPU使用大于60% (当前值: {{ $value }})"- alert: KubeNodeNotReadyexpr: kube_node_status_condition{condition="Ready",status="true"} == 0for: 1mlabels:severity: erroralertinstance: '{{ $labels.node }}/{{ $labels.instance }}'annotations:description: "{{ $labels.node }} 节点离线 已经有10多分钟没有准备好了"pod.rules: |groups:- name: pod.rulesrules:- alert: PodCPUUsageexpr: (sum(rate(container_cpu_usage_seconds_total{image!=""}[3m])) by (pod,namespace)) / (sum(container_spec_cpu_quota{image!=""}) by (pod,namespace) /100000) *100 > 80for: 5mlabels:severity: warning alertinstance: '{{ $labels.namespace }}/{{ $labels.pod }}'annotations:summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} CPU使用大于80% (当前值: {{ $value }})"description: "{{ $labels.namespace }}/{{ $labels.pod }} CPU使用大于80% (当前值: {{ $value }})"- alert: PodMemoryUsageexpr: sum(container_memory_rss{image!=""}) by(pod, namespace) / sum(container_spec_memory_limit_bytes{image!=""}) by(pod, namespace) * 100 != +inf > 80for: 5mlabels:severity: warning alertinstance: '{{ $labels.namespace }}/{{ $labels.pod }}'annotations:summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} 内存使用大于80% (当前值: {{ $value }})"description: "{{ $labels.namespace }}/{{ $labels.pod }} 内存使用大于80% (当前值: {{ $value }})"- alert: PodNetworkReceiveexpr: sum(rate(container_network_receive_bytes_total{image!="",name=~"^k8s_.*"}[5m]) /1000) by (pod,namespace)  > 30000for: 5mlabels:severity: warningalertinstance: '{{ $labels.namespace }}/{{ $labels.pod }}'annotations:summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} 入口流量大于30MB/s (当前值: {{ $value }}K/s)"           description: "{{ $labels.namespace }}/{{ $labels.pod }}:{{ $labels.interface }} 入口流量大于30MB/s (当前值: {{ $value }}K/s)"- alert: PodNetworkTransmitexpr: sum(rate(container_network_transmit_bytes_total{image!="",name=~"^k8s_.*"}[5m]) /1000) by (pod,namespace) > 30000for: 5mlabels:severity: warning alertinstance: '{{ $labels.namespace }}/{{ $labels.pod }}'annotations:summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }}出口流量大于30MB/s (当前值: {{ $value }}/K/s)"description: "{{ $labels.namespace }}/{{ $labels.pod }}:{{ $labels.interface }} 出口流量大于30MB/s (当前值: {{ $value }}/K/s)"- alert: PodRestartexpr: sum(changes(kube_pod_container_status_restarts_total[1m])) by (pod,namespace) > 0for: 1mlabels:severity: warning alertinstance: '{{ $labels.namespace }}/{{ $labels.pod }}'annotations:summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod重启 (当前值: {{ $value }})"description: "{{ $labels.namespace }}/{{ $labels.pod }} Pod重启 (当前值: {{ $value }})"- alert: PodNotHealthyexpr: sum by (namespace, pod, phase) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}) > 0for: 5mlabels:severity: erroralertinstance: '{{ $labels.namespace }}/{{ $labels.pod }}:{{ $labels.phase }}'annotations:summary: "命名空间: {{ $labels.namespace }} | Pod名称: {{ $labels.pod }} Pod状态不健康 (当前值: {{ $value }})"description: "{{ $labels.namespace }}/{{ $labels.pod }} Pod状态不健康 (当前值: {{ $labels.phase }})"

prometheusdeployment_530">1.3 prometheus-deployment

#prometheus-deployment.yaml #主要用于部署prometheus

apiVersion: apps/v1
kind: Deployment
metadata:name: prometheus namespace: opslabels:k8s-app: prometheus
spec:replicas: 1selector:matchLabels:k8s-app: prometheustemplate:metadata:labels:k8s-app: prometheusspec:serviceAccountName: prometheusinitContainers:- name: "init-chown-data"image: "busybox:latest"imagePullPolicy: "IfNotPresent"command: ["chown", "-R", "65534:65534", "/data"]volumeMounts:- name: prometheus-datamountPath: /datasubPath: ""containers:#负责热加载configmap配置文件的容器- name: prometheus-server-configmap-reloadimage: "jimmidyson/configmap-reload:v0.1"imagePullPolicy: "IfNotPresent"args:- --volume-dir=/etc/config- --webhook-url=http://localhost:9090/-/reloadvolumeMounts:- name: config-volumemountPath: /etc/configreadOnly: trueresources:limits:cpu: 10mmemory: 10Mirequests:cpu: 10mmemory: 10Mi- name: prometheus-serverimage: "prom/prometheus:v2.45.4"imagePullPolicy: "IfNotPresent"args:- --config.file=/etc/config/prometheus.yml- --storage.tsdb.path=/data#数据保留天数- --storage.tsdb.retention=3d- --web.console.libraries=/etc/prometheus/console_libraries- --web.console.templates=/etc/prometheus/consoles- --web.enable-lifecycleports:- containerPort: 9090readinessProbe:httpGet:path: /-/readyport: 9090initialDelaySeconds: 30timeoutSeconds: 30livenessProbe:httpGet:path: /-/healthyport: 9090initialDelaySeconds: 30timeoutSeconds: 30resources:limits:cpu: 500mmemory: 1500Mirequests:cpu: 200mmemory: 1000MivolumeMounts:- name: config-volumemountPath: /etc/config- name: prometheus-datamountPath: /datasubPath: ""- name: prometheus-rulesmountPath: /etc/config/rules- name: prometheus-etcdmountPath: /var/run/secrets/kubernetes.io/etcd-certs- name: timezonemountPath: /etc/localtimevolumes:- name: config-volumeconfigMap:name: prometheus-config- name: prometheus-rulesconfigMap:name: prometheus-rules- name: prometheus-datapersistentVolumeClaim:claimName: prometheus- name: prometheus-etcdsecret:secretName: etcd-certs- name: timezonehostPath:path: /usr/share/zoneinfo/Asia/Shanghai
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:name: prometheusnamespace: ops
spec:storageClassName: "managed-nfs-storage"accessModes:- ReadWriteManyresources:requests:storage: 1Gi
---
apiVersion: v1
kind: Service
metadata: name: prometheusnamespace: ops
spec: type: NodePortports: - name: http port: 9090protocol: TCPtargetPort: 9090nodePort: 30090selector: k8s-app: prometheus
---
#授权prometheus访问k8s资源的凭证
apiVersion: v1
kind: ServiceAccount
metadata:name: prometheusnamespace: ops
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:name: prometheus
rules:- apiGroups:- ""resources:- nodes- nodes/metrics- services- endpoints- podsverbs:- get- list- watch- apiGroups:- ""resources:- configmapsverbs:- get- nonResourceURLs:- "/metrics"verbs:- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:name: prometheus
roleRef:apiGroup: rbac.authorization.k8s.iokind: ClusterRolename: prometheus
subjects:
- kind: ServiceAccountname: prometheusnamespace: ops

1.4 node_exporter

#node-exporter.yaml #采集节点的指标数据,通过daemonset方式部署,并声明让prometheus收集

apiVersion: apps/v1 
kind: DaemonSet
metadata:name: node-exporternamespace: ops
spec:selector:matchLabels:app: node-exportertemplate:metadata:labels:app: node-exporter annotations:#标注此注释,声明让prometheus去收集prometheus.io/scrape: "true"#prometheus.io/scheme: "http"#prometheus.io/path: "/metrics"prometheus.io/port: "9100"spec:tolerations:- effect: NoScheduleoperator: ExistshostNetwork: truehostPID: truecontainers:- name: node-exporterimage: "prom/node-exporter:latest"args:- --path.rootfs=/host- --web.listen-address=:9100 ports:- name: metricscontainerPort: 9100volumeMounts:- name: rootfs mountPath: /hostreadOnly:  trueresources:requests:cpu: 100mmemory: 256Milimits:cpu: 500mmemory: 512Mivolumes:- name: rootfshostPath:path: /---
apiVersion: v1
kind: Service
metadata:name: node-exporternamespace: opsannotations:prometheus.io/scrape: "true"
spec:clusterIP: Noneports:- name: metricsport: 9100protocol: TCPtargetPort: 9100selector:k8s-app: node-exporter

1.5 kube-state-metrics k8s资源(例如depolyment、daemonset等资源)

#kube-state-metrics.yaml #采集k8s资源 例如 deployment/svc,并声明让prometheus收集

apiVersion: apps/v1
kind: Deployment
metadata:name: kube-state-metricsnamespace: ops
spec:selector:matchLabels:app: kube-state-metricsreplicas: 1template:metadata:labels:app: kube-state-metricsannotations:prometheus.io/scrape: "true"##prometheus.io/scheme: "http"##prometheus.io/path: "/metrics"prometheus.io/port: "8080"spec:serviceAccountName: kube-state-metricscontainers:- name: kube-state-metricsimage: registry.cn-shenzhen.aliyuncs.com/starsl/kube-state-metrics:v2.3.0ports:- name: http-metricscontainerPort: 8080- name: telemetrycontainerPort: 8081readinessProbe:httpGet:path: /healthzport: 8080initialDelaySeconds: 5timeoutSeconds: 5livenessProbe:httpGet:path: /healthzport: 8080initialDelaySeconds: 5timeoutSeconds: 5resources:requests:cpu: 100mmemory: 128Milimits:cpu: 500mmemory: 512MisecurityContext:readOnlyRootFilesystem: truerunAsNonRoot: truerunAsUser: 65534 volumeMounts:- name: timezonemountPath: /etc/localtimevolumes:- name: timezonehostPath:path: /usr/share/zoneinfo/Asia/Shanghai ---
apiVersion: v1
kind: Service
metadata:name: kube-state-metricsnamespace: ops
spec:ports:- name: http-metricsport: 8080targetPort: http-metricsprotocol: TCP- name: telemetryport: 8081targetPort: telemetryprotocol: TCPselector:app: kube-state-metrics
---
apiVersion: v1
kind: ServiceAccount
metadata:name: kube-state-metricsnamespace: ops
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:name: kube-state-metrics
rules:
- apiGroups:- ""resources:- configmaps- secrets- nodes- pods- services- serviceaccounts- resourcequotas- replicationcontrollers- limitranges- persistentvolumeclaims- persistentvolumes- namespaces- endpointsverbs:- list- watch
- apiGroups:- appsresources:- statefulsets- daemonsets- deployments- replicasetsverbs:- list- watch
- apiGroups:- batchresources:- cronjobs- jobsverbs:- list- watch
- apiGroups:- autoscalingresources:- horizontalpodautoscalersverbs:- list- watch
- apiGroups:- authentication.k8s.ioresources:- tokenreviewsverbs:- create
- apiGroups:- authorization.k8s.ioresources:- subjectaccessreviewsverbs:- create
- apiGroups:- policyresources:- poddisruptionbudgetsverbs:- list- watch
- apiGroups:- certificates.k8s.ioresources:- certificatesigningrequestsverbs:- list- watch
- apiGroups:- discovery.k8s.ioresources:- endpointslicesverbs:- list- watch
- apiGroups:- storage.k8s.ioresources:- storageclasses- volumeattachmentsverbs:- list- watch
- apiGroups:- admissionregistration.k8s.ioresources:- mutatingwebhookconfigurations- validatingwebhookconfigurationsverbs:- list- watch
- apiGroups:- networking.k8s.ioresources:- networkpolicies- ingressclasses- ingressesverbs:- list- watch
- apiGroups:- coordination.k8s.ioresources:- leasesverbs:- list- watch
- apiGroups:- rbac.authorization.k8s.ioresources:- clusterrolebindings- clusterroles- rolebindings- rolesverbs:- list- watch
---
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:name: kube-state-metrics
roleRef:apiGroup: rbac.authorization.k8s.iokind: ClusterRolename: kube-state-metrics
subjects:
- kind: ServiceAccountname: kube-state-metricsnamespace: ops

grafana_1023">1.6 grafana

#grafana.yaml #可视化展示收集到的数据

apiVersion: apps/v1 
kind: Deployment 
metadata:name: grafananamespace: ops
spec:replicas: 1selector:matchLabels:app: grafanatemplate:metadata:labels:app: grafanaspec:containers:- name: grafanaimage: grafana/grafana:8.4.0ports:- containerPort: 3000protocol: TCPresources:limits:cpu: 100m            memory: 256Mi          requests:cpu: 100m            memory: 256MivolumeMounts:- name: grafana-datamountPath: /var/lib/grafanasubPath: grafanasecurityContext:fsGroup: 472runAsUser: 472volumes:- name: grafana-datapersistentVolumeClaim:claimName: grafana 
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:name: grafana namespace: ops
spec:storageClassName: "managed-nfs-storage"accessModes:- ReadWriteManyresources:requests:storage: 5Gi
---
apiVersion: v1
kind: Service
metadata:name: grafananamespace: ops
spec:type: NodePortports:- port : 80targetPort: 3000nodePort: 30030selector:app: grafana
#按照顺序运行上面的yaml资源清单
[root@k8s-master prometheus]# kubectl create ns ops
[root@k8s-master prometheus]# kubectl apply -f prometheus-configmap.yaml 
[root@k8s-master prometheus]# kubectl apply -f prometheus-rules.yaml 
[root@k8s-master prometheus]# kubectl apply -f prometheus-deployment.yaml 
[root@k8s-master prometheus]# kubectl apply -f grafana.yaml [root@k8s-master prometheus]# kubectl get pod,svc -n ops 
NAME                              READY   STATUS    RESTARTS   AGE
pod/grafana-79c5bfb955-lxq6w      1/1     Running   0          58s
pod/prometheus-5ccf96b898-zd8b2   2/2     Running   0          5m6sNAME                 TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
service/grafana      NodePort   10.111.187.83    <none>        80:30030/TCP     57s
service/prometheus   NodePort   10.111.181.145   <none>        9090:30090/TCP   5m12s

如上,pod启动完成后访问prometheus(nodeIP:30090端口)和grafana(nodeIP:30030)
查看prometheus被监控端(来源于prometheus-config.yml)
在这里插入图片描述
prometheus告警规则(来源于prometheus-rules.yml)
在这里插入图片描述
访问grafana如下
在这里插入图片描述
#用户名/密码:admin/admin # 第一次需要重置密码


grafana__1124">2.grafana 添加数据源

在这里插入图片描述

grafana__1128">3.grafana 导入仪表盘

Dashboard --> Manage --> import --> upload (ID 13105)
在这里插入图片描述

或者使用我已经修改好的,添加了对jvm的监控,开发端暴露指标,再进行每个服务的收集
应用的deployment配置需要开启指标收集
K8S微服务监控大盘模版
在这里插入图片描述
在这里插入图片描述



七、Prometheus的基本使用:查询数据

PromQL(Prometheus Query Language)是prometheus自己开发的数据查询DSL语言,语言表现力非常丰富、支持条件查询、操作符,并且内建了大量内置函数,供我们针对监控数据的各种维度进行查询。

数据模型:

  • prometheus将所有数据存储为时间序列(内置时序数据库TSDB);
  • 具有相同度量名称以及标签属于同一个指标;
  • 每个时间序列都由度量标准名称和一组键值对(称为标签)唯一标识,通过标签查询指定指标

指标格式为:
<metric name> {<label name>=<label value>,....}
在这里插入图片描述
示例:
查询指标最新样本(称为瞬时向量/最新的数据):
node_cpu_seconds_total
可以通过附加一组标签来近一步查询:
node_cpu_seconds_total{job="linux_server"} #{ }里的“key=value”就是查询条件

查询指标近5分钟内样本(称为范围向量/历史数据,时间单位s、m、h、d、w、y):
node_cpu_seconds_total{job="linux_server"}[5m]
node_cpu_seconds_total{job="linux_server"}[1h]



http://www.ppmy.cn/ops/126252.html

相关文章

科普向 -- 什么是RPC

科普向 – 什么是RPC RPC&#xff0c;全称为远程过程调用&#xff08;Remote Procedure Call&#xff09;&#xff0c;是一种计算机通信协议&#xff0c;允许程序在不同的地址空间&#xff08;通常是不同的计算机&#xff09;上执行代码。RPC使得程序可以像调用本地函数一样调…

数据库基础-学习版

目录 数据库巡检清理表空间高水位处理重建索引扩展字段异常恢复处置常见命令汇总 数据库巡检 数据库巡检的主要目的是确保数据库的健康状态、性能和安全,及时发现潜在的问题。 一 数据库状态检查 查看数据库列表:SHOW DATABASES; 检查当前数据库SELECT DATABASE(); 检查数据…

网安加·百家讲坛 | 宋荆汉:大模型生成代码的安全风险及应对

作者简介&#xff1a;宋荆汉&#xff0c;网安加学院院长&#xff0c;深圳创新方法研究会理事、深圳质量协会专家委员&#xff0c;网安加社区、质量实干派社区创始人。20年研发及管理经验&#xff0c;在中兴通讯、任子行网络&#xff0c;全志科技、汇金科技&#xff0c;担任研发…

性能测试-JMeter(2)

JMeter JMeter断言响应断言JSON断言断言持续时间 JMeter关联正则表达式提取器正则表达式正则表达式提取器 XPath提取器JSON提取器 JMeter属性JMeter录制脚本 JMeter断言 断言&#xff1a;让程序自动判断预期结果和实际结果是否一致 提示&#xff1a; -Jmeter在请求的返回层面有…

小白投资理财 - 中国股票分类

小白投资理财 - 中国股票分类 按上市地点和投资者类型分类按公司规模和业务领域分类按流通和股东结构分类按公司所有制分类按股票性质和附加权利分类按行业分类总结 中国的股票市场根据不同的标准进行分类&#xff0c;主要分为以下几类&#xff1a; 按上市地点和投资者类型分类…

vue项目发布时移除所有console输出

安装依赖包 npm i babel-plugin-transform-remove-console修改babel.config.js文件 //增加以下配置 if (process.env.NODE_ENV production) {module.exports.plugins.push(transform-remove-console) }

MongoDB文档的详细使用说明

以下是关于MongoDB文档的详细使用说明&#xff1a; 1. 文档的概念 文档是MongoDB中数据的基本单元&#xff0c;它是一个类似于JSON格式的键值对数据结构&#xff0c;也被称为BSON&#xff08;Binary JSON&#xff09;格式。文档可以包含不同类型的数据字段&#xff0c;并且可…

精华版80页PPT | 智能工厂数字化顶层架构

项目背景及需求理解 随着科技的飞速发展&#xff0c;智能工厂的概念逐渐从理论走向实践&#xff0c;成为制造业转型升级的重要方向。方案对智能工厂数字化顶层架构进行全面介绍。在当前的市场环境下&#xff0c;消费者需求日益多样化、个性化&#xff0c;对产品质量、价格、环…