通过Prometheus来做SLI/SLO监控展示

什么是SLI/SLO

SLI，全名Service Level Indicator，是服务等级指标的简称，它是衡定系统稳定性的指标。

SLO，全名Sevice Level Objective，是服务等级目标的简称，也就是我们设定的稳定性目标，比如"4个9"，"5个9"等。

SRE通常通过这两个指标来衡量系统的稳定性，其主要思路就是通过SLI来判断SLO，也就是通过一系列的指标来衡量我们的目标是否达到了"几个9"。

如何选择SLI

在系统中，常见的指标有很多种，比如：

这么多指标，应该如何选择呢?只要遵从两个原则就可以：

通常情况下，可以直接使用谷歌的VALET指标方法。

这就是谷歌使用VALET方法给的样例。

上面仅仅是简单的介绍了一下SLI/SLO，更多的知识可以学习《SRE：Google运维解密》和赵成老师的极客时间课程《SRE实践手册》。下面来简单介绍如何使用Prometheus来进行SLI/SLO监控。

Service level operator是为了Kubernetes中的应用SLI/SLO指标来衡量应用的服务指标，并可以通过Grafana来进行展示。

Operator主要是通过SLO来查看和创建新的指标。例如：

apiVersion:monitoring.spotahome.com/v1alpha1
kind:ServiceLevel
metadata:
name:awesome-service
spec:
serviceLevelObjectives:
–name:"9999_http_request_lt_500"
description:99.99%ofrequestsmustbeservedwith<500statuscode.
disable:false
availabilityObjectivePercent:99.99
serviceLevelIndicator:
prometheus:
address:http://myprometheus:9090
totalQuery:sum(increase(http_request_total{host="awesome_service_io"}[2m]))
errorQuery:sum(increase(http_request_total{host="awesome_service_io",code=~"5.."}[2m]))
output:
prometheus:
labels:
team:a-team
iteration:"3"

Operator通过totalQuert和errorQuery就可以计算出SLO的指标了。

(1)首先创建RBAC

（2）然后创建Deployment

（3）创建service

（4）创建prometheus serviceMonitor

到这里，Service Level Operator部署完成了，可以在prometheus上查看到对应的Target，如下：

然后就需要创建对应的服务指标了，如下所示创建一个示例。

上面定义了grafana应用"4个9"的SLO。

然后可以在Prometheus上看到具体的指标，如下。

接下来在Grafana上导入ID为8793的Dashboard，即可生成如下图表。

上面是SLI，下面是错误总预算和已消耗的错误。

下面可以定义告警规则，当SLO下降时可以第一时间收到，比如：

groups:
–name:slo.rules
rules:
-alert:SLOErrorRateTooFast1h
expr:|
(
increase(service_level_sli_result_error_ratio_total[1h])
/
increase(service_level_sli_result_count_total[1h])
)>(1-service_level_slo_objective_ratio)*14.6
labels:
severity:critical
team:a-team
annotations:
summary:ThemonthlySLOerrorbudgetconsumedfor1hisgreaterthan2%
description:Theerrorratefor1hinthe{{$labels.service_level}}/{{$labels.slo}}SLOerrorbudgetisbeingconsumedtoofast,isgreaterthan2%monthlybudget.
-alert:SLOErrorRateTooFast6h
expr:|
(
increase(service_level_sli_result_error_ratio_total[6h])
/
increase(service_level_sli_result_count_total[6h])
)>(1-service_level_slo_objective_ratio)*6
labels:
severity:critical
team:a-team
annotations:
summary:ThemonthlySLOerrorbudgetconsumedfor6hisgreaterthan5%
description:Theerrorratefor6hinthe{{$labels.service_level}}/{{$labels.slo}}SLOerrorbudgetisbeingconsumedtoofast,isgreaterthan5%monthlybudget.