November 19, 2023 2:13 pm

Povilas

Introduction

PrometheusMissingRuleEvaluations is an alert coming from Prometheus Monitoring Mixin. Monitoring Mixins are a bundle of Grafana Dashboards, Prometheus Alerts, and Recording rules. Check out my Getting Started With Monitoring Mixins blog post to learn more about Monitoring Mixins. You typically get this alert automatically if you use kube-prometheus-stack or kube-prometheus solutions.

Typically, the configuration for the PrometheusMissingRuleEvaluations alert looks like this:

alert: PrometheusMissingRuleEvaluations
annotations:
  description: Prometheus {{$labels.instance}} has missed {{ printf "%.0f" $value
    }} rule group evaluations in the last 5m.
  summary: Prometheus is missing rule evaluations due to slow rule group evaluation.
expr: |
  increase(prometheus_rule_group_iterations_missed_total{job="prometheus"}[5m]) > 0
for: 15m
labels:
  severity: warning

It fires when Prometheus cannot evaluate rules in time. For example, you have a complicated Prometheus recording rule, which you evaluate every 15 seconds. If the query inside the recording rule takes longer, Prometheus will miss the rule evaluation.

Identifying the problem

So what do you do when the PrometheusMissingRuleEvaluations alert fires?


Well, the first thing you need to do is identify the rule that is failing. Go into Prometheus, find the firing alert, and look for the rule_group label. For example, I had the rule_group label equal to:

alertname=PrometheusMissingRuleEvaluations, instance=blablah:9090, job=kubernetes-service-endpoints, kubernetes_name=prometheuskubernetes_namespace=monitoring, kubernetes_pod_name=prometheus-0, name=prometheus, rule_group=/alerts/repo/output/k8s-rules.yml;kube-apiserver-burnrate.rules,severity=warning

Note: example labels for firing PrometheusMissingRuleEvaluations alert.

Once you know the name of the rule group, you can go to the Prometheus rules page at http://host:9090/rules to see how long the rules in the rule group took to evaluate. The rules page is instrumental in investigating which rule is problematic.

Once you have identified the problematic rule group, it’s time to analyze your potential options to fix it.

Solutions

1. Check Prometheus CPU resource usage

Sometimes, giving more CPU resources is the only way to fix it. If you run Prometheus in Kubernetes or other containerized environments, your CPU limits might be too low. Check if your Prometheus is being CPU throttled. In Kubernetes, you can check the container_cpu_cfs_throttled_seconds_total metric.

2. Split rules into different Rule Groups

In Prometheus, recording and alerting rules exist in rule groups. Rules within a group are evaluated sequentially. While rule groups are executed in parallel. So Prometheus might be missing rule evaluations due to large rule groups.

groups:
- name: example
  rules:
  - record: code:prometheus_http_requests_total:sum
    expr: sum by (code) (prometheus_http_requests_total)

Note: Example of Prometheus rule group and rules.

The solution is to split rules that don’t depend on each other into different groups to be evaluated in parallel.

For example, Kubernetes Monitoring Mixin had this issue in API Server SLO rules. These rules were in a single rule group, triggering missing rule evaluations. The fix was splitting the “kube-apiserver-availability.rules” into different groups. See the pull request in the Kubernetes Monitoring Mixin repository for the change.

3. Change Evaluation Interval

Prometheus evaluates rules periodically. Each rule group can have a different evaluation interval; if it’s not set, it defaults to global.evaluation_interval. By default global.evaluation_interval equals to 1 minute.

groups:
- name: "kube-apiserver-burnrate.rules"
  interval: "1m"
  rules:
  - expr: |
      (
        (
          # too slow
          sum by (cluster) (rate(apiserver_request_slo_duration_seconds_count{job="kubernetes-apiservers",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward"}[1d]))
          - ...

Note: example of rules group with an evaluation interval set.

So, the solution might be to evaluate the failing rule less often.

References

Sign up and never miss an article 

About the Author

I'm Povilas Versockas, a software engineer, blogger, Certified Kubernetes Administrator, CNCF Ambassador, and a computer geek.

>