PromCon 2018 Notes

Overview

  • Prometheus graduated CNCF.
  • Openmetrics joined the incubator stage of CNCF.
  • Prometheus will offer built-in TLS and auth in HTTP endpoints, as well as implement subqueries (e.g. max_over_time(rate(foo[5m])[1h:1m]))

Leveraging Prometheus to Build, Test, and Release New Products at DigitalOcean.

Talk by Sneha Inguva from DigitalOcean Observability team.
Slides

Talks about leveraging Prometheus to build and test, showed and example setting it up on DigitalOcean’s VPC and DHCP products.
Instrumenting code using Four Golden Signals (from Google SRE book) or Brendan Gregg’s USE method, working in build -> instrument -> test -> iterate loops. They use metrics, tracing, logging.

I really liked how they have a master dashboard, which links to a smaller, service specific dashboard. It sort of makes sense, as when things break, you want to see general view of what is affected and deep down into specific service view. Also really cool dashboard names, which reiterate the pattern:

  • [production] RNS Architecture – Master
  • [production] RNS Architecture – Northd
  • [production] RNS Architecture – RnsDB – MySQL Overview
  • [production] RNS Architecture – RabbitMQ

Testing:

  • easy to load test with generated GRPC client + multiple goroutines.
  • chaos testing – take down a component in the system.
  • integration testing

Life of an alert

Talk by Stuart Nelson from SoundCloud, maintainer of alertmanager.
[Slides]

Talk went deep dive into alertmanager and how it goes thru all the stages of alerting.

There is this interesting feature in Alertmanager called inhibit rules, which once inhibiting alert fire, it can mute other non important alerts. Imagine if datacenter is on fire and you would probably get tons of alertsm which doesn’t help you figure out what is happening, so you would write an alertmanager inhibit rule, which would mute all the non important alerts. Something like:

   inhibit_rules:
    - source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'

But be careful check for no circular inhibitions and remember that silenced alert which is inhibiting alert will still mute alerts

There is alert visualizer UI, which can draw your alert tree.

Alert grouping is another interesting feature, which has a couple of options to tweak such as group_wait – how long to wait for first group of alerts to form, group_interval – how long to wait before sending a notification about new alerts that are added to a group of alerts.

In the end alert goes thru notify pipeline, which has send at least one semantics.
In clustered mode before sending alert there is a wait and dedup stage in order to avoid duplicate notification. Alertmanager instance will wait HA mode position * 15s and then checks gossiped alert notification log and if alert is not there, then send external notification. HA mode position start’s from 0, so the first instance doesn’t wait.

Original https://github.com/prometheus/alertmanager/blob/master/doc/arch.svg

Q/A

Does Alertmanager store history of alerts?
There is no history of alerts in alertmanager, you can write a kafka sink for your notifications.

How to test inhibitions?

There is no other way apart from visual editor.

Alertmanager HA is still in development?

Soundcloud have been using for a couple of months, work fine.

How alert grouping works?

When new alert comes that will join the group, it will be notified after group_interval if group was already fired, if group hasn’t formed yet then after group_wait.

How do SoundCloud templates look for grouped alerts?

Sound cloud does the common labels, alert name and a link to the runbook.
You can look at the source of the alert and know how many of them are firing.
But in their slack template they actually show how many are firing.

Hidden linux metrics with ebpf_exporter

Talk by Ivan Babrou from Cloudflare.
Slides
Blogpost

You can use DNS over tls using Cloudflare, so that your internet provider doesn’t know what you are browsing 🙂

Current exporters don’t allow you to answer questions like:
How many fast io operations is disk doing?
How many slow io operations?

Current issue with node_exporter is that it doesn’t give you histograms, as Linux Kernel only provides counters in /proc/diskstats. There is a way around that using ebpf. Turns out ebpf is already in kernel, used by seccomp filters. It’s a in-kernel sandboxed virtual machine, which allows you to access kernel data and put in some memory region, after that exporter just needs to map the memory region and give you metrics out of that.

So if you need more detailed metrics from Linux definitely look into ebpf_exporter.

Prometheus Monitoring Mixins: Using Jsonnet to Package Together Dashboards, Alerts, and Exporters

Talk by Tom Wilkie from Grafana.
Slides

The talk is mostly the same as the one in KubeCon

Showed basics of jsonnet and how to template out grafana dashboards, alerts and recording rules.

There is also a jsonnet-bundler package manager to get the mixins.

Currently there are mixins for:

Ongoing:

Autoscaling All Things Kubernetes with Prometheus

Talk by Michael Hausenblas and Frederic Branczyk.
Slides

There are 2 types of autoscaling, vertical and horizontal.

Not all apps are horizontally scaleable.
Right now vertical autoscaling requires pod restart.

If you want to use Prometheus metrics for autoscaling, you need to launch:

API aggregation -> k8s-prometheus-adaptor -> Prometheus

Right now the Custom Metrics API doesn’t provide historic data, only current metrics. But sig-instrumentation & sig-autoscaling is working on that as it is needed for VerticalPodAutoScaler.

VPA is pre-alpha.

Taking Advantage of Relabeling

Talk by Julien Pivotto from Inuits.
Slides

You can get top 10 biggest cardinality metrics:

topk(
10,
count({job="JOB_NAME"}) by (__name__)
)

The talk showed really cool tricks on how you can use relabeling:

  • A config will alert on https certificate issues using Blackbox exporter based from Traefik ingress controller information.
  • Relabeling to spread the load of Prometheus server by droping some targets on one server and keeping on another.
  • Adding default label (for example priority: P1) to alerts.

Bomb Squad: Containing the Cardinality Explosion

Talk by Cody ‘Chowny’ Boggs from FreshTracks.
Slides

Turns out Prometheus cannot handle too much cardinality.

In Prometheus cardinality is the number of discrete label/value pairs associated with a particular metric.

A simple example of metric with big cardinality is having unique request id in a label.

Bombsquad is Prometheus sidecar which looks at a cardinality number and adds relabeling rule to all Prometheus jobs to silence the metric that causes cardinality explosion.

bombsquad comes with CLI tool that let’s you view all the silences and remove them.

You can view cardinality of your Prometheus by executing:

sort_desc(label_replace( count by(__name__) ({__name__!="", __name__!="card_count"}), "metric_name", "$1", "__name__", "(.+)" ))

Speaker also promoted cool Kubernetes CLI tool for viewing logs called stern.

Thanos – Prometheus at Scale

Talk by Bartłomiej Płotka from Improbable and Fabian Reinartz from Google.
Slides
Blogpost

  • Thanos is a new approach for solving Global view and Long Term Storage problems in Prometheus.
  • Thanos querier and Thanos sidecar can solve the global view and HA problem.
  • Thanos sidecar is run as a Prometheus sidecar, which has GRPC endpoint, called Store API.
  • Thanos querier calls to all the GRPC endpoints and returns the Prometheus data.
  • Instead of adding session stickiness to Prometheus HA pairs, Thanos queries has a deduplication mechnanism.
  • Deduplication switches to different replica once it finds a couple of of missing scrapes
  • You can solve Long Term storage, by adding S3 compatible storage.
  • Thanos sidecar then uploads 2h uncompacted Prometheus blocks to S3.
  • Thanos store gateway gives a Store API for blocks in S3.
  • You also need to compact & downsample blocks in S3, which is done via Thanos compactor.
  • You need to think about writing queries to hit downsampled data

Using the Flux Query Engine to Query Multiple Prometheus Servers

Talk by Paul Dix from Influx Data.
Slides

Influx Data is creating a new language called “flux”, which is used for data analysis, data science applications. It should be something like Pandas / R / Matlab / Spark and support loading prometheus data.

Right now flux is very very alpha.

Improving Reliability Through Engineering an Easy-to-use Prometheus-Based Monitoring and Alerting Stack: Introducing Our Reliability Toolkit

Talk by Robin van Zijll from ING bank.
Slides

Talk was about the struggles of adopting Prometheus in a bank.
Internally the team has developed a PromQL workshop, which hopefully they will open source. Also they created Model Builder, which generates thresholds for alerts.

Explore Your Prometheus Data in Grafana

Talk by David Kaltschmidt from Grafana.
Slides

Really cool features are coming to Grafana:

  • Better step alignment, so you get less “jumpy” graphs on refresh.
  • $__range – variable, which shows selected time range, finally you can do something like go_goroutines[$__range].
  • Explore window is coming, where you can explore metrics with good autocompletion support and tabs. Will be released in 6.0
  • Grafana is working on some kind of git integration to support dashboard changes.
  • Grafana has been secretly working on log aggregation
  • Support for mixins.

RPC at Google

Talk by JBD from Google
Slides

  • Google works on new monitoring system called Monarch.
  • Google has statz/rpcz pages in their http servers, similiar available in OpenCensus https://opencensus.io/core-concepts/z-pages/.
  • Google works on OpenCensus library, which is vendor-agnostic libraries for metrics and tracing.
  • Exemplar Traces a new kind of thing, which will be reported in metrics endpoint. The idea is to link latency histogram data to OpenTracing spans.

OpenMetrics – Transforming the Prometheus Exposition Format into a Global Standard

Talk By Richard Hartmann.
Slides

  • Going beyond metrics, it’s about monitoring.
  • We need a better official standard, as currently SNMP is the only available monitoring standard.
  • Prometheus format is going to a base for OpenMetrics.
  • It’s going to change a bit to enable Exemplar traces.
  • Every single data point in a time series can point to one single event.
  • Prometheus will ignore the stuff it doesn’t care about.

What Prometheus Means for Monitoring Vendors

Talk By Jorge Salamero Sanz from Sysdig.
[Slides]

  • The cool thing about Prometheus is that vendors don’t need to re-implement all the metrics libraries and can just Prometheus as a standard way of exposing metrics.

Implementing a Cooperative Multi-Tenant Capable Prometheus

Talk By Jonas Große Sundrup.
Slides

  • Talk about how to share single Prometheus server with multiple people.
  • Speaker developed promauthproxy, proxy which injects user labels into config.
  • The proxy uses Prometheus go packages to parse sent PromQL and add user label to Prometheus queries.
  • The same approach is used for alerts and blackbox exporter config.
  • Config is validated via promtool.

Monitoring at Scale: Migrating to Prometheus at Fastly

Talk By Marcus Barczak from Fastly.
Slides

  • Moving away from Ganglia + Icinga.
  • They are running one HA Prometheus pair per datacenter.
  • They encrypt metrics traffic via Ghost Tunnel.
  • They have a custom Service Discovery and Proxy component for configuring Prometheus.
  • Fastly is looking into Thanos for LTS and global view.

Automated Prometheus Benchmarking – Pushing It to Its Limits until It Breaks!

Talk By Krasi Georgiev from Red Hat and Harsh Agarwal.
Slides

  • Project doc
  • Source
  • Speakers created automated Prometheus benchmarking during Google Summer Of Code 2018.
  • It leverages Kubernetes Prow CI, creates Kubernetes environment one Prometheus instance running with your changes and one instance with master or some other branch.
  • Meta Prometheus collects benchmarks and allows to view them in Grafana.
  • Benchmarking can be started via comment on github PR.
  • Red Hat is also working on Kiali, UI for istio.

Anatomy of a Prometheus Client Library

Talk By Brian Brazil from Robust Perception.
Slides

  • To use a counter make a reference to it and then use it, don’t look it up in a global thread-safe map.
  • Put metrics in different files with different metric names.
  • Client libraries are done in a way that you don’t need to worry about concurrency.
  • Labels need to be known in advance and set to 0.
  • There are Registries, which check for duplicate metrics and Collectors, which provide Prometheus data format.

Advanced Standalone Kubelet tutorial for Raspberry Pi

So recently I adapted Kelsey Hightower’s Standalone Kubelet Tutorial for Raspberry Pi.
Standalone Kubelet Tutorial for Raspberry Pi is a prerequisite for this tutorial, as I’m going to skip Linux installation and all the other parts.

In this tutorial I will show you how to cross compile Kubernetes Kubelet to ARM architecture and we will run the amazing Prometheus, Node Exporter and Grafana using static pods.

I am running Ubuntu, so this should generally work for people running Ubuntu or other Linux distributions. If you are running Windows I have no idea whether this will work, but you can try 🙂 I think there could be issues in building Kubelet and/or cross compiling it for ARM.

You should subsitute IP for your Raspberry Pi’s IP address or hostname.

Compiling Kubernetes Kubelet for arm

Clone the Kubernetes project:

git clone git@github.com:kubernetes/kubernetes.git

You can checkout specific version branch if you like (e.g. git checkout release-1.10), I will be running master in this tutorial.

Compile Kubernetes for ARM:

cd kubernetes

Build Kubernetes for your current platform:

make

To cross compile you will need to have gcc-arm-linux-gnueabihf package installed.
On Ubuntu/Debian you can install via:

sudo apt-get install gcc-arm-linux-gnueabihf

Cross compile Kubelet for ARM:

make all WHAT=cmd/kubelet KUBE_BUILD_PLATFORMS=linux/arm

Copy Kubelet binary to your Raspberry Pi:

scp ./_output/local/bin/linux/arm/kubelet IP:~/

Installing Kubelet

Connect to Raspberry Pi:

ssh alarm@IP

Move Kubelet to /usr/bin:

sudo mv kubelet /usr/bin/kubelet

Download Kubelet’s config file and create /etc/kubernetes from previous tutorial, if you haven’t already.

Download the Kubelet systemd file:

wget -q --show-progress --https-only --timestamping \
  https://raw.githubusercontent.com/povilasv/advanced-raspberrypi-standalone-kubelet/master/kubelet.service

Move the kubelet.service unit file to the systemd configuration directory:

sudo mv kubelet.service /etc/systemd/system/

Start the kubelet service:

sudo systemctl daemon-reload

sudo systemctl enable kubelet

sudo systemctl start kubelet

Verification

Verify the kubelet is running:

sudo systemctl status kubelet

View logs for Kubelet service:

journalctl -u kubelet

Check that no containers are running:

docker ps

Installing Prometheus and friends

Prometheus

Create Prometheus data dir:

sudo mkdir /root/prometheus-data/

Download Prometheus config:

wget -q --show-progress --https-only --timestamping \
  https://raw.githubusercontent.com/povilasv/advanced-raspberrypi-standalone-kubelet/master/config/prometheus.yml

Move prometheus.yml to /root/prometheus-data/:

sudo mv prometheus.yml /root/prometheus-data/prometheus.yml

Download Prometheus manifest:

wget -q --show-progress --https-only --timestamping \
  https://raw.githubusercontent.com/povilasv/advanced-raspberrypi-standalone-kubelet/master/pods/prometheus.yaml

Move the prometheus.yaml pod manifest to the Kubelet manifest directory:

sudo mv prometheus.yaml /etc/kubernetes/manifests/prometheus.yaml

Wait for image to download and list containers:

docker ps

You should see two containers running which represent the prometheus pod and a kubelet container. Docker does not understand pods so the containers are listed as individual containers following the Kubernetes naming convention.

Open up browser and go to:

http://IP:9090/status

You should see Prometheus WEB UI.

Open up targets page:

http://ip:9090/targets

You should see 1 job as UP, and 2 jobs as DOWN. This is expected as we haven’t launched node exporter and grafana yet.

Node Exporter

Download Prometheus Node Exporter manifest:

wget -q --show-progress --https-only --timestamping \
  https://raw.githubusercontent.com/povilasv/advanced-raspberrypi-standalone-kubelet/master/pods/nodeexporter.yaml

Move the nodeexporter.yaml pod manifest to the Kubelet manifest directory:

sudo mv nodeexporter.yaml /etc/kubernetes/manifests/nodeexporter.yaml

Wait for image to download and list containers:

docker ps

You should see additional two containers running which represent the node exporter pod.

Open up Prometheus targets page:

http://ip:9090/targets

You should see 2 jobs as UP, and 1 job as DOWN.

Grafana

Download Grafana manifest:

wget -q --show-progress --https-only --timestamping \
  https://raw.githubusercontent.com/povilasv/advanced-raspberrypi-standalone-kubelet/master/pods/grafana.yaml

Move the grafana.yaml pod manifest to the Kubelet manifest directory:

sudo mv grafana.yaml /etc/kubernetes/manifests/grafana.yaml

Wait for image to download and verify it is running:

docker ps

You should see additional two containers running which represent the grafana pod.

Open up Prometheus targets page:

http://ip:9090/targets

You should see 3 jobs as UP.

Go to Grafana page:

http://IP:3000/login

Enter `admin`/`admin`.

Add data source:

Click on Add Source and enter:

Name: prom
URL: http://127.0.0.1:9090
Access: proxy
Scrape interval: 30s

Adding dashboards:

Click on Dashboards tab in `add data source page`:

Import `Prometheus 2.0 Stats` and `Grafana metrics` dashboards.

Hover on a left `+` button and click `Import`.

Enter `5573`.

Set `prometheus` as `prom`.

Take a look at those 3 dashboards in Grafana UI.

You should see something like:

Conclusion

Here we are at the end of the journey.

We went through a lot together: modifying archlinux kernel parameters, checking cgroup stats via lxc-checkconfig, crosscompiling Kubelet for ARM and actually running Prometheus and friends on a Raspberry Pi.

Hope you enjoyed the journey and see you next time!

Running Prometheus Node exporter on a router

So at home I have Asus RT-N14U router, which has 600 Mhz MIPS architecture CPU, 16mb of flash storage and 64mb of RAM. On the software side, I’m running Andy Padavan’s RT-N56U firmware (Thank you, Andy!), this firmware gives you all the router features, I think raw Linux, ssh and some basic tools.

So Node exporter is written in Go, why not just go build ? So what I did is just simply cross compile into MIPS little endian architecture:

git clone git@github.com:prometheus/node_exporter.git

cd node_exporter
GOARCH='mipsle' GOOS=linux go build

Success! We have a node_exporter executable. By the way, you can checkout specific version (eg git checkout release-0.16) if you like, I usually run master for the latest and greatest 😉
So now, let’s just copy the built executable into the router

scp node_exporter router:~/
node_exporter                                               89%   13MB   2.0MB/s   00:00
ETAscp: /home/admin//node_exporter: No space left on device
node_exporter

Turns out that I only have 1MB of space in my home dir:

Filesystem                Size      Used Available Use% Mounted on
rootfs                    8.5M      8.5M         0 100% /
/dev/root                 8.5M      8.5M         0 100% /
tmpfs                     8.0K         0      8.0K   0% /dev
tmpfs                     2.0M    180.0K      1.8M   9% /etc
tmpfs                     1.0M      8.0K   1016.0K   1% /home
tmpfs                     8.0K         0      8.0K   0% /media
tmpfs                     8.0K         0      8.0K   0% /mnt
tmpfs                    24.0M     76.0K     23.9M   0% /tmp
tmpfs                     4.0M    192.0K      3.8M   5% /var

and Node exporter’s executable size is around ~15mb, So, I attached usb, copied file scp node_exporter router:/media/UBUNTU_17_0/ into mounted usb and tried to launch it:

time="2018-04-14T15:15:00Z" level=info msg="Starting node_exporter (version=, branch=, revision=)" source="node_exporter.go:82"
time="2018-04-14T15:15:00Z" level=info msg="Build context (go=go1.10.1, user=, date=)" source="node_exporter.go:83"
time="2018-04-14T15:15:00Z" level=info msg="Enabled collectors:" source="node_exporter.go:90"
time="2018-04-14T15:15:00Z" level=info msg=" - arp" source="node_exporter.go:97"
time="2018-04-14T15:15:00Z" level=info msg=" - bcache" source="node_exporter.go:97"
time="2018-04-14T15:15:00Z" level=info msg=" - bonding" source="node_exporter.go:97"
time="2018-04-14T15:15:00Z" level=info msg=" - conntrack" source="node_exporter.go:97"
time="2018-04-14T15:15:00Z" level=info msg=" - cpu" source="node_exporter.go:97"


time=”2018-04-14T15:15:00Z” level=info msg=” – vmstat” source=”node_exporter.go:97″
time=”2018-04-14T15:15:00Z” level=info msg=” – wifi” source=”node_exporter.go:97″
time=”2018-04-14T15:15:00Z” level=info msg=” – xfs” source=”node_exporter.go:97″
time=”2018-04-14T15:15:00Z” level=info msg=” – zfs” source=”node_exporter.go:97″
time=”2018-04-14T15:15:00Z” level=info msg=”Listening on :9100″ source=”node_exporter.go:111″

Woohoo it’s working!

Let’s add a Grafana dashboard, to look at those exported metrics:

I’ve used Host Stats – Prometheus Node Exporter grafana dashboard, just had to change it, as all of the labels have changed in recent release of exporter. You can find this fixed dashboard in https://grafana.com/dashboards/5573.

Getting back to the Node exporter, so now that we have node exporter running, how do we start it on router boot?
This Linux is very limited and doesn’t support systemd. So, looking at what other options are available I found that there is crontab (you need to enable it via UI). So first thing I tried:

@reboot /media/UBUNTU_17_0/node_exporter >> /media/UBUNTU_17_0/node_exporter.log 2>&1

Restarted router and nope, @reboot doesn’t seem to work, no surprises here, as this router is very limited.

So this leads us doing things “the old way”: running a cron script every minute or so checking for process existence and starting if it’s not there.

Here is the script:

#!/bin/bash

pidof node_exporter

if [[ $? -ne 0 ]] ; then
        /media/UBUNTU_17_0/node_exporter >> /media/UBUNTU_17_0/node_exporter.log 2>&1 & 
fi

And the crontab:

* * * * * /media/UBUNTU_17_0/ne.sh         

Thats all, thanks for reading!

Exploring Prometheus Go client metrics

In this post I want to explore go metrics, which are exported by client_golang via promhttp.Handler() call.
Here is a sample program, registering prom handler and listening on 8080 port:

package main

import (
    "log"
    "net/http"

    "github.com/prometheus/client_golang/prometheus/promhttp"
)

func main() {
    http.Handle("/metrics", promhttp.Handler())
    log.Fatal(http.ListenAndServe(":8080", nil))
}

When you hit your metrics endpoint, you will get something like:

# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 3.5101e-05
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 6
...
process_open_fds 12
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.1272192e+07
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 4.74484736e+08

On initialisation client_golang registers 2 Prometheus collectors:

  • Process Collector – which collects basic Linux process information like CPU, memory, file descriptor usage and start time.
  • Go Collector – which collects information about Go’s runtime like details about GC, number of gouroutines and OS threads.

Process Collector

What this collector does is reads proc file system. proc file system exposes internal kernel data structures, which is used to obtain information about the system.1

So Prometheus client reads /proc/PID/stat file, which looks like this:

1 (sh) S 0 1 1 34816 8 4194560 674 43 9 1 5 0 0 0 20 0 1 0 89724 1581056 209 18446744073709551615 94672542621696 94672543427732 140730737801568 0 0 0 0 2637828 65538 1 0 0 17 3 0 0 0 0 0 94672545527192 94672545542787 94672557428736 140730737807231 140730737807234 140730737807234 140730737807344 0

You can get human readable variant of this information using cat /proc/PID/status.

process_cpu_seconds_total – it uses utime – number of ticks executing code in user mode, measured in jiffies, with stime – jiffies spent in the system mode, executing code on behalf of the process (like doing system calls). A jiffy is the time between two ticks of the system timer interrupt. 2

process_cpu_seconds_total equals to sum of utime and stime and divide by USER_HZ. This makes sense, as dividing number of scheduler ticks by Hz(ticks per second) produces total time in seconds operating system has been running the process. 3

process_virtual_memory_bytes – uses vsize – virtual memory size is the amount of address space that a process is managing. This includes all types of memory, both in RAM and swapped out.

process_resident_memory_bytes – multiplies rss – resident set memory size is number of memory pages the process has in real memory, with pagesize 4. This results in the amount of memory that belongs specifically to that process in bytes. This excludes swapped out memory pages.

process_start_time_seconds – uses start_time – time the process started after system boot, which is expressed in jiffies and btime from /proc/stat which shows the time at which the system booted in seconds since the Unix epoch. start_time is divided by USER_HZ in order to get the value in seconds.

process_open_fds – counts the number of files in /proc/PID/fd directory. This shows currently open regular files, sockets, pseudo terminals, etc.

process_max_fds – reads /proc/{PID}/limits and uses Soft Limit from “Max Open Files” row. The interesting bit here is that /limits lists Soft and Hard limits.
As it turns out, the soft limit is the value that the kernel enforces for the corresponding resource and the Hard limit acts as a ceiling for the soft limit.
An unprivileged process may only set its soft limit to a value up to the hard limit and (irreversibly) lower its Hard limit. 5

In Go you can use err = syscall.Setrlimit(syscall.RLIMIT_NOFILE, &syscall.Rlimit{Cur: 9, Max: 10}) to set your limit.

Go Collector

Go Collector’s most of the metrics are taken from runtime, runtime/debug packages.

go_goroutines – calls out to runtime.NumGoroutine(), which computes the value based off the scheduler struct and global allglen variable. As all the values in sched struct can be changed concurently there is this funny check where if computed value is less than 1 it becomes 1.

go_threads – calls out to runtime.CreateThreadProfile(), which reads off global allm variable. If you don’t know what M’s or G’s you can read my blogpost about it.

go_gc_duration_seconds – calls out to debug.ReadGCStats() with PauseQuantile set to 5, which returns us the minimum, 25%, 50%, 75%, and maximum pause times. Then it manualy creates a Summary type from pause quantiles, NumGC var and PauseTotal seconds. It’s cool how well GCStats struct fits the prom’s Summary type. 6

go_info – this provides us with Go version. It’s pretty clever, it calls out to runtime.Version() and set’s that as a version label and then always returns value of 1 for this gauge metric.

Memory

Go Collector provides us with a lot of metrics about memory and GC.
All those metrics are from runtime.ReadMemStats(), which gives us metrics from MemStats struct.
One thing that worries me, is that runtime.ReadMemStats() has a explicit call to make a stop-the-world pause 7.
So I wonder how much actual cost this pause introduces?
As during stop-the-world pause, all goroutines are paused, so that GC can run. I’ll probably do a comparison of an app with and without instrumentation in a later post.

We already seen that Linux provides us with rss/vsize metrics for memory stats, so naturally the question arises which metrics to use, the ones provided in MemStats or rss/vsize?

The good part about resident set size and virtual memory size is that it’s based off Linux primitives and is programming language agnostic.
So in theory you could instrument any program and you would know how much memory it consumes (as long as you name your metrics consistently, ie process_virtual_memory_bytes and process_resident_memory_bytes.).
In practice, however when Go process starts up it takes a lot of virtual memory beforehand, such a simple program like the one above takes up to 544MiB of vsize on my machine (x86_64 Ubuntu), which is a bit confusing. RSS shows around 7mib, which is closer to the actual usage.

On the other hand using Go runtime based metrics gives more fined grained information on what is happening in your running application.
You should be able to find out more easily whether your program has a memory leak, how long GC took, how much it reclaimed.
Also, it should point you into right direction when you are optimizing program’s memory allocations.

I haven’t looked in detail how Go GC and memory model works, a part of it’s concurrency model 8, so this bit is still new to me.

So let’s take a look at those metrics:

go_memstats_alloc_bytes – a metric which shows how much bytes of memory is allocated on the Heap for the Objects. The value is same as go_memstats_heap_alloc_bytes. This metric counts all reachable heap objects plus unreachable objects, GC has not yet freed.

go_memstats_alloc_bytes_total – this metric increases as objects are allocated in the Heap, but doesn’t decrease when they are freed. I think it is immensly useful, as it is only increasing number and has same nice properties that Prometheus Counter has. Doing rate() on it should show us how many bytes/s of memory app consumes and is “durable” across restarts and scrape misses.

go_memstats_sys_bytes – it’s a metric, which measures how many bytes of memory in total is taken from system by Go. It sums all the sys metrics described below.

go_memstats_lookups_total – counts how many pointer dereferences happened. This is a counter value so you can use rate() to lookups/s.

go_memstats_mallocs_total – shows how many heap objects are allocated. This is a counter value so you can use rate() to objects allocated/s.

go_memstats_frees_total – shows how many heap objects are freed. This is a counter value so you can use rate() to objects allocated/s. Note you can get number of live objects with go_memstats_mallocs_totalgo_memstats_frees_total.

Turns out that Go organizes memory in spans, which are contiguous regions of memory of 8K or larger. There are 3 types of Spans:
1) idle – span, that has no objects and can be released back to the OS, or reused for heap allocation, or reused for stack memory.
2) in use – span, that has atleast one heap object and may have space for more.
3) stack – span, which is used for goroutine stack. This span can live either in stack or in heap, but not in both.

Heap memory metrics

go_memstats_heap_alloc_bytes – same as go_memstats_alloc_bytes.

go_memstats_heap_sys_bytes – bytes of memory obtained for the heap from OS. This includes virtual address space that has been resevered, but not yet used and virtual address space which was returned to OS after it became unused. This metric estimates the largest size the heap.

go_memstats_heap_idle_bytes – shows how many bytes are in idle spans.

go_memstats_heap_idle_bytes minus go_memstats_heap_released_bytes estimates how many bytes of memory could be released, but is kept by runtime, so that runtime can allocate objects on the heap without asking OS for more memory.

go_memstats_heap_inuse_bytes – shows how many bytes in in-use spans.

go_memstats_heap_inuse_bytes minus go_memstats_heap_alloc_bytes shows how many bytes of memory has been allocated for the heap, but not currently used.

go_memstats_heap_released_bytes – shows how many bytes of idle spans were returned to the OS.

go_memstats_heap_objects – shows how many objects are allocated on the heap. This changes as GC is performed and new objects are allocated.

Stack memory metrics

go_memstats_stack_inuse_bytes – shows how many bytes of memory is used by stack memory spans, which have atleast one object in them. Go doc says, that stack memory spans can only be used for other stack spans, i.e. there is no mixing of heap objects and stack objects in one memory span.

go_memstats_stack_sys_bytes – shows how many bytes of stack memory is obtained from OS. It’s go_memstats_stack_inuse_bytes plus any memory obtained for OS thread stack.

There is no go_memstats_stack_idle_bytes, as unused stack spans are counted towards go_memstats_heap_idle_bytes.

Off-heap memory metrics

These metrics are bytes allocated for runtime internal structures, that are not allocated on the heap, because they implement the heap.

go_memstats_mspan_inuse_bytes – shows how many bytes are in use by mspan structures.

go_memstats_mspan_sys_bytes – shows how many bytes are obtained from OS for mspan structures.

go_memstats_mcache_inuse_bytes – shows how many bytes are in use by mcache structures.

go_memstats_mcache_sys_bytes – shows how many bytes are obtained from OS for mcache structures.

go_memstats_buck_hash_sys_bytes – shows how many bytes of memory are in bucket hash tables, which are used for profiling.

go_memstats_gc_sys_bytes – shows how many in garbage collection metadata.

go_memstats_other_sys_bytesgo_memstats_other_sys_bytes shows how many bytes of memory are used for other runtime allocations.

go_memstats_next_gc_bytes – shows the target heap size of the next GC cycle. GC’s goal is to keep go_memstats_heap_alloc_bytes less than this value.

go_memstats_last_gc_time_seconds – contains unix timestamp when last GC finished.

go_memstats_last_gc_cpu_fraction – shows the fraction of this program’s available CPU time used by GC since the program started.
This metric is also provided in GODEBUG=gctrace=1.

Playing around with numbers

So it’s a lot of metrics and a lot of information.
I think the best way to learn is to just play around with it, so in this part I’ll do just that.
So I’ll be using the same program that is above.
Here are the dump from /metrics (edited for space), which I’m going to use:

process_resident_memory_bytes 1.09568e+07

process_virtual_memory_bytes 6.46668288e+08

go_memstats_heap_alloc_bytes 2.24344e+06

go_memstats_heap_idle_bytes 6.3643648e+07

go_memstats_heap_inuse_bytes 3.039232e+06

go_memstats_heap_objects 6498

go_memstats_heap_released_bytes 0

go_memstats_heap_sys_bytes 6.668288e+07

go_memstats_lookups_total 0

go_memstats_frees_total 12209

go_memstats_mallocs_total 18707

go_memstats_buck_hash_sys_bytes 1.443899e+06

go_memstats_mcache_inuse_bytes 6912

go_memstats_mcache_sys_bytes 16384

go_memstats_mspan_inuse_bytes 25840

go_memstats_mspan_sys_bytes 32768

go_memstats_other_sys_bytes 1.310909e+06

go_memstats_stack_inuse_bytes 425984

go_memstats_stack_sys_bytes 425984

go_memstats_sys_bytes 7.2284408e+07

go_memstats_next_gc_bytes 4.194304e+06

go_memstats_gc_cpu_fraction 1.421928536233557e-06

go_memstats_gc_sys_bytes 2.371584e+06

go_memstats_last_gc_time_seconds 1.5235057190167596e+09

rss = 1.09568e+07 = 10956800 bytes = 10700 KiB = 10.4 MiB

vsize = 6.46668288e+08 = 646668288 bytes = 631512 KiB = 616.7 MiB

heap_alloc_bytes = 2.24344e+06 = 2243440 = 2190 KiB = 2.1 MiB

heap_inuse_bytes = 3.039232e+06 = 3039232 = 2968 KiB = 2,9 MiB

heap_idle_bytes = 6.3643648e+07 = 63643648 = 62152 KiB = 60.6 MiB

heap_released_bytes = 0

heap_sys_bytes = 6.668288e+07 = 66682880 = 65120 KiB = 63.6 MiB

frees_total = 12209

mallocs_total = 18707

mspan_inuse_bytes = 25840 = 25.2 KiB

mspan_sys_bytes = 32768 = 32 KiB

mcache_inuse_bytes = 6912 = 6.8 KiB

mcache_sys_bytes = 16384 = 12 KiB

buck_hash_sys_bytes = 1.443899e+06 = 1443899 = 1410 KiB = 1.4 MiB

gc_sys_bytes = 2.371584e+06 = 2371584 = 2316 KiB = 2.3 MiB

other_sys_bytes = 1.310909e+06 = 1310909 = 1280,2 KiB = 1.3MiB

stack_inuse_bytes = 425984 = 416 KiB

stack_sys_bytes = 425984 = 416 KiB

sys_bytes = 7.2284408e+07 = 72284408 = 70590.2 KiB = 68.9 MiB

next_gc_bytes = 4.194304e+06 = 4194304 = 4096 KiB = 4 MiB

gc_cpu_fraction = 1.421928536233557e-06 = 0.000001

last_gc_time_seconds = 1.5235057190167596e+09 = Thu, 12 Apr 2018 05:47:59 GMT

Interesting bit is that heap_inuse_bytes is more than heap_alloc_bytes.
I think heap_alloc_bytes shows how many bytes are in terms of objects and heap_inuse_bytes shows bytes of memory in terms of spans.
Dividing heap_inuse_bytes by size of the span gives: 3039232 / 8192 = 371 span.

heap_inuse_bytes minus heap_alloc_bytes, should show the amount of free space that we have in in-use spans, which is 2,9 MiB – 2.1 MiB = 0.8 MiB.
This roughly means that we can allocate 0.8 MiB of objects on the heap without using new memory spans.
But we should keep in mind memory fragmentation.
Imagine if you have a new bytes slice of 10K bytes, the memory could be in the position, where it doesn’t have a contiguous block of 10K bytes + slice header, so it would need use a new span, instead of reusing

heap_idle_bytes minus heap_released_byte shows that we have around 60.6 MiB of unused spans, which are reserved from OS and could be returned to OS. It’s 63643648/8192 = 7769 spans.

heap_sys_bytes, which is 63.6MiB estimates the largest size the heap has had. It’s 66682880/8192 = 8140 spans.

mallocs_total shows that we allocated 18707 objects and freed 12209. So currently, we have 18707-12209 = 6498 objects. We can find the average size of the object dividing heap_alloc_bytes over live objects, which is 6498. The result is 2243440 / 6498 = 345.3 bytes.
(This is probably a stupid metric, as object vary a lot in size and we should do histograms instead)

So sys_bytes should be a sum of all *sys metrics. So let’s check that:
sys_bytes == mspan_sys_bytes + mcache_sys_bytes + buck_hash_sys_bytes + gc_sys_bytes + other_sys_bytes + stack_sys_bytes + heap_sys_bytes.
So, we have 72284408 == 32768 + 16384 + 1443899 + 2371584 + 1310909 + 425984 + 66682880, which is 72284408 == 72284408, which is correct.

The interesting detail about sys_bytes, is that it’s 68,9 MiB it’s how many bytes of memory in total taken from OS. Meanwhile, OS’es vsize gives you 616,7MiB and rss 10.4 MiB. So all these numbers don’t really match up.

As I understand it so part of our memory could be in OS’s memory pages which are in swap or in filesystem (not in RAM), so this would explain why rss is smaller than sys_bytes.

And vsize contains a lot of things, like mapped libc, pthreads libs, etc. You can explore /proc/PID/maps and /proc/PID/smaps file, to see what is being currently mapped.

gc_cpu_fraction is running crazy low, 0.000001 of CPU time is used for GC. That’s really really cool. (Although this program doesn’t produce much garbage)

next_gc_bytes shows that the target for GC is to keep heap_alloc_bytes under 4 MiB, as heap_alloc_bytes is currently at 2.1 MiB the target is achieved.

Conclusion

I love Go and the fact that it exposes so much useful information in it’s packages and simple users like you and me can just call a function and get the information.

It was really cool playing around and reading about linux & Go, so I’m thinking of doing part 2 of this post. Maybe go look into metrics provided by cAdvisor or show how to use some of the metrics described here in dashboards/alerts with Prometheus.

Also, once vgo get’s integrated (and I really really hope it does, cause it’s like the best package manager I ever used). Then we should be able to inspect dependencies from some go runtime package, which would be really cool! Imagine writing a custom prom collector, which would go through all your dependencies, check for new versions and if found wouldgive you back a number of outdated pkgs, something like go_num_outdated_pkgs metric.
This way you could write an alert if your service get’s terribly outdated. Or check that your live dependency hashes don’t match current hashes?

If you like the post, hit the up arrow button on the reddit and see you soon.

  1. You can read more about what data kernel exposes http://tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html or more technical version here: https://www.kernel.org/doc/Documentation/filesystems/proc.txt
  2. Take a look at http://man7.org/linux/man-pages/man7/time.7.html, which has a great description for jiffies, user time and system time.
  3. The formula is in https://github.com/prometheus/procfs/blob/master/proc_stat.go#L187 . Also, there is an interesting work around in prometheus client definition of USER_HZ in https://github.com/prometheus/procfs/blob/master/proc_stat.go#L10-L25 .
  4. http://man7.org/linux/man-pages/man2/getpagesize.2.html
  5. More docs in https://linux.die.net/man/2/setrlimit
  6. https://golang.org/pkg/runtime/debug/#ReadGCStats
  7. https://github.com/golang/go/blob/master/src/runtime/mstats.go#L458
  8. Weirdly article “The Go Memory Model” talks about accessing memory from different goroutines https://golang.org/ref/mem rather than how memory is allocated etc