Exposing Go Modules to Prometheus


Recently Go has released Go 1.12, which added support to retrieve build module information via runtime/debug package. Therefore, I decided to build a Go module to expose this information for Prometheus. I’m happy to announce the release of prommod, which you can find here:

go get github.com/povilasv/prommod


I think it’s a really cool idea to put Go Module version information into Prometheus, as I had previously mentioned. This provides us with a global view of software packages we actually have running in production, as well as an up to date answer – considering this information is updated every minute or so (based on Prometheus scrape period).

Imagine you just learned that Go Module X with version v1.2.0 had just announced a critical security vulnerability. Now you would like to know which of your applications are currently being affected. This can become a hard question to answer, especially if you are utilizing microservice architecture. Considering you may have thousands of them in use, residing in multiple version control repositories, scattered all over the place, it may be nearly impossible to resolve on your own.

However, by using prommod package, it becomes pretty effortless to know what has been affected. Simply query Prometheus, by issuing this PromQL query:


Maybe you want to know how many different versions of package X we currently have running in production?

There is a PromQL for that:

count(count by (version)(go_mod_info{name="X"}))

Or, maybe you would like to get a list of different versions?

count by (version)(go_mod_info{name="X"})

Hell, you could even write an alert using similar PromQL language. Maybe you don’t like having too much of a version spread? Let’s say you allow 3 versions to be supported by library X, then PromQL for this would look like:

count(count by (version)(go_mod_info{name="X"})) > 3

There are tons of other interesting questions you could ask, for example, how long did it take us to update our packages? Or, how many terribly outdated packages do we have in production?

I think it also has a case for your company’s internal packages. Imagine you are a maintainer of a package and almost everyone in your company uses it. Now, as a maintainer, you would probably like to see what package version various teams are on, what the actual version spread is, or how many services are affected by the bug in Y release. By using prommod, you can do just that.

In practice, I use this approach all the time. Where I work, each team runs Prometheus & Thanos for our monitoring pipeline. From time to time I check for Thanos updates and make a Pull Request with the update for the teams. Once the teams approve, I merge it and expect the teams to actually deploy the changes. Obviously, not all of the teams have the time to do so, but, for me, it’s important that this deployment is not forgotten. After a couple of days or so, I check what our version spread is by issuing this PromQL:

count(thanos_build_info) by (revision)

From this query, I get a list of teams that did not apply the update and send them a quick reminder.


Prometheus Version package

This package was totally inspired by Prometheus version package, which has a very interesting API. In order to use it, you have to set Version, Revision, Branch, BuildUser,BuildDate in the global variables of that package. Once you have done that, you need to call version.NewExporter("application_name") to retrieve a Prometheus metrics from it.

Here is a simple code sample for it:

import (

func main(){
    version.Version = "v1.11"
    version.Revision = "sha"
    version.Branch = "master"
    version.BuildUser = "povilas"
    version.BuildDate = "2019-03-08"

But wait, isn’t package scoped global state bad?

The reason version package uses global variables, instead of a function argument, is that you can populate these variables during build time. Go allows you to do that, using the -ldflags argument. This is how my typical Makefile looks:

DATE := $(shell date +%FT%T%z)
USER := $(shell whoami)
GIT_HASH := $(shell git --no-pager describe --tags --always)
BRANCH := $(shell git branch | grep * | cut -d ' ' -f2)

go build -a -ldflags '-s
-X github.com/prometheus/common/version.Version=$(GIT_HASH)
-X github.com/prometheus/common/version.BuildDate="$(DATE)"
-X github.com/prometheus/common/version.Branch=$(BRANCH)
-X github.com/prometheus/common/version.Revision=$(GIT_HASH)
-X github.com/prometheus/common/version.BuildUser=$(USER)'

To explain this code briefly, global variables like User in package github.com/prometheus/common/version gets populated with user information during build time. You can read more about this approach here.

The actual produced metric by this library looks like this:

# HELP app_name_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which app_name was built.
# TYPE app_name_build_info gauge

app_name_build_info{branch="master",goversion="devel +5126feadd6 Sat Mar 2 18:28:10 2019 +0000",revision="32eb8c1",version="32eb8c1"} 1

As you can see, it just puts a “fake” metric with version information in the labels.

Printing version information

Another cool thing about version package is that it has a nice way of plugging version information with other tools. Let’s say you use the ever popular kingpin package, and would like to show version information to your users. Turns out, it’s really easy – take a look at this example:

app := kingpin.New("app_name", "My super awesome application")

This will make a --version flag of your application and display current and relevant version information. This is how the output looks:

app_name, version 32eb8c1 (branch: master, revision: 32eb8c1)
  build user:       povilasv
  build date:       "2019-03-08T17:59:47+0200"
  go version:       devel +5126feadd6 Sat Mar 2 18:28:10 2019 +0000

Many Open Source projects like Prometheus, Kubernetes, Thanos or Grafana use this package or have a very similar approach. Let’s take a look at Kubernetes, for a moment. Although Kubernetes may not use this package directly, it utilizes its own little package, which adds a bit more content. For example, querying Prometheus with kubernetes_build_inforesults in:




Let’s get back to my newly created prommod package, one I built to have a very similar feel to version package. In order to use it, you just call prommod.NewCollector("app_name")and register your metric with Prometheus. Take a look at this code snippet:

import (

func main(){

prommod will produce a metric for each go module used. Here is how exported metrics look:

# HELP go_mod_info A metric with a constant '1' value labeled by dependency name, version, from which app_name was built.
# TYPE go_mod_info gauge
go_mod_info{name="github.com/beorn7/perks",program="app_name",version="v0.0.0-20180321164747-3a771d992973"} 1
go_mod_info{name="github.com/golang/protobuf",program="app_name",version="v1.2.0"} 1
go_mod_info{name="github.com/matttproud/golang_protobuf_extensions",program="app_name",version="v1.0.1"} 1
go_mod_info{name="github.com/povilasv/prommod",program="app_name",version="v0.0.11-0.20190309143328-e661980fc053"} 1
go_mod_info{name="github.com/prometheus/client_golang",program="app_name",version="v0.9.2"} 1
go_mod_info{name="github.com/prometheus/client_model",program="app_name",version="v0.0.0-20180712105110-5c3871d89910"} 1
go_mod_info{name="github.com/prometheus/common",program="app_name",version="v0.0.0-20181126121408-4724e9255275"} 1
go_mod_info{name="github.com/prometheus/procfs",program="app_name",version="v0.0.0-20181204211112-1dc9a6cbc91a"} 1

As you can see, we create go_mod_info metric, which stores your project dependency version information in Prometheus labels.

Also there is prommod.Print("app_name") and prommod.Info() calls to get a human-readable representation of dependency information, just like in the Prometheus version package. You can use prommod alone, or you can combine it with version package in order to get nice combined output. For example:

func main() {

Gives you this nice output:

app_name, version 32eb8c1 (branch: master, revision: 32eb8c1)
  build user:       povilasv
  build date:       2019-03-09T06:22:18+0200
  go version:       devel +5126feadd6 Sat Mar 2 18:28:10 2019 +0000
  github.com/beorn7/perks:  v0.0.0-20180321164747-3a771d992973
  github.com/golang/protobuf:  v1.2.0
  github.com/matttproud/golang_protobuf_extensions:  v1.0.1
  github.com/povilasv/prommod:  v0.0.10
  github.com/prometheus/client_golang:  v0.9.2
  github.com/prometheus/client_model:  v0.0.0-20180712105110-5c3871d89910
  github.com/prometheus/common:  v0.0.0-20181126121408-4724e9255275
  github.com/prometheus/procfs:  v0.0.0-20181204211112-1dc9a6cbc91a

Personally, I find this very useful for development. You can ask your users to submit output of app --version, when submitting bug reports, which then allows you to quickly see all of the dependencies, as well as their versions.

Building prommod

Although this package is small and pretty trivial to implement, writing it was not so easy as it may seem, and I ended up using a lot of Go’s “hidden” features.

Build tags

I really wanted to make this package work on Go with versions lower than 1.12. Although Go modules were introduced in Go 1.11, the debug.ReadBuildInfocall, which is used by prommod, was only introduced in Go 1.12. Therefore, this wouldn’t work for users with older Go versions.

In order to solve this, I used Go build tags. For Go >=1.12 implementation I added Package comment // +build go1.12, which calls debug.ReadBuildInfo. For older Go versions I added // +build !go1.12 comment, and provided the same functions, with identical API. But, instead of calling debug.ReadBuildInfo, I would just return empty strings in prommod.Print("app_name") & prommod.Info() calls, and an empty Prometheus metric.

All in all, if you are using prommod with an older Go version, it will still compile, but won’t produce any useful output.

In order to continuously test this with different versions, I used Travis CI. And I think it’s the best tool for the job. My .travis.yml looks like this:

language: go

  - "1.x"
  - "master"
  - "1.12"
  - "1.11"
  - "1.10"

As you can see it’s just a list of Go versions, which I build and test against, and it all just works ūüôā

Go examples are tests

When testing this library, I particularly liked writing Go Examples.
Not a lot of Go developers know, but Go examples are tests. In order to use Example test features, you simply need to print the resulting variable to Standard Output and check it with // Output: comment.

Take a look at this code snippet:

func ExampleInfo() {
    // Output: ()

In this case, Go will run this code as a test expecting prommod.Info() to be equal to (). And, if the output is not equal, go test will give you a nice error:

$ go test
--- FAIL: ExampleInfo (0.00s)

I encountered weird Go 1.12 behavior, as Go behaves like a non Go module program in testing. What happens is runtime/debug.ReadBuildInfo() doesn’t return any of the Go module information, although compiled with Go 1.12, which makes this library a bit harder to unit test.

Using init() functions

Another lesser known feature are the Go init() functions. Basically, init() functions are called once per package, and they’re used to set up state. What I wanted to do is to always return the computed information from a global variable, instead of actually computing everything on Print()/Info() call.

I think it makes a lot of sense for this library, as this information never changes after the initial build. This also has performance benefits, as this library is mostly just returning constant global strings.

Semantic Versions

Lastly, and most importantly, this package tags versions using Semantic Versioning. Go Modules depend on the community using Semantic Versioning. Please use it – especially if you are an author of a Go package! Pretty please? We are all going to benefit from it. ūüôā


Although I really like this approach of using the Prometheus monitoring system for storing dependency versions, I do think I might be hammering a nail with a screwdriver here. It feels like we are slightly abusing Prometheus, as metrics are “fake” (i.e. metric value never changes) and we just use labels to store information. Maybe it’s worth investigating other solutions, like Grafeas, for storing this type of information‚Äč.

On the other hand, I really like the Pull approach here. I like that it’s sort of an application telling you, “hey, I’m this thing and I use these dependencies to perform my function”, just like application in health checks tells you “hey I think I’m fine”, or “hey I can’t connect to DB”. Also, Prometheus information is as close to real-time as possible. This means that you can monitor application & library updates, live.

Most importantly, what do you think? I would really like to hear more opinions on this topic. And, by the way, if you want to support my work you can buy me a book here.

Thanks for reading! Quality open source is really important to me. prommod is available using MIT license, so feel free to use it. Just remember to star it on github if you really like the package. And, if you are missing something, please create an issue and then a PR if you feel like it.

PromCon 2018 Notes


  • Prometheus graduated CNCF.
  • Openmetrics joined the incubator stage of CNCF.
  • Prometheus will offer built-in TLS and auth in HTTP endpoints, as well as implement subqueries (e.g. max_over_time(rate(foo[5m])[1h:1m]))

Leveraging Prometheus to Build, Test, and Release New Products at DigitalOcean.

Talk by Sneha Inguva from DigitalOcean Observability team.

Talks about leveraging Prometheus to build and test, showed and example setting it up on DigitalOcean’s VPC and DHCP products.
Instrumenting code using Four Golden Signals (from Google SRE book) or Brendan Gregg’s USE method, working in build -> instrument -> test -> iterate loops. They use metrics, tracing, logging.

I really liked how they have a master dashboard, which links to a smaller, service specific dashboard. It sort of makes sense, as when things break, you want to see general view of what is affected and deep down into specific service view. Also really cool dashboard names, which reiterate the pattern:

  • [production] RNS Architecture – Master
  • [production] RNS Architecture – Northd
  • [production] RNS Architecture – RnsDB – MySQL Overview
  • [production] RNS Architecture – RabbitMQ


  • easy to load test with generated GRPC client + multiple goroutines.
  • chaos testing – take down a component in the system.
  • integration testing

Life of an alert

Talk by Stuart Nelson from SoundCloud, maintainer of alertmanager.

Talk went deep dive into alertmanager and how it goes thru all the stages of alerting.

There is this interesting feature in Alertmanager called inhibit rules, which once inhibiting alert fire, it can mute other non important alerts. Imagine if datacenter is on fire and you would probably get tons of alertsm which doesn’t help you figure out what is happening, so you would write an alertmanager inhibit rule, which would mute all the non important alerts. Something like:

    - source_match:
        severity: 'critical'
        severity: 'warning'

But be careful check for no circular inhibitions and remember that silenced alert which is inhibiting alert will still mute alerts

There is alert visualizer UI, which can draw your alert tree.

Alert grouping is another interesting feature, which has a couple of options to tweak such as group_wait – how long to wait for first group of alerts to form, group_interval – how long to wait before sending a notification about new alerts that are added to a group of alerts.

In the end alert goes thru notify pipeline, which has send at least one semantics.
In clustered mode before sending alert there is a wait and dedup stage in order to avoid duplicate notification. Alertmanager instance will wait HA mode position * 15s and then checks gossiped alert notification log and if alert is not there, then send external notification. HA mode position start’s from 0, so the first instance doesn’t wait.

Original https://github.com/prometheus/alertmanager/blob/master/doc/arch.svg


Does Alertmanager store history of alerts?
There is no history of alerts in alertmanager, you can write a kafka sink for your notifications.

How to test inhibitions?

There is no other way apart from visual editor.

Alertmanager HA is still in development?

Soundcloud have been using for a couple of months, work fine.

How alert grouping works?

When new alert comes that will join the group, it will be notified after group_interval if group was already fired, if group hasn’t formed yet then after group_wait.

How do SoundCloud templates look for grouped alerts?

Sound cloud does the common labels, alert name and a link to the runbook.
You can look at the source of the alert and know how many of them are firing.
But in their slack template they actually show how many are firing.

Hidden linux metrics with ebpf_exporter

Talk by Ivan Babrou from Cloudflare.

You can use DNS over tls using Cloudflare, so that your internet provider doesn’t know what you are browsing ūüôā

Current exporters don’t allow you to answer questions like:
How many fast io operations is disk doing?
How many slow io operations?

Current issue with node_exporter is that it doesn’t give you histograms, as Linux Kernel only provides counters in /proc/diskstats. There is a way around that using ebpf. Turns out ebpf is already in kernel, used by seccomp filters. It’s a in-kernel sandboxed virtual machine, which allows you to access kernel data and put in some memory region, after that exporter just needs to map the memory region and give you metrics out of that.

So if you need more detailed metrics from Linux definitely look into ebpf_exporter.

Prometheus Monitoring Mixins: Using Jsonnet to Package Together Dashboards, Alerts, and Exporters

Talk by Tom Wilkie from Grafana.

The talk is mostly the same as the one in KubeCon

Showed basics of jsonnet and how to template out grafana dashboards, alerts and recording rules.

There is also a jsonnet-bundler package manager to get the mixins.

Currently there are mixins for:


Autoscaling All Things Kubernetes with Prometheus

Talk by Michael Hausenblas and Frederic Branczyk.

There are 2 types of autoscaling, vertical and horizontal.

Not all apps are horizontally scaleable.
Right now vertical autoscaling requires pod restart.

If you want to use Prometheus metrics for autoscaling, you need to launch:

API aggregation -> k8s-prometheus-adaptor -> Prometheus

Right now the Custom Metrics API doesn’t provide historic data, only current metrics. But sig-instrumentation & sig-autoscaling is working on that as it is needed for VerticalPodAutoScaler.

VPA is pre-alpha.

Taking Advantage of Relabeling

Talk by Julien Pivotto from Inuits.

You can get top 10 biggest cardinality metrics:

count({job="JOB_NAME"}) by (__name__)

The talk showed really cool tricks on how you can use relabeling:

  • A config will alert on https certificate issues using Blackbox exporter based from Traefik ingress controller information.
  • Relabeling to spread the load of Prometheus server by droping some targets on one server and keeping on another.
  • Adding default label (for example priority: P1) to alerts.

Bomb Squad: Containing the Cardinality Explosion

Talk by Cody ‘Chowny’ Boggs from FreshTracks.

Turns out Prometheus cannot handle too much cardinality.

In Prometheus cardinality is the number of discrete label/value pairs associated with a particular metric.

A simple example of metric with big cardinality is having unique request id in a label.

Bombsquad is Prometheus sidecar which looks at a cardinality number and adds relabeling rule to all Prometheus jobs to silence the metric that causes cardinality explosion.

bombsquad comes with CLI tool that let’s you view all the silences and remove them.

You can view cardinality of your Prometheus by executing:

sort_desc(label_replace( count by(__name__) ({__name__!="", __name__!="card_count"}), "metric_name", "$1", "__name__", "(.+)" ))

Speaker also promoted cool Kubernetes CLI tool for viewing logs called stern.

Thanos – Prometheus at Scale

Talk by BartŇāomiej PŇāotka from Improbable and Fabian Reinartz from Google.

  • Thanos is a new approach for solving Global view and Long Term Storage problems in Prometheus.
  • Thanos querier and Thanos sidecar can solve the global view and HA problem.
  • Thanos sidecar is run as a Prometheus sidecar, which has GRPC endpoint, called Store API.
  • Thanos querier calls to all the GRPC endpoints and returns the Prometheus data.
  • Instead of adding session stickiness to Prometheus HA pairs, Thanos queries has a deduplication mechnanism.
  • Deduplication switches to different replica once it finds a couple of of missing scrapes
  • You can solve Long Term storage, by adding S3 compatible storage.
  • Thanos sidecar then uploads 2h uncompacted Prometheus blocks to S3.
  • Thanos store gateway gives a Store API for blocks in S3.
  • You also need to compact & downsample blocks in S3, which is done via Thanos compactor.
  • You need to think about writing queries to hit downsampled data

Using the Flux Query Engine to Query Multiple Prometheus Servers

Talk by Paul Dix from Influx Data.

Influx Data is creating a new language called “flux”, which is used for data analysis, data science applications. It should be something like Pandas / R / Matlab / Spark and support loading prometheus data.

Right now flux is very very alpha.

Improving Reliability Through Engineering an Easy-to-use Prometheus-Based Monitoring and Alerting Stack: Introducing Our Reliability Toolkit

Talk by Robin van Zijll from ING bank.

Talk was about the struggles of adopting Prometheus in a bank.
Internally the team has developed a PromQL workshop, which hopefully they will open source. Also they created Model Builder, which generates thresholds for alerts.

Explore Your Prometheus Data in Grafana

Talk by David Kaltschmidt from Grafana.

Really cool features are coming to Grafana:

  • Better step alignment, so you get less “jumpy” graphs on refresh.
  • $__range – variable, which shows selected time range, finally you can do something like go_goroutines[$__range].
  • Explore window is coming, where you can explore metrics with good autocompletion support and tabs. Will be released in 6.0
  • Grafana is working on some kind of git integration to support dashboard changes.
  • Grafana has been secretly working on log aggregation
  • Support for mixins.

RPC at Google

Talk by JBD from Google

  • Google works on new monitoring system called Monarch.
  • Google has statz/rpcz pages in their http servers, similiar available in OpenCensus https://opencensus.io/core-concepts/z-pages/.
  • Google works on OpenCensus library, which is vendor-agnostic libraries for metrics and tracing.
  • Exemplar Traces a new kind of thing, which will be reported in metrics endpoint. The idea is to link latency histogram data to OpenTracing spans.

OpenMetrics – Transforming the Prometheus Exposition Format into a Global Standard

Talk By Richard Hartmann.

  • Going beyond metrics, it’s about monitoring.
  • We need a better official standard, as currently SNMP is the only available monitoring standard.
  • Prometheus format is going to a base for OpenMetrics.
  • It’s going to change a bit to enable Exemplar traces.
  • Every single data point in a time series can point to one single event.
  • Prometheus will ignore the stuff it doesn’t care about.

What Prometheus Means for Monitoring Vendors

Talk By Jorge Salamero Sanz from Sysdig.

  • The cool thing about Prometheus is that vendors don’t need to re-implement all the metrics libraries and can just Prometheus as a standard way of exposing metrics.

Implementing a Cooperative Multi-Tenant Capable Prometheus

Talk By Jonas Große Sundrup.

  • Talk about how to share single Prometheus server with multiple people.
  • Speaker developed promauthproxy, proxy which injects user labels into config.
  • The proxy uses Prometheus go packages to parse sent PromQL and add user label to Prometheus queries.
  • The same approach is used for alerts and blackbox exporter config.
  • Config is validated via promtool.

Monitoring at Scale: Migrating to Prometheus at Fastly

Talk By Marcus Barczak from Fastly.

  • Moving away from Ganglia + Icinga.
  • They are running one HA Prometheus pair per datacenter.
  • They encrypt metrics traffic via Ghost Tunnel.
  • They have a custom Service Discovery and Proxy component for configuring Prometheus.
  • Fastly is looking into Thanos for LTS and global view.

Automated Prometheus Benchmarking – Pushing It to Its Limits until It Breaks!

Talk By Krasi Georgiev from Red Hat and Harsh Agarwal.

  • Project doc
  • Source
  • Speakers created automated Prometheus benchmarking during Google Summer Of Code 2018.
  • It leverages Kubernetes Prow CI, creates Kubernetes environment one Prometheus instance running with your changes and one instance with master or some other branch.
  • Meta Prometheus collects benchmarks and allows to view them in Grafana.
  • Benchmarking can be started via comment on github PR.
  • Red Hat is also working on Kiali, UI for istio.

Anatomy of a Prometheus Client Library

Talk By Brian Brazil from Robust Perception.

  • To use a counter make a reference to it and then use it, don’t look it up in a global thread-safe map.
  • Put metrics in different files with different metric names.
  • Client libraries are done in a way that you don’t need to worry about concurrency.
  • Labels need to be known in advance and set to 0.
  • There are Registries, which check for duplicate metrics and Collectors, which provide Prometheus data format.

Go Memory Management

This is a blog post version of a talk I gave at Vilnius Go Meetup. If you’re ever in Vilnius and enjoy Go come join us and consider speaking ūüôā

So, in this post we will explore Go memory management. Let’s begin with a following little program:

func main() {
http.HandleFunc("/bar", func(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "Hello, %q", html.EscapeString(r.URL.Path))

http.ListenAndServe(":8080", nil)

Let’s compile and run it:

go build main.go

Now let’s find the running process via ps:

ps -u --pid 16609
povilasv 16609 0.0 0.0 388496 5236 pts/9 Sl+ 17:21 0:00 ./main

We can see that this program consumes 379.39 MiB of virtual memory and resident size is 5.11 mb. Wait what? Why ~380 MiB?

A little reminder:

Virtual Memory Size(VSZ) is all memory that the process can access, including memory that is swapped out, memory that is allocated, but not used, and memory that is from shared libraries. (Edited, good explanation in stackoverflow.)

Resident Set Size(RSS) is number of memory pages the process has in real memory multiplied by pagesize. This excludes swapped out memory pages.

Before deep diving into this problem, let’s go thru some basics of computer architecture and memory management in computers.

Memory Basics

Wikipedia defines RAM as:

Random-access memory (RAM /ræm/) is a form of computer data storage that stores data and machine code currently being used.
A random-access memory device allows data items to be read or written in almost the same amount of time irrespective of the physical location of data inside the memory.

I view physical memory as this array of slots/units, where slot can hold 8 bits of information 1. Each memory slot has an address and in your programs you tell CPU “yo CPU, could you grab that byte of information out of memory at address 0“, or “yo CPU, could you put this byte of information at address 1“.

As computers typically are running multiple tasks, reading and writing directly from/to physical memory is a bad idea. Imagine how easy is to write a program, which reads all the stuff (including your passwords) out of memory or a program, which would write at different program’s memory addresses. That would be wild wild west.

So, instead of doing things with Physical Memory we have a concept of Virtual Memory. When your program runs, it only sees it’s memory and it thinks that it’s the only one in here 2. Also, not all of your program’s stored memory bytes could be in RAM. If you don’t access specific memory block often enough, Operating System can put some block of memory into slower storage (like disk) saving precious RAM. And OS won’t even admit to your application that OS did it. But we all know that OS did it.

Virtual memory can be implemented using Segmentation or Page tables based on your CPU architecture and OS. I’m not going to go into detail about Segmentation as Page tables are way more common, but you can read more about Segmentation in 3.

In Paged Virtual Memory, we divide virtual memory into blocks, called Pages. Pages can vary in size based on hardware, but usually pages are 4-64 KB, often with the capability to use huge pages from 2 MB to 1 GB. The division into blocks is useful as it would require a lot more memory to manage each memory slot individually and would slow down performance of your computer.

In order to implement Paged Virtual Memory, there is a chip called Memory Management Unit (MMU) 4, which sits between CPU and your memory. MMU holds mapping from virtual address to physical address in a table (which it stores in memory) called Page Table, containing one Page Table Entry (PTE) per page. Also MMU has a physical cache called Translation Lookaside Buffer (TLB), which store recent translations from Virtual Memory to Physical. Schematically it looks like this:

So let’s say OS decides to put some virtual memory page into disk and your program tries to access it. This process looks like this:

  1. CPU issues a command to access the virtual address, MMU check’s it in it’s Page Table and prohibits access, because no Physical RAM has been allocated to that virtual page.
  2. Then MMU sends a Page Fault to the CPU.
  3. The Operating System then handles the Page fault, by finding a spare memory block of RAM (called frame) and setting up new PTE to map it.
  4. If no RAM is free, it may choose an existing page, using some replacement algorithm, and save it to disk (this process is called paging).
  5. With some Memory Management Units, there can also be a shortage of Page Table Entrys, in which case the OS will have to free one for the new mapping.

Operating systems usually manages multiple applications(processes) so the whole memory management bit looks like this:

Each process has one linear virtual address space, with addresses running from 0 to some huge maximum. Virtual address space doesn’t need to be contiguous, so that not all of these virtual addresses are actually used to store data and they don’t consume space in RAM or disk. What’s really cool about this is that the same frame of real memory can back multiple virtual pages belonging to multiple processes. This is normally the case, with virtual memory occupied by GNU C Library code (libc), which is included by default if you compile with go build. You can compile go code without libc via ldflags setting 5:

go build -ldflags '-libgcc=none'

So this is a high level overview of what memory is and how it’s implemented using hardware and OS talking to each other. Now let’s take a look at what happens in OS, when you try to run your program and how programs can allocate memory.

Operating system

In order to run programs Operating System has a module, which loads programs and needed libraries, called program loader. In Linux you can call program loader from your program using execve() system call.

When loader runs it goes thru these steps 6:

  1. validates the program image (permissions, memory requirements etc.);
  2. copies the program image from the disk into main memory;
  3. passes the command-line arguments on the stack;
  4. initializes registers (like the stack pointer);

Once loading is complete, the operating system starts the program by passing control to the loaded program code (executing a jump instruction to program’s entrypoint (_start)).

So what is a program anyway?

Usually you write programs in high-level languages like Go, which are compiled into either an executable machine code file or a non-executable machine-code object file (a library). These Object files, executable or not, are typically in a container format, such as Executable and Linkable Format (ELF) (typically in Linux), Portable Executable (typically in Windows). Sometimes you don’t have a luxury of writing everything in beloved Go. In that case, one option is to hand craft your own ELF binary and put machine code into correct ELF structure. Another option is to develop a program in assembly language, which remains human-readable while being more closely associated with machine code instructions.

Object files are binary representations of program intended to execute directly on to processor. Those object files contain not only machine code, but also carry metadata about application, like OS architecture, debug information. Also object files carry application data, like global variables or constants. Typically object files are structured into sections such as the .text (executable code), .data (global variables), and .rodata (global constants) 7.

So I’m running Linux (Ubuntu) and my compiled programs (output file of go build) are in ELF 8. In Go, we can easily write a program, which reads an ELF executable file, because Go has a debug/elf package in standard library. Here is an example:

package main

import (

func main() {
f, err := elf.Open("main")

if err != nil {

for _, section := range f.Sections {

And the output:

2018/05/06 14:26:08 &{{ SHT_NULL 0x0 0 0 0 0 0 0 0 0} 0xc4200803f0 0xc4200803f0 0 0}
2018/05/06 14:26:08 &{{.text SHT_PROGBITS SHF_ALLOC+SHF_EXECINSTR 4198400 4096 3373637 0 0 16 0 3373637} 0xc420080420 0xc420080420 0 0}
2018/05/06 14:26:08 &{{.plt SHT_PROGBITS SHF_ALLOC+SHF_EXECINSTR 7572064 3377760 560 0 0 16 16 560} 0xc420080450 0xc420080450 0 0}
2018/05/06 14:26:08 &{{.rodata SHT_PROGBITS SHF_ALLOC 7573504 3379200 1227675 0 0 32 0 1227675} 0xc420080480 0xc420080480 0 0}
2018/05/06 14:26:08 &{{.rela SHT_RELA SHF_ALLOC 8801184 4606880 24 11 0 8 24 24} 0xc4200804b0 0xc4200804b0 0 0}
2018/05/06 14:26:08 &{{.rela.plt SHT_RELA SHF_ALLOC 8801208 4606904 816 11 2 8 24 816} 0xc4200804e0 0xc4200804e0 0 0}
2018/05/06 14:26:08 &{{.gnu.version SHT_GNU_VERSYM SHF_ALLOC 8802048 4607744 78 11 0 2 2 78} 0xc420080510 0xc420080510 0 0}
2018/05/06 14:26:08 &{{.gnu.version_r SHT_GNU_VERNEED SHF_ALLOC 8802144 4607840 112 10 2 8 0 112} 0xc420080540 0xc420080540 0 0}
2018/05/06 14:26:08 &{{.hash SHT_HASH SHF_ALLOC 8802272 4607968 192 11 0 8 4 192} 0xc420080570 0xc420080570 0 0}
2018/05/06 14:26:08 &{{.shstrtab SHT_STRTAB 0x0 0 4608160 375 0 0 1 0 375} 0xc4200805a0 0xc4200805a0 0 0}
2018/05/06 14:26:08 &{{.dynstr SHT_STRTAB SHF_ALLOC 8802848 4608544 594 0 0 1 0 594} 0xc4200805d0 0xc4200805d0 0 0}
2018/05/06 14:26:08 &{{.dynsym SHT_DYNSYM SHF_ALLOC 8803456 4609152 936 10 0 8 24 936} 0xc420080600 0xc420080600 0 0}
2018/05/06 14:26:08 &{{.typelink SHT_PROGBITS SHF_ALLOC 8804416 4610112 12904 0 0 32 0 12904} 0xc420080630 0xc420080630 0 0}
2018/05/06 14:26:08 &{{.itablink SHT_PROGBITS SHF_ALLOC 8817320 4623016 3176 0 0 8 0 3176} 0xc420080660 0xc420080660 0 0}
2018/05/06 14:26:08 &{{.gosymtab SHT_PROGBITS SHF_ALLOC 8820496 4626192 0 0 0 1 0 0} 0xc420080690 0xc420080690 0 0}
2018/05/06 14:26:08 &{{.gopclntab SHT_PROGBITS SHF_ALLOC 8820512 4626208 1694491 0 0 32 0 1694491} 0xc4200806c0 0xc4200806c0 0 0}
2018/05/06 14:26:08 &{{.got.plt SHT_PROGBITS SHF_WRITE+SHF_ALLOC 10518528 6324224 296 0 0 8 8 296} 0xc4200806f0 0xc4200806f0 0 0}
2018/05/06 14:26:08 &{{.dynamic SHT_DYNAMIC SHF_WRITE+SHF_ALLOC 10518848 6324544 304 10 0 8 16 304} 0xc420080720 0xc420080720 0 0}
2018/05/06 14:26:08 &{{.got SHT_PROGBITS SHF_WRITE+SHF_ALLOC 10519152 6324848 8 0 0 8 8 8} 0xc420080750 0xc420080750 0 0}
2018/05/06 14:26:08 &{{.noptrdata SHT_PROGBITS SHF_WRITE+SHF_ALLOC 10519168 6324864 183489 0 0 32 0 183489} 0xc420080780 0xc420080780 0 0}
2018/05/06 14:26:08 &{{.data SHT_PROGBITS SHF_WRITE+SHF_ALLOC 10702688 6508384 46736 0 0 32 0 46736} 0xc4200807b0 0xc4200807b0 0 0}
2018/05/06 14:26:08 &{{.bss SHT_NOBITS SHF_WRITE+SHF_ALLOC 10749440 6555136 127016 0 0 32 0 127016} 0xc4200807e0 0xc4200807e0 0 0}
2018/05/06 14:26:08 &{{.noptrbss SHT_NOBITS SHF_WRITE+SHF_ALLOC 10876480 6682176 12984 0 0 32 0 12984} 0xc420080810 0xc420080810 0 0}
2018/05/06 14:26:08 &{{.tbss SHT_NOBITS SHF_WRITE+SHF_ALLOC+SHF_TLS 0 0 8 0 0 8 0 8} 0xc420080840 0xc420080840 0 0}
2018/05/06 14:26:08 &{{.debug_abbrev SHT_PROGBITS 0x0 10891264 6557696 437 0 0 1 0 437} 0xc420080870 0xc420080870 0 0}
2018/05/06 14:26:08 &{{.debug_line SHT_PROGBITS 0x0 10891701 6558133 350698 0 0 1 0 350698} 0xc4200808a0 0xc4200808a0 0 0}
2018/05/06 14:26:08 &{{.debug_frame SHT_PROGBITS 0x0 11242399 6908831 381068 0 0 1 0 381068} 0xc4200808d0 0xc4200808d0 0 0}
2018/05/06 14:26:08 &{{.debug_pubnames SHT_PROGBITS 0x0 11623467 7289899 121435 0 0 1 0 121435} 0xc420080900 0xc420080900 0 0}
2018/05/06 14:26:08 &{{.debug_pubtypes SHT_PROGBITS 0x0 11744902 7411334 225106 0 0 1 0 225106} 0xc420080930 0xc420080930 0 0}
2018/05/06 14:26:08 &{{.debug_gdb_scripts SHT_PROGBITS 0x0 11970008 7636440 53 0 0 1 0 53} 0xc420080960 0xc420080960 0 0}
2018/05/06 14:26:08 &{{.debug_info SHT_PROGBITS 0x0 11970061 7636493 1847750 0 0 1 0 1847750} 0xc420080990 0xc420080990 0 0}
2018/05/06 14:26:08 &{{.debug_ranges SHT_PROGBITS 0x0 13817811 9484243 167568 0 0 1 0 167568} 0xc4200809c0 0xc4200809c0 0 0}
2018/05/06 14:26:08 &{{.interp SHT_PROGBITS SHF_ALLOC 4198372 4068 28 0 0 1 0 28} 0xc4200809f0 0xc4200809f0 0 0}
2018/05/06 14:26:08 &{{.note.go.buildid SHT_NOTE SHF_ALLOC 4198272 3968 100 0 0 4 0 100} 0xc420080a20 0xc420080a20 0 0}
2018/05/06 14:26:08 &{{.symtab SHT_SYMTAB 0x0 0 9654272 290112 35 377 8 24 290112} 0xc420080a50 0xc420080a50 0 0}
2018/05/06 14:26:08 &{{.strtab SHT_STRTAB 0x0 0 9944384 446735 0 0 1 0 446735} 0xc420080a80 0xc420080a80 0 0}

You can also inspect ELF files using Linux tools like: size --format=sysv main or readelf -l main (here main is the output binary).

As you can see that executable is just a file with some predefined format. Usually executable formats have segments, which are blocks of data memory mapped before running image. Here is a common view of segements, that a process has:

The text segment contains a program’s instructions and literals and static constants.

The data segment is working storage for the program. It can be preallocated and preloaded by exec and the process can extend or shrink it.

The stack segment contains a program stack. It grows as the stack grows, but doesn’t shrink when the stack shrinks.

The heap area commonly begins at the end of the .bss and .data segments and grows to larger addresses from there.

Let’s take look at how processes can allocate memory.

Libc manual says 9, that programs can allocate in two major ways using exec family functions and programmatically. exec calls out program loader to launch a program, thus creating a virtual address space for a process, loading its program into it and running it. And the programmatic ways are:

  • Static allocation is what happens when you declare a global variable. Each global variable defines one block of space, of a fixed size. The space is allocated once, when your program is started (part of the exec operation), and is never freed.
  • Automatic allocation happens when you declare an automatic variable, such as a function argument or a local variable. The space for an automatic variable is allocated when the compound statement containing the declaration is entered, and is freed when that compound statement is exited.
  • Dynamic Allocation – is a technique in which programs determine as they are running where to store some information. You need dynamic allocation when the amount of memory you need, or how long you continue to need it, depends on factors that are not known before the program runs.

To dynamically allocate memory you have a couple of options. One of the options is to make a call to operating system (syscall or via libc). OS provides various functions like:

  • mmap/munmap – allocates/deallocates fixed block memory page.
  • brk/sbrk – changes/gets data segement size
  • madvise – gives advise to Operating System how to manage memory
  • set_thread_area/get_thread_area – works with thread local storage.

I think Go runtime only uses mmap, madvise, munmap and sbrk and it calls it directly to underlying OS via assembly or cgo, i.e. it’s not calling out to libc 10. These memory allocation are low level and typically programers don’t use them. More common is to use libc’s malloc family functions, where you ask for n bytes of memory and libc just returns it to you, and you need to call free to return it back.

Here is a basic C example of using malloc:

#include /* printf, scanf, NULL */
#include /* malloc, free, rand */
int main (){
int i,n;
char * buffer;

printf ("How long do you want the string? ");
scanf ("%d", &i);

buffer = (char*) malloc (i+1);
if (buffer==NULL) exit (1);

for (n=0; n<i; n++)

printf ("Random string: %s\n",buffer);
free (buffer);

return 0;

This example illustrates the need to dynamically allocate data, as we ask the user to enter string length, then allocate bytes based on that and generate random string. Also, note the explicit call to free()

Memory Allocators

As Go doesn’t use malloc to get memory, but asks OS directly (via mmap), it has to implement memory allocation and deallocation on it’s own (like malloc does). Go’s memory allocator is originally based off TCMalloc: Thread-Caching Malloc.

Some interesting facts about TCMalloc:

  • TCMalloc is faster than the glibc 2.3 malloc (available as a separate library called ptmalloc2).
  • ptmalloc2 takes approximately 300 nanoseconds to execute a malloc.
  • The TCMalloc implementation takes approximately 50 nanoseconds for the same operation pair.

TCMalloc also reduces lock contention for multi-threaded programs:

  • For small objects, there is virtually zero contention.
  • For large objects, TCMalloc tries to use fine grained and efficient spinlocks.


The secret behind TCMalloc performance is that it uses thread-local cache to store some preallocated memory “objects”, so that small allocations are satisfied from the thread-local cache 11. Once thread-local cache is out of space, memory objects are moved from central data structures into thread-local cache.

TCMalloc treats small object (size <= 32K) allocation differently from large. Large objects are allocated directly from the central heap using a page-level allocator. While, small objects are maped to one of approximately 170 allocatable size-classes.

So here is how it works for small objects:

When allocating a small object:

  1. We map its size to the corresponding size-class.
  2. Look in the corresponding free list in the thread cache for the current thread.
  3. If the free list is not empty, we remove the first object from the list and return it.

If the free list is empty:

  1. We fetch a bunch of objects from a central free list for this size-class (the central free list is shared by all threads).
  2. Place them in the thread-local free list.
  3. Return one of the newly fetched objects to the applications.

If the central free list is also empty:

  1. We allocate a run of pages from the central page allocator.
  2. Split the run into a set of objects of this size-class.
  3. Place the new objects on the central free list.
  4. As before, move some of these objects to the thread-local free list.

Large Objects (size > 32K) is rounded up to a page size (4K) and is handled by a central page heap. The central page heap is again an array of free lists:

For i < 256, the kth entry is a free list of runs that consist of k pages. The 256th entry is a free list of runs that have length >= 256 pages.

So here is how it works for large objects:

An allocation for k pages is satisfied:

  1. We look in the k-th free list.
  2. If that free list is empty, we look in the next free list, and so forth.
  3. Eventually, we look in the last free list if necessary.
  4. If that fails, we fetch memory from the system.
  5. If an allocation for k pages is satisfied by a run of pages of length > k, the remainder of the run is re-inserted back into the appropriate free list in the page heap.

The memory is managed in term of runs of contiguous pages, which are called Spans (this is important as Go also manages memory in terms of Spans).

In TCMalloc a span can either be allocated, or free:

  • If free, the span is one of the entries in a page heap linked-list.
  • If allocated, it is either a large object that has been handed off to the application, or a run of pages that have been split up into a sequence of small objects.

In this example, span 1 occupies 2 pages, span 2 occupies 4 pages, span 3 occupies 1 page. A central array indexed by page number can be used to find the span to which a page belongs.

Go Memory Allocator

Go allocator is similiar to TCMalloc, it works in runs of pages (spans / mspan objects), uses thread-local cache and divides allocations based on size. Spans are contiguous regions of memory of 8K or larger. You can see Span for yourself in in runtime/mheap.go there is mspan struct. There are 3 types of Spans:

  1. idle ‚Äď span, that has no objects and can be released back to the OS, or reused for heap allocation, or reused for stack memory.
  2. in use ‚Äď span, that has atleast one heap object and may have space for more.
  3. stack ‚Äď span, which is used for goroutine stack. This span can live either in stack or in heap, but not in both.

When allocation happens we map objects into 3 size classes: Tiny class for objects <16 bytes, Small class for objects up to 32 kB and Large class for other objects. Small allocation sizes are rounded to one of about 70 size classes, each of which has its own free set of objects of exactly that size. I found some interesting comments in runtime/malloc.go about Tiny allocator, and why it was introduced: > The main targets of tiny allocator are small strings and standalone escaping variables.

On a json benchmark the allocator reduces number of allocations by ~12% and reduces heap size by ~20%.
Tiny allocator combines several tiny allocation requests into a single memory block of 16 bytes.
The resulting memory block is freed when all subobjects are unreachable.
The subobjects can’t have pointers.

So here is how it works for Tiny objects:

When allocating Tiny object:

  1. Look in to corresponding tiny slot object in this P’s mcache.
  2. Round the size of existing subobject (if exists) into 8, 4 or 2 bytes based on the new object’s size.
  3. If the object fits together with existing subobjects, place it there.

If it doesn’t fit in the tiny block:

  1. Look in the corresponding mspan in this P’s mcache.
  2. Obtain a new mspan from mcache.
  3. Scan the mspan‘s free bitmap to find a free slot.
  4. If there is a free slot, allocate it and use it as a new tiny slot object. (This can all be done without acquiring a lock.)

If the mspan has no free slots:

  1. Obtain a new mspan from the mcentral‘s list of mspans of the required size class that have free space.
  2. Obtaining a whole span amortizes the cost of locking the mcentral.

If the mspan’s list is empty:

  1. Obtain a run of pages from the mheap to use for the mspan.

If the mheap is empty or has no page runs large enough:

  1. Allocate a new group of pages (at least 1MB) from the OS.
  2. Allocating a large run of pages amortizes the cost of talking to the OS.

For Small objects it’s very similiar, but we skip the first part:

When allocating small object:

  1. Round the size up to one of the small size classes.
  2. Look in the corresponding mspan in this P’s mcache.
  3. Scan the mspan‘s free bitmap to find a free slot.
  4. If there is a free slot, allocate it. (This can all be done without acquiring a lock.)

If the mspan has no free slots:

  1. Obtain a new mspan from the mcentral‘s list of mspans of the required size class that have free space.
  2. Obtaining a whole span amortizes the cost of locking the mcentral.

If the mspan’s list is empty:

  1. Obtain a run of pages from the mheap to use for the mspan.

If the mheap is empty or has no page runs large enough:

  1. Allocate a new group of pages (at least 1MB) from the OS.
  2. Allocating a large run of pages amortizes the cost of talking to the OS.

Allocating and freeing a large object uses the mheap directly, bypassing the mcache and mcentral. mheap is managed like in TCMalloc, where we have an array of free lists. Large objects are rounded up to page size (8K) and we look for kth entry in a free list, which consists of k pages, if it’s empty we go down. Rinse and repeat, until the 128 array entry. If we don’t find empty page in 127, we look for a span in leftover large pages (mspan.freelarge field) and if that fails we take from OS.

So that’s all about Go Memory allocation, after digging thru this code runtime.MemStats makes way more sense to me. You can see all the reporting for size classes, you can view how many bytes object’s that implement memory management (like MCache, MSpan) take, etc. You can read more about memstats in exploring Prometheus Go metrics.

Getting back to the problem

So to remind you, we are investigating:

func main() {
http.HandleFunc("/bar", func(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "Hello, %q", html.EscapeString(r.URL.Path))

http.ListenAndServe(":8080", nil)
go build main.go
ps -u --pid 16609
povilasv 16609 0.0 0.0 388496 5236 pts/9 Sl+ 17:21 0:00 ./main

Which gives us ~380 MiB of virtual memory size.

So maybe it’s runtime?

Let’s read memstats:

func main() {
http.HandleFunc("/bar", func(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "Hello, %q", html.EscapeString(r.URL.Path))

go func() {
for {
var m runtime.MemStats

log.Println(float64(m.Sys) / 1024 / 1024)
log.Println(float64(m.HeapAlloc) / 1024 / 1024)
time.Sleep(10 * time.Second)
http.ListenAndServe(":8080", nil)

MemStats.Sys is the total bytes of memory obtained from the OS. Sys measures the virtual address space reserved by the Go runtime for the heap, stacks, and other internal data structures.
MemStats.HeapAlloc is bytes of allocated for heap objects.

Nope, it doesn’t look like it:

2018/05/08 18:00:34 4.064689636230469
2018/05/08 18:00:34 0.5109481811523438

Maybe it’s normal?

Let’s try this C program:

#include /* printf, scanf, NULL */

int main (){
int i,n;
printf ("Enter a number:");
scanf ("%d", &i);

return 0;
gcc main.c

Nope, ~10MiB:

ps -u --pid 25074

povilasv 25074 0.0 0.0 10832 908 pts/6 S+ 17:48 0:00 ./a.out

Let’s try looking into /proc

cat /proc/30376/status
Name: main
State: S (sleeping)
Pid: 30376
FDSize: 64
VmPeak: 386576 kB
VmSize: 386576 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 5116 kB
VmRSS: 5116 kB
RssAnon: 972 kB
RssFile: 4144 kB
RssShmem: 0 kB
VmData: 44936 kB
VmStk: 136 kB
VmExe: 2104 kB
VmLib: 2252 kB
VmPTE: 132 kB
VmSwap: 0 kB
HugetlbPages: 0 kB
CoreDumping: 0
Threads: 6

Doesn’t help as segment sizes are normal, only VmSize is huge.

Let’s take a look at /proc/maps

cat /proc/31086/maps
00400000-0060e000 r-xp 00000000 fd:01 1217120 /main
0060e000-007e5000 r--p 0020e000 fd:01 1217120 /main
007e5000-0081b000 rw-p 003e5000 fd:01 1217120 /main
0081b000-0083d000 rw-p 00000000 00:00 0
0275d000-0277e000 rw-p 00000000 00:00 0 [heap]
c000000000-c000001000 rw-p 00000000 00:00 0
c41fff0000-c420200000 rw-p 00000000 00:00 0
7face8000000-7face8021000 rw-p 00000000 00:00 0
7face8021000-7facec000000 ---p 00000000 00:00 0
7facec000000-7facec021000 rw-p 00000000 00:00 0
7facf4021000-7facf8000000 ---p 00000000 00:00 0
7facf8000000-7facf8021000 rw-p 00000000 00:00 0
7facf8021000-7facfc000000 ---p 00000000 00:00 0
7facfd323000-7facfd324000 ---p 00000000 00:00 0
7facfd324000-7facfdb24000 rw-p 00000000 00:00 0
7facfdb24000-7facfdb25000 ---p 00000000 00:00 0
7facfeb27000-7facff327000 rw-p 00000000 00:00 0
7facff327000-7facff328000 ---p 00000000 00:00 0
7facff328000-7facffb28000 rw-p 00000000 00:00 0
7fddc2798000-7fddc2f98000 rw-p 00000000 00:00 0
7fddc2f98000-7fddc2f9b000 r-xp 00000000 fd:01 2363785 libdl-2.27.so
7fddc319c000-7fddc3383000 r-xp 00000000 fd:01 2363779 libc-2.27.so
7fddc3587000-7fddc3589000 rw-p 001eb000 fd:01 2363779 libc-2.27.so
7fddc3589000-7fddc358d000 rw-p 00000000 00:00 0
7fddc358d000-7fddc35a7000 r-xp 00000000 fd:01 2363826 libpthread-2.27.so
7fddc37a8000-7fddc37ac000 rw-p 00000000 00:00 0
7fddc37ac000-7fddc37b2000 r-xp 00000000 fd:01 724559 libgtk3-nocsd.so.0
7fddc39b2000-7fddc39b3000 rw-p 00006000 fd:01 724559 libgtk3-nocsd.so.0
7fddc39b3000-7fddc39da000 r-xp 00000000 fd:01 2363771 ld-2.27.so
7fddc3af4000-7fddc3bb8000 rw-p 00000000 00:00 0
7fddc3bda000-7fddc3bdb000 r--p 00027000 fd:01 2363771 ld-2.27.so
7fddc3bdc000-7fddc3bdd000 rw-p 00000000 00:00 0
7fff472e9000-7fff4730b000 rw-p 00000000 00:00 0 [stack]
7fff473a7000-7fff473aa000 r--p 00000000 00:00 0 [vvar]
7fff473aa000-7fff473ac000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 [vsyscall]

Summing all those addresses will probably leave me with the same ~380Mib. I am too lazy to sum it up. But it’s cool, scroll to the right you will see libc and other shared libraries mapped to your process.

Let’s try a simpler program

func main() {
go func() {
for {
var m runtime.MemStats

log.Println(float64(m.Sys) / 1024 / 1024)
log.Println(float64(m.HeapAlloc) / 1024 / 1024)
time.Sleep(10 * time.Second)

time.Sleep(1 * time.Hour)
go build main.go
ps -u --pid 3642
povilasv 3642 0.0 0.0 4900 948 pts/10 Sl+ 09:07 0:00 ./main

Hmm, that’s interesting, it’s only ~4Mib.

To be continued …

Thank you for reading this. As always I look forward to your comments. And please don’t spoil my search in the comments ūüėÄ


  • https://samypesse.gitbooks.io/how-to-create-an-operating-system/content/Chapter-8/
  • https://chortle.ccsu.edu/AssemblyTutorial/Chapter-04/ass04_1.html
  • http://brokenthorn.com/Resources/OSDev18.html
  • Libc man pages on memory: https://www.gnu.org/software/libc/manual/html_node/Memory.html#Memory
  • TcMalloc: http://goog-perftools.sourceforge.net/doc/tcmalloc.html
  • https://mcuoneclipse.com/2013/04/14/text-data-and-bss-code-and-data-size-explained/
  • https://stackoverflow.com/questions/24973973/is-exe-made-of-pure-machine-code-only
  • http://brokenthorn.com/Resources/OSDev18.html
  • https://www.win.tue.nl/~aeb/linux/lk/lk.html
  • https://samypesse.gitbooks.io/how-to-create-an-operating-system/content/chapter9/
  • https://wiki.osdev.org/ELF_Tutorial
  • https://stackoverflow.com/questions/610682/do-bss-section-zero-initialized-variables-occupy-space-in-elf-file
  1. Actually saying that each memory slot holds 8 bits is not really true, as there are architectures, where you can store less or more than 8 bits, you can read more in https://www.reddit.com/r/askscience/comments/3b6lkz/why_is_it_that_the_de_facto_standard_for_the/ and in https://softwareengineering.stackexchange.com/questions/91230/addressable-memory-unit#91263
  2. Actually there is a concept called shared memory, so that multiple applications can access the same memory. Read more in http://www.csl.mtu.edu/cs4411.ck/www/NOTES/process/shm/what-is-shm.html
  3. https://littleosbook.github.io/#a-short-introduction-to-virtual-memory and https://en.wikipedia.org/wiki/Virtual_memory
  4. You can read more about it in https://en.wikipedia.org/wiki/Memory_management_unit and https://wiki.osdev.org/Paging
  5. Libc man pages are a really great read: http://www.gnu.org/software/libc/manual/html_node/Memory.html
  6. The steps are for Linux, taken from https://en.wikipedia.org/wiki/Loader_(computing)
  7. You can read more about executable files in: https://en.wikipedia.org/wiki/Portable_Executable, https://wiki.osdev.org/ELF_Tutorial, https://www.quora.com/Where-does-elf-file-sits-inside-a-microcontroller
  8. You can verify that your program is in ELF, by opening the binary in text editor and seeing text ELF string in the beginning
  9. https://www.gnu.org/software/libc/manual/html_node/Memory-Concepts.html#Memory-Concepts
  10. Function definitions are in https://github.com/golang/go/blob/master/src/runtime/mem_linux.go and assembly in https://github.com/golang/go/blob/master/src/runtime/sys_linux_amd64.s#L449. There is really cool post about a bug, which happened because Go does not use libc wrappers: https://marcan.st/2017/12/debugging-an-evil-go-runtime-bug/.
  11. Read about tcmalloc design in http://goog-perftools.sourceforge.net/doc/tcmalloc.html

Advanced Standalone Kubelet tutorial for Raspberry Pi

So recently I adapted Kelsey Hightower’s Standalone Kubelet Tutorial for Raspberry Pi.
Standalone Kubelet Tutorial for Raspberry Pi is a prerequisite for this tutorial, as I’m going to skip Linux installation and all the other parts.

In this tutorial I will show you how to cross compile Kubernetes Kubelet to ARM architecture and we will run the amazing Prometheus, Node Exporter and Grafana using static pods.

I am running Ubuntu, so this should generally work for people running Ubuntu or other Linux distributions. If you are running Windows I have no idea whether this will work, but you can try ūüôā I think there could be issues in building Kubelet and/or cross compiling it for ARM.

You should subsitute IP for your Raspberry Pi’s IP address or hostname.

Compiling Kubernetes Kubelet for arm

Clone the Kubernetes project:

git clone git@github.com:kubernetes/kubernetes.git

You can checkout specific version branch if you like (e.g. git checkout release-1.10), I will be running master in this tutorial.

Compile Kubernetes for ARM:

cd kubernetes

Build Kubernetes for your current platform:


To cross compile you will need to have gcc-arm-linux-gnueabihf package installed.
On Ubuntu/Debian you can install via:

sudo apt-get install gcc-arm-linux-gnueabihf

Cross compile Kubelet for ARM:

make all WHAT=cmd/kubelet KUBE_BUILD_PLATFORMS=linux/arm

Copy Kubelet binary to your Raspberry Pi:

scp ./_output/local/bin/linux/arm/kubelet IP:~/

Installing Kubelet

Connect to Raspberry Pi:

ssh alarm@IP

Move Kubelet to /usr/bin:

sudo mv kubelet /usr/bin/kubelet

Download Kubelet’s config file and create /etc/kubernetes from previous tutorial, if you haven’t already.

Download the Kubelet systemd file:

wget -q --show-progress --https-only --timestamping \

Move the kubelet.service unit file to the systemd configuration directory:

sudo mv kubelet.service /etc/systemd/system/

Start the kubelet service:

sudo systemctl daemon-reload

sudo systemctl enable kubelet

sudo systemctl start kubelet


Verify the kubelet is running:

sudo systemctl status kubelet

View logs for Kubelet service:

journalctl -u kubelet

Check that no containers are running:

docker ps

Installing Prometheus and friends


Create Prometheus data dir:

sudo mkdir /root/prometheus-data/

Download Prometheus config:

wget -q --show-progress --https-only --timestamping \

Move prometheus.yml to /root/prometheus-data/:

sudo mv prometheus.yml /root/prometheus-data/prometheus.yml

Download Prometheus manifest:

wget -q --show-progress --https-only --timestamping \

Move the prometheus.yaml pod manifest to the Kubelet manifest directory:

sudo mv prometheus.yaml /etc/kubernetes/manifests/prometheus.yaml

Wait for image to download and list containers:

docker ps

You should see two containers running which represent the prometheus pod and a kubelet container. Docker does not understand pods so the containers are listed as individual containers following the Kubernetes naming convention.

Open up browser and go to:


You should see Prometheus WEB UI.

Open up targets page:


You should see 1 job as UP, and 2 jobs as DOWN. This is expected as we haven’t launched node exporter and grafana yet.

Node Exporter

Download Prometheus Node Exporter manifest:

wget -q --show-progress --https-only --timestamping \

Move the nodeexporter.yaml pod manifest to the Kubelet manifest directory:

sudo mv nodeexporter.yaml /etc/kubernetes/manifests/nodeexporter.yaml

Wait for image to download and list containers:

docker ps

You should see additional two containers running which represent the node exporter pod.

Open up Prometheus targets page:


You should see 2 jobs as UP, and 1 job as DOWN.


Download Grafana manifest:

wget -q --show-progress --https-only --timestamping \

Move the grafana.yaml pod manifest to the Kubelet manifest directory:

sudo mv grafana.yaml /etc/kubernetes/manifests/grafana.yaml

Wait for image to download and verify it is running:

docker ps

You should see additional two containers running which represent the grafana pod.

Open up Prometheus targets page:


You should see 3 jobs as UP.

Go to Grafana page:


Enter `admin`/`admin`.

Add data source:

Click on Add Source and enter:

Name: prom
Access: proxy
Scrape interval: 30s

Adding dashboards:

Click on Dashboards tab in `add data source page`:

Import `Prometheus 2.0 Stats` and `Grafana metrics` dashboards.

Hover on a left `+` button and click `Import`.

Enter `5573`.

Set `prometheus` as `prom`.

Take a look at those 3 dashboards in Grafana UI.

You should see something like:


Here we are at the end of the journey.

We went through a lot together: modifying archlinux kernel parameters, checking cgroup stats via lxc-checkconfig, crosscompiling Kubelet for ARM and actually running Prometheus and friends on a Raspberry Pi.

Hope you enjoyed the journey and see you next time!

Running Prometheus Node exporter on a router

So at home I have Asus RT-N14U router, which has 600 Mhz MIPS architecture CPU, 16mb of flash storage and 64mb of RAM. On the software side, I’m running Andy Padavan’s RT-N56U firmware (Thank you, Andy!), this firmware gives you all the router features, I think raw Linux, ssh and some basic tools.

So Node exporter is written in Go, why not just go build ? So what I did is just simply cross compile into MIPS little endian architecture:

git clone git@github.com:prometheus/node_exporter.git

cd node_exporter
GOARCH='mipsle' GOOS=linux go build

Success! We have a node_exporter executable. By the way, you can checkout specific version (eg git checkout release-0.16) if you like, I usually run master for the latest and greatest ūüėČ
So now, let’s just copy the built executable into the router

scp node_exporter router:~/
node_exporter                                               89%   13MB   2.0MB/s   00:00
ETAscp: /home/admin//node_exporter: No space left on device

Turns out that I only have 1MB of space in my home dir:

Filesystem                Size      Used Available Use% Mounted on
rootfs                    8.5M      8.5M         0 100% /
/dev/root                 8.5M      8.5M         0 100% /
tmpfs                     8.0K         0      8.0K   0% /dev
tmpfs                     2.0M    180.0K      1.8M   9% /etc
tmpfs                     1.0M      8.0K   1016.0K   1% /home
tmpfs                     8.0K         0      8.0K   0% /media
tmpfs                     8.0K         0      8.0K   0% /mnt
tmpfs                    24.0M     76.0K     23.9M   0% /tmp
tmpfs                     4.0M    192.0K      3.8M   5% /var

and Node exporter’s executable size is around ~15mb, So, I attached usb, copied file scp node_exporter router:/media/UBUNTU_17_0/ into mounted usb and tried to launch it:

time="2018-04-14T15:15:00Z" level=info msg="Starting node_exporter (version=, branch=, revision=)" source="node_exporter.go:82"
time="2018-04-14T15:15:00Z" level=info msg="Build context (go=go1.10.1, user=, date=)" source="node_exporter.go:83"
time="2018-04-14T15:15:00Z" level=info msg="Enabled collectors:" source="node_exporter.go:90"
time="2018-04-14T15:15:00Z" level=info msg=" - arp" source="node_exporter.go:97"
time="2018-04-14T15:15:00Z" level=info msg=" - bcache" source="node_exporter.go:97"
time="2018-04-14T15:15:00Z" level=info msg=" - bonding" source="node_exporter.go:97"
time="2018-04-14T15:15:00Z" level=info msg=" - conntrack" source="node_exporter.go:97"
time="2018-04-14T15:15:00Z" level=info msg=" - cpu" source="node_exporter.go:97"

time=”2018-04-14T15:15:00Z” level=info msg=” – vmstat” source=”node_exporter.go:97″
time=”2018-04-14T15:15:00Z” level=info msg=” – wifi” source=”node_exporter.go:97″
time=”2018-04-14T15:15:00Z” level=info msg=” – xfs” source=”node_exporter.go:97″
time=”2018-04-14T15:15:00Z” level=info msg=” – zfs” source=”node_exporter.go:97″
time=”2018-04-14T15:15:00Z” level=info msg=”Listening on :9100″ source=”node_exporter.go:111″

Woohoo it’s working!

Let’s add a Grafana dashboard, to look at those exported metrics:

I’ve used Host Stats – Prometheus Node Exporter grafana dashboard, just had to change it, as all of the labels have changed in recent release of exporter. You can find this fixed dashboard in https://grafana.com/dashboards/5573.

Getting back to the Node exporter, so now that we have node exporter running, how do we start it on router boot?
This Linux is very limited and doesn’t support systemd. So, looking at what other options are available I found that there is crontab (you need to enable it via UI). So first thing I tried:

@reboot /media/UBUNTU_17_0/node_exporter >> /media/UBUNTU_17_0/node_exporter.log 2>&1

Restarted router and nope, @reboot doesn’t seem to work, no surprises here, as this router is very limited.

So this leads us doing things “the old way”: running a cron script every minute or so checking for process existence and starting if it’s not there.

Here is the script:


pidof node_exporter

if [[ $? -ne 0 ]] ; then
        /media/UBUNTU_17_0/node_exporter >> /media/UBUNTU_17_0/node_exporter.log 2>&1 & 

And the crontab:

* * * * * /media/UBUNTU_17_0/ne.sh         

Thats all, thanks for reading!

Exploring Prometheus Go client metrics

In this post I want to explore go metrics, which are exported by client_golang via promhttp.Handler() call.
Here is a sample program, registering prom handler and listening on 8080 port:

package main

import (


func main() {
    http.Handle("/metrics", promhttp.Handler())
    log.Fatal(http.ListenAndServe(":8080", nil))

When you hit your metrics endpoint, you will get something like:

# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 3.5101e-05
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 6
process_open_fds 12
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.1272192e+07
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 4.74484736e+08

On initialisation client_golang registers 2 Prometheus collectors:

  • Process Collector – which collects basic Linux process information like CPU, memory,¬†file descriptor usage and start time.
  • Go Collector – which collects information about Go’s runtime like details about GC, number of gouroutines and OS threads.

Process Collector

What this collector does is reads proc file system. proc file system exposes internal kernel data structures, which is used to obtain information about the system.1

So Prometheus client reads /proc/PID/stat file, which looks like this:

1 (sh) S 0 1 1 34816 8 4194560 674 43 9 1 5 0 0 0 20 0 1 0 89724 1581056 209 18446744073709551615 94672542621696 94672543427732 140730737801568 0 0 0 0 2637828 65538 1 0 0 17 3 0 0 0 0 0 94672545527192 94672545542787 94672557428736 140730737807231 140730737807234 140730737807234 140730737807344 0

You can get human readable variant of this information using cat /proc/PID/status.

process_cpu_seconds_total – it uses utime – number of ticks executing code in user mode, measured in jiffies, with stime – jiffies spent in the system mode, executing code on behalf of the process (like doing system calls). A jiffy is the time between two ticks of the system timer interrupt. 2

process_cpu_seconds_total equals to sum of utime and stime and divide by USER_HZ. This makes sense, as dividing number of scheduler ticks by Hz(ticks per second) produces total time in seconds operating system has been running the process. 3

process_virtual_memory_bytes – uses vsize – virtual memory size is the amount of address space that a process is managing. This includes all types of memory, both in RAM and swapped out.

process_resident_memory_bytes – multiplies rss – resident set memory size is number of memory pages the process has in real memory, with pagesize 4. This results in the amount of memory that belongs specifically to that process in bytes. This excludes swapped out memory pages.

process_start_time_seconds – uses start_time – time the process started after system boot, which is expressed in jiffies and btime from /proc/stat which shows the time at which the system booted in seconds since the Unix epoch. start_time is divided by USER_HZ in order to get the value in seconds.

process_open_fds – counts the number of files in /proc/PID/fd directory. This shows currently open regular files, sockets, pseudo terminals, etc.

process_max_fds – reads /proc/{PID}/limits and uses Soft Limit from “Max Open Files” row. The interesting bit here is that /limits lists Soft and Hard limits.
As it turns out, the soft limit is the value that the kernel enforces for the corresponding resource and the Hard limit acts as a ceiling for the soft limit.
An unprivileged process may only set its soft limit to a value up to the hard limit and (irreversibly) lower its Hard limit. 5

In Go you can use err = syscall.Setrlimit(syscall.RLIMIT_NOFILE, &syscall.Rlimit{Cur: 9, Max: 10}) to set your limit.

Go Collector

Go Collector’s most of the metrics are taken from runtime, runtime/debug packages.

go_goroutines – calls out to runtime.NumGoroutine(), which computes the value based off the scheduler struct and global allglen variable. As all the values in sched struct can be changed concurently there is this funny check where if computed value is less than 1 it becomes 1.

go_threads – calls out to runtime.CreateThreadProfile(), which reads off global allm variable. If you don’t know what M’s or G’s you can read my blogpost about it.

go_gc_duration_seconds – calls out to debug.ReadGCStats() with PauseQuantile set to 5, which returns us the minimum, 25%, 50%, 75%, and maximum pause times. Then it manualy creates a Summary type from pause quantiles, NumGC var and PauseTotal seconds. It’s cool how well GCStats struct fits the prom’s Summary type. 6

go_info – this provides us with Go version. It’s pretty clever, it calls out to runtime.Version() and set’s that as a version label and then always returns value of 1 for this gauge metric.


Go Collector provides us with a lot of metrics about memory and GC.
All those metrics are from runtime.ReadMemStats(), which gives us metrics from MemStats struct.
One thing that worries me, is that runtime.ReadMemStats() has a explicit call to make a stop-the-world pause 7.
So I wonder how much actual cost this pause introduces?
As during stop-the-world pause, all goroutines are paused, so that GC can run. I’ll probably do a comparison of an app with and without instrumentation in a later post.

We already seen that Linux provides us with rss/vsize metrics for memory stats, so naturally the question arises which metrics to use, the ones provided in MemStats or rss/vsize?

The good part about resident set size and virtual memory size is that it’s based off Linux primitives and is programming language agnostic.
So in theory you could instrument any program and you would know how much memory it consumes (as long as you name your metrics consistently, ie process_virtual_memory_bytes and process_resident_memory_bytes.).
In practice, however when Go process starts up it takes a lot of virtual memory beforehand, such a simple program like the one above takes up to 544MiB of vsize on my machine (x86_64 Ubuntu), which is a bit confusing. RSS shows around 7mib, which is closer to the actual usage.

On the other hand using Go runtime based metrics gives more fined grained information on what is happening in your running application.
You should be able to find out more easily whether your program has a memory leak, how long GC took, how much it reclaimed.
Also, it should point you into right direction when you are optimizing program’s memory allocations.

I haven’t looked in detail how Go GC and memory model works, a part of it’s concurrency model 8, so this bit is still new to me.

So let’s take a look at those metrics:

go_memstats_alloc_bytes – a metric which shows how much bytes of memory is allocated on the Heap for the Objects. The value is same as go_memstats_heap_alloc_bytes. This metric counts all reachable heap objects plus unreachable objects, GC has not yet freed.

go_memstats_alloc_bytes_total – this metric increases as objects are allocated in the Heap, but doesn‚Äôt decrease when they are freed. I think it is immensly useful, as it is only increasing number and has same nice properties that Prometheus Counter has. Doing rate() on it should show us how many bytes/s of memory app consumes and is ‚Äúdurable‚ÄĚ across restarts and scrape misses.

go_memstats_sys_bytes – it’s a metric, which measures how many bytes of memory in total is taken from system by Go. It sums all the sys metrics described below.

go_memstats_lookups_total – counts how many pointer dereferences happened. This is a counter value so you can use rate() to lookups/s.

go_memstats_mallocs_total – shows how many heap objects are allocated. This is a counter value so you can use rate() to objects allocated/s.

go_memstats_frees_total – shows how many heap objects are freed. This is a counter value so you can use rate() to objects allocated/s. Note you can get number of live objects with go_memstats_mallocs_totalgo_memstats_frees_total.

Turns out that Go organizes memory in spans, which are contiguous regions of memory of 8K or larger. There are 3 types of Spans:
1) idle – span, that has no objects and can be released back to the OS, or reused for heap allocation, or reused for stack memory.
2) in use – span, that has atleast one heap object and may have space for more.
3) stack – span, which is used for goroutine stack. This span can live either in stack or in heap, but not in both.

Heap memory metrics

go_memstats_heap_alloc_bytes – same as go_memstats_alloc_bytes.

go_memstats_heap_sys_bytes – bytes of memory obtained for the heap from OS. This includes virtual address space that has been resevered, but not yet used and virtual address space which was returned to OS after it became unused. This metric estimates the largest size the heap.

go_memstats_heap_idle_bytes – shows how many bytes are in idle spans.

go_memstats_heap_idle_bytes minus go_memstats_heap_released_bytes estimates how many bytes of memory could be released, but is kept by runtime, so that runtime can allocate objects on the heap without asking OS for more memory.

go_memstats_heap_inuse_bytes – shows how many bytes in in-use spans.

go_memstats_heap_inuse_bytes minus go_memstats_heap_alloc_bytes shows how many bytes of memory has been allocated for the heap, but not currently used.

go_memstats_heap_released_bytes – shows how many bytes of idle spans were returned to the OS.

go_memstats_heap_objects – shows how many objects are allocated on the heap. This changes as GC is performed and new objects are allocated.

Stack memory metrics

go_memstats_stack_inuse_bytes – shows how many bytes of memory is used by stack memory spans, which have atleast one object in them. Go doc says, that stack memory spans can only be used for other stack spans, i.e. there is no mixing of heap objects and stack objects in one memory span.

go_memstats_stack_sys_bytes – shows how many bytes of stack memory is obtained from OS. It’s go_memstats_stack_inuse_bytes plus any memory obtained for OS thread stack.

There is no go_memstats_stack_idle_bytes, as unused stack spans are counted towards go_memstats_heap_idle_bytes.

Off-heap memory metrics

These metrics are bytes allocated for runtime internal structures, that are not allocated on the heap, because they implement the heap.

go_memstats_mspan_inuse_bytes – shows how many bytes are in use by mspan structures.

go_memstats_mspan_sys_bytes – shows how many bytes are obtained from OS for mspan structures.

go_memstats_mcache_inuse_bytes – shows how many bytes are in use by mcache structures.

go_memstats_mcache_sys_bytes – shows how many bytes are obtained from OS for mcache structures.

go_memstats_buck_hash_sys_bytes – shows how many bytes of memory are in bucket hash tables, which are used for profiling.

go_memstats_gc_sys_bytes – shows how many in garbage collection metadata.

go_memstats_other_sys_bytesgo_memstats_other_sys_bytes shows how many bytes of memory are used for other runtime allocations.

go_memstats_next_gc_bytes – shows the target heap size of the next GC cycle. GC’s goal is to keep go_memstats_heap_alloc_bytes less than this value.

go_memstats_last_gc_time_seconds – contains unix timestamp when last GC finished.

go_memstats_last_gc_cpu_fraction – shows the fraction of this program’s available CPU time used by GC since the program started.
This metric is also provided in GODEBUG=gctrace=1.

Playing around with numbers

So it’s a lot of metrics and a lot of information.
I think the best way to learn is to just play around with it, so in this part I’ll do just that.
So I’ll be using the same program that is above.
Here are the dump from /metrics (edited for space), which I’m going to use:

process_resident_memory_bytes 1.09568e+07

process_virtual_memory_bytes 6.46668288e+08

go_memstats_heap_alloc_bytes 2.24344e+06

go_memstats_heap_idle_bytes 6.3643648e+07

go_memstats_heap_inuse_bytes 3.039232e+06

go_memstats_heap_objects 6498

go_memstats_heap_released_bytes 0

go_memstats_heap_sys_bytes 6.668288e+07

go_memstats_lookups_total 0

go_memstats_frees_total 12209

go_memstats_mallocs_total 18707

go_memstats_buck_hash_sys_bytes 1.443899e+06

go_memstats_mcache_inuse_bytes 6912

go_memstats_mcache_sys_bytes 16384

go_memstats_mspan_inuse_bytes 25840

go_memstats_mspan_sys_bytes 32768

go_memstats_other_sys_bytes 1.310909e+06

go_memstats_stack_inuse_bytes 425984

go_memstats_stack_sys_bytes 425984

go_memstats_sys_bytes 7.2284408e+07

go_memstats_next_gc_bytes 4.194304e+06

go_memstats_gc_cpu_fraction 1.421928536233557e-06

go_memstats_gc_sys_bytes 2.371584e+06

go_memstats_last_gc_time_seconds 1.5235057190167596e+09

rss = 1.09568e+07 = 10956800 bytes = 10700 KiB = 10.4 MiB

vsize = 6.46668288e+08 = 646668288 bytes = 631512 KiB = 616.7 MiB

heap_alloc_bytes = 2.24344e+06 = 2243440 = 2190 KiB = 2.1 MiB

heap_inuse_bytes = 3.039232e+06 = 3039232 = 2968 KiB = 2,9 MiB

heap_idle_bytes = 6.3643648e+07 = 63643648 = 62152 KiB = 60.6 MiB

heap_released_bytes = 0

heap_sys_bytes = 6.668288e+07 = 66682880 = 65120 KiB = 63.6 MiB

frees_total = 12209

mallocs_total = 18707

mspan_inuse_bytes = 25840 = 25.2 KiB

mspan_sys_bytes = 32768 = 32 KiB

mcache_inuse_bytes = 6912 = 6.8 KiB

mcache_sys_bytes = 16384 = 12 KiB

buck_hash_sys_bytes = 1.443899e+06 = 1443899 = 1410 KiB = 1.4 MiB

gc_sys_bytes = 2.371584e+06 = 2371584 = 2316 KiB = 2.3 MiB

other_sys_bytes = 1.310909e+06 = 1310909 = 1280,2 KiB = 1.3MiB

stack_inuse_bytes = 425984 = 416 KiB

stack_sys_bytes = 425984 = 416 KiB

sys_bytes = 7.2284408e+07 = 72284408 = 70590.2 KiB = 68.9 MiB

next_gc_bytes = 4.194304e+06 = 4194304 = 4096 KiB = 4 MiB

gc_cpu_fraction = 1.421928536233557e-06 = 0.000001

last_gc_time_seconds = 1.5235057190167596e+09 = Thu, 12 Apr 2018 05:47:59 GMT

Interesting bit is that heap_inuse_bytes is more than heap_alloc_bytes.
I think heap_alloc_bytes shows how many bytes are in terms of objects and heap_inuse_bytes shows bytes of memory in terms of spans.
Dividing heap_inuse_bytes by size of the span gives: 3039232 / 8192 = 371 span.

heap_inuse_bytes minus heap_alloc_bytes, should show the amount of free space that we have in in-use spans, which is 2,9 MiB – 2.1 MiB = 0.8 MiB.
This roughly means that we can allocate 0.8 MiB of objects on the heap without using new memory spans.
But we should keep in mind memory fragmentation.
Imagine if you have a new bytes slice of 10K bytes, the memory could be in the position, where it doesn’t have a contiguous block of 10K bytes + slice header, so it would need use a new span, instead of reusing

heap_idle_bytes minus heap_released_byte shows that we have around 60.6 MiB of unused spans, which are reserved from OS and could be returned to OS. It’s 63643648/8192 = 7769 spans.

heap_sys_bytes, which is 63.6MiB estimates the largest size the heap has had. It’s 66682880/8192 = 8140 spans.

mallocs_total shows that we allocated 18707 objects and freed 12209. So currently, we have 18707-12209 = 6498 objects. We can find the average size of the object dividing heap_alloc_bytes over live objects, which is 6498. The result is 2243440 / 6498 = 345.3 bytes.
(This is probably a stupid metric, as object vary a lot in size and we should do histograms instead)

So sys_bytes should be a sum of all *sys metrics. So let’s check that:
sys_bytes == mspan_sys_bytes + mcache_sys_bytes + buck_hash_sys_bytes + gc_sys_bytes + other_sys_bytes + stack_sys_bytes + heap_sys_bytes.
So, we have 72284408 == 32768 + 16384 + 1443899 + 2371584 + 1310909 + 425984 + 66682880, which is 72284408 == 72284408, which is correct.

The interesting detail about sys_bytes, is that it’s 68,9 MiB it’s how many bytes of memory in total taken from OS. Meanwhile, OS’es vsize gives you 616,7MiB and rss 10.4 MiB. So all these numbers don’t really match up.

As I understand it so part of our memory could be in OS’s memory pages which are in swap or in filesystem (not in RAM), so this would explain why rss is smaller than sys_bytes.

And vsize contains a lot of things, like mapped libc, pthreads libs, etc. You can explore /proc/PID/maps and /proc/PID/smaps file, to see what is being currently mapped.

gc_cpu_fraction is running crazy low, 0.000001 of CPU time is used for GC. That’s really really cool. (Although this program doesn’t produce much garbage)

next_gc_bytes shows that the target for GC is to keep heap_alloc_bytes under 4 MiB, as heap_alloc_bytes is currently at 2.1 MiB the target is achieved.


I love Go and the fact that it exposes so much useful information in it’s packages and simple users like you and me can just call a function and get the information.

It was really cool playing around and reading about linux & Go, so I’m thinking of doing part 2 of this post. Maybe go look into metrics provided by cAdvisor or show how to use some of the metrics described here in dashboards/alerts with Prometheus.

Also, once vgo get’s integrated (and I really really hope it does, cause it’s like the best package manager I ever used). Then we should be able to inspect dependencies from some go runtime package, which would be really cool! Imagine writing a custom prom collector, which would go through all your dependencies, check for new versions and if found wouldgive you back a number of outdated pkgs, something like go_num_outdated_pkgs metric.
This way you could write an alert if your service get’s terribly outdated. Or check that your live dependency hashes don’t match current hashes?

If you like the post, hit the up arrow button on the reddit and see you soon.

  1. You can read more about what data kernel exposes http://tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html or more technical version here: https://www.kernel.org/doc/Documentation/filesystems/proc.txt
  2. Take a look at http://man7.org/linux/man-pages/man7/time.7.html, which has a great description for jiffies, user time and system time.
  3. The formula is in https://github.com/prometheus/procfs/blob/master/proc_stat.go#L187 . Also, there is an interesting work around in prometheus client definition of USER_HZ in https://github.com/prometheus/procfs/blob/master/proc_stat.go#L10-L25 .
  4. http://man7.org/linux/man-pages/man2/getpagesize.2.html
  5. More docs in https://linux.die.net/man/2/setrlimit
  6. https://golang.org/pkg/runtime/debug/#ReadGCStats
  7. https://github.com/golang/go/blob/master/src/runtime/mstats.go#L458
  8. Weirdly article “The Go Memory Model” talks about accessing memory from different goroutines https://golang.org/ref/mem rather than how memory is allocated etc

Tracking request duration with Prometheus

I recently started using Prometheus for instrumenting and I really like it! It has a cool concept of labels,  a functional query language & a bunch of very useful functions like rate(), increase() & histogram_quantile().

Adding metrics to your app is easy,  just import prometheus client and register metrics HTTP handler http.Handle("/metrics", promhttp.Handler())

This one-liner adds HTTP /metrics endpoint to HTTP router. By default client exports memory usage, number of goroutines, gc information and other runtime information. Exporting metrics as HTTP endpoint makes the whole dev/test lifecycle easy, as it is really trivial to check whether your newly added metric is now exposed.

One thing I struggled on is how to track request duration.

Prometheus doesn’t have a built in Timer metric type, which is often available in other monitoring systems. It has only 4 metric types: Counter, Gauge, Histogram and Summary.

It turns out that client library allows you to create a timer using: prometheus.NewTimer(o Observer) and record duration using ObserveDuration() method. Provided Observer can be either Summary, Histogram or a Gauge. So, which one to use?


In Prometheus Histogram is really a cumulative histogram (cumulative frequency). Histogram is made of a counter, which counts number of events that happened, a counter for a sum of event values and another counter for each of a bucket. Buckets count how many times event value was less than or equal to the bucket’s value.

Let’s take a look at the example:

Imagine that you create a histogram with 5 buckets with values:¬†0.5, 1, 2, 3, 5. Let’s call this histogram¬†http_request_duration_seconds¬†and 3 requests come in with durations 1s, 2s, 3s. Then you would see that /metrics¬†endpoint contains:

# HELP http_request_duration_seconds request duration histogram
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.5"} 0
http_request_duration_seconds_bucket{le="1"} 1
http_request_duration_seconds_bucket{le="2"} 2
http_request_duration_seconds_bucket{le="3"} 3
http_request_duration_seconds_bucket{le="5"} 3
http_request_duration_seconds_bucket{le="+Inf"} 3
http_request_duration_seconds_sum 6
http_request_duration_seconds_count 3

Here we can see that:

sum is 1s + 2s + 3s = 6,

count is 3, because of 3 requests

bucket {le=”0.5″} ¬†is 0, because none of the requests where <= 0.5 seconds

bucket {le=”1″} ¬†is 1, because one of the requests where <= 1¬†seconds

bucket {le=”2″} ¬†is 2, because two of the requests where <= 2¬†seconds

bucket {le=”3″} ¬†is 3, because all of the requests where <= 3¬†seconds

So what can we do with it?

Thirst thing to note is that when using Histogram we don’t need to have a separate counter to count total HTTP requests, as it creates one for us.

We could calculate average request time by dividing sum over count. In PromQL it would be:

http_request_duration_seconds_sum / http_request_duration_seconds_count

Also we could calculate percentiles from it. Prometheus comes with a handy histogram_quantile function for it. For example calculating 50% percentile (second quartile) for last 10 minutes in PromQL would be:

histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[10m])

Which results in 1.5.

Wait, 1.5? Shouldn’t it be 2?¬†(50th percentile is supposed to be the median, the number in the middle)

I even computed the 50th percentile using cumulative frequency table (what I thought prometheus is doing) and still ended up with 2.

As it turns out, this value is only an approximation of computed quantile. You can find more information on what type of approximations prometheus is doing in histogram_quantile doc.

By the way, be warned that percentiles can be easily misinterpreted.

It’s important to understand that creating a new histogram requires you to specify bucket boundaries up front. The default values, which are 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10¬†are tailored to broadly measure the response time in seconds and probably won’t fit your app’s behavior.

This creates a bit of a chicken or the egg problem, because you cannot know bucket boundaries until you launched the app and collected latency data and you cannot make a new Histogram without specifying (implicitly or explicitly) the bucket values. So I guess the best way to move forward is launch your app with default bucket boundaries, let it spin for a while and later tune those values based on what you see.


A Summary is like a histogram_quantile() function, but percentiles are computed in the client. Summary is made of a count and sum counters (like in Histogram type) and resulting quantile values.

For example, we want to find 0.5, 0.9, 0.99 quantiles and the same 3 requests with 1s, 2s, 3s durations come in.

Then /metrics endpoint should contain:

http_request_duration_seconds{quantile="0.5"} 2
http_request_duration_seconds{quantile="0.9"} 3
http_request_duration_seconds{quantile="0.99"} 3
http_request_duration_seconds_sum 6
http_request_duration_seconds_count 3

So we can see that:

sum is 1s + 2s + 3s = 6,

count is 3, because of 3 requests.

{quantile=”0.5″} is 2, meaning 50th percentile is 2.

{quantile=”0.9″} is 3, meaning 90th percentile is 3.

{quantile=”0.99″} is 3, meaning 99th percentile is 3.¬†

Summaries are great if¬†you already know what quantiles you want. You just specify them in¬†SummaryOpts¬†objectives map with it’s error window. For example:

map[float64]float64{0.5: 0.05}, which will compute 50th percentile with error window of 0.05.

Of course there are a couple of other parameters you could tune (like MaxAge, AgeBuckets or BufCap), but defaults should be good enough.

Although, there are a couple of problems with this approach.

First, you really need to know what percentiles you want. Because if you want to compute a different percentile, you will have to make changes in your code.

The other problem is that you cannot aggregate Summary types, i.e. if you have more than one replica of your app running you won’t be able to compute quantiles across all of the instances.

I usually don’t really know what I want, so I prefer to use Histograms.

By the way, the default go_gc_duration_seconds, which measures how long garbage collection took is implemented using Summary type.


Although Gauge doesn’t really implement¬†Observer¬†interface, you can make it using¬†prometheus.ObserverFunc(gauge.Set).

If we had the same 3 requests with 1s, 2s, 3s durations. The /metrics would contain:

# HELP http_request_duration_seconds Duration of the last http request
# TYPE http_request_duration_seconds gauge
http_request_duration_seconds 3

So we can see that:

http_request_duration_seconds is 3, meaning that last observed duration was 3.

At first I thought, this is great, I’ll just record all my request durations this way and aggregate/average out them later.

Well, no.

The mistake here is that Prometheus scrapes /metrics data only once in a while (by default every 1 min), which is configured by scrap_interval for your target.

So if you don’t have a lot of requests you could try to configure scrape_interval¬†to align with your requests and then you would see how long each request took. But I don’t think it’s a good idea, in this case I would rather push¬†the Gauge metrics to Prometheus.

I think this could be useful¬†for “job” type problems . For example, you could push how long backup, or data aggregating job has took.


Thanks for reading. Hopefully by now you and I know a bit more about Histograms, Summaries and tracking request duration. My plan for now is to track latency using Histograms, play around with histogram_quantile  and make some beautiful dashboards.

Oh and I forgot to mention, if you are instrumenting HTTP server or client, prometheus library has some helpers around it in promhttp package.

Go scheduler: Ms, Ps & Gs

I decided to learn a bit more about Go internals and since it was long time that somebody wrote about Go scheduler, I though it would be an interesting post. So, let’s get to it!


The Go runtime manages scheduling, garbage collection, and the runtime environment for goroutines. Here, I will focus only on the scheduler.

Runtime scheduler runs goroutines by mapping them onto operating system threads. Goroutines are lightweight version of threads, with very low cost of starting up. Each goroutine is described by a struct called G, which contains fields necessary to keep track of its stack and current status. So, G = goroutine.

Runtime keeps track of each G and maps them onto Logical Processors, named P. P can be seen as a abstract resource or a context, which needs to be acquired, so that OS thread (called M, or Machine) can execute G .

You can control how many logical processor your runtime has by calling runtime.GOMAXPROCS(numLogicalProcessors), if you are planning to tweak this parameter¬†(you probably shouldn’t), set it once and forget it, because it needs “stop the world” GC pause.

Essentially, Operating System runs¬†threads, which runs your¬†code. The Go’s trick is that compiler inserts calls into Go runtime in various places, e.g. sending a value thru the channel, making a call to¬†runtime package), so that Go can notify scheduler and take action.

This is how go program, runtime & operating system fits together

Note: Idea for figure below taken from Analysis of the Go runtime scheduler.

The dance between Ms, Ps & Gs

The interaction between Ms, Ps & Gs is a bit complicated. Take a look at this amazing workflow graph, which is from  go runtime scheduler slides by Gao Chao.

This is what happens when you create a new goroutine



Here we can see, that there are two types of queues for¬†G: a global queue¬†in the schedt struct¬†(which is rarely used)¬†and each¬†P¬†maintains a queue¬†of runnable¬†G‘s.

In order to execute a goroutine,¬†M¬†needs to be holding context P. Machine, then just pops it’s¬†P queue of goroutines and executes code.

When you schedule a new goroutine (do a go func() call)¬†it is placed into P‘s¬†queue. There is this interesting work-stealing scheduling algorithm, which runs when¬†M¬†finishes executing some¬†G and then¬†it¬†tries to take another G out of a queue, which is empty, then it randomly chooses another P and tries to steal a half of runnable G’s¬†from it!

Interesting things happen when your goroutine makes a blocking syscall. Blocking syscall will be intercepted, if there are Gs to run, runtime¬†will detach the thread from the P¬†and create a new OS thread (if idle thread doesn’t exist) to service that processor.

When a system calls resumes, the goroutine is placed back into a local run queue and the thread will park itself (meaning thread won’t¬†be¬†running) and insert itself in the list of idle¬†threads.

If a goroutine makes a network call, runtime will do a similar action. The call will be intercepted, but because Go has integrated network poller, which has its own thread, it will be assigned to it.

Essentially Go runtime will run a different goroutine, if current goroutine is blocked on:

  • blocking syscall (for example opening a file),
  • network input,
  • channel operations,
  • primitives in the sync package.

Scheduler Tracing

Go allows to trace runtime scheduler. This is done via GODEBUG environment variable:

$ GODEBUG=scheddetail=1,schedtrace=1000 ./program

Here is an example of output it gives:

SCHED 0ms: gomaxprocs=8 idleprocs=7 threads=2 spinningthreads=0 idlethreads=0 runqueue=0 gcwaiting=0 nmidlelocked=0 stopwait=0 sysmonwait=0
  P0: status=1 schedtick=0 syscalltick=0 m=0 runqsize=0 gfreecnt=0
  P1: status=0 schedtick=0 syscalltick=0 m=-1 runqsize=0 gfreecnt=0
  P2: status=0 schedtick=0 syscalltick=0 m=-1 runqsize=0 gfreecnt=0
  P3: status=0 schedtick=0 syscalltick=0 m=-1 runqsize=0 gfreecnt=0
  P4: status=0 schedtick=0 syscalltick=0 m=-1 runqsize=0 gfreecnt=0
  P5: status=0 schedtick=0 syscalltick=0 m=-1 runqsize=0 gfreecnt=0
  P6: status=0 schedtick=0 syscalltick=0 m=-1 runqsize=0 gfreecnt=0
  P7: status=0 schedtick=0 syscalltick=0 m=-1 runqsize=0 gfreecnt=0
  M1: p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=1 dying=0 helpgc=0 spinning=false blocked=false lockedg=-1
  M0: p=0 curg=1 mallocing=0 throwing=0 preemptoff= locks=1 dying=0 helpgc=0 spinning=false blocked=false lockedg=1
  G1: status=8() m=0 lockedm=0

Note that it uses the same concepts of G, M & P and their state, such as P’s queue size. Usually, you don’t need that much detail, so you can just use:

$ GODEBUG=schedtrace=1000 ./program

There is this great article by William Kennedy, which explains how to interpret this and the detailed version of the trace.

Also, there is an advanced tool called go tool trace, which has an UI and allows you to explore, what your program and runtime is doing. You can read how to use it in Pusher article.


yaml url.URL parsing

I was using gopkg.in/yaml.v2 to unmarshal a .yaml file into a struct and couldn’t find a solution to unmarshal an url string into url.URL.

Trying to just use url.URL  fails with:

Error initializing configuration: yaml: 
unmarshal errors:\n  line 54:
cannot unmarshal !!str `http://...` into url.URL

This is because, url.URL doesn’t adhere to yaml.v2.Unmarshaler interface.

So here is my solution, feel free to copy paste it!

type YAMLURL struct {

func (j *YAMLURL) UnmarshalYAML(unmarshal func(interface{}) error) error {
    var s string
    err := unmarshal(&s)
    if err != nil {
        return err
    url, err := url.Parse(s)
    j.URL = url
    return err

And here is how to use it:

type Config struct{
    url YAMLURL `yaml:"url"`

Thanks for reading!

Think about your dependencies

A little copying is better than a little dependency, says Go proverb.

Go has a huge standard library, which has support for many, many things. I think many developers, don’t utilize it fully. I mean I have¬†seen many new gophers¬†try to use packages for everything. For example, do you really need a dependency on github.com/stretchr/testify ? ¬†If you are just using assert,¬†maybe you¬†can just can copy the assert function from benbjohnson/testing? Ben Johnson has ¬†published these convenience functions (MIT license):

func assert(tb testing.TB, condition bool, msg string) 
func ok(tb testing.TB, err error) 
func equals(tb testing.TB, exp, act interface{}) 

If you are using the TestSuite¬†stuff, maybe you can just avoid it entirely? The same goes for all the cool HTTP muxes. It’s cool, but if you don’t have a lot of endpoints, maybe you can just avoid that dependency and use standard library’s¬†http¬†package.

But why am I against dependencies? Well, for new developers learning a new language is a lot, if you complicate this by adding custom packages and frameworks it will be harder. And it will be harder for you too. If you get back to a project after a break, you will notice that you have to relearn the dependencies you have used.

With less dependencies you will spend less time managing them. Once, I spent 3 hours fixing a dependency problem with Glide. The problem was that one of my dependency (let’s call this dependency A) had a dependency on an older version of a package C¬†and another dependency (let’s call it B) had a dependency on a newer version of a package C.¬†The biggest problem was that package¬†C¬†hadn’t used semantic versioning, so I had to try random library C commits and try to get this thing working… After doing that for an hour, I took a break and¬†realized, that I can easily rewrite the code and remove a dependency on B.

But aren’t packages from standard library are dependencies too? Well yes, but by utilizing the standard library, you are learning the language. You¬†will start to remember public functions, structs and interfaces, after awhile. Also, Go has version 1 compatibility guarantee, that means that standard library’s API won’t break down on you, for go 1.* versions.

Of course don’t go and reimplement Go MySQL driver or anything crazy like that. Just think before adding another dependency!

That’s all from me. Do you enjoy these articles? Follow me on twitter @PofkeVe¬†and join my mailing list¬†ūüôā