Recap

Previously on povilasv.me, we explored Go Memory Management and we left with 2 small Go programs having different virtual memory size.

Firstly, let’s take a look at ex1 program which consumes a lot of virtual memory. The code for ex1 looks like this:


func main() {
  http.HandleFunc("/bar", 
      func(w http.ResponseWriter, r *http.Request) {
        fmt.Fprintf(w, "Hello, %q", 
           html.EscapeString(r.URL.Path))
    })

  http.ListenAndServe(":8080", nil)
}

In order to view virtual memory size, I executed ps. Take a look at the output showing large virtual memory size below. Note that ps output is in kibibytes. 388496 KiB is around 379.390625 MiB.

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT 
povilasv 16609  0.0  0.0 388496  5236 pts/9    Sl+

Secondly, let’s take a look at ex2 program, which doesn’t consume a lot of virtual memory:


func main() {
    go func() {
        for {
            var m runtime.MemStats
            runtime.ReadMemStats(&m)

            log.Println(float64(m.Sys) / 1024 / 1024)
            log.Println(float64(m.HeapAlloc) / 1024 / 1024)
            time.Sleep(10 * time.Second)
        }
    }()

    fmt.Println("hello")
    time.Sleep(1 * time.Hour)
}

Finally, let’s take a look at ps output for this program. As you can see this program runs with small virtual memory size. Note that 4900 KiB is around 4.79 MiB.

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT 
povilasv  3642  0.0  0.0   4900   948 pts/10   Sl+ 

It’s important to note that these programs are compiled with a bit dated Go 1.10 version. There are differences on newer versions of Go.
For example, compiling ex1 program with Go 1.11 produces virtual memory size of 466MiB and resident set size of 3.22MiB. Additionally, doing the same with Go 1.12 provides us with 100.37MiB of virtual memory and 1.44MiB of resident set size.

As we can see, the difference between HTTP server and a simple command line application made all the difference to virtual memory size.

Aha moment

Since then I had an aha moment. Maybe I can debug this interesting behavior with strace. Take a look at strace description:

strace is a diagnostic, debugging and instructional userspace utility for Linux. It is used to monitor and tamper with interactions between processes and the Linux kernel, which include system calls, signal deliveries, and changes of process state.

So the plan is to run both programs with strace, in order to compare what is happening on the operating system level. Running and using strace is really simple, you just add strace in front of compiled program. For instance, in order to trace ex1 program I executed:


strace ./ex1

Which produces the following output:

execve("./ex1", ["./ex1"], 0x7fffe12acd60 /* 97 vars */) = 0
brk(NULL)                               = 0x573000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/local/lib/tls/haswell/x86_64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/usr/local/lib/tls/haswell/x86_64", 0x7ffdaa923fa0) = -1 ENOENT (No such file or directory)
...
stat("/lib/x86_64", 0x7ffdaa923fa0)     = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340b\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=146152, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc8a8d11000
mmap(NULL, 2225248, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fc8a88cd000
mprotect(0x7fc8a88e8000, 2093056, PROT_NONE) = 0
mmap(0x7fc8a8ae7000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a000) = 0x7fc8a8ae7000
mmap(0x7fc8a8ae9000, 13408, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fc8a8ae9000
close(3)                                = 0
openat(AT_FDCWD, "/usr/local/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\34\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1857312, ...}) = 0
mmap(NULL, 3963464, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fc8a8505000
mprotect(0x7fc8a86c3000, 2097152, PROT_NONE) = 0
mmap(0x7fc8a88c3000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1be000) = 0x7fc8a88c3000
mmap(0x7fc8a88c9000, 14920, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fc8a88c9000
close(3)                                = 0
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc8a8d0e000
arch_prctl(ARCH_SET_FS, 0x7fc8a8d0e740) = 0
mprotect(0x7fc8a88c3000, 16384, PROT_READ) = 0
mprotect(0x7fc8a8ae7000, 4096, PROT_READ) = 0
mprotect(0x7fc8a8d13000, 4096, PROT_READ) = 0
set_tid_address(0x7fc8a8d0ea10)         = 2109
set_robust_list(0x7fc8a8d0ea20, 24)     = 0
rt_sigaction(SIGRTMIN, {sa_handler=0x7fc8a88d2ca0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7fc8a88e1140}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {sa_handler=0x7fc8a88d2d50, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO, sa_restorer=0x7fc8a88e1140}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
brk(NULL)                               = 0x573000
brk(0x594000)                           = 0x594000
sched_getaffinity(0, 8192, [0, 1, 2, 3]) = 8
mmap(0xc000000000, 65536, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xc000000000
munmap(0xc000000000, 65536)             = 0
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc8a8cce000
mmap(0xc420000000, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xc420000000
mmap(0xc41fff8000, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xc41fff8000
mmap(0xc000000000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xc000000000
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc8a8cbe000
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc8a8cae000
rt_sigprocmask(SIG_SETMASK, NULL, [], 8) = 0
sigaltstack(NULL, {ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}) = 0
sigaltstack({ss_sp=0xc420002000, ss_flags=0, ss_size=32768}, NULL) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
gettid()                                = 2109
...

Similarly, let’s trace the ex2 program and take a look at it’s output:


strace ./ex2

Output:

execve("./ex2", ["./ex2"], 0x7ffc2965ca40 /* 97 vars */) = 0
arch_prctl(ARCH_SET_FS, 0x5397b0)       = 0
sched_getaffinity(0, 8192, [0, 1, 2, 3]) = 8
mmap(0xc000000000, 65536, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xc000000000
munmap(0xc000000000, 65536)             = 0
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff1c637b000
mmap(0xc420000000, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xc420000000
mmap(0xc41fff8000, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xc41fff8000
mmap(0xc000000000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xc000000000
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff1c636b000
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ff1c635b000
rt_sigprocmask(SIG_SETMASK, NULL, [], 8) = 0
sigaltstack(NULL, {ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}) = 0
sigaltstack({ss_sp=0xc420002000, ss_flags=0, ss_size=32768}, NULL) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
gettid()                                = 22982

Note that actual output is longer. For readability sake, I just copied output for strace up to the point, where both programs call gettid(). This point was selected for the reason that it appeared only once in both strace dumps.

Let’s compare the outputs. First of all, ex1 trace output is way longer than ex2. ex1 looks for some .so libraries and then loads them into memory. For instance, this is how trace output for loading libpthread.so.0 looks:

...
openat(AT_FDCWD, "/lib/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340b\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=146152, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc8a8d11000
mmap(NULL, 2225248, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fc8a88cd000
mprotect(0x7fc8a88e8000, 2093056, PROT_NONE) = 0
mmap(0x7fc8a8ae7000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a000) = 0x7fc8a8ae7000
mmap(0x7fc8a8ae9000, 13408, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fc8a8ae9000
close(3)                                = 0

In this example, we can see that file is being opened, read into memory and finally closed. Some of those memory regions are mapped with PROTO_EXEC flag, which allows our program to execute code located in that memory region. We can see the same thing happening for libc.so.6:

...
openat(AT_FDCWD, "/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\34\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1857312, ...}) = 0
mmap(NULL, 3963464, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fc8a8505000
mprotect(0x7fc8a86c3000, 2097152, PROT_NONE) = 0
mmap(0x7fc8a88c3000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1be000) = 0x7fc8a88c3000
mmap(0x7fc8a88c9000, 14920, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fc8a88c9000
close(3)                                = 0

After loading these libraries, both programs start to behave similarly. We can see that strace outputs after loading, show mapping of the same memory regions and contain traces of running similar commands up to the point, where programs execute gettid().

This is interesting. So ex1 has loaded libpthread & libc, but ex2 didn’t.

cgo must be involved.

CGO

Let’s explore cgo and how it works. As godoc explains:

Cgo enables the creation of Go packages that call C code.

Turns out in order to call C code, you need to add a special comment and use special C package. Let’s take a look at this mini example:


package main

// #include 
import "C"
import "fmt"

func main() {
    char := C.getchar()
    fmt.Printf("%T %#v", char, char)
}

This program includes stdio.h from C standard library and then calls getchar() and prints resulting variable. getchar() gets a character (an unsigned char) from stdin. Let’s try it:


go build
./ex3

When executing this program, it will ask you to enter a character and simply print it. Here is an example of this interaction:

a
main._Ctype_int 97

As we can see, it behaves just like Go. Interestingly it compiles just like native Go code, as you just go build and that’s it. I bet If you didn’t look at the code, you wouldn’t notice a difference.

Obviously it has a lot more interesting features. For instance, if you place your .c and .h files in the same directory as native go code, go build will compile and link it together with your native code.

I suggest you checkout godoc and C? Go? Cgo! blog post if you want to learn more. Now let’s, get back to our interesting problem. Why did ex1 program use cgo and ex2 didn’t?

Exploring difference

The difference between ex1 and ex2 programs is import path. ex1 imports net/http, while ex2 doesn’t. Interestingly, Grepping thru net/http package doesn’t show any signs of C statements. But it’s enough to look one level up and volia, it’s the net package.

Take a look at these net package files:

For example, net/cgo_linux.go contains:

// +build !android,cgo,!netgo

package net

/*
#include <netdb.h>
*/
import "C"

// NOTE(rsc): In theory there are approximately balanced
// arguments for and against including AI_ADDRCONFIG
// in the flags (it includes IPv4 results only on IPv4 systems,
// and similarly for IPv6), but in practice setting it causes
// getaddrinfo to return the wrong canonical name on Linux.
// So definitely leave it out.
const cgoAddrInfoFlags = C.AI_CANONNAME | C.AI_V4MAPPED | C.AI_ALL

As we can see net package includes C header file netdb.h and uses a couple of variables from there. But why do we need this? Let’s explore it.

What is netdb.h ?

If you look up netdb.h documentation, you will see that netdb.h is part of libc. It’s documentation states:

netdb.h it provides definitions for network database operations.

Additionally, documentation explains these constants. Let’s have a look:

  • AI_CANONNAME – Request for canonical name.
  • AI_V4MAPPED – If no IPv6 addresses are found, query for IPv4 addresses and return them to the caller as IPv4-mapped IPv6 addresses
  • AI_ALL – Query for both IPv4 and IPv6 addresses.

Looking at how these flags are actually used, it turns out that they are passed to getaddrinfo(), which is a call to resolve DNS name using libc. So simply those flags, control how DNS name resolution is going to take place.

Likewise if you open net/cgo_bsd.go, you will see a bit diffferent version of cgoAddrInfoFlags constant. Let’s take a look:

// +build cgo,!netgo
// +build darwin dragonfly freebsd

package net

/*
#include <netdb.h>
*/
import "C"

const cgoAddrInfoFlags = (C.AI_CANONNAME | C.AI_V4MAPPED |
 C.AI_ALL) & C.AI_MASK

So this hints us, that there is a mechanism to set OS specific flags for DNS resolution amd we are using cgo to do DNS queries correctly. That’s cool. Let’s explore net package a little more.

Have a read thru net package documentation:

Name Resolution

The method for resolving domain names, whether indirectly with functions like Dial or directly with functions like LookupHost and LookupAddr, varies by operating system.

On Unix systems, the resolver has two options for resolving names. It can use a pure Go resolver that sends DNS requests directly to the servers listed in /etc/resolv.conf, or it can use a cgo-based resolver that calls C library routines such as getaddrinfo and getnameinfo.

By default the pure Go resolver is used, because a blocked DNS request consumes only a goroutine, while a blocked C call consumes an operating system thread. When cgo is available, the cgo-based resolver is used instead under a variety of conditions: on systems that do not let programs make direct DNS requests (OS X), when the LOCALDOMAIN environment variable is present (even if empty), when the RES_OPTIONS or HOSTALIASES environment variable is non-empty, when the ASR_CONFIG environment variable is non-empty (OpenBSD only), when /etc/resolv.conf or /etc/nsswitch.conf specify the use of features that the Go resolver does not implement, and when the name being looked up ends in .local or is an mDNS name.

The resolver decision can be overridden by setting the netdns value of the GODEBUG environment variable (see package runtime) to go or cgo, as in:

export GODEBUG=netdns=go    # force pure Go resolver
export GODEBUG=netdns=cgo   # force cgo resolver

The decision can also be forced while building the Go source tree by setting the netgo or netcgo build tag.

A numeric netdns setting, as in GODEBUG=netdns=1, causes the resolver to print debugging information about its decisions.

https://golang.org/pkg/net/#hdr-Name_Resolution

Enough reading, let’s play with it and try to use different DNS client implementations.

Build tags

So as per documentation, we can use environment variables to use one or the other DNS client implementation. This is quite neat as you don’t have to recompile your go code, in order to switch implementations.

Additionally, looking thru the code I found that we can use Go build tags to compile Go binary with different DNS client. In addition we can check, which implementation is actually used, by setting GODEBUG=netdns=1 environment variable and doing an actual DNS call.

By looking thru net package source files, I figured that there is 3 build modes. The build modes are enabled using different build tags. Here is the list:

  1. !cgo – no cgo means that we are forced to use Go’s resolver.
  2. netcgo or cgo – means that we are using libc DNS resolution.
  3. netgo + cgo – means that we are using go native DNS implementation, but we can still include C code.

Let’s try all of these combinations to see how this actually works out.
We need to use different program as none of our programs execute DNS queries. Therefore, we will use this code:


func main() {
    addr, err := net.LookupHost("povilasv.me")
    fmt.Println(addr, err)
}

Let’s build it:


export CGO_ENABLED=0
export GODEBUG=netdns=1
go build -tags netgo

And run it:


./testnetgo

Output:

go package net: built with netgo build tag; using Go's DNS resolver
104.28.1.75 104.28.0.75 2606:4700:30::681c:4b 2606:4700:30::681c:14b <nil>

Now let’s build it with libc resolver:


export GODEBUG=netdns=1
go build -tags netcgo

And run it:


./testnetgo

Output:

go package net: using cgo DNS resolver
104.28.0.75 104.28.1.75 2606:4700:30::681c:14b 2606:4700:30::681c:4b <nil>

And finally let’s use netgo cgo:


export GODEBUG=netdns=1
go build -tags 'netgo cgo' .

And lastly run it:


./testnetgo

Output:

go package net: built with netgo build tag; using Go's DNS resolver
104.28.0.75 104.28.1.75 2606:4700:30::681c:14b 2606:4700:30::681c:4b <nil>

We can see that build tags actually work. Now let’s get back to our virtual memory problem.

Back to Virtual Memory

Now I would like to see how virtual memory behaves with those 3 combinations of build tags with our ex1 program, which is a simple HTTP web server.

Let’s build it in netgo mode:


export CGO_ENABLED=0
go build -tags netgo

Process stats:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT 
povilasv  3524  0.0  0.0   7216  4076 pts/17   Sl+

As we can see virtual memory is low in this case.

Now let’s do the same with netcgo on:


go build -tags netcgo

Process stats:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT 
povilasv  6361  0.0  0.0 382296  4988 pts/17   Sl+

As we can see this suffers form large virtual memory size (382296 KiB).

Lastly, let’s try netgo cgo mode:


go build -tags 'netgo cgo' .
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT 
povilasv  8175  0.0  0.0   7216  3968 pts/17   Sl+

And we can see that virtual memory is low in this case (7216 KiB).

So We can definitely roll out the netgo, as it doesn’t use a lot of virtual memory. On the other hand, we can’t yet blame cgo for bloated virtual memory. As our ex1 example, doesn’t include any additional C code, so netgo would actually compile the same thing as netgo cgo, skipping the whole cgo process of building and linking C files.

Therefore, we need to try netcgo and netgo cgo with additional C code included. This would check how program behaves in cgo mode with disabled libc DNS client and cgo mode with enabled libc DNS client.

Let’s try this example:


package main

// #include 
import "C"
import "fmt"

func main() {
  char := C.getchar()
  fmt.Printf("%T %#v", char, char)

  http.HandleFunc("/bar", 
    func(w http.ResponseWriter, r *http.Request) {
        fmt.Fprintf(w, "Hello, %q", 
          html.EscapeString(r.URL.Path))
  })

   http.ListenAndServe(":8080", nil)
}

As we can see this example should work well for us. Because, it uses both cgo and would use netgo or libc DNS client implementations based on build tags.

Let’s try it out:


go build -tags netcgo .

Process stats:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT 
povilasv 12594  0.0  0.0 382208  4824 pts/17   Sl+

We can see the same virtual memory behavior. Now let’s try netgo cgo:


go build -tags 'netgo cgo' .

Process stats:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT 
povilasv  1026  0.0  0.0 382208  4824 pts/17   Sl+ 

Finally, we can rule out libc DNS client implementation, as disabling it didn’t help. We can clearly see it is related to cgo.

To explore this further, let’s simplify our program. ex1 program starts an HTTP server, which would be way harder to debug than a simple command line application. So take a look this code:


package main

// #include 
// #include 
import "C"
import (
    "time"
    "unsafe"
)

func main() {
    cs := C.CString("Hello from stdio")
    C.puts(cs)

    time.Sleep(1 * time.Second)

    C.free(unsafe.Pointer(cs))
}

Let’s run it and check it’s memory:


go build .
./ex6

Process stats:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT 
povilasv 15972  0.0  0.0 378228  2476 pts/17   Sl+

Cool virtual memory is big, this means that we need to investigate cgo.

Checkout Go Memory Management Part 3 for deeper investigation.

That’s it for the day. If you are interested to get my blog posts first, join the newsletter. If you want to support my writing, I have a public wish list, you can buy me a book or a whatever 😉.

Thanks for reading & see you next time!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.