First time writing here, just wanted to share this story of how my stubbornness led me into looking at everything but where I should have.

Context

I maintain the backing infrastructure of a small company, as part of our software development process, developers push code to GitLab and for each push a container image is then built and deployed to the correct environment.

The build step happens inside the company's Kubernetes cluster and is achieved via a procedure called "Docker-in-Docker":

  1. A temporary dockerd instance is started
  2. In the same Kubernetes pod, a docker client utilizes the local Docker daemon (containers inside the pod share the network namespace) to perform the build in a way which is completely analogous to performing a local docker build invokation
  3. The built image is then pushed to the company's Docker registry

Some of these container images have been failing their build step in unexpected ways, in seemingly unrelated places and without any consistency between the various failures.

Initial assessment

The most flaky builds were the ones that either

  • Downloaded packages off the internet (commands like composer install or npm install)
  • Required long-lived connections

By taking these factors into account I realized that the problem was probably bound to be network related.

Misleading error messages

My troubleshooting encountered many false positives: problems that were ultimately caused by what I was looking for but led my attention to unrelated issues.

For instance, in a PHP image's Dockerfile, pecl (a PHP extension package installer) was indicating failures with the following message

unable to connect to ssl://pecl.php.net:443 (Unknown error) in Proxy.php on line 183

I took a look at the source code

$fp = @fsockopen($host, $port, $errno, $errstr);

if (!$fp) {
  return PEAR::raiseError("Connection to `$host:$port' failed: $errstr", $errno);

}

And I was even more confused: the fsockopen function is pretty much a wrapper around OS libraries, so the Unknown Error could lie in either

  • OpenSSL
  • The operating system itself

Unluckily, I started from the wrong pick in this list: permuting OpenSSL versions and PHP configuration options did not fix the problem.

Frustrated by the PHP situation and without fully realizing the cause of the problem I decided to tackle the other problematic test: a npm install taking 2 hours to complete.

A quick speedtest was clocking the builder node's download speed at 984 Mbps so I quickly ruled out the idea of blaming my provider's connection.

The npm install was failing with a weird behaviour: after some packages, it got stuck forever downloading.

The Node process's event loop was completely stuck with many async functions waiting for their data and never receiving any "read available" event from the OS.

This second failure made me finally realize where the problem was: the network interface of the pod running inside Kubernetes.

Reproducing locally

In order to exclude possible flaws in the affected Dockerfiles, I built the images myself using the Docker daemon installed in my machine and it worked every single time.

To be super cautious I also performed the same tests with Podman, yielding the same results.

I then started a local Kubernetes cluster leveraging the minikube utility and spawned a dockerd daemon pod in it. All of the builds performed in this environment worked flawlessly.

Despite knowing that the issue was network related I stubbornly refused to acknowledge reality and insisted with locally reproducing the GitLab instance and GitLab runner configuration. In my mind, despite knowing how GitLab runner was setting up pods (and understanding that it had no possible effect on networking), this step was necessary to 100% confirm that the issue was not with GitLab.

I decided to try out the big guns: using Vagrant I deployed a 5 node cluster (with the same CNI as the production one), installed GitLab, installed GitLab runner, watched my laptop burst into flames a couple of times and to my great suprise everything worked reliably as intended.

Inspecting production

At this point you might be wandering: "Why didn't you take a look at the network traffic before?". I did, I looked at 2.5GB of PCAPs in Wireshark but in there there's nothing but the correct traffic: you can see the traffic leaving the pod, NAT in action and cross-node routing correctly. This prompted me to perform all of the troubleshooting steps mentioned above.

The eureka moment

Do you know how Docker default networking works?

  1. Docker sets up a default bridge
  2. Containers are attached with Virtual ETHernets to that bridge
  3. NAT shenanigans are set-up to allow container <-> world interaction

How is the MTU ("Biggest size of a packet on an interface") set for this bridge? It's configurable and defaults to 1500.

So where is the problem here?

In my production environment some nodes are not in the same subnet and my CNI plugin (Calico) connects subnets by establishing IP-in-IP encapsulation between the affected hosts.

So the real MTU is 1500 (for instance) but the effective MTU is real - 20 (the IP-in-IP header overhead), this MTU of 1480 is correctly set for the network interface of pods but it's not set for the default bridge that dockerd creates.

This also explains why my local testing did not exhibit any problems: all of the nodes are L2 and / or subnet adjacent.

Solving the problem

So, now that I discovered the error it's time to solve it. Two approaches came to my mind:

  1. Proposing auto MTU detection to the upstream moby/moby project
  2. Creating an ugly hack

Option 1 would be the cleanest but might break others workflows so I went with 2:

MTU=1500
for card in /sys/class/net/*; do
  [ -f "$card/mtu" ] || continue
  if [ $(cat "$card/mtu") -lt $MTU ]; then
    MTU=$(cat "$card/mtu")
  fi
done
exec dockerd --mtu=$MTU

This has fixed the problem for me (please take into account that this will set the MTU to the lowest available interface and not to the correct one)

Lessons learned

  1. Try as hard as you can to understand all of the components involved in a system. Here there were MANY layers of networking to pass through (build container =VETH=> docker bridge =NAT=> CNI interface =...=> ... =...=> Node interface)
  2. Try to focus on one problem at a given time, if you encounter a different issue, take note of it but troubleshoot it AFTER resolving the initial problem, it will probably be already solved if they were correlated.
  3. Try to really work ON the problem and not AROUND it: if the problem is with Docker + dockerd in Kubernetes, installing GitLab locally probably isn't the troubleshooting step that's needed.
  4. While it's cool to throw things at the wall and see what sticks. If you feel like you aren't progressing with diagnosing an issue. Stop whatever you're doing and start drawing out on a piece of paper all of the components of the affected system. Your problem will be in one or more of the boxes you have drawn. For each of these components think of its role in the problem, if any.
  5. Understanding why something is not working by using your knowledge is much more effective and satisfying than looking at gigabytes of logs / thousands of metrics / millions of network packets.