"We're getting severe Drone outages ... builds just hang"

Slack message about builds hanging in Drone

A rather vague issue with our CI system takes us down through its source code and into the bowels of of how the Docker Engine handles json-file logging. The following issue was found in v1.12.1 and fixed in v1.13.0.

What's a Drone?

Drone is the CI/CD service that we use at uSwitch. It uses Docker in order to provide any tools and services required by jobs. This removes the issue of managing the union of dependencies that usually plagues traditional CI systems.

A Drone agent receives a YAML file that describes the pipeline of the job to be executed. It in turn talks to a Docker daemon to spin up containers, mount the jobs workspace, run request steps and then retrieve the logs to be stored.

Down the rabbit hole

Frustratingly, the builds were not hanging every time we ran the job, which made initial investigation a little tricky. We could see that all the commands that Drone was executing were completed, yet it wasn't finishing that step.

We attached strace to the Drone agent process that coordinates the execution of a job and could see that it was stuck waiting on the Docker daemon to close a chunked HTTP response that was streaming out the logs. But, when looking at the trace for the container that was hanging it looked identical whether it failed or not.

Our Drone agents are running on CoreOS. We lost a bunch of time trying to run GDB and SystemTap through /bin/toolbox, but GDB kept receiving SIGKILL every time it was attaching and SystemTap wouldn't run due to kernel version issues (UTS_VERSION from compile.h was different to uname -v. Here's the offending comparison in SystemTap and some code to demonstrate the issue).

Eventually we distilled down a failing case we could run locally:

#!/bin/bash

while true  
do  
        container_id="$(docker run -d ubuntu:latest /bin/bash -e -c "echo -n \"foo\"")"
        docker logs -f "${container_id}"
done  

When running this script, after an indeterminate number of container executions the logs command would hang mirroring the behaviour we saw in Drone. This meant that we could now focus on how Docker streams logs out of a container.

A client that wants to follow the logs of a containers makes an HTTP request to the API /containers/{name}/logs. The handler for the logs asks the back-end (the daemon in this case) for the logs, passing in the http.ResponseWriter as the place to write out the logs.

The daemon then gets the logger associated with that particular container and starts reading its logs. The logs are then written back out until, either the client closes their request or the logger sends a message saying it is done.

From this we can see that as we're not closing our connection, it must be that — for some reason — the logger never thinks it has closed even though the container has exited. We then moved on to exploring the logger and how it worked. In our case we were using the json-file logger.

The logger first gets instantiated when the daemon gets asked to attach to the containers streams by libcontainerd. It creates the logger and then starts to copy STDOUT and STDERR from the container into it.

In the case of the json-file logger, the copier writes in log messages which then get written out to a log file in the container's metadata directory. It will then happily sit there writing logs into the file until the logger is closed.

Whenever anything asks to read the logs, the logger will open the latest JSON log file and then start following the file. While it is following the file, it will read as many lines as it can before receiving an EOF. It will then wait for an inotify watcher to tell it the file has changed and it should start reading again.

select {  
    case <-fileWatcher.Events():
        dec = json.NewDecoder(f)
        continue
    case <-fileWatcher.Errors():
        logWatcher.Err <- err
        return
    case <-logWatcher.WatchClose():
        fileWatcher.Remove(name)
        return
    case <-notifyRotate:
        f.Close()
        fileWatcher.Remove(name)

        // retry when the file doesn't exist
        for retries := 0; retries <= 5; retries++ {
            f, err = os.Open(name)
            if err == nil || !os.IsNotExist(err) {
                break
            }
        }

        if err = fileWatcher.Add(name); err != nil {
            logWatcher.Err <- err
            return
        }
        if err != nil {
            logWatcher.Err <- err
            return
        }

        dec = json.NewDecoder(f)
        continue
}

It is in the above select statement that the deadlock occurs, causing the logger to never close.

When Docker receives a state change event to say that a container has exited, it will reset the container. It will then close the logger, which in turn will close all readers. When we close a reader, it sends an event that followLogs listens to. followLogs then tries to remove the file from the fileWatcher.

This is where the fun starts.

When the fileWatcher is removing the inotify watch from the file it waits to receive an IN_IGNORED message. The condition that the removal is waiting on is only released when it's processing a new inotify event. Once an inotify event is processed, if it isn't an IN_IGNORED message it will get pushed down an unbuffered channel. So if an inotify message that isn't an IN_IGNORED is received in between the container exiting and an IN_IGNORE being sent, it will get stuck.

The fix

We hoisted out the code that handled terminating the filewatcher from the processing of inotify events, which removed the potential deadlock.

We raised a pull request against Docker and the fix was released in v1.13.0.