"We're getting severe Drone outages ... builds just hang"
A rather vague issue with our CI system takes us down through its source code and into the bowels of of how the Docker Engine handles json-file
logging. The following issue was found in v1.12.1 and fixed in v1.13.0.
What's a Drone?
Drone is the CI/CD service that we use at uSwitch. It uses Docker in order to provide any tools and services required by jobs. This removes the issue of managing the union of dependencies that usually plagues traditional CI systems.
A Drone agent receives a YAML file that describes the pipeline of the job to be executed. It in turn talks to a Docker daemon to spin up containers, mount the jobs workspace, run request steps and then retrieve the logs to be stored.
Down the rabbit hole
Frustratingly, the builds were not hanging every time we ran the job, which made initial investigation a little tricky. We could see that all the commands that Drone was executing were completed, yet it wasn't finishing that step.
We attached strace
to the Drone agent process that coordinates the execution of a job and could see that it was stuck waiting on the Docker daemon to close a chunked HTTP response that was streaming out the logs. But, when looking at the trace for the container that was hanging it looked identical whether it failed or not.
Our Drone agents are running on CoreOS. We lost a bunch of time trying to run GDB and SystemTap through /bin/toolbox
, but GDB kept receiving SIGKILL
every time it was attaching and SystemTap wouldn't run due to kernel version issues (UTS_VERSION
from compile.h
was different to uname -v
. Here's the offending comparison in SystemTap and some code to demonstrate the issue).
Eventually we distilled down a failing case we could run locally:
#!/bin/bash
while true
do
container_id="$(docker run -d ubuntu:latest /bin/bash -e -c "echo -n \"foo\"")"
docker logs -f "${container_id}"
done
When running this script, after an indeterminate number of container executions the logs command would hang mirroring the behaviour we saw in Drone. This meant that we could now focus on how Docker streams logs out of a container.
A client that wants to follow the logs of a containers makes an HTTP request to the API /containers/{name}/logs
. The handler for the logs asks the back-end (the daemon in this case) for the logs, passing in the http.ResponseWriter
as the place to write out the logs.
The daemon then gets the logger associated with that particular container and starts reading its logs. The logs are then written back out until, either the client closes their request or the logger sends a message saying it is done.
From this we can see that as we're not closing our connection, it must be that — for some reason — the logger never thinks it has closed even though the container has exited. We then moved on to exploring the logger and how it worked. In our case we were using the json-file
logger.
The logger first gets instantiated when the daemon gets asked to attach to the containers streams by libcontainerd. It creates the logger and then starts to copy STDOUT
and STDERR
from the container into it.
In the case of the json-file
logger, the copier writes in log messages which then get written out to a log file in the container's metadata directory. It will then happily sit there writing logs into the file until the logger is closed.
Whenever anything asks to read the logs, the logger will open the latest JSON log file and then start following the file. While it is following the file, it will read as many lines as it can before receiving an EOF. It will then wait for an inotify watcher to tell it the file has changed and it should start reading again.
select {
case <-fileWatcher.Events():
dec = json.NewDecoder(f)
continue
case <-fileWatcher.Errors():
logWatcher.Err <- err
return
case <-logWatcher.WatchClose():
fileWatcher.Remove(name)
return
case <-notifyRotate:
f.Close()
fileWatcher.Remove(name)
// retry when the file doesn't exist
for retries := 0; retries <= 5; retries++ {
f, err = os.Open(name)
if err == nil || !os.IsNotExist(err) {
break
}
}
if err = fileWatcher.Add(name); err != nil {
logWatcher.Err <- err
return
}
if err != nil {
logWatcher.Err <- err
return
}
dec = json.NewDecoder(f)
continue
}
It is in the above select
statement that the deadlock occurs, causing the logger to never close.
When Docker receives a state change event to say that a container has exited, it will reset the container. It will then close the logger, which in turn will close all readers. When we close a reader, it sends an event that followLogs
listens to. followLogs
then tries to remove the file from the fileWatcher
.
This is where the fun starts.
When the fileWatcher
is removing the inotify watch from the file it waits to receive an IN_IGNORED
message. The condition that the removal is waiting on is only released when it's processing a new inotify event. Once an inotify event is processed, if it isn't an IN_IGNORED
message it will get pushed down an unbuffered channel. So if an inotify message that isn't an IN_IGNORED
is received in between the container exiting and an IN_IGNORE
being sent, it will get stuck.
The fix
We hoisted out the code that handled terminating the filewatcher
from the processing of inotify events, which removed the potential deadlock.
We raised a pull request against Docker and the fix was released in v1.13.0.