On Thursday, October 23rd after the rollout of a planned platform update at 13:30 UTC, Semaphore experienced issues which caused delays in running builds and deploys, coupled with decreased performance. First, we want to apologize for that. We know it messed up your workdays. That’s not how we want to do things and will do better next time. Second, I’d like to take a moment to explain what happenned.
Build errors after platform update
More than a few projects experienced unexpected build failures related to mysql2 gem. This was caused by the fact that in this platform update we migrated all projects to a new version of OS, Ubuntu 14.04 with different system libraries. Since Semaphore caches your project’s git repository and installed dependencies between builds, there have been cases where dependencies such as Ruby gems depending on system libraries could not work.
While we did our best to help everyone as soon as they raised the issue — on support, live chat or Twitter — it was also not the way we intended things to go. Our goal is that you don’t have to be aware of such details and not have unexpected failures which require an action on our end.
For this reason, we immediately shipped a small tool to clear a project’s dependency cache, now available in project settings. And at 17:02 UTC we cleared the cache for all projects. This resulted in new git clones and dependency installations for all, but without unexpected failures. We will be evaluating how to do this more granularly when a similar major update comes along next time.
Slow build queue
At the same time, our infrastructure was experiencing a larger issue where machines would go down at a very high rate and frequency. While we have a system to automatically detect this situation and reschedule any jobs that were running on a machine that’s affected, it couldn’t solve the problem completely because the failures were happening too fast, adding up more and more jobs to our build queue. We considered increasing our capacity but realized that it would not remedy the problem.
At 18:50 UTC we identified a memory leak in use of Java-related services, such as Solr and Cassandra, which was causing the failure of build servers. After some consideration we settled on the first guess that it was caused by platform update’s switch to Oracle JVM as the default JVM, and by 21:46 UTC we shipped a revert back to OpenJDK globally.
It took some more time to become evident that this was not a solution, and eventually we realized that the only change left was a memory limiter on LXC containers that run the builds, causing unexpected behaviour when certain Java processes hit it. We reverted this implementation of memory limiting at 22:10 UTC and all builds were able to start and finish normally. At 22:59 the build queue was clear (as we announced on Twitter) and new builds were starting at normal speed.
What we’ll do to avoid this in the future
While we do extensively test every platform update before the final release, it is not possible to recreate every possible scenario that comes from real world usage of the service. For this reason our current plan is to extend our infrastructure to securely run copies of a fraction of jobs with a new version of the platform.
Once again we would like to apologize to you for an interruption in your work. We know how Semaphore’s CI service is important to you and while this is our first situation worthy of a post-mortem in more than two years, we see failures as inevitable. It is our imperative however to shield you from them as much as we can.