From Tuesday, December 12 until Sunday, December 17, the Semaphore build cluster suffered from sporadic network instabilities due to a faulty device at a Tier 1 network provider.
We know how much you rely on Semaphore, and our top priority is to provide a reliable service. To the customers that were affected: we are deeply sorry for the instabilities that you were facing during these days. In this post, we will share what were the technical details and challenges that we faced, the steps we took to resolve the issues, and what we plan to do to prevent similar issues from happening in the future.
Day 1 – Tuesday, December 12, 2017
The first report of network instabilities arrived at 01:51 UTC. A customer reported that network latency to GitHub was unusually high, and that as a result, dependency installation would sometimes fail. For the next 10 hours we didn’t receive any other support requests related to this.
At 14:20 UTC, after similar support requests began to arrive, we opened a ticket with our hosting provider and simultaneously started tracing routes to potentially discover an obvious source of packet loss. We did not find one.. Initial communication with our hosting provider was about IPv6 OS-level and packet size transfer configuration, but this did not result in a solution.
By the end of the day, we knew that sporadic issues were present when establishing Git network connections and pulling Docker images. Issues were caused by “TLS handshake timeout.” We were not sure if the situation should be escalated further as an incident.
Day 2 – Wednesday, December 13, 2017
Further reports by users had made it clear to us that the issue should be escalated, and we formed an engineering team dedicated to resolving it. By 11:00am UTC, we developed a way to reproduce “TLS handshake timeout” errors. To reproduce the error once, we needed to perform a docker pull about 1,000 times.
We shared this information with the networking department of our hosting provider. However, we made a mistake: our error reproduction script used only Docker pull examples, which got us caught up in a conversation about Docker-specific possible causes of TLS handshake timeouts, many of which have been documented on the web.
At 13:57 UTC, our hosting provider changed its transit route to a part of AWS, AS14618. We continued to run our new network tests, but because the failure rate was non-zero, we concluded that the transit route change didn’t have any effect. At 14:36 UTC, we sent that conclusion and ways to reproduce the issue to our hosting provider. This was a major mistake on our end since, in hindsight, we drew our conclusion from a data sample that was statistically irrelevant: the sample was too small and the time frame too short.
In a response back from our hosting provider, we learned that the default transit routing is based a Tier 1 network link operated by Telia. By 15:25 UTC, our network transit was back to the Telia-based network.
At 17:15, UTC we received an email from GitHub support. That day, they had received two additional reports of similar issues from customers who also connect via Telia.
We passed this information on to our hosting provider, who once again at 17:19 UTC changed routing to avoid Telia. We ran our network tests, which at this point originated from a bigger slice of our build cluster, and while we saw a smaller error rate, it was non-zero.
We continued to suspect that the fault was somewhere in our hosting provider’s infrastructure. Meanwhile, we expanded our network tests so they would run continuously on all servers and target multiple endpoints.
Day 3 – Thursday, December 14, 2017
At 8:48 UTC, we requested that our provider revert the change in routing back to Telia, with the goal of reducing the number of variables in the problem, as we continued to detect sporadic network issues. We got some tips from our provider’s networking department about limiting bandwidth and number of concurrent TCP connections. We tried them out, but there were no changes. At this point we thought that it would be a good idea to bring in an external networking expert, since we felt that we were deep in a rabbit hole and that another set of eyes would help.
At 12:26 UTC, after consulting with a networking expert, we concluded that the problems were in the part of the networking stack that is out of our control, and that the expert might be of little help. Working together, however, we figured out that in less than 1% of the cases, the packets responsible for doing TLS handshake were either not arriving or arriving with a significant delay. When reproducing this with curl, which doesn’t have a TLS handshake timeout, the effect was that in problematic cases, the program was hanging for a minute or two before it established a connection — or in some cases it waited forever. We observed that TLS implementations used in Git and Docker clients set a timeout which would expire and raise an exception.
At 14:29 UTC, we received a confirmation that our provider’s networking department had made another routing change. We continued running tests from different parts of our cluster, and results were promising, but it turned out that test results varied depending on the subnet in our provider’s network. This was a valuable insight, as it would help us draw conclusions from a statistically relevant sample. At this point we started running TLS handshake tests from all our servers in parallel, and discovered that there were subnets which were much more stable than the others.
At 15:00 UTC, we established contact with Stripe’s priority support, whose point of contact we received from one of our customers. Stripe’s API is one of the endpoints with which network issues were present.
At 18:51 UTC, we deployed a pair of Squid proxy servers, plus a system-level configuration to use them for all HTTP traffic in the Semaphore Platform. We also deployed a configuration to reroute traffic to Docker Hub, which eliminated TLS handshake errors.
At 20:12 UTC, we passed information about stable vs. unstable subnets to the networking department of our provider, but it wasn’t helpful for them. We requested escalation of the issue, but it was not something that they could do. At this point, things heated up a bit between our team and our hosting provider. Looking back, we might have benefited from more transparency from their side into what they were doing, and from better communication between us and their upstream provider. Their readiness to help was not an issue.
Day 4 – Friday, December 15, 2017
At 00:53 UTC, we decided to split our further efforts into three branches. First, we would develop a system-wide retry command in the Platform, and execute continuous network scanning and testing across the entire build cluster. Second, we would set up a network tunnel from our hosting provider to our US-based networks, and develop a definitive test for whether the root cause is in our hosting provider’s network, or an outer network. Third, we would create a new build cluster based in another region (US).
At 8:58 UTC, we received a request from our hosting provider to send them additional information about the timeframes when problems were observed, and the IP addresses that our customers connected to during those timeframes. We shared that information and continued our development efforts.
At 15:16 UTC, we received a notice from our hosting provider that they had made some routing changes in cooperation with GitHub. After running our tests, we confirmed that all communication with GitHub was operating normally.
At 15:54 UTC, we forwarded examples of IP addresses that were still affected to our hosting provider.. We didn’t hear any good news from them regarding AWS-based systems for the rest of the incident’s time, as they were unable to get input from the Amazon NOC team, despite being a direct contact partner.
At 19:34 UTC, we shipped the retry command in the Semaphore Platform. We also cancelled the plan to set up network tunnelling in the build cluster, in order to focus our efforts on launching a build cluster in another region.
At 21:49 UTC, we were informed that the routing change on GitHub’s end had to be reverted due to high traffic. At 23:08 UTC, we had evidence that communication issues with GitHub were back, and we shared that information with our provider.
Day 5 – Saturday, December 16, 2017
At 00:51 UTC, we received a note from Stripe support: according to their traces, the network issues began at the edge of our provider’s network. They also advised us to spin up an AWS EC2 instance in us-west-2 to help with the diagnosis on our side, as that should allow us to approximate the route back from Stripe to the Semaphore build cluster.
At 11:25 UTC, GitHub support informed us that they were working on the issue with Telia directly. GitHub reverted the routing fix during morning PST (GitHub’s high traffic time), but routed around them again for the duration of the weekend while they worked the issue out with Telia. The communication between Semaphore and GitHub was back to normal.
This was the critical point, as it appears that collaboration with GitHub finally drove Telia to detect and replace the faulty device in their network. This was confirmed in an email that we received from GitHub on Monday, December 18, which stated that Telia had replaced a line card in a router that was throwing errors.
Day 6 – Sunday, December 17, 2017
At 02:51 UTC, we successfully provisioned first servers in the new build cluster, and continued working on the new build cluster in the morning.
At 16:08 UTC, our hosting provider mentioned BGP hijacking in their outer network. Its initial timestamp matched the time when our network issues began, which we mentioned in a public status update. Subsequent investigation showed that this was irrelevant information.
At 17:23 UTC, all network tests — toward GitHub, Stripe, DockerHub, and some AWS endpoints — reported 100% success in the build cluster. At 20:21 UTC, we completed a Grafana dashboard which continuously collects network tests from the build cluster. All reported 100%.
At 21:51 UTC, we successfully launched the first servers in the new build cluster to production. The next day, we expanded the new build cluster and moved about a third of all jobs there, even though network issues were no longer present.
This was a hard lesson in how things far from our circle of influence can go wrong and impact the Semaphore user experience. To make sure that you don’t suffer from similar problems in the future, we plan to do the following:
- Build capability to provision a new build cluster in another region, on demand, within 1 hour of the moment a similar network outage is detected again. That new build cluster would then completely replace the default build cluster.
- We have already developed a system that continuously monitors the networks that might be affected. Our current engineering efforts are focused on completing the on-demand multi-region provisioning project. As you might be aware, Semaphore’s build cluster is based on bare metal servers, which is the reason why Semaphore can provide 2x faster CI/CD workers comparing to competitors.
- Develop continuous data collection from network tracing. We can use this data to escalate any similar issues with our hosting provider and the Tier 1 providers.
We realize how important Semaphore is to the workflows that enable your projects and businesses to succeed. All of us at Semaphore would like to apologize for the impact of this outage. We will continue the work needed to prevent similar issues from blocking your development pipelines in the future. We’d like to thank those of you that reached out to us during the incident for your patience, support and assistance in identifying the issues.