Service Outage Postmortem: April 28

On April 28th, 2023, Semaphore experienced an incident that impacted its cloud CI/CD services. The incident started at 22:23 UTC and lasted until 01:43 UTC, during which time the triggering of the jobs on Semaphore was severely delayed, leading to a partial loss of service for many users.

Our on-call SRE team was alerted by automated monitoring systems and initiated an investigation into the issue.

The root cause was determined to be low-performing database queries that, under specific conditions, caused the production database CPU usage to spike to 100%. The problem was resolved by implementing a solution and cleaning up the jobs queue.

Timeline (all times in UTC)

22:23: Incident start
22:34: On-call SRE team alerted by automated monitoring systems
23:05: Problem identified as database performance issue; issue escalated to additional engineers
00:29: Root cause found: low-performing queries causing production database CPU usage to spike to 100% under specific conditions
01:05: Solution implemented, resolving the issue
01:24: Job processing queue cleaned up and job processing resumed, returning service to operational state
01:43: Incident end

Root Cause Analysis

The database performance decrease was primarily due to the degraded performance of the DB table responsible for managing job states. Although this table is relatively small, it experiences a high read/write rate, leading to the rapid accumulation of dead tuples.

The auto-vacuuming function, which is responsible for cleaning up these tuples, was not executing as expected due to increased traffic in other parts of the DB cluster. This inaction resulted in further database performance degradation and increased load.

Resolution and Recovery

Upon identifying the issue, our engineers paused the subsystem responsible for job scheduling and performed a manual vacuum operation on the jobs database table. This operation was completed at 01:05 UTC, at which point the service loss was mitigated, and the system returned to normal.

An additional 19 minutes were required for the job queue, which had built up since the start of the incident, to clear. By 01:24 UTC, the system was fully operational again.

Corrective Actions and Preventative Measures

To prevent similar incidents in the future, we will:

Conduct a comprehensive review of our database queries and performance metrics to identify any other potential bottlenecks or areas for improvement.
Implement regular performance testing and monitoring to proactively identify and address performance issues before they escalate to incidents.
Enhance our automated monitoring systems to better detect and alert on potential performance-related incidents.

We apologize for the inconvenience this incident caused our customers and are committed to continuously improving our service reliability and performance.

Incident Response Policy Breach

During the investigation of this incident, we also identified an internal breach of our incident response policy. Our policy mandates that any service disruption impacting multiple users must be communicated through our public status page. Unfortunately, this communication protocol was not followed in this instance.

To prevent this oversight from happening again, we will:

Re-evaluate and update our internal incident response policy to ensure it is comprehensive, clear, and up-to-date with our current processes.
Conduct additional training sessions with our engineering and support teams to reinforce the importance of adhering to the incident response policy and effectively communicating with our customers during incidents.
Ensure that the SRE team always has a dedicated support team member on standby during incidents. This will guarantee that any progress made in resolving the issue is promptly and accurately communicated to the impacted customers, maintaining transparency and keeping them informed throughout the process.

We recognize that timely and accurate communication is crucial during service disruptions, and we are committed to improving our processes to ensure transparency and accountability.

Service Outage Postmortem: April 28

Timeline (all times in UTC)

Root Cause Analysis

Resolution and Recovery

Corrective Actions and Preventative Measures

Incident Response Policy Breach

Learn CI/CD

Leave a Reply Cancel reply

CI/CD Weekly Newsletter