28 Apr 2023 · Semaphore News

    Service Outage Postmortem: April 28

    3 min read
    Contents

    On April 28th, 2023, Semaphore experienced an incident that impacted its cloud CI/CD services. The incident started at 22:23 UTC and lasted until 01:43 UTC, during which time the triggering of the jobs on Semaphore was severely delayed, leading to a partial loss of service for many users. 

    Our on-call SRE team was alerted by automated monitoring systems and initiated an investigation into the issue. 

    The root cause was determined to be low-performing database queries that, under specific conditions, caused the production database CPU usage to spike to 100%. The problem was resolved by implementing a solution and cleaning up the jobs queue.

    Timeline (all times in UTC)

    • 22:23: Incident start
    • 22:34: On-call SRE team alerted by automated monitoring systems
    • 23:05: Problem identified as database performance issue; issue escalated to additional engineers
    • 00:29: Root cause found: low-performing queries causing production database CPU usage to spike to 100% under specific conditions
    • 01:05: Solution implemented, resolving the issue
    • 01:24: Job processing queue cleaned up and job processing resumed, returning service to operational state
    • 01:43: Incident end

    Root Cause Analysis

    The database performance decrease was primarily due to the degraded performance of the DB table responsible for managing job states. Although this table is relatively small, it experiences a high read/write rate, leading to the rapid accumulation of dead tuples. 

    The auto-vacuuming function, which is responsible for cleaning up these tuples, was not executing as expected due to increased traffic in other parts of the DB cluster. This inaction resulted in further database performance degradation and increased load.

    Resolution and Recovery

    Upon identifying the issue, our engineers paused the subsystem responsible for job scheduling and performed a manual vacuum operation on the jobs database table. This operation was completed at 01:05 UTC, at which point the service loss was mitigated, and the system returned to normal. 

    An additional 19 minutes were required for the job queue, which had built up since the start of the incident, to clear. By 01:24 UTC, the system was fully operational again.

    Corrective Actions and Preventative Measures

    To prevent similar incidents in the future, we will:

    1. Conduct a comprehensive review of our database queries and performance metrics to identify any other potential bottlenecks or areas for improvement.
    2. Implement regular performance testing and monitoring to proactively identify and address performance issues before they escalate to incidents.
    3. Enhance our automated monitoring systems to better detect and alert on potential performance-related incidents.

    We apologize for the inconvenience this incident caused our customers and are committed to continuously improving our service reliability and performance.

    Incident Response Policy Breach

    During the investigation of this incident, we also identified an internal breach of our incident response policy. Our policy mandates that any service disruption impacting multiple users must be communicated through our public status page. Unfortunately, this communication protocol was not followed in this instance.

    To prevent this oversight from happening again, we will:

    1. Re-evaluate and update our internal incident response policy to ensure it is comprehensive, clear, and up-to-date with our current processes.
    2. Conduct additional training sessions with our engineering and support teams to reinforce the importance of adhering to the incident response policy and effectively communicating with our customers during incidents.
    3. Ensure that the SRE team always has a dedicated support team member on standby during incidents. This will guarantee that any progress made in resolving the issue is promptly and accurately communicated to the impacted customers, maintaining transparency and keeping them informed throughout the process.

    We recognize that timely and accurate communication is crucial during service disruptions, and we are committed to improving our processes to ensure transparency and accountability.

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    Aleksandar Mitrovic
    Writen by:
    Project and product manager with 10+ years of experience in developing great products and services from idea to market. Always excited to work with excellence seeking teams.