Semaphore Blog

News and updates from your friendly continuous integration and deployment service.

Faster Rails: Indexing Large Database Tables Without Downtime

Get future posts like this one in your inbox.

Follow us on

This article is part of our Faster Rails series. Check out the previous article about proper database indexing.

As the scope and size of a Rails project grows, actions that were blazingly fast can become slow, and even downright unacceptable. The cause behind this issue can be an exponential growth of your database tables that makes lookups and updates much slower. If this is the case, adding missing indexes to your database is a cheap and easy way to drastically improve the performance of your application.

Faster Rails: Indexing Large Database Tables Without Downtime

However, adding a new index to a database table that’s already big can be dangerous. Don’t forget, index creation on a database table is a synchronous action that prevents INSERT, UPDATE, and DELETE operations until the full index is created. If the system is a live production database, this can have severe effects. Indexing very large tables can take many hours. For a system like Semaphore, even short periods are unacceptable. If this happens during deployment, we can potentially cause an unwanted downtime for the whole system.

note: There might be a database vendor that doesn’t lock the table by default. We are mostly familiar with PostgreSQL and MySQL. Both of them lock write access on your table while the index is being created.

Building Indexes Concurrently

PostgreSQL – our database of choice while developing Semaphore – has a handy option that enables us to build indexes concurrently without locking up our database.

For example, let’s build an index concurrently for branches on the build model:

CREATE INDEX CONCURRENTLY idx_builds_branch ON builds USING btree (branch_id);

The main benefit of concurrent index creation is that it does not require a lock on the table to build the index tree so we can avoid the issue of accidental downtimes.

Keep in mind that while concurrent index building is a safe option for your production system, the build itself takes up to several times longer to complete. The database must perform two scans of the table, and it must wait for all existing transactions that could modify or use the index to terminate. The concurrent index build also imposes extra CPU and I/O load that might slow down other database operations.

Concurrent Index Creation in Rails

In Rails Migrations, you can use the algorithm option to trigger a concurrent index build on your database table.

For example, we recently noticed that we miss a database index for accessing our build_metrics database table from our build models, which in a snowball effect slowed down job creation on Semaphore.

Our build_metrics table is huge, counting many millions of elements, and it’s also accessed very frequently. We could not risk introducing a migration that would lock this table and potentially block build processing on Semaphore.

We used the safe route, and triggered a concurrent index build:

def change
  add_index :builds, :build_metric_id, :algorithm => :concurrently
end

However, we immediately learned that you can’t run the above from inside of a transaction. Active Record creates a transition around every migration step. To avoid this, we used the disable_ddl_transaction! introduced in Rails 4 to run this one migration without a transaction wrapper:

class AddIndexToBuildMetricIdOnBuilds < ActiveRecord::Migration
  disable_ddl_transaction!

  def change
    add_index :builds, :build_metric_id, :algorithm => :concurrently
  end
end

The results were phenomenal. With this simple little tweak, our job processing capabilities got around 2.5 times faster.

2.5x Job Processing

Small tweaks can sometimes bring great improvements. Premature optimization can be a huge anti-pattern, however investing in metrics and gaining a deep understanding of your system never is.

Keep building and tweaking!

At Semaphore, we’re all about speed. We’re on a mission to make continuous integration fast and easy. Driven by numerous conversations with our customers and our own experiences, we’ve built a new CI feature that can automatically parallelize any test suite and cut its runtime to just a few minutes - Semaphore Boosters. Learn more and try it out.

Speeding Up Rendering Rails Pages with render_async

Adding new code to Rails controllers can bring a couple of problems with it. Sometimes controller actions get really big, and they tend to do a lot of things. Another common problem is an increase in data over time, which can lead to slow page loading time. Adding new code to controller actions can also sometimes block the rendering of some actions if it fails, breaking user experience and user hapiness.

Here at Semaphore, we came across these types of problems a couple of times. We usually resolved them by splitting controller actions into smaller actions, and rendering them asynchrounously using plain Javascript.

After some time, we saw that this can be extracted to render_async, a gem that speeds up Rails pages for you - it loads content to your HTML asynchrounously by making an AJAX call to your Rails server.

Speeding Up Rendering Rails Pages with render_async

Problem no. 1: Slowness accumulates over time

As new code gets added, Rails controller actions can get “fat”. If we’re not careful, page load time slowly increases as the amount of code and data rises.

Problem no. 2: Dealing with code that blocks your actions

As we add new code to our controllers, we sometimes need to load extra data in the controller action in order to render the complete view.

Let’s take a look at an example of code that blocks the rendering of an action.

Let’s say we have a movies_controller.rb, and in the show action we want to fetch a movie from the database, but we also want to get the movie rating from IMDB.

class MoviesController < ApplicationController
  def show
    @movie = Movies.find_by_id(params[:id])

    @movie_rating = IMDB.movie_rating(@movie)
  end
end

Getting the movie by find_by_id is a normal line that tries to find a movie in our database that we can control.

However, the line where we fetch the movie rating makes an external request to an IMDB service that is expected to return the answer. The problem starts when an external service is not available or is experiencing downtime. Now our MoviesController#show is down and cannot be loaded to the user that wants the movie.

The solution

Both problems can be solved or relieved by splitting your code using the renderasync gem. renderasync loads content asynchronously to your Rails pages after they’ve rendered.

Why choose renderasync over traditional Javascript code that does async requests and adds HTML to the page? Because renderasync does the boring Javascript fetch and replace for you.

How render_async works

Let’s say you have an app/views/movies/show.html.erb file that shows details about a specified movie and ratings it fetches from an external service.

Here’s the code before using render_async:

# app/views/movies/show.html.erb

<h1>Information about <%= @movie.title %></h1>

<p><%= @movie.description %></p>

<p>Movie rating on IMDB: <%= @movie_rating %></p>

And this is what it looks like after using render_async:

# app/views/movies/show.html.erb

<h1>Information about <%= @movie.title %></h1>

<p><%= @movie.description %></p>

<%= render_async movie_rating_path(@movie.id) %>

With renderasync, the section with the movie rating is loaded after the show.html.erb loads. The page makes an AJAX request using jQuery to `movierating_path`, and it renders the contents of the AJAX response in the HTML of the page.

Since render_async makes a request to the specified path, we need to add it to config/routes.rb:

# config/routes.rb

get :movie_rating, :controller => :movies

We also need to add a proper action in the controller we set in the routes, so we’ll add the movie_rating action inside movies_controller.rb:

# app/controllers/movies_controller.rb

def movie_rating
  @movie_rating = IMDB.movie_ratings(@movie)

  render :partial => "movie_ratings"
end

Since our movie_rating is rendering a partial, we need to create a partial for it to render:

# app/views/movies/_movie_rating.html.erb

<p>Movie rating on IMDB: <%= @movie_rating %></p>

The most important part is to add the content_for tag to your application layout, because render_async will put the code for fetching AJAX responses there. It’s best is to put it just before the footer in your layout.

# app/views/layouts/application.html.erb

<%= content_for :render_async %>

If the IMDB service is down or not responding, our show page will load without the movie rating. The movie page now cannot be broken by an external service, leaving the rest of the page fully usable.

Wrapping Up

In this example we managed to speed up rendering Rails pages with render_async by:

  1. Simplifying the MoviesController show action, thus making it easier to test and load.
  2. Splitting our movie rating markup into a partial, which is a good pattern in the Rails world, and
  3. Freeing up show action from being blocked by an external service.

If you have ideas on what else could be added or improved in render_async, you can submit a pull request or an issue at render_async.

Also, feel free to leave any comments or questions you may have in the section below. If you find this article useful, or think someone else would find it useful, please share it with the world.

At Semaphore, we’re all about speed. We’re on a mission to make continuous integration fast and easy. Driven by numerous conversations with our customers and our own experiences, we’ve built a new CI feature that can automatically parallelize any test suite and cut its runtime to just a few minutes - Semaphore Boosters. Learn more and try it out.

Introducing Boosters: Move Faster with Automatic Parallel Testing

As any application grows in features, running all automated tests in the continuous integration (CI) environment begins to take a significant amount of time. A slow CI build — anything longer than 10 minutes — takes a toll on everyone’s focus, flow and productivity. How do you move fast when even a trivial update or hotfix takes 15 minutes to reach production? Half an hour? Forty-five minutes?

Semaphore Boosters: Move Faster with Automatic Parallel CI Testing

Today, we’re announcing Semaphore Boosters, a new CI feature that can cut a build’s runtime from over an hour down to just a few minutes by automatically parallelizing your test suite. It drastically speeds up testing, and helps your team get faster feedback, save time, be more productive, and deliver updates to users much more frequently.

What can you expect from using Semaphore Boosters?

Early adopter customers have seen some amazing improvements in productivity:

“Before we started working with Boosters, test times were one of our biggest development bottlenecks. Running our test suite took over 90 minutes, and forced us to try a whole host of clunky workarounds just to try to get quicker feedback. Semaphore Boosters have given us a tremendous time savings — our full suite now runs in just under 16 minutes. Our productivity has increased tremendously, almost doubling our output in terms of tickets we close each week. Using Boosters has really helped our large project feel a lot more nimble to develop!” says Bryce Senz, CIO at Credda.

How does it work?

Watch the video below to see Semaphore Boosters in action:

In the video, we started with an application whose test suite took 8 minutes to run. In only a few clicks, without any change in source code, we configured a CI build that runs in only a minute and half.

Semaphore Boosters monitor your test suite and dynamically distribute your test files across parallel jobs. This ensures best possible performance regardless of how your code changes over time. The only thing that you as a user need to do is select the number of parallel jobs you’d like to run.

Semaphore Boosters currently support Ruby on Rails applications via RSpec and Cucumber. We plan to support other languages and frameworks, which we’ll announce here.

Book a Semaphore Boosters demo to help your team improve productivity and deliver new features faster with automatic parallel testing.

Happy building!

How the Team at 500px Moves Faster with Semaphore

500px has been our customer since 2014, and they have been growing and evolving along with Semaphore.

Moving fast is crucial to the 500px team. The less time they spend on testing, the more value they can create for their users. They put new code into production several times per day, and automated testing allows them to ensure that new features work, while spending less time reviewing previously-tested functionality. In order to accomplish this, they rely on Semaphore to automatically run their tests in parallel and speed up their test suite.

How the Team at 500px Moves Faster with Semaphore


In the past, Devon Noel de Tilly from the 500px QA team exaplained to us what their development workflow looks like, and Artem Lypiy, their QA Lead, shared how moving from Jenkins to Semaphore helped 500px build up a continuous integration process that scales.

Recently, Shreya Khasnis wrote a post on what happens behind the scenes and how they maintain 500px on their blog:

“Pull requests are reviewed by developers, but also reviewed by machines! We have a large suite of automated tests, which run when new pull requests are opened. These tests are a great way to ensure that new features work as expected and verify that these new changes do not break existing functionality. Given the variety of features on our site, it would be time-consuming to test all aspects by hand on every code change. Currently, we have about 4,000 automated tests that are separated into threads which run simultaneously. We use a continuous integration framework called Semaphore CI that runs these tests on every proposed change. The tests are randomly executed, which encourages the development of independent tests to ensure the order of execution does not impact the expected result. This helps us parallelize the test suite into different threads. Semaphore can also be integrated with Slack to inform developers about tests that have passed or failed. From this, developers are able to triage through and fix the code that broke things.”

Read the entire blog post to learn how 500px moves fast, while ensuring that everything works.

Happy building!

Faster Rails: Is Your Database Properly Indexed?

This article is part of our Faster Rails series. Check out the previous article about fast existence checks.

My Rails app used to be fast and snappy, and everything was working just fine for several months. Then, slowly, as my product grew and users started to flock in, web requests become slow and my database’s CPU usage started hitting the roof. I hadn’t changed anything, why was my app getting slower?

Is there any cure for the issues I’m having with my application, or is Rails simply not able to scale?

Faster Rails: Is Your Database Properly Indexed?

What makes your Rails application slow?

While there can be many reasons behind an application’s slowness, database queries usually play the biggest role in an application’s performance footprint. Loading too much data into memory, N+1 queries, lack of cached values, and the lack of proper databases indexes are the biggest culprits that can cause slow requests.

Missing database indexes on foreign keys and commonly searched columns or values that need to be sorted can make a huge difference. The missing index is an issue that is not even noticeable for tables with several thousand records. However, when you start hitting millions of records, the lookups in the table become painfully slow.

The role of database indexes

When you create a database column, it’s vital to consider if you will need to find and retrieve records based on that column.

For example, let’s take a look at the internals of Semaphore. We have a Project model, and every project has a name attribute. When someone visits a project on Semaphore, e.g. https://semaphoreci.com/renderedtext/test-boosters, the first thing we need to do in the projects controller is to find the project based on its name — test-boosters.

project = Project.find_by_name(params[:name])

Without an index, the database engine would need to check every record in the projects table, one by one, until a match is found.

However, if we introduce an index on the ‘projects’ table, as in the following example, the lookup will be much, much faster.

class IndexProjectsOnName < ActiveRecord::Migration
  def change
    add_index :projects, :name
  end
end

A good way to think about indexes is to imagine them as the index section at the end of a book. If you want to find a word in a book, you can either read the whole book and find the word, or your can open the index section that contains a alphabetically sorted list of important words with a locator that points to the page that defines the word.

What needs to be indexed?

A good rule of thumb is to create database indexes for everything that is referenced in the WHERE, HAVING and ORDER BY parts of your SQL queries.

Indexes for unique lookups

Any lookup based on a unique column value should have an index.

For example, the following queries:

User.find_by_username("shiroyasha")
User.find_by_email("support@semaphoreci.com")

will benefit from an index of the username and email fields:

add_index :users, :username
add_index :users, :email

Indexes for foreign keys

If you have belongs_to or has_many relationships, you will need to index the foreign keys to optimize for fast lookup.

For example, we have the branches that belong to projects:

class Project < ActiveRecord::Base
  has_many :branches
end

class Branch < ActiveRecord::Base
  belongs_to :project
end

For fast lookup, we need to add the following index:

add_index :branches, :project_id

For polymorphic associations, the owner of the project can either be a User or an Organization:

class Organization < ActiveRecord::Base
  has_many :projects, :as => :owner
end

class User < ActiveRecord::Base
  has_many :projects, :as => :owner
end

class Project < ActiveRecord::Base
  belongs_to :owner, :polymorphic => true
end

We need to make sure that we create a double index:

# Bad: This will not improve the lookup speed

add_index :projects, :owner_id
add_index :projects, :owner_type

# Good: This will create the proper index

add_index :projects, [:owner_id, :owner_type]

Indexes for ordered values

Any frequently used sorting can be improved by using a dedicated index.

For example:

Build.order(:updated_at).take(10)

can be improved with a dedicated index:

add_index :updated_at

Should I always use indexes?

While using indexes for important fields can immensely improve the performance of your application, sometimes the effect can be negligible, or it can even make your application slower.

For example, tables that have elements that are frequently deleted can negatively impact the performance of your database. Huge tables with many millions of records also require more storage for your indexes.

Always be concious about the changes you introduce in your database, and if in doubt, be sure to base your decisions on real world data and measurements.

At Semaphore, we’re all about speed. We’re on a mission to make continuous integration fast and easy. Driven by numerous conversations with our customers and our own experiences, we’ve built a new CI feature that can automatically parallelize any test suite and cut its runtime to just a few minutes - Semaphore Boosters. Learn more and try it out.

Perfection is Useless

One of the most important things we teach the junior programmers who join the Semaphore team is the mindset of shipping in small iterations. This is a simple concept, however there’s an inevitable misunderstanding that stems from the subjective ideas of “small”. Thus, in practice we need to teach by example what we really mean by small.

When you’re inexperienced, the desire to do and show your best work often leads to perfectionism. In programming, perfectionism manifests itself as “I haven’t submitted my pull request because I haven’t completed everything yet”.

Perfectionism is at odds with the goals of developing business software — giving something useful to users, preferably sooner rather than later. Perfectionists create imaginary obstacles and never end up building anything.

Recently, a pair of junior programmers was building a new reporting screen for our marketing team. The screen needed to combine two sources of data for a given time range and present a paginated view of results. The team that needed the report has never seen the data this screen would provide. Would it hurt if the first version of the report did not include a date picker and pagination of results beyond the top 25? Hell no. So, we encouraged them to ship the screen without the date range and pagination. The initial results provided more than enough value and ideas for improvement. The marketing team had some data they could work with while the developers continued working on the remaining tasks.

The crux of the matter lies in decomposing a task into minimal useful pieces. Next, you estimate the complexity of each piece and communicate expectations with the “stakeholder” (customer, client, product manager, or feature user).

Say a designer has recently updated several details that affect four distinct screens. Would it be best to integrate these changes in four separate pull requests, or one? This is where complexity, i.e. the time it would take to complete each one, needs to be considered. If they would take a day each, four separate pull requests are probably best. If all of them together would take you less than an hour to complete, go ahead and combine them all into one pull request. Are three tasks really easy, but the fourth one requires additional input from the designer who’s having a day off, as well as more time than all others combined? Best to please your users with what you can finish soon, and then do the last thing separately.

Shipping early will often provide you with surprising feedback. Perhaps the initial version is so good that nobody really needs the stuff that’s “missing”. Or, the whole idea didn’t really deliver what was expected and needs to be reconsidered. The goal is to learn and help others. Just keep moving.

Fast Failing: Introducing Faster Feedback on Failing Builds

Fast feedback on the work we’ve done minimizes developer context switching and keeps us in the state of flow. Waiting for all the jobs to finish in order to see that a job has failed can waste a lot of time. If a job fails, the developer should have the option to be notified right away, rather than wait until all the tests are run.

In order to make you more productive when building on Semaphore, we bring you fast failing as a feature.

The fast failing approach

On Semaphore, fast failing means that the developers get instant feedback when a job fails. All the running jobs of a build are stopped as soon as one of the jobs fails. This means that you don’t need to wait for all the other jobs to finish in order to get build feedback.

For example, if a build takes 10 minutes, the fast failing approach gives you feedback in 1 minute, and fixing issues takes 1 minute, the entire process along with re-building takes 12 minutes in total. With the approach that does not allow fast failing, the entire process would take 21 minutes.

The fast failing approach also minimizes developer context switch due to faster feedback cycles.

How to enable fast failing on Semaphore

You can select the type of fast failing in the branch settings of your project. You can either enable it for all branches, or for all branches except the default one.

Branch settings for Fast Failing


Happy fast failing!

Flaky Tests: Are You Sure You Want to Rerun Them?

Flaky Tests: Are You Sure You Want to Rerun Them?

Different teams have different approaches to dealing with flaky tests. Some even go as far as using the “Let’s run each test 10 times and if it passes at least once, it passed” approach.

I personally think that rerunning failed tests is poisonous — it legitimizes and encourages entropy, and rots the test suite in the long run.

The half-cocked approach

Some teams see rerunning failed tests as a very convenient short-term solution. In my experience, there is unfortunately no such thing as ‘a short-term solution’. All temporary solutions tend to become permanent.

Along with some other techniques that are efficient in the short term, but are otherwise devastating, rerunning tests is very popular with a certain category of managers. It’s particularly common in corporate environments: there are company goals, and then there are personal goals (ladders to climb). In such environments, some people tend to focus only on what needs to happen until the end of the current quarter or year. What happens later is often seen as someone else’s concern. Looking from that perspective, test rerunning is both fast and efficient, which makes it a desirable and convenient solution.

Keeping flaky tests and brute-forcing them to pass defeats the purpose of testing. There is an unspoken assumption that something is wrong with the tests, and that it’s fine to just rerun them. This assumption is dangerous. Who’s to say that the race or the time-out that causes the flakiness is in the test, and not in the production code? And that it’s not affecting the customer?

The sustainable solution

The long-term solution is to either fix or replace the flaky tests. If one developer cannot fix them, another one should try. If a test cannot be fixed, it should be deleted and written from scratch, preferably by somebody who didn’t see the flaky one. Test coverage tools can be used as a kind of a safety net, showing if some tests have been deleted without being adequately replaced.

Not being able to develop stable tests for some part of the code usually means one of these two things — either that something is wrong with the test and/or the testing approach, or that something is wrong with the code being tested. If we are reasonably certain that the tests are fine, it’s time to take a deeper look at the code itself.

Our position on flaky tests

Deleting and fixing flaky tests is a pretty aggressive measure, and rewriting tests can be time consuming. However, not taking care of flaky tests leads to certain long-term test suite degradation.

On the other hand, there are some legitimate use-cases for flaky test reruns. For example, when time-to-market is of essential importance, and when technical debt is deliberately accumulated with the intention and a clear plan on paying it off in the near future.

As a CI/CD tool vendor, we feel that our choice whether to support rerunning failing flaky tests affects numerous customers. Not just the way they work, but, much more importantly, the way they perceive flaky tests and the testing process itself. At this point, we are choosing not to support rerunning failed tests, since our position is that this approach is harmful much more often than it is useful.

Customizable Command Timeouts

For a long time, Semaphore has been limiting your build command execution time to a fixed 60 minutes. This restriction worked great for the majority of builds on Semaphore, however there are some cases when this limit is simply not good enough.

Tasks such as compiling large binaries, provisioning big infrastructures, or running tests that are difficult to parallelize sometimes require more than 60 minutes to complete.

On the other hand, developers like to restrict build duration to prevent their test suites from getting stuck because of an accidental debug statement or a network call that will never complete.

For this reason, we have introduced a new configuration option in the admin section of your project’s configuration that allows you to choose a best suited timeout for your project:

Customizable command timeouts on Semaphore


Happy building!

The Cracking Monolith: The Forces That Call for Microservices

The microservice architecture has recently been gaining traction, with many companies sharing their positive experiences with applying it. The early adopters have been tech behemoths such as Amazon and Netflix, or companies with huge user bases like SoundCloud. Based on the profiles of these companies and the assumption that there’s more complexity to running and deploying many things than to deploying a single application, many people understand microservices as an interesting idea that does not apply to them. It’s something that mere mortals could qualify for in the far distant future, if ever.

However, obsessing about “being ready” is rarely a good strategy in life. I think that it’s far more useful to first learn how to detect when the opposite approach — a monolithic application — is no longer optimal. The knowledge that helps us to recognize the need enables us to start taking action when the time comes for us to make the change. This and future posts on our blog will be based on our experience of scaling up Semaphore to manage tens of thousands of private CI jobs on a daily basis.

Overweight monoliths exhibit two classes of problems: degrading system performance and stability, and slow development cycles. So, whatever we do comes from the desire to escape these technical and consequently social challenges.

The single point of fragility

Today’s typical large monolithic systems started off as web applications written in an MVC framework, such as Ruby on Rails. These systems are characterized by either being a single point of failure, or having severe bottlenecks under pressure.

Of course, having potential bottlenecks, or having an entire system that is a single point of failure is not inherently a problem. When you’re in month 3 of your MVP, this is fine. When you’re working in a team of a few developers on a client project which serves 100 customers, this is fine. When most of your app’s functionality are well-designed CRUD operations based on human input with a linear increase of load, things are probably going to be fine for a long time.

Also, there’s nothing inherently wrong about big apps. If you have one and you’re not experiencing any of these issues, there’s absolutely no reason to change your approach. You shouldn’t build microservices solely in the service of making the app smaller — it makes no sense to replace the parts that are doing their job well.

Problems begin to arise after your single point of failure has actually started failing under heavy load.

At that point, having a large attack surface can start keeping the team in a perpetual state of emergency. For example:

  • An outage in non-critical data processing brings down your entire website. With Semaphore, we had events where the monolith was handling callbacks from many CI servers, and when that part of the system failed, it brought the entire service down.
  • You moved all time-intensive tasks to one huge group of background workers, and keeping them stable gradually becomes a full-time job for a small team.
  • Changing one part of the system unexpectedly affects some other parts even though they’re logically unrelated, which leads to some nasty surprises.

As a consequence, your team spends more time solving technical issues than building cool and useful stuff for your users.

Slow development cycles

The second big problem is when making any change happen begins to take too much time.

There are some technical factors that are not difficult to measure. A good question to consider is how much time it takes your team to ship a hotfix to production. Not having a fast delivery pipeline is painfully obvious to your users in the case of an outage.

What’s less obvious is how much the slow development cycles are affecting your company over a longer period of time. How long does it take your team to get from an idea to something that customers can use in production? If the answer is weeks or months, then your company is vulnerable to being outplayed by competition.

Nobody wants that, but that’s where the compound effects of monolithic, complex code bases lead to.

Slow CI builds: anything longer than a few minutes leads to too much unproductive time and task switching. As a standard for web apps we recommend setting the bar at 10 minutes and we actually draw the line for you. Slow CI builds are one of the first symptoms of an overweight monolith, but the good news is that a good CI tool can help you fix it. For example, on Semaphore you can split your test suite into parallel jobs, or let Semaphore do the work for you automatically, regardless of the sequential runtime of your build.

Slow deployment: this issue is typical for monoliths that have accumulated many dependencies and assets. There are often multiple app instances, and we need to replace each one without having downtime. Moving to container-based deployment can make things even worse, by adding the time needed to build and copy the container image.

High bus factor on the old guard, long onboarding for the newcomers: it takes months for someone new to become comfortable with making a non-trivial contribution in a large code base. And yet, all new code is just a small percentile of the code that has already been written. The idiosyncrasies of old code affect and constrain all new code that is layered on top of the old one. This leaves those who have watched the app grow with an ever-expanding responsibility. For example, having 5 developers that are waiting for a single person to review their pull requests is an indicator of this.

Emergency-driven context switching: we may have begun working on a new feature, but an outage has just exposed a vulnerability in our system. So, healing it becomes a top priority, and the team needs to react and switch to solving that issue. By the time they return to the initial project, internal or external circumstances can change and reduce its impact, perhaps even make it obsolete. A badly designed distributed system can make this even worse — hence one of the requirements for making one is having solid design skills. However, if all code is part of a single runtime hitting one database, our options for avoiding contention and downtime are very limited.

Change of technology is difficult: our current framework and tooling might not be the best match for the new use cases and the problems we face. It’s also common for monoliths to depend on outdated software. For example, GitHub upgraded to Rails 3 four years after it was released. Such latency can either limit our design choices, or generate additional maintenance work. For example, when the library version that you’re using is no longer receiving security updates, you need to find a way to patch it yourself.

Decomposition for fun and profit

While product success certainly helps, a development team that’s experiencing all of these issues won’t have the highest morale. Nor will its people be able to develop their true potential.

All this can happen regardless of code quality. Practicing behavior-driven development is not a vaccine against scaling issues.

The root cause is simple. A monolithic application grows multiple applications within itself, and it meets high traffic and large volumes of data.

Big problems are best solved by breaking them up into many smaller ones that are easier to handle. This basic engineering idea is what leads teams to start decomposing large monoliths into smaller services, and eventually into microservices. The ultimate goal is to go back to being creative and successful by enabling the team to develop useful products as quickly as possible.

Get future posts like this one in your inbox.

Follow us on