Semaphore Engineering Blog

Subscribe

Get future posts like this one in your inbox.

Follow us on

Flaky Tests: Are You Sure You Want to Rerun Them?

Flaky Tests: Are You Sure You Want to Rerun Them?

Different teams have different approaches to dealing with flaky tests. Some even go as far as using the “Let’s run each test 10 times and if it passes at least once, it passed” approach.

I personally think that rerunning failed tests is poisonous — it legitimizes and encourages entropy, and rots the test suite in the long run.

The half-cocked approach

Some teams see rerunning failed tests as a very convenient short-term solution. In my experience, there is unfortunately no such thing as ‘a short-term solution’. All temporary solutions tend to become permanent.

Along with some other techniques that are efficient in the short term, but are otherwise devastating, rerunning tests is very popular with a certain category of managers. It’s particularly common in corporate environments: there are company goals, and then there are personal goals (ladders to climb). In such environments, some people tend to focus only on what needs to happen until the end of the current quarter or year. What happens later is often seen as someone else’s concern. Looking from that perspective, test rerunning is both fast and efficient, which makes it a desirable and convenient solution.

Keeping flaky tests and brute-forcing them to pass defeats the purpose of testing. There is an unspoken assumption that something is wrong with the tests, and that it’s fine to just rerun them. This assumption is dangerous. Who’s to say that the race or the time-out that causes the flakiness is in the test, and not in the production code? And that it’s not affecting the customer?

The sustainable solution

The long-term solution is to either fix or replace the flaky tests. If one developer cannot fix them, another one should try. If a test cannot be fixed, it should be deleted and written from scratch, preferably by somebody who didn’t see the flaky one. Test coverage tools can be used as a kind of a safety net, showing if some tests have been deleted without being adequately replaced.

Not being able to develop stable tests for some part of the code usually means one of these two things — either that something is wrong with the test and/or the testing approach, or that something is wrong with the code being tested. If we are reasonably certain that the tests are fine, it’s time to take a deeper look at the code itself.

Our position on flaky tests

Deleting and fixing flaky tests is a pretty aggressive measure, and rewriting tests can be time consuming. However, not taking care of flaky tests leads to certain long-term test suite degradation.

On the other hand, there are some legitimate use-cases for flaky test reruns. For example, when time-to-market is of essential importance, and when technical debt is deliberately accumulated with the intention and a clear plan on paying it off in the near future.

As a CI/CD tool vendor, we feel that our choice whether to support rerunning failing flaky tests affects numerous customers. Not just the way they work, but, much more importantly, the way they perceive flaky tests and the testing process itself. At this point, we are choosing not to support rerunning failed tests, since our position is that this approach is harmful much more often than it is useful.

The Cracking Monolith: The Forces That Call for Microservices

The microservice architecture has recently been gaining traction, with many companies sharing their positive experiences with applying it. The early adopters have been tech behemoths such as Amazon and Netflix, or companies with huge user bases like SoundCloud. Based on the profiles of these companies and the assumption that there’s more complexity to running and deploying many things than to deploying a single application, many people understand microservices as an interesting idea that does not apply to them. It’s something that mere mortals could qualify for in the far distant future, if ever.

However, obsessing about “being ready” is rarely a good strategy in life. I think that it’s far more useful to first learn how to detect when the opposite approach — a monolithic application — is no longer optimal. The knowledge that helps us to recognize the need enables us to start taking action when the time comes for us to make the change. This and future posts on our blog will be based on our experience of scaling up Semaphore to manage tens of thousands of private CI jobs on a daily basis.

Overweight monoliths exhibit two classes of problems: degrading system performance and stability, and slow development cycles. So, whatever we do comes from the desire to escape these technical and consequently social challenges.

The single point of fragility

Today’s typical large monolithic systems started off as web applications written in an MVC framework, such as Ruby on Rails. These systems are characterized by either being a single point of failure, or having severe bottlenecks under pressure.

Of course, having potential bottlenecks, or having an entire system that is a single point of failure is not inherently a problem. When you’re in month 3 of your MVP, this is fine. When you’re working in a team of a few developers on a client project which serves 100 customers, this is fine. When most of your app’s functionality are well-designed CRUD operations based on human input with a linear increase of load, things are probably going to be fine for a long time.

Also, there’s nothing inherently wrong about big apps. If you have one and you’re not experiencing any of these issues, there’s absolutely no reason to change your approach. You shouldn’t build microservices solely in the service of making the app smaller — it makes no sense to replace the parts that are doing their job well.

Problems begin to arise after your single point of failure has actually started failing under heavy load.

At that point, having a large attack surface can start keeping the team in a perpetual state of emergency. For example:

  • An outage in non-critical data processing brings down your entire website. With Semaphore, we had events where the monolith was handling callbacks from many CI servers, and when that part of the system failed, it brought the entire service down.
  • You moved all time-intensive tasks to one huge group of background workers, and keeping them stable gradually becomes a full-time job for a small team.
  • Changing one part of the system unexpectedly affects some other parts even though they’re logically unrelated, which leads to some nasty surprises.

As a consequence, your team spends more time solving technical issues than building cool and useful stuff for your users.

Slow development cycles

The second big problem is when making any change happen begins to take too much time.

There are some technical factors that are not difficult to measure. A good question to consider is how much time it takes your team to ship a hotfix to production. Not having a fast delivery pipeline is painfully obvious to your users in the case of an outage.

What’s less obvious is how much the slow development cycles are affecting your company over a longer period of time. How long does it take your team to get from an idea to something that customers can use in production? If the answer is weeks or months, then your company is vulnerable to being outplayed by competition.

Nobody wants that, but that’s where the compound effects of monolithic, complex code bases lead to.

Slow CI builds: anything longer than a few minutes leads to too much unproductive time and task switching. As a standard for web apps we recommend setting the bar at 10 minutes and we actually draw the line for you. Slow CI builds are one of the first symptoms of an overweight monolith, but the good news is that a good CI tool can help you fix it. For example, on Semaphore you can split your test suite into parallel jobs, or let Semaphore do the work for you automatically, regardless of the sequential runtime of your build.

Slow deployment: this issue is typical for monoliths that have accumulated many dependencies and assets. There are often multiple app instances, and we need to replace each one without having downtime. Moving to container-based deployment can make things even worse, by adding the time needed to build and copy the container image.

High bus factor on the old guard, long onboarding for the newcomers: it takes months for someone new to become comfortable with making a non-trivial contribution in a large code base. And yet, all new code is just a small percentile of the code that has already been written. The idiosyncrasies of old code affect and constrain all new code that is layered on top of the old one. This leaves those who have watched the app grow with an ever-expanding responsibility. For example, having 5 developers that are waiting for a single person to review their pull requests is an indicator of this.

Emergency-driven context switching: we may have begun working on a new feature, but an outage has just exposed a vulnerability in our system. So, healing it becomes a top priority, and the team needs to react and switch to solving that issue. By the time they return to the initial project, internal or external circumstances can change and reduce its impact, perhaps even make it obsolete. A badly designed distributed system can make this even worse — hence one of the requirements for making one is having solid design skills. However, if all code is part of a single runtime hitting one database, our options for avoiding contention and downtime are very limited.

Change of technology is difficult: our current framework and tooling might not be the best match for the new use cases and the problems we face. It’s also common for monoliths to depend on outdated software. For example, GitHub upgraded to Rails 3 four years after it was released. Such latency can either limit our design choices, or generate additional maintenance work. For example, when the library version that you’re using is no longer receiving security updates, you need to find a way to patch it yourself.

Decomposition for fun and profit

While product success certainly helps, a development team that’s experiencing all of these issues won’t have the highest morale. Nor will its people be able to develop their true potential.

All this can happen regardless of code quality. Practicing behavior-driven development is not a vaccine against scaling issues.

The root cause is simple. A monolithic application grows multiple applications within itself, and it meets high traffic and large volumes of data.

Big problems are best solved by breaking them up into many smaller ones that are easier to handle. This basic engineering idea is what leads teams to start decomposing large monoliths into smaller services, and eventually into microservices. The ultimate goal is to go back to being creative and successful by enabling the team to develop useful products as quickly as possible.

Faster Rails: How to Check if a Record Exists

Ruby and Rails are slow — this argument is often used to downplay the worth of the language and the framework. This statement in itself is not false. Generally speaking, Ruby is slower than its direct competitors such as Node.js and Python. Yet, many businesses from small startups to platforms with millions of users use it as the backbone of their operations. How can we explain these contradictions?

Faster Rails: How to Check if a Record Exists

What makes your application slow?

While there can be many reasons behind making your application slow, database queries usually play the biggest role in an application’s performance footprint. Loading too much data into memory, N+1 queries, lack of cached values, and the lack of proper databases indexes are the biggest culprits that cause slow requests.

There are some legitimate domains where Ruby is simply too slow. However, most of the slow responses in our applications usually boil down to unoptimized database calls and the lack of proper caching.

Even if your application is blazing fast today, it can become much slower in only several months. API calls that worked just fine can suddenly start killing your service with a dreaded HTTP 502 response. After all, working with a database table with several hundred records is very different from working with a table that has millions of records.

Existence checks in Rails

Existence checks are probably the most common calls that you send to your database. Every request handler in your application probably starts with a lookup, followed by a policy check that uses multiple dependent lookups in the database.

However, there are multiple ways to check the existence of a database record in Rails. We have present?, empty?, any?, exists?, and various other counting-based approaches, and they all have vastly different performance implications.

In general, when working on Semaphore, I always prefer to use .exists?.

I’ll use our production database to illustrate why I prefer .exists? over the alternatives. We will try to look up if there has been a passed build in the last 7 days.

Let’s observe the database calls produced by our calls.

Build.where(:created_at => 7.days.ago..1.day.ago).passed.present?

# SELECT "builds".* FROM "builds" WHERE ("builds"."created_at" BETWEEN
# '2017-02-22 21:22:27.133402' AND '2017-02-28 21:22:27.133529') AND
# "builds"."result" = $1  [["result", "passed"]]


Build.where(:created_at => 7.days.ago..1.day.ago).passed.any?

# SELECT COUNT(*) FROM "builds" WHERE ("builds"."created_at" BETWEEN
# '2017-02-22 21:22:16.885942' AND '2017-02-28 21:22:16.886077') AND
# "builds"."result" = $1  [["result", "passed"]]


Build.where(:created_at => 7.days.ago..1.day.ago).passed.empty?

# SELECT COUNT(*) FROM "builds" WHERE ("builds"."created_at" BETWEEN
# '2017-02-22 21:22:16.885942' AND '2017-02-28 21:22:16.886077') AND
# "builds"."result" = $1  [["result", "passed"]]


Build.where(:created_at => 7.days.ago..1.day.ago).passed.exists?

# SELECT 1 AS one FROM "builds" WHERE ("builds"."created_at" BETWEEN
# '2017-02-22 21:23:04.066301' AND '2017-02-28 21:23:04.066443') AND
# "builds"."result" = $1 LIMIT 1  [["result", "passed"]]

The first call that uses .present? is very inefficient. It loads all the records from the database into memory, constructs the Active Record objects, and then finds out if the array is empty or not. In a huge database table, this can cause havoc and potentially load millions of records, that can even lead to downtimes in your service.

The second and third approaches, any? and empty?, are optimized in Rails and load only COUNT(*) into the memory. COUNT(*) queries are usually efficient, and you can use them even on semi-large tables without any dangerous side effects.

The third approach, exists?, is even more optimized, and it should be your first choice when checking the existence of a record. It uses the SELECT 1 ... LIMIT 1 approach, which is very fast.

Here are some numbers from our production database for the above queries:

present? =>  2892.7 ms
any?     =>   400.9 ms
empty?   =>   403.9 ms
exists   =>     1.1 ms

This small tweak can make your code up to 400 times faster in some cases.

If you take into account that 200ms is considered the upper limit for an acceptable response time, you will realize that this tweak can spell the difference between a good, sluggish, and bad user experience.

Should I always use exists?

I consider exists? a good sane default that usually has the best performance footprint. However, there are some exceptions.

For example, if we are checking for the existence of an association record without any scope, any? and empty? will also produce a very optimized query that uses SELECT 1 FROM ... LIMIT 1 form, but any? fill not hit the database again if the records are already loaded into memory.

This makes any? faster by one whole database call when the records are already loaded into memory:

project = Project.find_by_name("semaphore")

project.builds.load    # eager loads all the builds into the association cache

project.builds.any?    # no database hit
project.builds.exists? # hits the database

# if we bust the association cache
project.builds(true).any?    # hits the database
project.builds(true).exists? # hits the database

As a conclusion, my general advice is to always use exists? and improve the code based on metrics.

Making a Mailing Microservice with Elixir and RabbitMQ

At Rendered Text, we like to decompose our applications into microservices. These days, when we have an idea, we think of its implementation as a composition of multiple small, self-sustaining services, rather than an addition to a big monolith.

In a recent Semaphore product experiment, we wanted to create a service that gathers information from multiple sources, and then emails a report that summarizes the data in a useful way to our customers. It’s a good use case for a microservice, so let’s dive into how we did it.

Our microservices stack includes Elixir, RabbitMQ for communication, and Apache Thrift for message serialization.

We designed a mailing system that consists of three parts:

  1. the main user-facing application that contains the data,
  2. a data-processing microservice, and
  3. a service for composing and sending the emails.

Asynchronous messaging with RabbitMQ

For asynchronous messaging, we decided to use RabbitMQ, a platform for sending and receiving messages.

First, we run the cron job inside our main application which gathers the data. Then, we encode that data and send it to the Elixir mailing microservice using RabbitMQ.

Every RabbitMQ communication pipeline consists of a producer and a consumer. For publishing messages from our main application, we have a Ruby producer that sends a message to a predefined channel and queue.

def self.publish(message)
  options = {
    :exchange => "exchange_name",
    :routing_key => "routing_key",
    :url => "amqp_url"
    }
  Tackle.publish(message, options)
end

In the last line above, we are using our open source tool called Tackle to publish a message. Tackle tackles the problem of processing asynchronous jobs in a reliable manner by relying on RabbitMQ. It serves as an abstraction around RabbitMQ’s API.

The consumer is a microservice written in Elixir, so we use ex-tackle, which is an Elixir port of Tackle:

defmodule MailingService.Consumer do
  use Tackle.Consumer,
  url: "amqp_url",
  exchange: "exchange_name",
  routing_key: "routing_key",
  service: "service_name",
  retry_limit: 10,
  retry_delay: 10

  def handle_message(message) do
    message
    |> Poison.decode!
    |> Mail.compose
  end
end

We connect to the specified exchange and wait for encoded messages to arrive. Options like retry_limit and retry_delay are there to allow us to specify how many times we want to retry message handling before the message is sent to the dead letter queue. The delay is there to set the timespan between each retry. This ensures the stability and reliability of our publish-subscribe system.

In our case, we use the decoded message to request data from the data processing microservice, and later use that response to send a message to the mailing service.

HTML template rendering in Elixir with EEx

Once our data is received and successfully decoded, we start composing the email by inserting the received data into the email templates. Similar to Ruby’s ERB and Java’s JSPs, Elixir has EEx, or Embedded Elixir. EEx allows us to embed and evaluate Elixir inside strings.

While using EEx, we go through three main phases. The first one is evaluation, the second is definition, and the third is compilation. EEx rules apply when a filename contains an extension html.eex.

Our email consists of multiple sections, all of which are using different datasets. Because of this, we divided our HTML email into partials for easier composing and improved code readability.

In order to evaluate the data inside a partial, we call the eval_file function and pass the data to partials:

<div>
  <%= EEx.eval_file "data/_introduction.html.eex",
                    [data1: template_data.data1,
                     data2: template_data.data2] %>

  <%= EEx.eval_file "data/_information.html.eex",
                    [data2: template_data.data2,
                     data3: template_data.data3] %>
</div>

Once we have all partials in place, we can combine them by evaluating them inside an entry point HTML template.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 <%= EEx.eval_file "data/_header.html.eex" %>
 <%= EEx.eval_file "data/_content.html.eex", [template_data: template_data] %>
</html>

Sending email with the SparkPost API

For email delivery, we rely on SparkPost. It provides email delivery services for apps, and it also provides useful email analytics. In our case, we used the Elixir SparkPost API client. We used it by creating a Mailer module that is very easy to use for email sending when instanced.

defmodule MailingService.Mailer do
  alias SparkPost.{Content, Recipient, Transmission}

  @return_path "semaphore+notifications@renderedtext.com"

  def send_message(recipient_email, content) do
    Transmission.send(%Transmission{
                      recipients: [ recipient_email ],
                      return_path: @return_path,
                      content: content,
                      campaign_id: "Campaign Name"
                    })
  end
end

Once we’ve defined this module, we can easily use it anywhere as long as we pass the correct data structure. For example, we have a function that creates a data structure for the email template and passes it to the send_message function with desired recipients.

def compose_and_deliver(data1, data2) do
  mail = %Content.Inline{
    subject: "[Semaphore] #{TimeFormatter.today} Mail subject",
    from: "Semaphore <semaphore+notifications@renderedtext.com>",
    text: template_data(data1, data2) |> text,
    html: template_data(data1, data2) |> html
  }
  @mailer.send_message(data2["email"], mail)
end

SparkPost also enables us to send to multiple recipients at the same time, as well as send both HTML and plain text versions of an email. Bear in mind that you also need to provide .txt templates in order to send a plain text email.

As a final step in this iteration, we have developed a preview email that service owners receive a few hours before the production reports go out to customers.

Wrapping up

We now have a mailing microservice, made using Elixir, SparkPost, and RabbitMQ. Combining these three has allowed us to create a microservice that takes less than 4 seconds to gather data, send it, receive it on the other end, compose the emails, and dispatch them to customers.

What is Proper Continuous Integration?

Standard continuous integration time

Continuous integration (CI) is confusing. As with all ideas, everybody does their own version of it in practice.

CI is a solution to the problems we face while writing, testing and delivering software to end users. Its core promise is reliability.

A prerequisite for continuous integration is having an automated test suite. This is not a light requirement. Learning to write automated tests and mastering test-driven development takes years of practice. And yet, in a growing app, the tests we’ve developed can become an impediment to our productivity.

Are We Doing CI?

Let’s take two development teams, both writing tests, as an example. The first one’s CI build runs for about 3 minutes. The second team clocks at 45 minutes. They both use a CI server or a hosted CI service like Semaphore that runs tests on feature branches. They both release reliable software in predictable cycles. But are they both doing proper continuous integration?

Martin Fowler recently shared a description of an informal CI certification process performed by Jez Humble:

He usually begins the certification process by asking his [conference] audience to raise their hands if they do Continuous Integration. Usually most of the audience raise their hands.

He then asks them to keep their hands up if everyone on their team commits and pushes to a shared mainline (usually shared master in git) at least daily.

Over half the hands go down.

He then asks them to keep their hands up if each such commit causes an automated build and test. Half the remaining hands are lowered.

Finally he asks if, when the build fails, it’s usually back to green within ten minutes.

With that last question only a few hands remain. Those are the people who pass his certification test.

Software Development or a Sword Fight?

If a CI build takes long enough for us to have time to go practice swordmanship while we wait, we approach our work defensively. We tend to keep branches on the local computer longer, and thus every developer’s code is in a significantly different state. Merges are rarer, and they become big and risky events. Refactoring becomes hard to do on the scale that the system needs to stay healthy.

With a slow build, every “git push” sends us to Limbo. We either wait, or look for something else to do to avoid being completely idle. And if we context-switch to something else, we know that we’ll need to switch back again when the build is finished. The catch is that every task switch in programming is hard and it sucks up our energy.

The point of continuous in continuous integration is speed. Speed drives high productivity: we want feedback as soon as possible. Fast feedback loops keep us in a state of flow, which is the source of our happiness at work.

So, it’s helpful to establish criteria for what proper continuous integration really means and how it’s done.

The 10 Minutes Test

It’s simple: does it take you less than 10 minutes from pushing new code to getting results? If so, congratulations. Your team is equipped for high performance. If not, your workflow only has elements of a CI process, for lack of a better term. But, this slowness develops wrong habits and hurts the productivity of all developers in a team. This ultimately inhibits the performance of the company as a whole.

Nobody sets out to build an unproductive delivery pipeline. Yet, we’re busy enough writing code until we feel like a boiling frog — we don’t notice the change until we accept it as the way things just are. Of course our build takes long, we have over 10,000 lines of code!

The Light at the End of the Tunnel

But, things don’t have to be this way. Regardless of how big your test suite is, parallelizing tests can cut waiting time down to just a couple of minutes or less. A fast hosted CI service that allows you to easily configure jobs to run in parallel and run as many jobs as you need can be a good solution. By parallelizing tests, you’ll reduce the time you spend deciding what to do while you wait, and keep your team in a state of flow.

We are building Semaphore, which has been proven to be the fastest hosted CI service on the market. We’re on a mission to make CI fast and easy. And we’re getting ready to take CI speed to the next level.

Lightweight Docker Images in 5 Steps

Lightweight Docker Images

Lightweight Docker Images Speed Up Deployment

Deploying your services packaged in lightweight Docker images has many practical benefits. In a container, your service usually comes with all the dependencies it needs to run, it’s isolated from the rest of the system, and deployment is as simple as running a docker run command on the target system.

However, most of the benefits of dockerized services can be negated if your Docker images are several gigabytes in size and/or they take several minutes to boot up. Caching Docker layers can help, but ideally you want to have small and fast containers that can be deployed and booted in a mater of minutes, or even seconds.

The first time we used Docker at Rendered Text to package one of Semaphore services, we made many mistakes that resulted in a huge Docker image that was painful for deployment and maintenance. However, we didn’t give up, and, step by step, we improved our images.

We’ve managed to make a lot of improvements since our first encounter with Docker, and we’ve successfully reduced the footprint of our images from several gigabytes to around 20 megabytes for our latest microservices, with boot times that are always under 3 seconds.

Our First Docker Service

You might be wondering how a Docker image can possibly be larger than a gigabyte. When you take a standard Rails application — with gems, assets, background workers and cron jobs — and package it using a base image that comes with everything but the kitchen sink preinstalled, you will surely cross the 1 GB threshold.

We started our Docker journey with a service that used Capistrano for deployment. To make our transition easy, we started out with a base Docker image that resembled our old workflow. The phusion/passenger-full image was a great candidate, and we managed to package up our application very quickly.

A big downside of using passenger-full was that it’s around 300 MB in size. When you add all of your application’s dependency gems, which can easily be around 300 MB in size, you are already starting at around 600 MB.

The deployment of that image took around 20 minutes, which is an unacceptable time frame if you want to be happy with your continuous delivery pipeline. However, this was a good first step.

We knew that we could do better.

Step 1: Use Fewer Layers

One of the first things you learn when building your Docker images is that you should squash multiple Docker layers into one big layer.

Let’s take a look at the following Dockerfile, and demonstrate why it’s better to use fewer layers in a Docker image:

FROM ubuntu:14.04

RUN apt-get update -y

# Install packages
RUN apt-get install -y curl
RUN apt-get install -y postgresql
RUN apt-get install -y postgresql-client

# Remove apt cache to make the image smaller
RUN rm -rf /var/lib/apt/lists/*

CMD bash

When we build the image with docker build -t my-image ., we get an image that is 279 MB in size. With docker history my-image we can list the layers of our Docker image:

$ docker history my-image

IMAGE               CREATED             CREATED BY                                      SIZE
47f6bd778b89        7 minutes ago       /bin/sh -c #(nop)  CMD ["/bin/sh" "-c" "bash"   0 B
3650b449ca91        7 minutes ago       /bin/sh -c rm -rf /var/lib/apt/lists/*          0 B
0c43b2bf2d13        7 minutes ago       /bin/sh -c apt-get install -y postgresql-client 1.101 MB
ce8e5465213b        7 minutes ago       /bin/sh -c apt-get install -y postgresql        56.72 MB
b3061ed9d53a        7 minutes ago       /bin/sh -c apt-get install -y curl              11.38 MB
ee62ceeafb06        8 minutes ago       /bin/sh -c apt-get update -y                    22.16 MB
ff6011336327        3 weeks ago         /bin/sh -c #(nop) CMD ["/bin/bash"]             0 B
<missing>           3 weeks ago         /bin/sh -c sed -i 's/^#\s*\(deb.*universe\)$/   1.895 kB
<missing>           3 weeks ago         /bin/sh -c rm -rf /var/lib/apt/lists/*          0 B
<missing>           3 weeks ago         /bin/sh -c set -xe   && echo '#!/bin/sh' > /u   194.6 kB
<missing>           3 weeks ago         /bin/sh -c #(nop) ADD file:4f5a660d3f5141588d   187.8 MB

There are several things to note in the output above:

  1. Every RUN command creates a new Docker layer
  2. The apt-get update command increases the image size by 23 MB
  3. The rm -rf /var/lib/apt/lists/* command doesn’t reduce the size of the image

When working with Docker, we need to keep in mind that any layer added to the image is never removed. In other words, it’s smarter to update the apt cache, install some packages, and remove the cache in a single Docker RUN command.

Let’s see if we can reduce the size of our image with this technique:

FROM ubuntu:14.04

RUN apt-get update -y && \
    apt-get install -y curl postgresql postgresql-client && \
    rm -rf /var/lib/apt/lists/*

CMD bash

Hooray! After the successful build, the size of our image dropped to 250 megabytes. We’ve just reduced the size by 25 MB just by joining the installation commands in our Dockerfile.

Step 2: Make Container Boot Time Predictable

This step describes an anti-pattern that you should avoid in your deployment pipeline.

When working on a Rails-based application, the biggest portion of your Docker images will be gems and assets. To circumvent this, you can try to be clever and place your gems outside of the container.

For example, you can run the Docker image by mounting a directory on the host machines, and cache the gems between two subsequent runs of your Docker image.

FROM ruby

WORKDIR /home/app
ADD . /home/app

CMD bundle install --path vendor/bundle && bundle exec rails server

Let’s build such an image. Notice that we use the CMD keyword, which means that our gems will be installed every time we run our Docker image. The build step only pushes the source code into the container.

docker build -t rails-image .

When we start our image this time around, it will first install the gems, and then start our Rails server.

docker run -tdi rails-image

Now, let’s use a volume to cache the gems between each run of our image. We will achieve this by mounting an external folder in our Docker image with -v /tmp/gems:vendor/bundle option.

docker run -v /tmp/gems:vendor/bundle -tdi rails-image

Hooray! Or is it?

The technique above looks promising, but in practice, it turns out to be a bad idea. Here are some reasons why:

  1. Your Docker images are not stateless. If you run the image twice, you can experience different behaviour. This is not ideal because it makes your deployment cycle more exciting than it should be.

  2. Your boot time can differ vastly depending on the content of your cache directory. For bigger Rails projects, the boot time can range from several seconds up to 20 minutes.

We have tried to build our images with this technique, but we ultimately had to drop this idea because of the above drawbacks. As a rule of thumb, predictable boot time and immutability of your images outweigh any speed improvement you may gain by extracting dependencies from your containers.

Step 3: Understand and Use Docker Cache Effectively

When creating your first Docker image, the most obvious choice is to use the same commands you would use in your development environment.

For example, if you’re working on a Rails project, you would probably want to use the following:

FROM ruby

WORKDIR /home/app
ADD . /home/app

RUN bundle install --path vendor/bundle
RUN bundle exec rake asset:precompile

CMD bundle exec rails server

However, by doing this, you will effectively wipe every cached layer, and start from scratch on every build.

New Docker layers are created for every ADD, RUN and COPY command. When you build a new image, Docker first checks if a layer with the same content and history exists on your machine. If it already exists, Docker reuses it. If it doesn’t exist, Docker needs to create a new layer.

In the above example, ADD . /home/app creates a new layer even if you make the smallest change in your source code. Then, the next command RUN bundle install --path vendor/bundle always needs to do a fresh install of every gem because the history of your cached layers has changed.

To avoid this, it’s better to just add the Gemfile first, since it changes rarely compared to the source code. Then, you should install all the gems, and add your source code on top of it.

FROM ruby

WORKDIR /tmp/gems
ADD Gemfile /tmp/gems/Gemfile
RUN bundle install --path vendor/bundle

WORKDIR /home/app
ADD . /home/app
RUN bundle exec rake asset:precompile

RUN mv /tmp/gems/vendor/bundle vendor/bundle

CMD bundle exec rails server

With the above technique, you can shorten the build time of your image and reduce the number of layers that need to be uploaded on every deploy.

Step 4: Use a Small Base Image

Big base images are great when you’re starting out with Docker, but you’ll eventually want to move on to smaller images that contain only the packages that are essential for your application.

For example, if you start with phusion/passenger-full, a logical next step would be to try out phusion/baseimage-docker and enable only the packages that are necessary. We followed this path too, and we successfully reduced the size of our Docker images by 200 megabytes.

But why stop there? You can also try to run your image on a base ubuntu image. Then, as a next step, go and try out debian that’s only around 80 MB in size.

You will notice that every time you reduce the image size, some of the dependencies will be missing, and you will probably need to spend some time on figuring out how to install them manually. However, this is only a one-time issue, and once you’ve resolved it, you can most likely enjoy faster deployment for several months to follow.

We used ubuntu for several months too. However, as we moved away from Ruby as our primary language and started using Elixir, we knew that we could go even lighter.

Being a compiled language, Elixir has a nice property that the resulting compiled binaries are self-contained, and can pretty much run on any Linux distribution. This is when the alpine image becomes an awesome candidate.

The base image is only 5 MB, and if you compile your Elixir application, you can achieve images that are only around 25 MB in size. This is awesome comparing to the 1.5 GB beast from the beginning of our Docker journey.

With Go, which we also use occasionally, we could go even further and build an image FROM scratch, and achieve 5 MB-sized images.

Step 5: Build Your Own Base Image

Using lightweight images can be a great way to improve build and deployment performance, but bootstrapping a new microservice can be painful if you need to remember to install a bunch of packages before you can use it.

If you create new services frequently, building your own customized base Docker image can bring great improvement. By building your own custom Docker image and publishing it on DockerHub, you can have small images, that are also easy to use.

Keep Learning, Docker is Great

Switching to a Docker-based development and deployment environment can be tricky at first. You can even get frustrated and think that Docker is not for you. However, if you persist and do your best to learn some good practices including how to make and keep your Docker images lightweight, the Docker ecosystem will reward you with speed, stability and reliability you’ve never experienced before.

P.S. If you’re looking for a continuous integration and deployment solution that works great with Docker — including fully-featured toolchain support, registry integrations, image caching, and fast image builds — try Semaphore for free.

How to Capture All Errors Returned by a Function Call in Elixir

If there is an Elixir library function that needs to be called, how can we be sure that all possible errors coming from it will be captured?

Elixir/Erlang vs. Mainstream Languages

Elixir is an unusual language because it functions as a kind of a wrapper around another language. It utilizes Erlang and its rock solid libraries to build new concepts on top of it. Erlang is also different compared to what one may call usual or mainstream languages, e.g. Java, C++, Python, Ruby, etc. in that it’s a functional programming language, designed with distributed computing in mind.

Process and Concurrency

The languages we are used to working with (at least here in Rendered Text) are imperative programming languages. They’re quite different compared to Erlang and Elixir. It’s taken for granted that all those languages do not provide any significant abstraction over the operating system process model. In contrast to that, threads of execution (i.e. units of scheduling) are implemented as user-space processes in Erlang.

Also, these mainstream programming languages do not support concurrency themselves. They support it through libraries, which are usually based on OS capabilities. There is usually either no native support for concurrency in the language at all, or there is minimal support which is backward compatible with the initial sequential model of the language. Consequently, in mainstream languages there are no error handling mechanisms designed for concurrent/distributed processing.

The Consequence

When a new language is introduced, we search for familiar concepts. In the context of error handling, we look for exceptions, and try to use them the way we are used to. And, with Elixir — we fail. Miserably.

Error Capturing

Why is error capturing a challenge? Isn’t it trivial? Well, in Erlang/Elixir, errors can be propagated using different mechanisms. When a library function is called, it’s sometimes unclear which mechanism it’s using for error propagation. Also, it might happen that an error is generated in some other process (created by an invoked function) and propagated to the invoking function/process using some of the language mechanisms.

Let’s consider the foo/0 library function. All we know is that it returns a numeric value on successful execution. It can also fail for different reasons and notify the caller in non-obvious ways. Here’s a trivial example of foo/0:

defmodule Library do
  def foo, do: 1..4 |> Enum.random |> choice

  defp choice(1), do: 1/3
  defp choice(2), do: 1/0
  defp choice(3), do: Process.exit(self, :unknown_error)
  defp choice(4), do: throw :overflow
end

What can we do to capture all possible errors generated by foo/0?

Error Propagation

In mainstream languages, there is only one way to interrupt processing and propagate errors up the call stack — exceptions. Nothing else. Exceptions operate within the boundaries of a single operating system thread. They cannot reach outside the thread scope, because the language does not recognize anything beyond that scope.

In Elixir, error handling works differently. There are multiple mechanisms for error notification, and this can be quite confusing to novice users.

The error condition can be propagated as an exception, or as an exit signal. There are two mutually exclusive flavors of exceptions: raised and thrown. When it comes to signals, a process can either send an exit signal to itself, or to other processes. Reaction to receiving an exit signal is different based on the state of the receiving process and signal value. Also, a process can choose to terminate itself because of an error, calling the exit/1 function.

Exceptions

There are two mechanisms to create an exception, and two mechanisms to handle them. As previously mentioned, these mechanisms are mutually exclusive!

A Raised exception can only be rescued and a thrown exception can only be caught. So, the following exceptions will be captured:

try do raise "error notification" rescue e -> e end
try do throw :error_notification  catch  e -> e end

But these won’t:

try do raise "error notification" catch  e -> e end
try do throw :error_notification  rescue e -> e end

This will do the job:

try do raise "error notification" rescue e -> e catch e -> e end
try do throw :error_notification  rescue e -> e catch e -> e end

However, this is still not good enough because neither rescue nor catch will handle the exit signal sent from this or any other process:

try do Process.exit self, :error_notification rescue e -> e catch e -> e end

From a single-process (and try block mechanism) perspective, there are just too many moving parts to get it right. And at the end of the day, we cannot cover all possible scenarios anyway.

Something is obviously wrong with this approach. Let’s look for a different solution.

The Erlang Way

Now, let’s move one step back and look at Erlang again. It’s an intrinsically concurrent language. Everything in it is designed to support distributed computing. Not Google scale distributed, but still distributed. Elixir is built on top of that.

An Elixir application, no matter how simple, should not be perceived as a single entity. It’s a distributed system on its own, consisting of tens, and often even hundreds of processes.

The try block is useful in the scope of a single process, and that’s where it should be used: to capture errors generated in the same process. However, if we need to handle all errors that might affect a process while a particular function is being executed (possibly originating in some other process), we’ll need to use some other mechanism. The try block cannot take care of that. This is a mindset change we’ll have to accept.

When in Rome, Do as the Romans Do

The Erlang philosophy is “fail fast”. In theory, this is a sound fault-tolerance approach. It basically means that you shouldn’t try to fix the unexpected! This makes much more sense than the alternative, since the unexpected is difficult to test. Instead, you should let the process or the entire process group die, and start over, from a known state. This can be easily tested.

So, what happens when an error notification is propagated above a process’s initial function? The process is terminated, and a notification is sent to all interested parties — all the processes that need to be notified. This is done consistently for all processes, and for all termination reasons, including a normal exit of the initial function.

If you want to capture all errors, you will need to engage an interprocess notification mechanism. This cannot be done using an intraprocess mechanism like the try block, at least not in Elixir.

Now, let’s discuss some approaches to capturing errors.

Approach 1: Exit Signals

Erlang’s “fail fast” mechanism are exit signals combined with Erlang messages. When a process terminates for any reason (whether it’s a normal exit, or an error), it sends an exit signal to all processes it is linked with.

When a process receives an exit signal, it usually dies, unless it’s trapping exit signals. In that case, the signal is transformed into a message and delivered to the process message box.

So, to capture all errors from a function, we can:

  • enable the exit signal trapped in a calling process,
  • execute the function in separate but linked processes, and
  • wait for the process exit signal message and determine if the process/function has finished successfully or failed, and if it failed, for what reason.
def capture_link(callback) do
  Process.flag(:trap_exit, true)
  pid = spawn_link(callback)
  receive do
    {:EXIT, ^pid, :normal} -> :ok
    {:EXIT, ^pid, reason}   -> {:error, reason}
  end
end

This approach is acceptable, but it’s a little intrusive, since capture_link/1 changes the invoking process state by calling the Process.flag/2 function. A non-intrusive approach (with no side effects involving the running process) is preferable.

Approach 2: Process Monitoring

Instead of linking (and possibly dying) with the process whose lifecycle is to be monitored, a process can be simply monitored. The process that requested monitoring will be informed when the monitored process terminates for any reason. The algorithm becomes as follows:

  • execute the function in a separate process that is monitored, but not linked to,
  • wait for the process termination message delivered by the monitor, and determine if the process/function has successfully completed or failed, and if it has failed, what is the reason behind the failure.

Here’s an example of a successfully completed monitored process:

iex> spawn_monitor fn -> :a end
{#PID<0.88.0>, #Reference<0.0.2.114>}
iex> flush
{:DOWN, #Reference<0.0.2.114>, :process, #PID<0.88.0>, :normal}

When a monitored process terminates, the process that requested monitoring receives a message in the following form: {:DOWN, MonitorRef, Type, Object, Info}.

Here’s a non-intrusive example of capturing all errors:

def capture_monitor do
  {pid, monitor} = spawn_monitor(&Library.foo/0)
  receive do
    {:DOWN, ^monitor, :process, ^pid, :normal} -> :ok
    {:DOWN, ^monitor, :process, ^pid, reason}  -> {:error, reason}
  end
end

Let’s take a look at an example implementation of described capturing mechanism that can:

  • invoke any function and capture whatever output the invoked function generates (a return value or the reason behind the error) and
  • transfer it to the caller in a uniform way:
    • {:ok, state} or
    • {:error, reason}

The example implementation is as follows:

def capture(callback, timeout_ms) do
  {pid, monitor} = callback |> propagate_return_value_wrapper |> spawn_monitor
  receive do
    {:DOWN, ^monitor, :process, ^pid, :normal} ->
      receive do
        {__MODULE__, :response, response} -> {:ok, response}
      end
    {:DOWN, ^monitor, :process, ^pid, reason}  ->
      Logger.error "#{__MODULE__}: Error in handled function: #{inspect reason}";
      {:error, reason}
  after timeout_ms ->
    pid |> Process.exit(:kill)
    Logger.error "#{__MODULE__}: Timeout..."
    {:error, {:timeout, timeout_ms}}
  end
end

defp propagate_return_value_wrapper(callback) do
  caller_pid = self
  fn-> caller_pid |> send( {__MODULE__, :response, callback.()}) end
end

Approach 3: The Wormhole

We’ve covered some possible approaches to ensuring that all errors coming from an Elixir function are captured. To simplify error capturing, we created the Wormhole module, a production-ready callback wrapper. You can find it here, feel free to use it!

In Wormhole, we used Task.Supervisor to monitor the callback lifecycle. Here is the most important part of the code:

def capture(callback, timeout_ms) do
  Task.Supervisor.start_link
  |> callback_exec_and_response(callback, timeout_ms)
end

defp callback_exec_and_response({:ok, sup}, callback, timeout_ms) do
  Task.Supervisor.async_nolink(sup, callback)
  |> Task.yield(timeout_ms)
  |> supervisor_stop(sup)
  |> response_format(timeout_ms)
end
defp callback_exec_and_response(start_link_response, _callback, _timeout_ms) do
  {:error, {:failed_to_start_supervisor, start_link_response}}
end

defp supervisor_stop(response, sup) do
  Process.unlink(sup)
  Process.exit(sup, :kill)

  response
end

defp response_format({:ok,   state},  _),          do: {:ok,    state}
defp response_format({:exit, reason}, _),          do: {:error, reason}
defp response_format(nil,             timeout_ms), do: {:error, {:timeout, timeout_ms}}

Wormhole.capture starts Task.Supervisor, executes callback under it, waits for the response at most timeout_ms milliseconds, stops the supervisor, and returns a response in the :ok/:error tuple form.

Takeaways

Elixir is inherently a concurrent language designed for developing highly distributed, fault-tolerant applications. Elixir provides multiple mechanisms for error handling. A user needs to be precise abut what kinds of errors are to be handled and where they are coming from. If our intention is to handle errors originating from the same process they are being handled in, we can use common mechanisms utilized in mainstream, sequential languages like the try block.

When capturing errors originating from a nontrivial logical unit (involving multiple processes), well known, sequential mechanisms will not be appropriate. In these types of situations, process monitoring mechanisms and a supervisor-like approach are in order.

A logical unit entry function (callback) needs to be executed in a separate process, in which it can succeed or fail without affecting the function-invoking process. In such a scenario, the function-invoking process spawns a supervisor. Then, it engages the language mechanism to transport the pass or fail information from the callback-executing process to the supervisor. All of this can be achieved without making any changes in the code from which errors are being captured, which makes this approach generally applicable.

Semaphore is described by Elixir developers as the only CI which supports Elixir out of the box. To make testing in Elixir even easier, we regularly publish tutorials on TDD, BDD, and using Docker with Elixir. Read our tutorials and subscribe here.

Sending ECS Service Logs to the ELK Stack

Having comprehensive logs is a massive life-saver when debugging or investigating issues. Still, logging too much data can have the adverse effect of hiding the actual information you are looking for. Because of these issues, storing your log data in a structured format becomes very helpful. Also, being able to track the number of occurrences and monitor their rate of change can be quite indicative of the underlying causes. This is why we decided to include the ELK stack in our architecture.

I’ll introduce you to one possible solution for sending logs from various separate applications to a central ELK stack and storing the data in a structured way. A big chunk of Semaphore architecture is made up of microservices living inside Docker containers and hosted by Amazon Web Services (AWS), so ELK joins this system as the point to which all of these separate services send their logs, which it then processes and visualizes.

In this article, I’ll cover how to configure the client side (the microservices) and the server side (the ELK stack).

Client-side Configuration

Before developing anything, the first decision we needed to make was to pick a logging format. We decided on syslog, which is a widely accepted logging standard. It allows for a client-server architecture for log collection, with a central log server receiving logs from various client machines. This is the structure that we’re looking for.

Our clients are applications sitting inside Docker containers, which themselves are parts of AWS ECS services. In order to connect them to ELK, we started by setting up the ELK stack locally, and then redirecting the logs from a Docker container located on the same machine. This setup was useful for both development and debugging. All we needed to do on the client side was start the Docker container as follows:

docker run --log-driver=syslog --log-opt syslog-address=udp://localhost:2233 <image_id>

We’ll assume here that Logstash is listening to UDP traffic on port 2233.

Once we were done developing locally, we moved on to updating our ECS services. For this, all we needed to do was update our task definition by changing the log configuration of the container:

{
  "containerDefinitions": [
    {
      "logConfiguration": {
        "logDriver": "syslog",
        "options": {
          "syslog-address": "udp://<logstash_url>:2233"
        }
      }
    },
    ...
  ],
  ...
}

This started our Docker containers with the same settings we previously used locally.

Server-side Configuration

On the server side, we started with the Dockerized version of the ELK stack. We decided to modify it so that it accepts syslog messages, and enable it to read the custom attributes embedded inside our messages (more on that in the ‘Processing’ section below). For both of these, we needed to configure Logstash. In order to do that, we needed to look into the config/logstash.conf file.

Logstash pipeline consists of input, filter, and output sections. Inputs and outputs describe the means for Logstash to receive and send data, whereas filters describe the data transformations that Logstash performs. This is the basic structure of a logstash.conf file:

# logstash/config/logstash.conf

input {
  tcp {
    port => 5000
  }
}

## Add your filters / logstash plugins configuration here

output {
  elasticsearch {
    hosts => "elasticsearch:9200"
  }
}

Receiving Messages

In order to receive syslog messages, we expanded the input section:

input {
  udp {
    port => 2233
    type => inline_attributes
  }
}

This allowed Logstash to listen for UDP packets on the specified port. type is just a tag that we added to the received input in order to be able to recognize it later on.

Now, since our ELK stack components are sitting inside Docker containers, we needed to make the required port accessible. In order to do that, we modified our docker-compose.yml by adding port “2233:2233/udp” to Logstash:

services:
  logstash:
    ports:
      - "2233:2233/udp"
      ...
    ...
  ...
...

Since we’re hosting our ELK stack on AWS, we also needed to update our task definition to open the required port. We added the following to the portMappings section of our containerDefinition:

{
  "containerDefinitions": [
    {
      "portMappings": [
        {
          "hostPort": 2233,
          "containerPort": 2233,
          "protocol": "udp"
        },
        ...
      ]
    },
    ...
  },
  ...
}

Processing

For processing, we decided to add the ability to extract key=value pairs from our message strings and add them as attributes to the structure that is produced by processing of a message. For example, the message ... environment=staging ... would produce a structure containing the key environment with the value staging.

We implemented this by adding the following piece of code into the filters section:

  ruby {
    code => "
      return unless event.get('message')
      event.get('message').split(' ').each do |token|
        key, value = token.strip.split('=', 2)
        next if !key || !value || key == ''
        event.set(key, value)
      end
    "
  }

The Ruby plugin allowed us to embed the Ruby code inside the configuration file, which came in very handy. Another useful thing we did at this point was to enable the outputting of processed messages to the console by adding the following to the output section:

stdout { codec => rubydebug }

Finally, we logged the following string in a client application:

“Some text service=service1 stage=stage1”

This produced the following event in the console debug, showing us the structure that gets persisted once the message has been processed:

{
     "message" => "<30>Nov  2 15:01:42  [998]: 14:01:42.671 [info] Some text service=service1 stage=stage1",
    "@version" => "1,
  "@timestamp" => "2016-11-02T14:01:42.672Z",
        "type" => "inline_attributes",
        "host" => "172.18.0.1",
     "service" => "service1",
       "stage" => "stage1"
}

Note that service=service1 and stage=stage1 were added as attributes to the final structure. This final structure is then available for searching and inspection through Kibana’s GUI, as seen below:

Wrap-up

This sums up the logging setup. The result is a centralized logging system that can include new service logs with minimal setup on the client side. This allows for us to analyse logs quickly and efortlessly, as well as visualise the logs of various separate services through Kibana’s GUI.

This is the first post on our brand new engineering blog. We hope that you’ll find it useful in setting up your logging architecture. Speaking of useful, we also hope that you’ll trust Semaphore to run the tests and deploy your applications for you.

Happy building!

Get future posts like this one in your inbox.

Follow us on