Faster Rails: How to Check if a Record Exists
Ruby and Rails are slow — this argument is often used to downplay the worth of the language and the framework. This statement in itself is not false. Generally speaking, Ruby is slower than its direct competitors such as Node.js and Python. Yet, many businesses from small startups to platforms with millions of users use it as the backbone of their operations. How can we explain these contradictions?
What makes your application slow?
While there can be many reasons behind making your application slow, database queries usually play the biggest role in an application's performance footprint. Loading too much data into memory, N+1 queries, lack of cached values, and the lack of proper databases indexes are the biggest culprits that cause slow requests.
There are some legitimate domains where Ruby is simply too slow. However, most of the slow responses in our applications usually boil down to unoptimized database calls and the lack of proper caching.
Even if your application is blazing fast today, it can become much slower in only several months. API calls that worked just fine can suddenly start killing your service with a dreaded HTTP 502 response. After all, working with a database table with several hundred records is very different from working with a table that has millions of records.
Existence checks in Rails
Existence checks are probably the most common calls that you send to your database. Every request handler in your application probably starts with a lookup, followed by a policy check that uses multiple dependent lookups in the database.
However, there are multiple ways to check the existence of a database record in Rails. We have present?, empty?, any?, exists?, and various other counting-based approaches, and they all have vastly different performance implications.
In general, when working on Semaphore, I always prefer to use .exists?.
I'll use our production database to illustrate why I prefer .exists? over the alternatives. We will try to look up if there has been a passed build in the last 7 days.
Let's observe the database calls produced by our calls.
Build.where(:created_at => 7.days.ago..1.day.ago).passed.present? # SELECT "builds".* FROM "builds" WHERE ("builds"."created_at" BETWEEN # '2017-02-22 21:22:27.133402' AND '2017-02-28 21:22:27.133529') AND # "builds"."result" = $1 [["result", "passed"]] Build.where(:created_at => 7.days.ago..1.day.ago).passed.any? # SELECT COUNT(*) FROM "builds" WHERE ("builds"."created_at" BETWEEN # '2017-02-22 21:22:16.885942' AND '2017-02-28 21:22:16.886077') AND # "builds"."result" = $1 [["result", "passed"]] Build.where(:created_at => 7.days.ago..1.day.ago).passed.empty? # SELECT COUNT(*) FROM "builds" WHERE ("builds"."created_at" BETWEEN # '2017-02-22 21:22:16.885942' AND '2017-02-28 21:22:16.886077') AND # "builds"."result" = $1 [["result", "passed"]] Build.where(:created_at => 7.days.ago..1.day.ago).passed.exists? # SELECT 1 AS one FROM "builds" WHERE ("builds"."created_at" BETWEEN # '2017-02-22 21:23:04.066301' AND '2017-02-28 21:23:04.066443') AND # "builds"."result" = $1 LIMIT 1 [["result", "passed"]]
The first call that uses .present? is very inefficient. It loads all the records from the database into memory, constructs the Active Record objects, and then finds out if the array is empty or not. In a huge database table, this can cause havoc and potentially load millions of records, that can even lead to downtimes in your service.
The second and third approaches, any? and empty?, are optimized in Rails and load only COUNT(*) into the memory. COUNT(*) queries are usually efficient, and you can use them even on semi-large tables without any dangerous side effects.
The third approach, exists?, is even more optimized, and it should be your first choice when checking the existence of a record. It uses the SELECT 1 ... LIMIT 1 approach, which is very fast.
Here are some numbers from our production database for the above queries:
present? => 2892.7 ms any? => 400.9 ms empty? => 403.9 ms exists => 1.1 ms
This small tweak can make your code up to 400 times faster in some cases.
If you take into account that 200ms is considered the upper limit for an acceptable response time, you will realize that this tweak can spell the difference between a good, sluggish, and bad user experience.
Should I always use exists?
I consider exists? a good sane default that usually has the best performance footprint. However, there are some exceptions.
For example, if we are checking for the existence of an association record without any scope, any? and empty? will also produce a very optimized query that uses SELECT 1 FROM ... LIMIT 1 form, but any? fill not hit the database again if the records are already loaded into memory.
This makes any? faster by one whole database call when the records are already loaded into memory:
project = Project.find_by_name("semaphore") project.builds.load # eager loads all the builds into the association cache project.builds.any? # no database hit project.builds.exists? # hits the database # if we bust the association cache project.builds(true).any? # hits the database project.builds(true).exists? # hits the database
As a conclusion, my general advice is to always use exists? and improve the code based on metrics.
At Semaphore, we're all about speed. We're on a mission to make continuous integration fast and easy. Driven by numerous conversations with our customers and our own experiences, we've built a new CI feature that can automatically parallelize any test suite and cut its runtime to just a few minutes - Semaphore Boosters. Learn more and try it out.