Keith Smiley of Lyft on How to Scale Code with Bazel

In this episode, I welcome Keith Smiley, Principal Engineer and Lead Maintainer of Bazel’s iOS support at Lyft. We talk about how the Lyft team is using Bazel and what the advantages are of using this build tool. Keith also shares his team’s experience of adopting Bazel.

Listen to our entire conversation above, and check out my favorite parts in the episode highlights.

Key takeaways:

How the Lyft team is using Bazel and monorepos
How Lyft adopted Bazel
How the Lyft team maintains its build configuration
What is Bazel
How to get into Bazel
Handling flaky tests in Bazel

You can also get Semaphore Uncut on Apple Podcasts, Spotify, Google Podcasts, Stitcher, and more.

Like this episode? Be sure to leave a ⭐️⭐️⭐️⭐️⭐️ review on the podcast player of your choice and share it with your friends.

Edited Transcript

Darko Fabijan: Hello, and welcome to Semaphore Uncut, a podcast for developers about building great products. Today I’m excited to welcome Keith Smiley. Keith, thank you so much for joining us. Please go ahead and introduce yourself.

Keith Smiley: My name’s Keith Smiley. I’m a principal engineer at Lyft where I mostly work on developer infrastructure.

When I was in college, I got involved in the Cocoapods project. It’s a dependency manager for iOS, similar to npm and RubyGems. Through that, I made a lot of friends in the iOS community and also got some great experience.

After college, I jumped straight into an iOS consulting job and worked there for a little while before moving to Lyft. I’ve been at Lyft for almost seven years now, mostly working on infrastructure.

Darko Fabijan: You mentioned that one of the main things that you are working on is building infrastructure for your iOS teams and that you’re using Bazel. There’s a lot of talk about Bazel and monorepos these days. Some people are brave and jumping into it straight away; others are more cautious. Can you give us a bit of history, how you ended up adopting Bazel and how it is working for Lyft now?

Read also: What Is Monorepo?

How the Lyft Team Is Using Bazel and Monorepos

At Lyft, for a long time, we only just had one app. It was a combined rider and driver app, and we had a relatively small iOS team. But then we started to grow and ended up splitting the apps because the experience and the people working on them were very different.

Then our team exploded and we went from 15-20 engineers to 60. At that time, we knew we were starting to hit some difficult technical challenges. So we started looking around at some alternatives, and Bazel looked pretty promising.

We had a lot of incremental things along the way. Speaking of monorepos, at some point, we had our apps in different repos and then we had shared repos for dependencies. Anybody who’s managed a setup like that knows that that is challenging.

A monorepo felt natural at that point, especially because we were sharing so much code between all of the apps. Adopting monorepo helped us to reduce the operational burden of that a lot and opened the door to Bazel.

When using Bazel, it’s more preferable if you’re working in a monorepo as opposed to many different repos. The developer experience of having a monorepo is also much nicer. Using Bazel still works really well for us. For instance, we can remotely cache build artifacts. If I build it on my machine and then you build it on your machine, you can just download what I built instead of having to rebuild it yourself. Now, instead of taking 15 minutes, the build of our rider app would take two.

Bazel has been a really great tool for us to continue scaling our code base, team, number of apps and everything else. I feel like we’ve had one of the ideal experiences with Bazel so far.
-Keith Smiley, Principal Engineer at Lyft

How Lyft Adopted Bazel

Darko Fabijan: You mentioned that it’s been a couple of years since you embraced Bazel. Do you remember how the introduction of Bazel worked with the whole team? Usually, you can’t just say, “Okay, on Friday this week, we are going to start using Bazel. That thing that you worked on, you don’t work that way anymore.” How did you handle that?

Keith Smiley: There were a lot of incremental steps in that migration for us. One thing especially about iOS and that community is that most folks are using the exact same IDE: everyone uses Xcode from Apple. They expect some basic functionality in that to work, like using certain keyword shortcuts.

We knew that we had to mirror that experience as much as possible. So we had a few different steps.

Step 1.

We made folks define their modules, which is the unit of code separation that we were using in some sort of configuration file that was text base. We ended up using that to generate an Xcode project even before Bazel, and that was a good analogy to Bazel’s build files.

Step 2.

Then we ended up replacing those with build files. There was a bit of a change for developers there, but realistically they’re mostly just adding one line here or there to add different dependencies. So that wasn’t a huge change.

Step 3.

After that, we started using that to build our project on CI and saw immediate benefits. So we swapped the local developer experience over but still kept maintaining the normal Xcode experience. Unless you know where to look, you can’t tell that it’s Bazel.

That was a great path for us because it was much more approachable for folks who are joining and who’d never heard of Bazel.

How the Lyft Team Maintains Their Build Configuration

Darko Fabijan: Over time, did developers become comfortable with Bazel and now they maintain their own Bazel build or there is a maintainer of the build configuration?

Keith Smiley: There’s a few pieces. First, there’s the ins build files that folks actually define their modules in. Those are completely maintained by developers, but they’re also very simple. It’s like I have this module. It has these dependencies. I have this test target. It has these dependencies. That’s pretty much it.

Then there’s the layer that we have on top of Bazel. Bazel allows you to define what they call macros. You can also define full-on rules. There’s some subtle technical differences between those two things. But either way, the benefit is that you don’t have to use exactly what Bazel provides out of the box. And if you want to, you can abstract it such that the interface that your developers use is more specific to you. So we put in some work to do that, to make it more ergonomic for our developers and hide some of the technical complexities of Bazel. The tooling team who maintains CI maintains those. But those mostly don’t have to change, except when there’s a new version of Bazel or new features in Xcode.

The benefit of using Bazel is that you don’t have to use exactly what Bazel provides out of the box. You can abstract it so that the interface that your developers use is more specific to you.
-Keith Smiley, Principal Engineer at Lyft

It works really well for us. In our case, this kind of mobile monorepo when there’s one central team that owns all the build tooling makes it easier for developers. The team can create a consistent development environment for everybody and make sure people don’t accidentally shoot themselves in the foot with some feature that may not work on iOS the way that basal intends.

What Is Bazel

Darko Fabijan: Let’s rewind a little bit and share some tips on how to start with Bazel for folks who don’t know that it is yet.

Keith Smiley: To start with Bazel, it depends on what community you work in. If you’re familiar with C++, you might have used Cmake, and Bazel is a nice replacement to that.

If you’re from the Java land, you use Maven, and that’s also a good analogy. But mainly Bazel is a language agnostic build tool. It knows how to take some source files and produce some output binary.

One of Bazel’s top line goals is to be hermetic. With the same inputs, you’ll always get the same outputs. That’s a hard thing to guarantee because you have to define the entire set of inputs, which includes what version of the tools you have on your computer, what environment variables you have set, etc.

Building software that includes a mix of programming languages? Maybe you've looked at Bazel, which is a fairly popular build tool.

This @semaphoreci post includes a great look at what Bazel is all about, and how to use it.https://t.co/mgBRVG2Q9g
— Richard Seroter (@rseroter) October 7, 2021

There’s also a separate concept in Bazel called “rules”. They define against some public API that Bazel provides, how to build in every different language. For example, there are rules that know how to compile Go code, and there’s rules Apple which knows how to package iOS apps and things like that.

One nice thing about that architecture is that those rules are entirely separate from Bazel. You can create and maintain them yourself. This is also nice for Google because they don’t have to know how to compile every single language in the world or maintain any of that. It’s also nice for users because it gives them a lot of flexibility to make things work the way they want.

How to Get into Bazel

Keith Smiley: Getting into it is an interesting question. There’s a lot of resources out there in the Bazel documentation. There’s also a conference called BazelCon that Google hosted.

💚 Get Ready! #BazelCon 2021 is on November 17-18 and the agenda is LIVE 👉 https://t.co/Rplal7XsbR

Connect with the Bazel team for a live Q&A, hear exciting announcements, and more! pic.twitter.com/IfH6woxkAJ
— bazelbuild (@bazelbuild) October 12, 2021

If you’re familiar with some other types of build systems, Bazel should be pretty approachable. It might be a little more strict than you’re used to. But once you get into it, it’s quite nice and has a lot of features that allow you to introspect what’s going on.

Also, Bazel has a powerful query language where you can do SQL-type queries against your build graph, which can give you some really nice insights into what’s happening. Once you get used to some of these tools, it’s actually a very usable build tool, especially compared to some of the alternatives out there.

Darko Fabijan: You mentioned that it depends on the language and the background. How would you define the sweet spot when it makes sense for teams to adopt Bazel?

Keith Smiley: This is definitely one of the hardest things to put a specific timeframe on. While I think Bazel can be approachable, it can still require a bit of maintenance. This depends on the communities and how fast Bazel itself is changing with what you’re trying to do.

Some communities are more stable than others. We have definitely found that maintaining the Bazel setup takes a decent amount of time. You definitely have to have at least one person who can dedicate most of their time to working on that, which is a bar for when you might want to adopt it.

Another thing is that there are a lot of other options on how to solve your problems before moving to Bazel. Bazel is a pretty big hammer for this. In your languages community, you probably have a lot of tools that you can use or a lot of tweaks that you can make to the default tools to help with the specific problems that you see. But once you’ve exhausted those options and feel like you can take on that maintenance burden, that’s when it’s worth starting with Bazel.

Darko Fabijan: In terms of testing, what benefits do you think Bazel provides?

Keith Smiley: Someone from Google once said that Bazel isn’t a build tool, it’s a test tool. Testing has a different set of concerns like whether or not your tests are written in the same language as your actual code or what environment you need to set for your tests.

Bazel isn’t a build tool, it’s a test tool.
-Keith Smiley, Principal Engineer at Lyft

Bazel makes some of those complexities way easier. Test rules look very similar to build rules in Bazel. You can define a test, it can have some sources, dependencies, etc. Then Bazel can also apply its smarts to only run those tests if something that is an input to them has changed.

If you have hundreds of test bundles and you just say, ‘Run all the tests’, Bazel could end up running literally nothing depending on if those have been run before with the same configuration. It also gives you a really nice way to interpret between languages for different types of tests.

For example, we have a lot of internal tools that assist with the building of our apps. Those are also built, run and tested with Bazel. It’s a common pattern for us to have some Python command line tool that has some Python unit tests but then it also has some integration tests that you might orchestrate with some tiny shell script that runs the thing and verifies the output. Bazel does a great job of treating all of those things the same in a way that’s very transparent. You can run any types of tests together.

Darko Fabijan: So, the iOS code is in the same repository with those build tools that you mentioned. All those helper based tools also live in the same repository, right?

Keith Smiley: Yeah, that’s how we do it but you can use other repositories, too. We do that for open source dependencies that we pull in or dependencies we pull in through native package managers like PyPy for Python.

If we really needed to, we could extract our tools into a separate repo and then vendor those through Bazel. It’s just been a little bit easier for us this way.

Also, since we have all of our iOS code in one repo, we don’t technically need those tools outside of the repo anyways, so that makes it a little bit easier. But Bazel does fully support that if you want to.

Darko Fabijan: One thing that you mentioned is that you save a lot of CPU time by not running everything and that your Mac build cluster would have to be much bigger otherwise. Do you have a gut feeling what’s the difference between if you had been building everything all the time for the end developers experience, comparing to this dependency management that you have, where you can decide what needs to be built or not?

Keith Smiley: There’s two pieces to that. One is the local developers and how much they’re rebuilding when they don’t need to. The other one is CI.

I think for local developers, the prime use case is they’re pulling the main branch after finishing working on a feature and that also includes potentially hundreds of other commits. Then they want to start working on a new feature, so they rebuild the app.

At that point without Bazel, you’d end up pretty much doing a clean build every single time because there’s just a lot of churn in the code with a lot of engineers. But instead, they can download everything and get going much faster.

Then there’s CI. For iOS builds, we have to run CI on MacOS. We have our own Mac mini fleet that we manage. One thing about that is obviously we can’t auto scale that, so we need to be very careful about the CPU time on those machines, especially in a monorepo where we build 10 apps.

If we have a hundred machines, which is about what we have today, then 10 pull requests, theoretically, would consume all of those if you were just building the apps. But then realistically, we also do a bunch of test jobs and a bunch of lints jobs and other things like that. You can definitely see how you could very quickly get to a point where CI is just constantly consumed.

We avoid that as much as we can by using Bazel’s query that I mentioned earlier. We only build and test the things that we know might have changed. We avoid even the CI machine scheduling ahead of time. That saves us a lot of parallelism that we have on CI.

How Does Bazel Work?

Darko Fabijan: A question coming from someone who has zero experience with Bazel. Earlier, you mentioned that SQL-like language where you can query various information about your build. Does that mean that somewhere there’s a kind of database which is supporting Bazel where you can query data about all the previous builds?

Keith Smiley: There’s not. When you do a Bazel query, it queries against Bazel’s current memory of what the build graph looks like. Bazel is kind of two pieces: a client and a server. The first time you run a client, it forks a server. Then that server lives for a while and it’ll take your build requests over time.

One of the things that can make Bazel be fast is that it’ll load all of your build files and the entire build graph of whatever you’re trying to interact with the first time it needs it, and then it’ll keep that in memory. Then if you change a build file, maybe it’ll invalidate that, but otherwise it’ll keep that around.

So when you run a query, you’re running against that current in memory version of that stuff. So it doesn’t know about any previous builds in that case.

Today Bazel is just the current state and you have to roll that yourself if you want to store previous states.

Getting data out of Bazel

Darko Fabijan: When the engineering team is growing from 10 to 60 people or more, more code is coming in, and all those tests need to be maintained. At some point, you have to start instrumenting everything to make sure that developer productivity is as low as two minutes. People just start wrapping pieces of their build pipeline into blocks that do some kind of instrumentation and then are submitting that to either some Prometheus is getting data or influx DB, then charting it Graphana and so on. Can you maybe share some of your experiences in that area?

Keith Smiley: As far as data collection, Bazel provides a lot of its internal data and you can use it however you want. So they have a lot of different types of logs you can dump.

You can get a Chrome trace, for example, so that you can see how each step in the build, how much it took overall. And you can see that broken down by core and you can see that broken down by what’s the bottleneck. So that can be a really nice bird’s eye view to just start like, “Okay, the build is slow. Where do we start?”

You can get more granular with that, too. Bazel dumps some sort of JSON file that talks about every single action it did and how it took, etc. You can parse that data and use that howeer you want.

The other cool feature that Bazel has is tied to dumping that JSON log. You can also stream that directly to a web service and then the web service can do whatever it wants with that. There are some startups in this space working dashboards for that. You can go to these dashboards and click through different tabs and see all the information that will help you to understand why CI is slow, etc.

Handling flaky tests

Darko Fabijan: You mentioned flaky tests. We’ve had hundreds of conversations with customers about how they can minimize flaky tests. There are different technologies to minimize flaky tests. How are you guys handling that?

Keith Smiley: Bazel has a few features for this that are pretty cool. For every type of test target you can set a boolean, like flaky true. Then every time you run it, Bazel will run it multiple times and only pass if it passes some percentage of those times.

Then there’s some knobs you can change to how many attempts you want to do and stuff like that. But that’s a really nice way to just say, “Okay, this test is flaky. We know it.” Then you can file it to JIRA, etc. But at least you don’t end up immediately disabling it.

You can also do that globally and say, every single time you run this test, you can run them multiple times if you want, to try to reduce the flakiness.

At Lyft, we do a combination of those things and disabling flaky tests. We use our stats that we get on that to look at ones that are consistently flaky and then actually try to trace those down. So we do some sort of that filtering as opposed to filing a JIRA every single time we see it. Probably a lot of our tests are flakier than we’d like, but those approaches combine to work for us pretty well.

We’ve also had the classic case where the test is flaky in the passing way, where you realize your flaky test isn’t just flaky failing, like it shouldn’t ever be passing. That’s the scariest reason to have auto reruns. So we’ve definitely had that case before, so we do try to stay on top of it.

But obviously, on a large team, it can be difficult to track everyone down and get folks to prioritize it and all of those more human problems. But no matter what that’s going to be asynchronous, these other knobs help us in the meantime to make sure that no one else is affected by their test’s flakiness.

But yeah, I guess I’d say that that’s the worst case is when someone actually finally goes to investigate it and they realize, “Ooh, my logic was wrong the whole time.” That’s always a bummer.

Darko Fabijan: Thanks for sharing all these insights! It will be super valuable for people who are going to get to the size of your team and the complexity. And then we’ll be able to use these experiences to help their teams stay fast.