Lyft Software Engineer Matt Klein on the Future of Envoy

In today’s Semaphore Uncut episode, host Darko Fabijan chats with Matt Klein, a software engineer at Lyft. Matt is the architect behind Envoy, one of the most popular open-source service proxies, which is shipping out a new mobile version soon.

What prompted Matt to create Envoy, and what problems does it solve? What’s the secret sauce that made it such a wild hit? Listen to the episode to learn the answers to these questions and more.

Highlights from this episode

Darko Fabijan (00:02): Welcome to Semaphore Uncut. Today, we have Matt Klein with us. Matt, thank you so much for joining us. Please go ahead and introduce yourself.

Matt Klein: Thank you for having me. My name is Matt Klein. I’m a software engineer at Lyft, where I have worked for the last four and a half years. I’ve spent pretty much my entire career working on low-level technology like hypervisors, operating systems, distributed systems, and networking performance.

During my time at Lyft, I’ve been working on an open-source networking technology called Envoy, which has become quite popular. These days, I spend about half of my time working on Lyft-related problems around reliability and networking and distributed systems. I spend the other 50% of my time doing industry work around Envoy, open-source management—those types of things.

What is Envoy?

Darko: Okay, great. And can you give us an explanation about Envoy?

Matt: Envoy is a software load balancer or a software network proxy. At a high level, that’s a piece of software that will take network requests, which might be Transmission Control Protocol (TCP) connections or HTTP requests. It will do various things to those connections or requests as they transit that proxy. That might be observability, load balancing, or things like timeouts and circuit breakers.

There’s a wide variety of things that a network proxy can do. For those who have heard of other software proxies such as NGNIX or HAProxy, Envoy is similar.

Why did you create Envoy?

Darko: When you initially kicked off the project, was there a concrete pain point that you wanted to solve? Because Envoy’s solving a whole class of problems.

Matt: The primary challenge we were trying to solve with Envoy was around observability. Four and a half years ago, Lyft was entirely based out of Amazon Web Services. We were using elastic load balancers for load balancing. And four and a half years ago, ELBs at that time did not support percentile latency metrics.

At the time, ELB and CloudWatch didn’t support the ability to observe P50 latency or P99 latency, which is something that most developers today would really take for granted from a network observability system.

In general, Envoy was a black box. It was very difficult to understand what was going on. Envoy’s initial use case was as an edge proxy, and it was to give richer logs, stats and metrics so developers could understand what was actually going on.

And that was based on my experience at Twitter. There, I had worked on a proprietary edge proxy system that was providing similar types of features, like advanced observability. Most of the solutions that had existed at the time did not support these types of features.

You are correct. Today, Envoy supports an absolutely incredible amount of features, but at that time observability was mostly the focus.

From Monolith to Microservices

Darko: (05:21) Yeah, as we were migrating from our monoliths to microservices architecture, observability was something that struck us out of nowhere. Previously, we were able to learn a lot about our monolith. However, when moving to microservices, there is that layer of network that just becomes everywhere.

Matt: It’s a super interesting thing. I think what you’re seeing right now is there has been a lot of buzz around service mesh. And then there’s a lot of backlash around service mesh complexity. Honestly, if an organization is committed to moving to a microservices architecture, they have to solve a set of problems. They’re complicated, and they mostly revolve around networking and observability. Potentially, you can solve them by writing a bunch of code, putting it into a library, and writing it into every service. Or, you can try looking at the sidecar pattern.

Darko: Yeah, and maybe in a monolith, you have to solve it once in terms of simple things like timeouts, retries, and all of that. So, it can be a single library with a single code base. And with microservices, you need to solve that in various languages and places and update that library in various places.

Matt: Yeah, and if you look historically at companies that have spearheaded some of the microservice architecture work, it would probably include Amazon, Twitter, Netflix, and other companies that grew during a time in which they basically used Java.

But most modern companies, for better or worse, are what I call polyglot. They have multiple languages in flight, such as Java, Go, Rust, and C++, and then they have to choose between reimplementing and trying to normalize all of these things in every language, or they can try to use some kind of client-side load-balancing proxy.

But at the end of the day, the problems have to be solved. There’s no easy way out. One option is to stay with the monolith, and I tell everyone to stay with the monolith as long as possible.

When the monolith doesn’t work (typically for human scalability reasons) it’s time to go with a microservices architecture, which can present its own set of problems.

Envoy Mobile and the Mobile-First Architecture

Darko: (08:55) Based on what I’ve seen on your Twitter, it appears that you’ve just shipped something called Envoy Mobile?

Matt: For those who don’t know, Envoy Mobile is the idea that we’re going to take the Envoy proxy that we run server-side and actually run it on phones.

And for those who might think this is absolutely insane, there’s quite a bit of precedent set by large companies.

Facebook has done this for years. They actually use that same code on their mobile clients, primarily to give consistency. And Google has done this for a long time also. They have a library called Cronet, which essentially bundles the Chrome networking code across Android and iOS. These are two major companies that have really valued mobile-first architectures.

They’re doing this for the same reason that you might investigate a sidecar service mesh pattern. This code is incredibly complicated, so why would you want to reimplement it twice when 99% of the code is the same?

It doesn’t make a lot of sense to reimplement it multiple times when you might want to pay a few people to be experts in that code and can then work across server and mobile. We felt that there was a real opportunity here to have a true end-to-end solution.

Get Your EnvoyCon Tickets Before They Sell Out

Darko: (14:06) EnvoyCon is happening soon, isn’t it? I haven’t written down the exact date, so maybe you can share that with us.

Matt: Yeah, it is the day before KubeCon—November 18th. We published the schedule, and I think it’s going to be a fantastic conference. For anyone that is coming to KubeCon and wants to learn more about Envoy, we did our first conference last year and it was fantastic.

I’m very excited about this year’s conference. I’m pretty confident it’s going to sell out, so I would grab your ticket as soon as possible.

What’s Next on Envoy’s Roadmap?

Darko: (25:50) In this mature state of an open-source project, how do you decide which features should be merged into the core next, and how are you managing that?

Matt: So far, we’ve been able to stay fairly nimble. And for an infrastructure software project like Envoy, we are very odd these days in the sense that we’re a true community-driven project. People show up and build features, and that’s how they get merged.

People are always asking me, “What is the Envoy roadmap? What did people do in the last year?” And, honestly, I tell them I don’t know. We don’t have a community manager, and we don’t have a product manager, so we move forward.

In the history of the project so far, we have not had a major disagreement among the maintainers. I’m sure it’ll happen eventually. But I think it hasn’t happened yet because we’ve been very careful about who becomes a maintainer, making sure that we see eye-to-eye on philosophy.

The other reason why we haven’t had a lot of disagreements is what I talked about before: the extensibility. It’s a way of allowing people to push their site’s specific needs into extensions so we can focus on the core.

Darko: Thank you for joining us, Matt. Good luck with EnvoyCon, and good luck with the project.

Matt: Great, thank you so much for having me.