Charity Majors on what is observability & how to measure the quality of microservices

Honeycomb’s CTO and coauthor of Database Reliability Engineering Charity Majors joins me on this episode of Semaphore Uncut to share her insights on observability and going beyond logs and dashboards to better understand the systems we build.

It’s hard to trace down problems in modern distributed systems. For events that we’re able to foresee, we have to implement logging, metrics and performance monitoring. The data ends up scattered across several services, which doesn’t help when you get a call that your service is down. For unforeseen events, it’s even worse, as we often have no data to reason with.

Is it possible that the system provides us with enough information to diagnose unknown unknowns from a single origin? What’s the future of measuring the quality of microservices in production? Listen to this episode or watch it on youtube below?

Also, connect with me and Charity on Twitter @darkofabijan @mipsytipsy @semaphoreci @honeycombio.

Watch this Episode on Youtube

Edited Transcript

Darko: (00:16) Hello everyone. Welcome to Semaphore Uncut, a show where we talk about engineering topics, products and people behind those products. My name is Darko Fabijan and I’m your host today. I’m a co-founder of Semaphore. Today with us we have Charity Majors who is joining us live. So hello Charity, nice to have you on the show.

Charity: (00:35) Hi. Thanks for having me.

Darko: (00:38) Yeah. Please go ahead and introduce yourself.

Charity: (00:40) I am the co-founder of Honeycomb, currently CTO. I was CEO for three long years until recently. We are a company that is observability product that helps you understand what’s actually happening on these crazy complex systems that we keep inflicting on the universe without having to ship new code to handle things that we know in advance.

The origins of Honeycomb: an undebuggable system at Facebook

Darko: (01:09) Okay. And maybe before diving deep in the technical topics, can you tell us a bit more about Honeycomb and the product?

Charity Majors: (01:12) Yeah. Co-founder Christine and I were both early engineers at Parse, the mobile backend as a service. Love Parse, rest in peace. We were acquired by Facebook in 2015. Around the time we got acquired by Facebook, I was coming to the horrified realization that we had built a system that was basically undebuggable, by some of the best engineers in the world doing all of the right things. Yet every day people were coming to us, “Parse is down.” We’d be like, “Parse is not down. Like behold my wall full of dashboards. They’re all great. Everything’s cool, right?” Because baby, we’re doing a hundred thousand requests per second. Mobile app, traffic isn’t huge. Maybe they’re doing like 50 requests per second or four. Never even show up in my time series graphs. So I’d have to dispatch engineer or go debug myself exactly what had gone wrong or if it was their fault or our fault, or combination of the two.

Charity: (02:03) It would take a day sometimes or more to figure out what was actually going wrong in each case. Our productivity ground to a halt. We stopped shipping, and we were just trying to understand our product. I tried everything out there. The problem with logs is that you have to know what to search for before, and if it’s a new problem you don’t know what to search for. The problem with metrics is they aggregate at write time, and you can’t break down by high cardinality dimensions like say User ID. So it was a very manual and awful process. The first thing that helped us start to dig our way out of this pit was this tool at Facebook called Scuba, which is not a pretty tool. I would go so far to say it’s actively hostile to users.

Charity: (02:49) But Scuba did one thing really well, which was it let you slice and dice in basically real time, on dimensions of arbitrarily high cardinality. Cardinality meaning the number of unique elements in a set. So like the highest possible cardinality will always be a unique ID. First name, last name, high cardinality. Gender is low cardinality and species is very low I assume. So solutions didn’t support that. We started getting our data into Scuba, and it started to drop our time to understand these complex scenarios that just dropped like a rock, from days to seconds, maybe a minute. But it was like a support problem, not even an engineering problem. That made a huge impact on me. To that point that, when I was leaving Facebook, planning to go be an engineering manager at Stripe or Slack, I suddenly realized: “Oh shit. I don’t even know how to engineer anymore without this stuff we’ve written.”

Charity: (03:31) We built around Scuba because it’s not just about incident response, it’s like my five senses. It’s how I decide what to build by instituting something, looking at the impact, what it’s going to affect, and then I write it. I’m in this constant conversation with my coach, just like is it doing what I thought it would do? Is it behaving as I expected? Does anything else like weird? The idea of going back to metrics and logs, it was unthinkable like using Ed instead of like an actual editor. But at the time we thought that this was a platform problem. Christine and I started working on this for a year, and we really thought that this is a platform problem because platforms will have this characteristic where it’s one of many thousands of apps to me. But to you, the customer who wrote a big check expecting our solution to solve your needs, it’s your world. It’s everything.

Charity: (04:56) It’s like some of this is self-induced, but whether it’s containers or schedulers, or polyglot persistence, proliferation of mobile devices. All of these things are high cardinality problems and everybody needs a different solution. According to the control theory definition, observability was just the ability to understand what’s going on in the inner workings of the system by observing it from the outside. Not by knowing in advance, writing custom code to handle it. Not by any of these things that work for known unknowns, but it’s really about action such that you can ask any question of your systems without having to ship custom code to handle that.

Charity: (05:43) This was a mind blowing thing to me. Because it really spoke to the shift from known unknowns (which we’d had in the days of the LAMP stack), to unknown unknowns (which we have with distributed systems of today). It’s like, the problems we have to deal with are like this infinitely long thin tail of things that almost never happen, except one time they do. And it’s not a good use of our time and effort to invest in a dashboard for it that will help us find that problem immediately the next time, or monitor and check for it. We’re handling all these things as one offs like there’s some end in sight, and there’s just not. So that was the original insight that led to Honeycomb and also to us taking a pretty aggressive stance that this was something different. And that observability is something that the industry needs to know and respect to technical term, not just as a generic synonym for telemetry.

Does your job end when you push to master?

Charity: (06:35) I think of observability as a technical term because you could look at a tool and just say, “Does this give me observability or does it not?” And if it does pre-aggregation, it doesn’t give you observability because you’ve gathered your data in a way that prohibits you from asking a new question. Same with indexes. You need to be able to do read time aggregation of the raw data in order to have that flexibility. So anybody who’s not offering that is not doing observability. So I think that the reason it’s taken off is you can, so many people have seen themselves, and their problems reflected in this distinction.

Darko: (07:05) Yeah. It’s an interesting journey and definitely in the area of scratching your own itch.

Charity: (07:10) Oh God, yes.

Darko: (07:11) That’s when you get really motivated.

Charity: (07:16) Yeah. The thing is that, this comes at the right time I think. It’s just in the past three years I feel like we really arrived at a consensus that software engineers need to be on call for their own systems. This was not an accepted answer three years ago, but we’ve learned as an industry that this is the way to build systems and support them in a way that scales, in a way that is not miserable for the humans who have to tend them. And so the person who has the original intent for what they’re trying to build, in their head, goes and watches it all the way out to where your code is interacting in real time with users. You’re the only person who knows really what you’re expecting to see.

Charity: (07:58) You have to take it all that way. You can’t just lob it over the wall. You can’t just say, “My job is done when I’ve merged to master.” The ops team doesn’t have your original intent. You don’t have necessarily their skill sets. So I feel like this is kind of the second coming of DevOps in a way. The first wave of DevOps is all about, “Ops people must learn to write code.” Like, “Yeah, absolutely. Message received.” And we do now. But the second wave of DevOps is very much about, “Okay software engineers, it’s your turn. It’s time to learn to write operable services, and it’s time to learn to run them.” I’m not saying that all roles are going to dissolve and go away, but it’s increasingly almost initial consulting area of expertise, where we’re here to help you as software engineers run your own services using our expertise, not to do it for you.

Charity: (08:52) Because that is the direction in which lays misery and pages, waking up every night. A lot of people are really afraid that being on call means that, that’s what I’m asking them to do. I want to be clear that it’s not. I’m over 30, I don’t want to get woken up in the middle of the night either. But the thing is that we can make it so that no one has to get woken up, if that person with original intent is babysitting all the way to the end. If we just raise our standards for what we accept in terms of the abuse that we’re willing to sign up for as engineers.

Positioning observability between logs, metrics and APM

Darko: (09:19) Great. To explain to myself some of the things that you’ve shared and hopefully for some of our viewers and listeners too. So the problem that we have with metrics and logs is that we must decide to implement them. We must benchmark certain parts of our code and decide, “Okay, this was not benchmarked, let’s introduce these metrics. Then let’s put it on some dashboard somewhere and wait for enough data to arrive.” With logs, it’s a very similar process. “We have a bug, what’s the best thing to do when something is really complicated? Let’s add a couple of lines of logs, and wait for the next situation.” We would want to get away from that problem. Because as you said, we cannot figure out in advance.

Charity: (10:05) It’s fundamentally reactive.

Darko: (10:06) Exactly.

Charity: (10:06) You’re always reacting to something.

Darko: (10:09) Okay. To solve this challenge in practical terms, there are some frameworks and tools like Istio. That’s maybe the only one that I know. Apart from maybe tools like New Relic, where you add some library into your application and they gather everything over time. So maybe just in that area where New Relic still is, what are some other options that you have?

Charity: (10:39) New Relic is an APM, Application Performance Monitoring. And Istio is a service mesh. And then there’s tracing. Tracing is incredibly important if you’re using microservices because ordering is so important. So there’s two things here. First of all, I see observability as sitting like right smack in the middle of monitoring and metrics, logs, and APM. Honestly, I believe in the next couple of years you’re going to see all three of those categories go away, because they were all premature optimizations. Hardware was very expensive, so they had to optimize for something up front, when the data was being written.

Charity: (11:11) What you want is to not have to write that data out to three different places, because then you, as a human, are sitting there in the middle, copy/pasting IDs from tool to tool, trying to track down a single problem. That’s just nuts. It’s expensive. It’s unwieldy. It relies on humans. You want there to be one source of truth, and you want to be able to go from a very high level, like the dashboards monitoring has, to a very low level of, like, the logs, without jumping between tools.

Charity: (11:35) So I think that observability is ultimately going to make all of those categories disappear, or become one. APM, you’re absolutely right that tools of the future will have to come from your code. You’re going to need to install library or something, and you’re going to need to do some amount of manual effort, not zero. Because magic is never going to give you insights into your code. You know your code. I don’t know your code. I can do a lot of guessing, and that’s going to get you a long way. It’s going to give you your great top ten graphs, which is what New Relic gets you, right? Those beautiful top ten graphs. But then you hit a wall. You’re like, okay, cool. I care about this graph, but for this user, you can’t do it. Right?

A new way of capturing runtime data in the age of microservices

Charity: (12:11) So the Honeycomb way, and I think that this is becoming the industry standard way, which I’m stoked about, is when the request enters a service, we initialize an empty, arbitrarily wide row of structured data. And then we pre-populate it with everything that we know about that request, or can infer from the environment, from the language internals, in the request parameters that were handed in, everything that we know.

Charity: (12:34) Then, throughout the life of that request and that service, you, as the developer, can basically do printf() of anything that you know are going to be interesting: shopping cart IDs, user IDs, anything that you’re like, “This is going to be useful to me for debugging in the future,” you just stash it into that blob. And then, at the end, when it’s ready to exit for error, it ships off to Honeycomb as one, single, very wide, usually hundreds of dimensions, structured data blob. And then if you have, like, 12 microservices, you’re going to have one of those blobs for the edge, one per service, and maybe one for each data base call.

Charity: (13:07) That gives you really powerful amount of context. So when you’re debugging these systems, it turns out that the hardest part is almost never debugging the code. It’s figuring out which part of the system the code that you need to debug lives in. And if you have this rich context for the entire path of your request, it allows you to zero in and pinpoint that just immediately. Say, like, which five things have to go wrong in order for this bug of errors to happen, right? You’ve got all the data packaged in the right way for you to get that really rapid wisdom out of it.

Charity: (13:37) And, it turns out, that since tracing is so important, well, tracers are just events with some ordering, right? So you basically can get that for free. If you’re using the Honeycomb library, you get all of the span IDs and everything admitted, so you just switch visualizations. You’re slicing and dicing, trying to isolate an error. Oh, I found it! Cool. Let me trace it. Oh, there’s a problem in the trace. Okay, now let me zoom out and see who else is impacted by this.

Instead of opening five tabs

Charity: (14:01) So you’ve gotten away from that thing where you’re storing it in four different places, and the human is hopping between tools. When there’s just one tool, it just gives you observability, and tracers are included. But it really does start with that library that you build into your code that gives you the insights from the inside out. You’ve got the software explaining itself back out to you, the developer.

Charity: (14:20) And then, once you’ve found where in the system the problem is, then you can go debugging, like GDB. Stepping through functions is out of scope for this kind of thing. Way out of scope. But it tells you where the problem is happening, and you have all of the context of the request at that point, so you can feed that into your local debugger and find the actual problem.

Darko: (14:39) Okay. So when you said the request is coming in, for example “Give me the sign in page”, you have something which is on the level of a process running in whatever programming language, and those two talk together?

Charity: (15:02) Yeah. It’s just a library in your code, right? We provide all the helpers. And other people have done this, not using Honeycomb. They’ve implemented the same thing, where they initialize an empty data blob at the beginning. They pre-populate it. Then they stuff stuff in through the life of the requested service, and then the fire it off. This is just where in your system problems are, like, full stop, that we have discovered, as an industry.

Darko: (15:22) Okay. Yeah. Sounds very powerful. I mean, what you said, I can totally relate to that. There are five tools. There is a PagerDuty call coming in, you open five tabs…

Charity: (15:34) And you have to pay to store it so many times! It’s not a good use of money, either. I believe that observability should be a dollar well spent. I think it should generally be, like, 10 to 30 percent of your infra costs. You should spend that much on observability. But not on every single tool! Like, total, right? So you really want something that can bundle up as many functions as possible. Right now, you’ve got all of these people who are charging you like they’re you’re only tool. But, in fact, you need all these different tools. It’s kind of painful. But I believe that the industry is headed in the right direction.

Every developer is now a service reliability engineer

Darko: (16:07) I can share a war story. In the first version of Semaphore, there was a single Rails application. At the end, it was close to a hundred thousand lines of code, using lots of memory, and all that. And when we were creating the second version, we used Elixir as our main language, and we have, like, 20 services running. We were getting close to launching, and we used Kubernetes in production for the first time. We delayed our launch by maybe month and a half, at least, until we installed and learned to use Istio in our Kubernetes cluster.

Charity: (16:45) Yeah, yeah. Yep.

Darko: (16:46) It’s probably possible to use Kubernetes without Istio, but I would rather not.

Charity: (16:52) Yeah. Agree.

Darko: (16:54) There’s another thing that you mentioned that I wanted to ask. For a monolithic application, the line is relatively sharp between when it’s working and when it’s not. For instance, it’s not booting at all or the queue is full.

Charity: (17:12) Yes, yes.

Darko: (17:13) Who’s going to tackle that incident? And you have that huge code base with all the features, any of those can make a problem.

Charity: (17:22) Right.

Darko: (17:22) In our case it was clear who was on call, there was a group of people, and then there were other groups of people who were just not on the call. Another thing that was not surprising, so a developer that is developing a new service, in the end it’s just an operating system process. And when it’s time to ship it, engineers pretty much have no clue. Does it require four vCPUs, or eight? Does our application need 16 GBs of RAM, or 4? Like, no clue.

Charity: (17:53) Yes! Yeah.

Darko: (17:56) Now with Kubernetes and containers, you pretty much have to reserve your capacity.

Charity: (18:03) Yes, you do.

Darko: (18:04)
At least in my view of the world right now, that’s the main influence, that every developer is now a system reliability engineer.

Charity: (18:15) The abstractions have gotten very leaky, right? Now you have to care about those things. You have to think about them, or you’re just going to be screwed. Absolutely agree.

Charity: (18:23) I think that part of the reasons it’s taken us this long to agree that software engineers should be on call is because, in the past, we’ve asked them to be on call, and then we’ve given them ops tools to debug their code with. Ops tools speak the language of free memory and uptime. Translating that to the world of variables and endpoints takes work, it’s a different language. And you were basically asking them to do two jobs, right? Do your job! Also, learn this other job and do it at the same time.

Charity: (18:53) Some exceptional engineers did it, and do it well. Most engineers, and I don’t blame them, were just like, “Hell, no.” Right? Which is why Honeycomb is very much designed to speak to engineers in the language of variables, endpoints, the things that they spend every day thinking about. But it’s definitely true that, like I was saying, it’s like when doing DevOps, it’s kind of like saying to software engineers, ops is now part of your job. And I would argue that that’s a good thing, because ops has always been the engineering or most aligned with user happiness. It can be very easy for software engineers to construct an ivory tower, where they don’t feel the pain or the consequences of what they’ve shipped. That tower’s being torn down. It’s going away. And I think this is, overall, a very good thing. But there’s definitely some pain in the meantime.

Charity: (20:05) You mentioned developers on call, and the lines between roles. This is a very hard problem. If you’ve got a monolith, and you’ve got 20 developers, you can’t have a rotation with 20 people on call. That rotation is so long everyone’s on call like twice a year. They’re going to forget everything in between those times, right?

Charity: (20:21) There is a case study that I found last night, of how they took a monolith and three teams of software engineers with three SREs supporting them, and they divided up the types of alerts, and they’re like, okay, I own these. You own these. And it being a monolith, there was kind of no way to protect each other. You’re all going to get the top level alerts, the app isn’t performing well. But I’m going to take ElasticSearch ones. You’re going to take the MongoDB ones. And that kind of works, because you’ve got three people on call at any given time.

Defining on-call duty for microservices

Charity: (20:56) But most people are starting to look at the shift to microservices now. Microservices give you more tools for doing on-call differently. If you’ve done it correctly, you’ll have this service in front of every data store. Data stores are the number one cause of infection seeping throughout the layers, because if a data store goes down, everything starts queuing up, waiting for that data store, everybody’s getting paged, right? Which is why you have to take that and put it in a service that’s a level up, and make it so the only people who are responsible for that data store get paged, and you can start to separate out who’s responsible for the app, who’s responsible for the data store. And if you’ve done it Uber-style, and you have a shit ton of tiny, little services, you can start to group up. Like, okay, this team is going to own these four or five services.

Charity: (21:39) I feel like upper limit of one to two services per team member is the absolute max. And, really, that’s talking, like, two or three in active development, and the rest have to be pretty stable if you’re going to go beyond that. It’s definitely possible to take the philosophy a little too far. But I think, all in all, it’s the right direction for us to be taking steps in, and learning how to isolate these services cleanly from each other so that we can craft on call policies and only impact the people, because the key to designing an on-call rotation that doesn’t burn people out that is effective is making sure that every single alert that you get is actionable, that you can fix and make it so that it never happens again. Right? Because every time you get paged, you should be going, “Huh. This is new. I don’t understand this.” It’s the death of on-call if you’re like, “Oh, that again. Oh, that again.” That will kill your team. You have to pay that down. If it’s, “Oh, that again, and I can’t fix it because it’s somebody else’s problem,” that is 10 times worse. That will burn people out like nothing.

The transition to shared responsibility

Darko: (22:42) I agree completely. Do you have any predictions how this will play out? There are so many developers in the world that haven’t been on call. I’ve developed these features, shipped it. Not something that I’m going to worry about.

Charity: (22:56) It’s a cycle. I feel like there’s an understandable period where people are just repelled by the idea because it’s so bad for ops teams. You’re the worker climbing out of that pit and talking about it and telling people, “No, a better world is possible.” And it is possible. I’ve seen teams who never get into that pit. My teams, we consider it a crisis if someone gets woken up. We post-mortem it. We make it so it doesn’t happen again. We respect their time and their sleep.

Charity: (23:23) I’ve also seen teams who are way deep in the pits of terribleness and they’ve clawed their way out, and it’s been better. Because the amazing thing is that once you get out of that hole, you have so many more cycles to think about what’s best for your users. You can spend your time more efficiently. Like firefighting is just like lost time of your life.

Charity: (23:43) I feel like there are three pegs to this stool. There is ops teams. We have to stop being gatekeepers. We have to stop blocking people. We have to stop building a glass castle. We have to start building a playground, which is why I say test in prod. We have to get used to being up to our elbows. Every engineer who is shipping to prod should be looking at prod every single day so they know what feels normal and they know what wrong feels like and they know how to debug it and they know how to get to a known good state. That’s the bar of operations that every developer who’s shipping to prod should have. Everyone should know how to debug, how to get to a known good state, how to deploy.

Charity: (24:17)Ops people need to stop being gatekeepers and we need to start inviting people in. We need to start sharing our knowledge and educating and stop seeing ourselves as the people who do things and start seeing ourselves as the people who empower people to do things.

Charity: (24:37) Software engineers need to be willing to be heard again. Take a risk on love, right? I know you’ve been hurt before, but I swear to you, you’ll get hooked on it. The dopamine hit of, “Oh, I found it. I fixed it. I made it better for that user,” and you’re seeing the impact of your work, that is addictive.

Charity: (24:55) What I’ve seen is that once people have experienced that level of control and power and empathy with their users, they find it very hard to go back. They don’t want to go back to a place where they’re insulated from it ever again, because it’s so much more visceral and real and they can see the impact of what they’re doing. That’s very motivating to every engineer that I know of. So scarred and like I’ve been woken up so many times, it’s like never again. Just please. You have to be willing to try again. It’s up to you, too. We need you and the original intent in your head to help us dig ourselves out of this pit.

Charity: (25:25) The third part is management. There’s no on-call situation that will ever work if management is not carving out enough project development time, like continuous development time, for things to actually get fixed. I know that interferes with product shipping cycles in a short run. You just have to get aggressive about it. You have to shield your team. You have to carve out that time. Let themselves dig out of that hole so that you’ll have so many more cycles freed up to spend on product and so many fewer cycles going down the toilet to debugging, like production problems in prod.

Charity: (25:57) This is the job for line managers. It is not reasonable for you to expect your team to be on call if you are not carving out the time for them to fix their shit. Then you’re just asking them to go to the salt mines every day. If any engineers are working under those conditions, I would encourage them to quit their jobs and to go to somewhere where they do have air cover. It takes all three, but it’s doable and it’s a better world.

Darko: (26:18) You presented this very nicely, and I agree with you that it will bring a better future.

Service Level Objectives: agreed metrics of quality

Darko: (26:51) As the last thing I wanted to ask, what are some of the features that you are planning to ship in Honeycomb that you are most excited about?

Charity: (26:58) Oh, boy. Yes. Two things. We’re shipping as a beta now a tool for SLOs. We’re always talking about being able to go to the big picture, like what’s happening at big level into the weeds and see individual requests. A lot of people get stuck on the question of how much is too much? Or like management is very concerned about this one user, and engineers are like, “It doesn’t matter.” You need to have some common language there where you all agree this is what matters, and below this line engineers are responsible for delivering and above this line managers are responsible for making sure that it’s the right line. That is what SLOs are.

Charity: (27:31) SLOs are service level objectives: a few service level indicators where you all agree this is the quality of service that you agree to provide for your users. As long as you are hitting that line, anything that you do in engineering is fine. Solve it as you want. This is how we create that crisp level of obstruction so that everyone gets what they needs and nobody feels micromanaged and nobody feels completely abandoned. You agree on this number and then engineering can go and build it however they need to.

Charity: (27:57) SLOs, that sounds deceptively simple. It is not. It quickly devolves into arguments about what does good actually mean? Over what window of time? So we’re building this into Honeycomb so that you can just pick. Any query that you run you can say, “This is an SLI. This is something I care about.” And then out of your SLIs, we will compute this is your SLO. So you can see if you’re hitting it or not or if you’re on track to run out of your budget before your time is up.

Charity: (28:35) This is the most powerful tool that you could have in your arsenal for allocating time correctly. When I was saying that managers have to carve out the time for their engineers to fix things, how much? How good is good enough? Engineers are always going to want to spend time refactoring, making things better, more elegant. How much is enough? That’s where SLOs come in.

Charity: (28:53) SLOs are the number that you have agreed upon. So if the quality of service has been brutally bumpy for the past 30 days and you’re running out of budget, that’s what your manager uses as the hammer to be like, “Okay. Sorry, product development. We’re about to stop. My team needs time to fix things.” When things are going pretty well and engineers are agitating because they really want to do this thing that isn’t directly tied to product development, that’s when managers can go, “You’re going to have to wait for that, team, because actually we’re meeting our objectives and it’s time for us to make progress on the roadmap.” This is the only way that I know of to make this relationship not fraught and painful. You agree on the number. You ship it. You build to it. You’re done.

Charity: (29:32) Nobody in the industry has actually done this well yet. Liz Fong-Jones just joined our team a month or two ago from Google, and she had SLOs there. So she has been leading the product development process for us. We’re building an SLO product that I would be proud to run myself in production.

You can throw data away

Charity: (29:54) The second thing that we’re building that I’ll say a little bit less about is there’s this myth in this industry that you can not throw away data. The log vendors who are like, “Keep every log line,” and the metrics vendors who are like, “We keep every metric”. Bullshit. Either you’re throwing away data at ingestion time by aggregating or you’re throwing away data after by sampling. There is no other way. No company in this world is going to pay for an observability stack that is as large or larger than their production stack. It’s just not going to happen.

Charity: (30:22) We’ve kind of lost the muscles and the language, because our vendors have been monopolizing this conversation, and it did mostly the pre-aggregate types plus some logging vendors. We want to reintroduce sampling. We want to do it in a way that helps elevate the level of discourse and speaks to people like they’re engineers. Like this is not outside of your scope to comprehend by sampling matters.

Charity: (30:44) We don’t have the libraries. We don’t have the language. We obviously are a tool where if you don’t sample at some level, it’s going to be absolutely unaffordable. But that’s fine, because what percentage of the 200s to your group domain do you actually care about? Almost none. You care about trends. You care about errors and outliers. All of this can be done incredibly cost effectively with intelligent sampling that weights common things like 200s as less important than things like 500s. So this is not a hard product to develop from an engineering perspective. Difficult thing to develop from a language and marketing and educational perspective. It’s three years in the making, so I’m really pretty excited about it.

Darko: (31:27) Okay. So what’s the ETA?

Charity: (31:33) SLOs are being beta tested right now by some customers, and if anyone wants to try it, they should hit us up. The sampling stuff, maybe a month. We’re a startup. We move pretty fast when we decide to do something.

Darko: (31:51) Great. Sounds all very exciting.

Charity: (31:54) Thanks for having me on. This has been really fun.

Darko: (31:56) Yeah, for me, also. I learned a lot and it was great to hear your thinking process and how you see infrastructure ops, where it’s going. It looks very promising to me.

Charity: (32:11) Thanks.

Darko: (32:11) Thank you very much. See you!

Charity: (32:13) Bye!