3 Mar 2022 · Software Engineering

    On the Importance of Observability-Driven Software Development

    18 min read
    Contents

    Charity Majors is the CEO and co-founder of Honeycomb, a tool for software engineers to explore their code on production systems. Prior to co-founding Honeycomb, Majors worked as production engineering manager at Facebook, was the first infrastructure engineer on Parse also spent several years at Linden Lab as operations and database engineer working on the infrastructure that powers “Second Life“.

    In this interview, we spoke with Majors about the importance of putting developers on call, the concept of observability-driven development and software ownership within teams.

    You always underline the fact that you’re coming from the “ops” world rather than the “dev” one. In one of your talks you mention that “software engineers spend too much time looking at code in elaborately falsified environments, and not enough time observing it in the real world.” On the other side of the wall, ops people need to deal with real outages and errors, but they can’t really own what they received from the devs. You have a lot of experience in scaling software and leading teams — so what would you suggest as a best practice for actually fostering software ownership in both dev and ops teams?

    There are a couple of core principles to grasp: after that, you can get as creative as you want with implementation details.  First, you need to empower your engineers to own the full lifecycle of their code. This means that everyone who writes code must have the ability to deploy and rollback their code, and the ability to watch it and debug their code live in production (through the lens of their instrumentation). Second, you must try to align incentives so that any effects or consequences of an action or change are experienced first by the people who performed them, as soon as possible after that action or change.

    It’s obvious why, if you think about it. The person who made the change or did the action knows exactly what they just did or shipped, and why, and should have a good mental model for the intended changes. This is fresh in their head for a few seconds or minutes, but rapidly decays.  Hours later, they’ll have only slightly more of a clue than their teammates who never had the model in their heads. Days later, it’s gone. Someone who is completely unfamiliar with the code (e.g. your ops team) is likely to have even less of an idea for what the change was and what should have happened. Therefore, the best time to spot an unintended consequence or bug is right after it was shipped, and the best person to spot it is the person who wrote it. It’s all downhill from there.

    We’ve all dug up bugs that we realized must have shipped months before, and nobody knows why. How much time an agony would have been spared if that engineer had just used their muscle memory to go and look and compare what actually happened to what they expected to happen, every time they deployed?

    The brute force implementation for the second part is also the best one: put your software engineers on call for their own code. They all need to learn to experience consequences. A little pain is Nature’s greatest teacher; it’s how you develop good intuition and build up reflexes.

    You touch a stove, it burns your finger, you don’t want to touch it again. You ship a query you have an iffy feeling about, your pager goes off — same thing. By exposing yourself to the consequences part of the cause-effect process, you’re developing your intuitive muscles.  Developers who never expose themselves to what comes out of these systems are famously and dangerously ignorant of real-world effects.

    When I started talking about this two or three years ago everybody was like, “oh, you can’t put software engineers on call. They’ll quit!” A few might, but the good ones won’t. We don’t have a choice anymore. Our systems have gotten so sprawling and complex that if you’ve got one person writing and deploying the code, then having a second person getting paged for its errors and trying to debug it, this could take days. This could literally be impossible. Your best bet for delivering a quality experience to your user is when you can identify problems swiftly. So, you need to construct these tight, virtuous feedback loops where the person writing the code has the ability to watch it in production as it’s running live and fix their own bugs. But if we’re going to put everyone on call, we have a responsibility to the people in the rotation — on call must not suck.  It cannot be something seriously life-impacting that you have to plan around or sacrifice your sleep, your health or personal life for on a regular basis.

    Ops people have a terrible history of self-abusive behavior. I completely understand why software engineers freak out about being asked to join the ops people on call. But the answer isn’t to make twice as many people miserable and sleepless wrecks — it’s to utilize virtuous feedback loops to make the system fairer and better for everyone.

    Nobody should have to suffer through the bad old kind of on-call rotations. I’m in my 30s now, and I can’t do that anymore either — but no one should have to. Systems don’t have to be like that. They only become like that after years and years of running into this dreadful state where nobody has the context or is empowered to fix the real problems, so everybody is just duct taping and monkey patching with whatever tools they have, and you’re all just trying to stuff your hands in the leaks. That’s not “fixing” anything, and I don’t want anyone to endure that treatment — not developers, not ops. Keep your system clean and orderly and well-understood, pay down technical debt when it becomes cumbersome. Utilize SLAs and design your systems to degrade gracefully under most types of failures. Be ruthless about pruning alerts. Consider it an emergency if anyone gets woken up — ever — and post-mortem that.

    So it’s not only about “the dialogue” and communication between the development and operations teams?

    Talking doesn’t do much unless you act on it. No, I’m talking about real-world cause and effect machines, the incentives and consequences that drive human behavior. We develop habits through repetition, and change them when we feel pain. Pain doesn’t have to be excruciating… just a pinprick here or there. You don’t have to leave your finger resting on the hot stove to learn the lesson. This feedback loop is exactly what we want. We want to expose people to just enough negative consequences that they quickly learn what to do and what not to do, and over time this matures into rich layers of intuition about their systems.

    But that also includes some extra learning time on both sides, right?

    Of course, absolutely. That’s why managers have an absolute responsibility to remember that operational work and infrastructure work are just as “real” as feature work, and the skill sets are just as difficult and as valuable. Reliability isn’t something you just fit in the couch cushions, you know, between features — you have to dedicate actual focus and concentration and learning and development.

    Your ops people are your best allies and resources here, if you’re lucky enough to have them. They’re usually delighted to be helpful if you are learning about operability because they’re so stoked to see you taking an interest in their field. Everyone has a stake in this: managers, directors, product people, ops and dev.

    In one of your talks you mention that “the health of the system doesn’t tell you a lot about the health of the request,” and you advise “invest in quality as far as it makes business sense for you.” When we add these two together we might say that you’re really user- and business-centric in terms of what software is for. It’s not only software and infrastructure for its own sake. Do you think that nowadays, with a lot of layers of deploying and maintaining software, we focus too much on tooling rather than on the outcome and the purposefulness of the software?

    The thing I hear consistently from our (Honeycomb) users over and over again is, “All that stuff that vendors always promise, you guys actually do.” People are so used to being lied to by vendors. Vendors will insist to your face that you can just “insert hundreds of thousands of dollars and get a reliable system out.”

    It’s never going to be how it works. Magic is always dumb and the lowest common denominator; the stuff that makes a tool powerful is locked up in your head. Always be suspicious of magic; it’s always too good to be true.

    The health of the system is irrelevant to you as an app developer. You don’t care if a third of the infra is technically down — all you care about is “can my code execute? Can every request succeed?” It’s not your job to understand the infrastructure. It’s only your job to handle things at the request level like timeouts or retries.

    People get all hung up on lots of nines, but nines don’t matter to your customer. You can have five nines of reliability — and still have deeply unhappy customers for whom your product doesn’t work. You can also have only two nines of reliability, and have very happy customers. The point is to understand your use case, your requirements and set an SLO for your user experience that’s reasonable. Quality, latency, error rates are rough proxies for user happiness, but they are not the same thing as user happiness — and that’s an important distinction. If you don’t have to care about high availability, you shouldn’t, because it comes with added costs and complexities. Just try very hard to care about as few things as you absolutely have to care about, because caring is hard.

    The software stack is a total disaster at the moment, and I’m not here to make excuses for that. Ignore a shiny new tool, and instead, use the simplest, best-known thing you can think of and try to reuse tools instead of introducing new ones. Make the software serve the purpose, not the other way around.

    But do you think that there are still ideas for tools that are missing?

    Of course. We are going to need a tool to help us keep track of all our tools, aren’t we? Kidding, but only kind of. The world is changing, and that always demands new tools. There’s an interesting shift right now towards an ecosystem of systems where the most salient fact about them are their distributedness, and that demands a whole new way of life.

    That’s part of what exposed the cracks in existing monitoring and metrics and logging solutions, and drives the need for the newer philosophy of observability. Observability is about instrumenting and building a view of the world from the perspective of the event, so you can reason about the insides of these vast sprawling distributed systems without having to log in or run ssh or strace — you can simply inquire from the outside, and understand what’s happening on the inside. It’s key for a world where most of the questions you need to ask are new questions and unknown-unknowns. Your dashboards are only ever going to answer the questions you predicted and pre-aggregated and configured in advance, whereas you need to be asking brand new questions about unknown-unknowns constantly.

    Honeycomb is the first tool of this time, fully native to distributed systems and inhabiting the observability worldview. Many technical ramifications flow out of this technical definition of observability. You must have arbitrarily wide and structured events, and you must be able to run flexible read-time aggregation queries on your raw events. Pre-aggregation and indexes are verboten — they lock you into a set of known unknowns. Tracing is just a visualization view on top of your events because ordering is there naturally. You’re going to see more systems like these, because this approach is just so much friendlier and simpler and more powerful than traditional monitoring.

    When we started this company, we were told over and over that this was a solved problem, there was nothing new to be done in the space and that everything was under control. We knew this had to be false because everybody hates their monitoring tools, including us — and you don’t hate a tool that’s doing its job well. Now, you look at the space and everybody is imitating our language, claiming that they do observability too, and they are tacking on features and doing acquisitions to try to build up the single source of truth that we started with: that original arbitrarily wide structured event, with ordering for trace awareness.

    I’m not some remarkable genius. I was at Facebook and got to apply its tools at scale to our problems on the Parse platform, and I saw how transformative it was for us every day. It dropped our time to debug problems by orders of magnitude. Companies like Google and Facebook used to be light years ahead of people like me and you, but that Delta is dropping like a rock. More and more people have distributed-systems problems earlier and earlier in their development.

    You mention these two big brands that operate on a very large scale. Do you think there will be more and more companies coming to the table that will need solutions to address issues that the Googles and Facebooks of this world face?

    Yes, I do, because the situation where the complexity of distributed systems is the most salient characteristic is true of more and more and more systems, and it’s happening earlier and earlier in a company’s development, when they’re smaller and smaller. These are paradigms that are responding to real needs on the product side, but they’re creating a lot of complexity and debt on the infrastructure side. We’re just going to need better tools to deal with it.

    You focus on observability and the business value that observable systems bring. You mentioned a concept called ODD. I know it was for fun, but if we could take a closer look at this: when does observability kick in in a software development cycle? How can we embed it into it?

    I started saying it as a joke, but then I’ve noticed more and more people starting to talk about observability and write articles about it. I think that it speaks to a real need. For example, unit tests run on your laptop and… that’s it. There are so many things they don’t test and physically cannot test, like network hops, partial degradations, third-party services, etc. Nobody has adequate observability or monitoring connected to their tests while they’re running, so they don’t catch it. It’s such a microscopic thing, and we just have to move the tools and processes and techniques of interaction with the production environment to as early as we possibly can. We just have to relinquish this sense of control over our code and just accept that there’s so much we don’t know or don’t understand, and all we can do is put it out there in a realistic environment and watch it.

    Observability-driven development is when you’d never accept a pull request unless you can answer the question: “How will I know if this is okay?” It’s adding an operational context and sense in from the minute you’ve written your code for the first time. It’s measure twice, cut once.  Use observability to decide if a thing is worth building, if you shipped what you think you shipped, and to validate your assumptions and guesses about production at every step and stage.

    In your blog post entitled “Shipping Software Should Not Be Scary,” you write: “deploy software is the most important software you have! In other words, make deploy tools a first-class citizen of your technical toolset.” At Semaphore we get a lot of feedback from our customers about how enabling it is for them to do CI/CD quickly and in a flexible manner. On the other hand, I’ve seen companies where deployment was done once a month on Fridays, manually. What do you think still prevents companies from investing properly in how their software is shipped and maintained? Is it a matter of culture? Of the time they have on their hands? Or too big a focus on what software features do for the business over what software reliability does?

    It’s the essence of ops: How you deliver value to your users. It is the beating heart of your software team. If you’re not shipping it doesn’t matter what you’re writing — it’s somewhere out there; it’s not doing anyone any good on your laptop. I really enjoy Jez Humble, Gene Kim, and Nicole Forsgren’s book (“Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations“) that showed that the speed and the errors of your shipping code reinforce one another. As speed goes up, errors go down. There are so many dysfunctions of engineering teams and engineers that you can solve by just greasing the wheels and getting everyone to move faster; but you can’t do that without confidence. Right now, people use the amount of time they’ve gone without getting paged as a proxy for confidence — because that’s the best they’ve got. So you naturally wouldn’t want to ship a bunch of things straight in a row, because you haven’t been given enough time to have confidence in it. If you have other ways of gaining confidence, if you can look at what you know you’ve changed and compare it side by side with the old ways, then you gain a lot more confidence to keep that motor humming.

    How do you increase this consciousness within the industry?

    It starts by just showing people that the pain they have can be solved. It’s drawing connections. Everybody out there will sit there and rattle off all of these annoyingly painful things about their jobs, and then you just have to show them what can be addressed by what. You just have to tell that story again and again, and you have to get other people to tell that story too, because everybody’s going to resonate with a different story. I’m not a very abstract thinker. I don’t understand unless you get very specific, so I’m a big fan of storytelling.

    You know a lot about being on call and have an ambiguous relationship with the word “serverless,” so I couldn’t resist asking you this question: Who should be on call when you’re running a serverless infrastructure?

    You’re always responsible for the availability of your software. You can try and give the responsibility away to someone else, but that doesn’t work. It’s up to me, even if Amazon is down. This is really about making people aware that operations aren’t about any place in the stack. It’s about responsibility. It’s about responsibility being the “how” of shipping software. If you’re shipping software, you do ops. That has lots of different manifestations, but I think that getting too bugged on the details distracts from the fact that if you’re shipping software you should respect ops. You should know enough about it to make your users happy and being on call should not be miserable — it should not be life-impacting.

    I’ve been on call since I was 17. Now at Honeycomb we have zero dedicated ops people. We have two people who have an operational background, but they ship product now and they just consult, and they help all the other engineers to come up to speed. We have one rotation, which all the engineers participate in, and nobody gets woken up more than maybe once in a quarter, and we’re growing very rapidly.

    Perhaps it’s an empathy-driven thing as soon as developers get more direct contact with clients?

    Exactly — that’s a huge thing, which is why developers need to be put on call.

    We’ve made a full loop right now.

    Yes  — we’ve ended up exactly where we started, and I love it.

    Article originally published on The New Stack.

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    Avatar
    Writen by: