Jeff Smith on DevOps Antipatterns

In this episode, Jeff Smith, DevOps advocate, director of Ops at Centro, and author of the book Operations Anti-patterns, DevOps Solutions, talks about adapting DevOps patterns and avoiding pitfalls as companies grow out from startup to full size. We talk about the types of engineers companies need as they evolve and how to achieve the right balance between letting your engineers experiment and keeping things sane.

Key takeaways:

The usual way companies implement change management doesn’t work.
There are three types of engineers. Companies need the right mix as they scale.
Self-contained teams make better decisions about DevOps patterns they adopt.
Engineers need the freedom to experiment and fail.
At the same time, adding new technologies has a cost. Companies need to balance innovation with standardization.

Listen to our entire conversation above, and check out my favorite parts in the episode highlights!

You can also get Semaphore Uncut on Apple Podcasts, Spotify, Google Podcasts, Stitcher, and more.

Like this episode? Be sure to leave a ⭐️⭐️⭐️⭐️⭐️ review on the podcast player of your choice and share it with your friends.

Edited transcript

Darko (00:02): Hello, and welcome to Semaphore Uncut, a podcast for developers about building great products. Today, I’m excited to welcome Jeff Smith. Jeff, thank you so much for joining us. Please just go on ahead and introduce yourself.

Jeff Smith: My name is Jeff Smith. I’ve been in the production operations space for my entire career over the last 20 years. I currently work for a company called Centro, which is an advertising software platform based out of Chicago. I’ve been with them for about four and a half years, and I just recently finished up a book, Operations Anti-patterns, DevOps Solutions, which I think we’ll probably talk a little bit about today. The book helps individual contributors and engineering leads jumpstart DevOps transformations without necessarily having top-down support or buy-in.

DevOps Antipatterns

Darko (05:14): How did the idea of the book come up?

Jeff Smith: I started speaking at a bunch of different events. And the more I attended these events, the more I realized we were hearing a lot from these unicorns companies. They didn’t worry about profit, only about revenue. The dynamics are just completely different. So, I wanted to write a book that spoke to those people, but framed it in the context of their reality.

The genesis of the book was the idea that constantly comparing yourself to technological giants like Netflix can be cancerous. It’s great for inspiration, but you may not be solving your own problems. You might be solving someone else’s.

Darko (07:49): Can you give us an overview of what are you presenting in the book and what can people get out of it?

Jeff Smith: The book is centered around operational patterns that lead to organizational complexity and less operational efficiency. And a lot of them are rooted in people problems, right? We’re so quick to grab a tool to solve something, but in my experience, the order of solving these issues has always been people first, then process, then tools. Once your people are on board, you can start to work on your process.

I could deliver the perfect process, but just by instinct, you’re kind of like, “I wasn’t part of this, so I’m not interested because you didn’t think about my viewpoints.” How many times have you heard someone say, “oh, are you guys using Kubernetes? You should be doing Kubernetes because then you can do this, this, this, and this.” And then in your mind, you’re thinking of all this cool stuff, but you’re also like, “is this really solving what an actual pain point I have is, or is this something that is just cool?”

Use the special code podsemaphore19 to get 40% discount for the Operations Anti-patterns, DevOps Solutions book or any other product from Manning Publications.

Change management is broken

The book deals a lot with the dynamics of that thought process. How do we get people on board? How do we build consensus? One of the things that we talk about a lot is change management. How do you leverage technology to enhance your process? For developers, change management is really about some sort of review, an approval process prior to something getting deployed. That’s really what a pull request is.

A pull request is just another form of change management. So we don’t have to have this knee-jerk reaction to change management as a process, we just have to think about how it fits into our workflows and how we can leverage the processes that we’re already doing today.

In an organization with change management, typically you’re presenting this highly technical change to a bunch of people that have no idea what you’re talking about. And then they ask you like, “well, do you think this is safe?” And it’s like, “do you think I’d be here if I didn’t?”

There’s very little value that gets added to those processes. But if I can pass this on to another technical engineer, they can comment and evaluate. Not only does it satisfy audit requirements, but it’s faster, it’s better collaboration. It’s just sort of win-win all around.

Jeff Smith: One of the tipping points will be more self-contained teams. Today, a tool choice reverberates throughout the organization. The person that makes that decision isn’t necessarily the person that feels the pain or consequences of those trade-offs. Now that we’re moving to these contained teams and you’ve got engineering people being responsible for the entire life cycle, those consequences move closer and closer to the decision-makers. Eventually, people say, “why do we keep running into this issue?” Or “why do we keep running into this resistance?” These things will naturally manifest themselves as we start to do that postmortem. When they see the problem recurring, and they feel that pain, they’re going to want to break it down and look at it.

Getting the right mix of engineers

Darko (16:08): I wanted to ask you, if you can maybe cross this what you have been saying with maybe the size and the path in the company, or like the journey that the company is on. Can you maybe share experiencing on that like team size, freedom versus structure?

Jeff Smith: That is something we experienced at GrubHub and are starting to experience at Centro. The people that are attracted to startups are very different than the people that are attracted to established companies, right? Startup engineers love to be involved with everything. They love the idea of figuring something out and getting it launched.

There was a blog a while ago about a V1 and V2 engineer, the V1 engineer is a person that likes to start stuff. And the V2 engineer comes in and sort of tidies it up. At startups, you sort of go through this phase where once you get to a particular size, just for sanity’s sake, you have to standardize. And that’s when you start to begin to see the shift in personnel. Those early engineers start to move on. And then you start to attract these V2 engineers that are coming in, and they’re saying, “oh my goodness, we’re using 30 different database technologies. We got to standardize on a thing.”

Balancing freedom to experiment with standardization

Darko (18:25): This wasn’t exactly the lines that I heard. Their biggest problem is multiple database stores.

Jeff Smith: Everyone’s got to preferred database store for whatever reason. And at the end of the day, it’s usually like one feature or something like that. So the thing that I think is important, especially from an ops perspective. There are reasons to standardize and say, “hey, this is the way it’s going to be and we just have to deal with it.”

But if you want to still encourage that freedom, the thing that we used to do is what we call the Yellow Brick Road, where we build up a lot of tooling and a lot of automation around a particular set of technology choices. And we say these are the things that have to be implemented no matter what. You have to be able to instrument everything. You have to be able to automate a restart. You have to be able to accept these sorts of signals.

But you still have to meet all of these standards. When faced with those sorts of trade-offs, you’re like, “do I really need Mongo DB that badly?” Because it’s like, if you do, then it’s worth that effort, right? Whereas if you’re like, “well, I really don’t know if we want to do all that,” Then maybe we’re talking about a preference as opposed to a better technology choice.

So that path served us very well at GrubHub and we’re looking to implement something very similar at Centro with the advent of Kubernetes. We want people to have flexibility, but at the same time, we want to make sure we’re making intelligent choices. Because the unspoken truth is you got to offer your top engineers some flexibility or they’ll leave. So you want to make sure that you’re giving people an area to grow. Because some people just get bored coding Ruby after 10 years. They want to try something new.

Coming up with a structure where people can embrace that polyglot mentality is good, but you still have to protect the organization. One of the things that we do at Centro is that we scale it based on the number of unique technologies, not on the number of services or servers. The idea is that a person can only hold, let’s say seven technologies in their head. That puts pressure on the organization, “well, if you choose this brand new database store amongst the other database stores we already have, we have to scale the ops team. Because we don’t have anyone with a mental bandwidth to be able to support whatever new tool you’re choosing.” And then that puts the organizational pressure to say, “do we need to do this?”

Engineers need psychological safety to fail

Darko (22:19): Okay. And I checked one of your speeches. I think it was from a couple of years ago. I think it was “The Good, The Bad, and Ugly about DevOps” or something like that. So I’m going into the bad and the ugly direction in this organization. And did you ever experience a technology that turned out bad choice? Have you had the experience of managing that or cleaning that up?

Jeff Smith: Yeah, it’s messy, right? Because what ends up happening is the person that really loved that technology leaves, then it becomes that thing everyone is afraid to touch because it’s dangerous. You need the right team to approach this, and that is going to be a mix of like the V2 and the V1 person. You need a combination of those two types of engineers when you tackle those problems.

Having that psychological safety gets rid of so many of the issues encountered while re-architecting. It’s about understanding all the things that this thing does, communicating with all of the stakeholders. Nothing is worse than that feeling you get when you change something and then there’s a group that you didn’t even know existed that’s like, “hey, we do X, Y, Z off this thing and you just broke it.” That’s why it’s important to institute some sort of psychological safety to give the team the freedom to experiment and fail.

Because you never know when you’re going to implement something that just isn’t quite right, I might use the strangler pattern for this kind of migrations. The strangler pattern is a huge thing for migrating. You pull features slowly off of that old architecture into the new architecture. When you inevitably make the mistake, you just flop it back over to the old one. To be able to just slowly migrate functionality from one to the other, you stick a load balancer in front of them and route particular traffic, assuming it’s a web app to the new nodes or the old nodes. You might have to figure out how to do data migration if you’re changing data stores or things like that, but all in all, it’s better than having to stumble through the process with a broken server not knowing the true impact.

The V3 engineer

Darko (26:33): There is a V1, there is V2, is there a V3 engineer?

Jeff Smith: I think there is. The V3 engineer is the engineer that everyone hopes they need, and that’s the one you need when you’re wildly successful and you are hitting scaling challenges you weren’t anticipating. Everyone wants to have Twitter-sized growth. Everyone wants to be getting that number of transactions. But the truth be told, as an engineer, you’re trading off the issues that you’ve got today with a future state you may never make, right?

You get to a point where a bigger database isn’t enough anymore. Now you’ve got to scale and shard across multiple databases. That’s a different set of solutions and patterns. In my mind, that’s the V3 engineer. That’s the one that’s coming in and say, “okay, we need to transform this thing into something that is highly scalable, but it’s a good problem to have.”

It’s interesting to be able to identify those V3 engineers because if you don’t have that scale, they get bored very quickly, right? And then they start inventing problems to solve.

We have this conversation with my engineers all the time. When we see a poor SQL query running on a database server, our instinct is like, “this is so sloppy, you need to get better.” But then when we talk with the engineers and realize the effort that it takes, we’re like, “it’s probably easier for us to click this check box in AWS and just make the DB bigger than to spend three months retooling all the SQL queries.” Yes, it’s bad. We understand it’s bad, but the effort involved, does it make sense? If you’re a true team, you recognize that sometimes the trade-off may mean more pain for you in the short term, but in the long term, it’s helping the organization out better.

The most important thing you can do in a pull request or a commit message is provide context for the decision that you made in a particular change. And I think context, even in these cases for architecture, is a huge bit. When the V3 engineer finally does show up they can look and understand why a particular decision was made that day. These were the trade-offs we were facing, and this is the choice that we made.

Darko (30:43): Okay. So thank you, Jeff. We are going to link to your book. Are there maybe any speeches that you’re going to give soon on some of these virtual events?

Jeff Smith: Yeah. There’s a few that we’re working on. Actually today I’m doing something with the DevOps London exchange. So the best place to check is my website, attainabledevops.com. I’ll link to a bunch of stuff there.

Darko (31:22): Great. Thank you so much. Good luck with those talks and thanks again.