In this podcast episode, I welcome Gleb Bahmutov, Senior Director of Engineering at Mercari. We talk about about Gleb’s engineering experience at Cypress.io and Mercari US, discuss the testing pyramid and why it makes little sense, and talk about what we can use instead.
- Gleb’s Story with Cypress.io
- Testing Pyramid Makes Little Sense, What Can You Use Instead?
- Why Is the Testing Pyramid Shaped This Way?
- Testing Matrix
- The Cost of Bugs
- Setting Boundaries within Teams
Listen to our insightful conversation or read the edited transcript.
Like this episode? Be sure to leave a ⭐️⭐️⭐️⭐️⭐️ review on the podcast player of your choice and share it with your friends.
Hello and welcome to Semaphore Uncut, a podcast for developers about building great products. Today I’m excited to welcome Gleb Bahmutov. Gleb, thank you so much for joining us. Please go ahead and introduce yourself.
Gleb Bahmutov (00:17):
I’m Gleb Bahmutov. I’m a Senior Director of Engineering at Mercari US, an online marketplace available in the United States and Japan. If you have something that you no longer need or use, you can take a picture, create a listing and sell it very quickly. Mercari makes it so much simpler to reuse it by selling to someone else.
Yeah and it’s also good for the planet! You also have a history in testing, with your previous position at Cypress. Can you please talk about that?
Gleb’s Story with Cypress.io
Gleb Bahmutov (01:13):
I always loved writing software that works, and I loved writing tests. I’m a strong believer that you build value by writing code once, and then using it for a long time.
To be able to use code in my projects for a long time, I have to make sure that when I come back to them years later, I can understand it, upgrade dependencies and add features. To be able to do that, you need good quality code and tests.
So five years ago, I joined a company called Cypress.io where I worked on writing end-to-end tests for web applications. There are so many tools out there, like Selenium, Phantom GS, Nightmare, Web driver. But when I saw that Cypress.io doesn’t use Selenium Web Driver, I thought it was the best thing. I’m excited to see my website being loaded, I’m excited to be able to see the commands and be able to debug them.
I used Cypress for a year before joining the company. When I joined, I was excited about writing web application tests. I worked for four years at Cypress.io as VP of Engineering, building the test run tools, integrations with CI providers.
Gleb Bahmutov (03:28):
And it’s kind of funny that I joined Mercari US, because the chief technical officer is someone with whom I worked before, who really wanted me to help them write end to end tests. So I found myself on the other side of the fence and turned into a 100% Cypress.io user myself. I can see the big picture and apply everything I know that we suggest that users do on writing tests. I’m also teaching others and making sure the CI infrastructure works. All the things that go into day to day testing, now I’m doing it as part of Mercari US.
Testing Pyramid Makes Little Sense, What Can You Use Instead?
There is an upcoming TestJS summit. We’ve known each other for a long time but we spotted you there giving a talk on a very interesting topic – “Testing pyramid makes little sense, what can you use instead?”
This is a great thing to talk about. There is a lot of enthusiasm that drove you through your career around testing, making great software. So I would love to hear from all your experiences how you ended up with this talk and what are the ideas?
Gleb Bahmutov (05:49):
Absolutely. So the TestJS summit is an online conference held on November 18th and 19th. Everyone can join from your room, right from work, or watch the talk recordings later.
My friend Roman Sandler, a developer from Fiverr, and I are making the presentation together. We always looked at the pieces of advice that everyone gives in the industry. When you pick what test to write, you have to follow the pyramid approach.
When you pick what tests to write, you have to follow the pyramid approach.-Gleb Bahmutov, Senior Director of Engineering at Mercari US
At the base of the pyramid, there is unit testing: the tests that you write for the smallest pieces of your code, like a single function or a single class.
In the middle of the pyramid are integration tests which are fewer than unit tests. You start putting pieces of code together and you can test your database layer, but there is no database, you mark and stop the database itself.
And then at the very top of a pyramid, you have end to end tests, and that’s where you test the whole API, the whole web application, you visit via URL and operate like a real user.
Gleb Bahmutov (07:02):
And that’s what Cypress would use. The advice was always to look at this testing pyramid and check the speed of execution for tests. If unit tests are fast, you can write a lot of them. When you go up the pyramid, you’d have slower tests but they would give you more confidence because they would test bigger parts of the system.
But if you look at all the factors associated with testing day-to-day work, it’s not just running tests. When it comes to the speed of testing, everything matters, from installing the tool to writing tests, to running the tests.
Gleb Bahmutov (08:10):
Then if a test fails, you have to debug it, usually on CI, right? How much effort does it take to debug a failing test? A unit test versus integration test, versus end-to-end test. This is a very significant chunk of the effort. You go to Semaphore and you just buy a bigger plan, and you spin up more boxes to run tests in parallel, and then they run faster. So to me that’s not where you spend the effort. To me it’s like debugging a failing test, that’s where your human time is spent.
To me, debugging a failed test is where your human time is spent.-Gleb Bahmutov, Senior Director of Engineering at Mercari US
Gleb Bahmutov (08:47):
You should also think about how much effort it takes to maintain the test. If you change the feature, you have to change the test. So, is it hard to change your test or is it easy?
If you write a lot of unit tests, you’re most likely testing implementation. If you change what the code is supposed to do, all of a sudden you have to update all these tests. If you write a lot of integration tests where you mark everything; you change a feature or you add something, and often everything has to be redone and you pay this giant maintenance cost.
Gleb Bahmutov (09:30):
As for end-to-end tests, they operate like a user. By definition, you test through the public interface of your website. If you change implementation under the hood, you can swap your whole backend. The test should not be concerned. The maintenance should actually be much lower.
The testing pyramid to me was like one axis up and down, expensive, cheap. It became a dogma: write more unit tests and few end-to-end tests.
Why Is the Testing Pyramid Shaped This Way?
Gleb Bahmutov (10:11):
Once Roman Sandler and I were talking about what we should replace the testing pyramid with. It got me thinking, “Why is the pyramid that shape in the first place? Why do we have a lot of unit tests but fewer end-to-end tests?”
From the point of view of the user, it should be the reverse: you should write more end-to-end tests to make sure everything it uses works. When you develop software, while writing code, you can be writing unit tests – and most developers do that, because it’s easy.
So you write tests alongside putting the web application together. Then we deploy it and then we start testing it end-to-end, usually manually.
Gleb Bahmutov (11:19):
But typically, when you plan out how long a project should take, by the time you get to deploying and testing, you’re already out of time. By the time you need to write end-to-end tests, you’re out of time, and that’s why you write very few of them. To me, that’s why people write so many unit tests and so few end-to-end tests: they just simply run out of time.
People write so many unit tests and so few end-to-end tests because by the time they get to end-to-end tests, they just run out of time.-Gleb Bahmutov, Senior Director of Engineering at Mercari US
Gleb Bahmutov (12:07):
Now let’s talk about what we can replace the testing pyramid with. Roman Sandler and I decided to look at the whole picture, why we are doing testing in the first place.
Let’s talk about what we can replace the testing pyramid with. Because if we don’t come up with an alternative, what value is there in complaining that something’s bad?-Gleb Bahmutov, Senior Director of Engineering at Mercari US
You want to be confident that when you release something or deploy something that when the user uses it, it works. So you can actually think about the whole testing picture as rows from low confidence to high confidence, and in the columns you can think about the effort you spent.
Gleb Bahmutov (13:02):
So you can spend very little effort, medium effort or a lot of effort. So let’s say you only do manual testing. You do it yourself, or you hire a team to do manual testing. Well, that would put you in the high effort column, because manual tests require a lot of time.
Now if you do everything manually, how confident are you that you can release the new feature or deploy software? I would say you cannot be very confident. It’s probably medium, because when humans have to do the same thing again and again, they miss stuff, because it’s a very boring job, honestly.
Gleb Bahmutov (13:58):
You want to spend very little effort, meaning everything is automated, but make sure your tests catch everything. You want to be in the top left corner – low effort, high confidence. Think about how much effort it takes to write whatever type of test you’re writing versus how much confidence you get at the end of the day. Then think – what can you do to lower effort and increase confidence?
The Testing Matrix
Gleb Bahmutov (14:48):
Roman calls this “testing matrix”: you apply all your professional knowledge to use this matrix as you plan your testing. You apply what you’ve done before, what tools you have experienced with, what tools you think are appropriate for the types of tests you have to write: unit tests, API tests, security reviews, etc.
Then be honest with yourself and think: I want to increase confidence. What kind of tests should I write? From my experience, it would be more end-to-end tests. It would probably take you longer to run them but the effort would be worth it.
I really hope that this dogma “End-to-end tests are expensive, so let’s write a few of them. Unit tests are cheap, so let’s write a lot of them” will change. And hopefully the test matrix will be a better way.
There is a lot of talk about the testing pyramid. Whenever we are helping our customers to iron out some of the bottlenecks that they have in their teams, those bottlenecks happen for very different reasons.
For instance, the team is growing from 15 to 115 engineers. They all need to keep being productive and decide how they’re going to interact and structure the teams. The more successful the company, the faster they need to move.
Another problem is having to run test suite across 100 runners, and they really want to give their developers a fast feedback loop.
Then there is also that element of flaky tests becoming slower over time. There are also those inverted pyramids where companies have close to zero unit tests. But as you were talking about a cycle of developing a feature, maybe at each cycle of improvement, in each iteration of the feature we can rethink, so for this concrete feature, how are we going to structure that pyramid matrix?
I would say teams just having that in place is like a checklist. Okay, one of the things that we are going to do now is, or rethink what’s our testing strategy for this feature. Sounds simple, but it’s I would say pretty profound for a lot of things that they should be doing that
Gleb Bahmutov (19:36):
Darko, you hit the nail on the head here. We usually iterate on developing software. You plan the first feature, you implement it, you deliver to the customer. If everything goes well, you have to deliver a second feature. And you plan it, you code it up, you test it, deliver to customer and so on.
So this whole process repeats and at each step and we’re like, let’s say four, five major steps from planning to coding, to testing, to delivering or deploying. You can think about how you are doing the testing. During planning, you can ask yourself: how are we going to test this? The simplest thing, how are we going to write tests? Do we have to write tests at all? Maybe it’s a prototype. Maybe it’s a feature where you are not sure it’s going to stick or not. And you’re like, forget it, if we need to, we’ll manually check it.
When you are planning a new feature, ask yourself: how are we doing to test this, how are we going to write tests? Do we have to write tests at all?-Gleb Bahmutov, Senior Director of Engineering at Mercari US
But you have to ask these questions during the planning phase.
The Cost of Bugs
Gleb Bahmutov (20:34):
And you also have to ask a question, how much a bug in this feature would cost us? If you are working on a payment system, you probably have to say yes, everything that slips by during planning and coding all the way to the customer to production will be very expensive to ship, both in our reputation, in revenue, potentially financial, mistakes.
So some features are very costly if a bug slips in. Some features not so much, then during coding it up, when you code it, ask yourself: do I have to write tests? What kind of tests should I write? Are the tests that I’m writing readable?
And this is where sometimes when you look at the test you’re writing and you find that you’re fighting against the code, you write all these tests that just look bad and complex. So maybe you picked the wrong testing tool. Maybe you are writing the wrong type of test.
Gleb Bahmutov (21:35):
For example, if you have to stop a lot of code, maybe you are writing an integration test or unit test and you should be writing an end to end test instead. Ask yourself: do I have to test only how the feature works or I also have to confirm implementation details?
Here’s an example. Let’s say you’re doing payment tests and you decide, okay it’s very important, I’ll write an end-to-end test, that goes from payment and there’s a credit card and calls payment service.
Now if you decide that you want to confirm implementation details, then when you write the end-to-end test, you will spy on the network call that sends that payment. And you confirm that everything it sends has a credit card, the name, the item ID, whatever you want. And that the response has certain fields and doesn’t have errors.
So that would be kind of confirming the implementation details of a network call
Gleb Bahmutov (22:26):
Then consider this: payment and shipping are very important to any service. So you want to make sure they work. Now if you just test the UI through a page, if you refractive a UI, you can accidentally forget some field in the payment call. For example, you forget that the credit card can have a three or four digit confirmation code when you enter it. If you don’t confirm the network call, you might get an error. And you’re like, can I debug this? And then you’re like, oh I should have confirmed the implementation details. So you have to decide what you want to test.
Gleb Bahmutov (23:21):
Finally, when you do the deployment and release, that’s a good opportunity to write end-to-end tests for both happy paths and unhappy error paths. This is very important, but usually, when you deploy, you’re very short on time. And people rush and deadline everything, and you end up writing fewer end-to-end tests than you actually need.
But I think personally that iteration is very good, because when you iterate and you say, okay now I have to plan the next feature, before you do it that’s your perfect opportunity to say, okay, I’m going to be refactoring a lot of code, let me add end to end test. Now during planning, I can start with them to make sure that when I add a new feature, I will not break existing ones.
Gleb Bahmutov (24:18):
So I think the iteration is very important. And at each step you can ask yourself questions regarding your testing strategy, the confidence and the effort you spend on testing. If you look back at your testing matrix effort versus confidence, you should be sure that you know where you are and that you’re moving in the right direction (which is, lower effort, more confidence).
Setting Boundaries within Teams
That’s a great example that you made with the payment option, where you potentially want to verify details of the implementation to be correct.
As software matures over years, usually people start, if for nothing else then just for the sake of making sure that people can work together, and be happy and not step on each other’s feet. You should throw the line somewhere and make some cuts. And those borders are different. You can embrace certain architecture, because of this or that reason, but you need to make those cuts. And potentially some team just ends up developing an API that some other team will use and maybe a third team would implement the UI there. Do you have any experiences in that area or tips that you can share maybe?
Gleb Bahmutov (25:57):
Yes, for the last five months as I’m looking to improve the quality of our web application and mobile application at Mercari US, we have had this situation like multiplied by 100.
If you take any complex web application like Amazon, Mercari, what you see when you for example enter a new item to sell or when you buy an item, or when you search for an item, and it suggests items, it’s a facade. It’s not a single application. It’s almost as you see a playground, and on the playground you have all these components that show different information. Now that information comes from all the different teams like you said, that have their own boundaries, their own APIs.
Gleb Bahmutov (26:48):
For example, when you search for an item, there is a search API. So the widget you see is owned probably by one team, but where API it calls and all the data that goes into their analysis is owned probably by even more people than the people who actually make the widget.
At Mercari, we have tens, if not 20 or 30 teams each responsible for APIs and microservices, and they all go into the single page that you see. So when we write end-to-end tests, we indirectly exercise the work of every team. When I joined and I was doing prototype end-to-end tasks using Cypress, we found errors that were ultimately assigned to each team. Common searches did not work. We traced it back to a team that changed something in the API and suddenly suggested search would not work.
Gleb Bahmutov (27:49):
Pricing, suggestions where you list an item and you suggest a price of lets say $100 and it comes back and says, oh items like this usually are sold for $50, did not work. Shipping taxes, all the areas from different teams can be tested like a user, but here’s where things get interesting.
Yes, you can write the test, but the confidence in your test comes not when you can catch the regressions and bugs later on, it’s when you prevent them. And so that’s where organization of your development life cycle matters, and your CI matters so much.
The confidence in your test comes not when you can catch the regressions and bugs later on, it’s when you prevent them.-Gleb Bahmutov, Senior Director of Engineering, Mercari US
My first advice to anyone doing testing is to make sure you test in the dev environment first, then make sure you have dev preview environments where each pool request gets its own preview that you can run tests again.
Gleb Bahmutov (28:45):
And now Darko, what we are doing is all the different teams, the teams that own different APIs and all the things, and the web team, when they have full requests, each team separately, we have a web UI that uses their URL.And we run tests against that. So we are trying to catch for each team a problem before we merge code, and before we have this problem, and we have to investigate and figure out who caused it and when. But this is the easiest part. That’s where you catch logical errors in full request. But there is so much more that requires running end-to-end tests.
Gleb Bahmutov (29:30):
For example, we use feature flags a lot in many organizations, we have feature flags where they control parts of a page or which API it calls. Every time you change a feature flag, usually using a dashboard somewhere, you want them to run the test, because we found that someone flipped a switch and all of a sudden like some part of a site did not work correctly. It required so much investigation to trace it back to this feature flag.
If you change your value, you have to retest feature flags. If you redeploy something after upgrading the goal language version or library, you have to run the test again. If you do database migrations, you want to run some of the tests again to confirm, because you want to quickly trace it to where the change was that broke things.
Like you said, the boundaries could be set across teams. Bugs can cross boundaries easily. It’s your organization structure that has problems jumping over the boundary and finding who’s responsible for the change that introduced the bug.
Gleb Bahmutov (30:36):
So all I’m saying is that end-to-end tests are extremely effective, but it becomes an organizational question. It’s like can you organize running end-to-end tests or whatever you think effective tests are when something that potentially can cause an issue happens?
All those things you want to run your test, and when you run them and they finish, you want to be confident that nothing goes wrong. If they are green, or if they are red, you immediately know why the problem happened, where it happened, then you can revert or fix it and for real.
Gleb Bahmutov (31:26):
So being on our side of a testing fence is really, really interesting. And I can definitely sympathize with every large company or medium company, or small company that sometimes the website doesn’t work or they see an error in the depth of tools console.
I used to make fun of everyone. Like I would open the depth tools, I was like, look even Google cannot fix that stuff or whatever company. You go to any website and you find problems, but I do sympathize. And it’s because of this huge organization with ownership and very diffused responsibilities, but then the final product is the thing the user sees. So whatever you do you have to organize the testing better.
Yeah, yeah exactly. And as you said, that level of confidence that it brings you as someone who is going to deploy a feature is an amazing KPI almost a track on a team level. Not sure that I have ideas on how easy it is to implement, but I would say that on the granular basis of teams, based on their experience, their own domain, define and tweak how they are going to approach it.
Gleb Bahmutov (32:35):
Yes. Roman suggested a very good thing for confidence. What do you still test manually after you deploy software? Like that’s the confidence, even just opening a website or it still loads. They’re like, why couldn’t you automate it? Like that’s the one you test manual, going automated, so you don’t have to test it manually. So, that’s one measure of confidence.
Yeah and various number of opens when I deploy and okay. Okay it works. Yeah that’s great. Thank you Gleb, there are a lot of insightful things. We are going to make sure to list to the slides, and also to the upcoming conference talk. And yeah also to share these experiences with our current customers and other engineering teams that we are talking to. Thank you so much and good luck with your new journey.
Gleb Bahmutov (33:26):
Thank you Darko and thank you for inviting me.