Key Points From This Episode:
- Introducing today’s topic — dreaded flaky tests.
- Sharing laughs about croissants and flaky tests.
- The correlation between heisenbugs and flaky tests.
- Programming banter about heisenbugs and Breaking Bad’s antagonist, Heisenburg.
- Hear about David’s favourite date-time flaky test.
- The importance of flushing your database and clearing cache before test runs.
- David talks about his most confusing tests.
- Find out about the ins and outs of parallel testing.
- What you should and shouldn’t do with one-hour builds.
- Why ‘sleep 5’ is not a good component to have in your code.
- The one trap people fall into using the page object model.
- How assertions can lead to tricky behaviors or values.
Transcript for Episode 185. Flaky Tests
[0:00:00.3] MN: Hey, Dave.
[0:00:01.2] DA: Hey, Mike.
[0:00:02.1] MN: How’s it going? We’re reaching the end of the year. Can you believe it?
[0:00:05.5] DA: End of the year, 2020, what a year. What did you think of it? I have a quick survey for you.
[0:00:11.8] MN: I’d be more than happy to fill out any survey that I have pertaining to 2020, but this is not just about 2020, right?
[0:00:18.9] DA: Well, I happen to have a survey right here that you can fill out for 2020. And we really appreciate everyone listening to the podcast. It’s really kept us going through the year, having you guys listen and comment on everything that’s going on and we want to keep talking to you, we want to hear more.
[0:00:43.6] MN: Right. I think that one of the things that we will make an extra effort for in 2021, I feel, is the customer participation. I want to be able to interact with the listeners who are listening and be available for any questions or comments and stuff like that.
[0:01:02.5] DA: Yeah, trash talking, if you think that Ruby is fine and that I should get over Python, then you can let us know as well.
[0:01:10.9] MN: We need to be more responsive for those hot takes too, whenever we dish them out. But yeah, Dave, as you mentioned, you have a survey about the Rabbit Hole?
[0:01:21.5] DA: Yeah, let me get you the link. It is bit.ly/rabbitholesurveykebabcase. If you know what it means, then you should the take survey. But if you don’t know what it means, you should also take the survey, kebabcase means it’s a dash. So, rabbit-hole-survey.
[0:01:45.5] MN: There you go. Upon completing the survey, I’ll probably have your email associated to the survey, that way, we’re giving out a prize. A random selection to an individual –
[0:01:58.6] DA: A fabulous prize?
[0:02:00.4] MN: It’s a fabulous prize, yes. We are planning to give out a fabulous prize and the prize is going to be a cool gift. It’s going to be a raspberry pie kit.
[0:02:10.7] DA: Man, I am kind of jealous. I feel like I should get this on my Christmas list as well. I know you have one yourself.
[0:02:18.5] MN: I do. But I’m definitely going to fill out the survey like five times, so don’t worry about it. You should fill out the survey for sure and with your email, we’ll ensure that we’ll you contact you if you are the selected winner. We would probably need your address to send this over, note that you may need to live in the United States for us to send it to you, that’s probably some logistics that we have to deal with.
[0:02:40.4] DA: There’s some legal stuff, maybe. Maybe there’s fine print about us not entering, but maybe they’ll fill out the survey anyway. Mike says he’s going to fill it out five times, so I got to get in there too.
[0:02:51.1] MN: Hit us up on the survey and that is bit.ly/rabbit-hole-survey.
[0:02:57.8] DA: Awesome. On to the show.
[0:03:00.4] MN: Hello and welcome to The Rabbit Hole, the definitive developer’s podcast. Live from the boogie down Bronx. I’m your host, Michael Nunez. Our co-host today.
[0:03:08.3] DA: Dave Anderson.
[0:03:10.9] MN: And our producer.
[0:03:11.2] WJ: William Jeffries.
[0:03:12.6] MN: And today, we’re going to talk about flaky tests. Probably the thing we hate running into the most, those flakiness.
[0:03:20.1] DA: Yeah, just warm flaky tests like fresh croissant.
[0:03:24.3] WJ: You're making it sound so tasty.
[0:03:26.4] DA: How can they be so bad?
[0:03:29.3] MN: No, they’re definitely bad, they’re the worst, the absolute worst things to run into is the flaky test. We’ll talk about flaky tests, we’ll talk about the types of flaky tests we’re running to. Some things we’ve seen as solutions out there.
[0:03:41.1] DA: Yeah, things that you can smash, things that you shouldn’t smash when you're filled with anger.
[0:03:48.7] MN: There you go. I’m hungry now, you mentioned that flaky test sounds delicious.
[0:03:55.9] WJ: A box of cereal.
[0:03:57.9] MN: Yeah, William was talking about that like Korean breakfast treat and it just got me thinking about pastries and stuff. But we’re not talking about pastries unfortunately, we’re talking about issues, we are running into all the time when we’re running test, would anyone say that they would, and what part of the triangle of testing, right?
I forgot who created it, refresh my memory.
[0:04:23.2] WJ: The testing pyramid with the unit test at the bottom.
[0:04:26.4] MN: Yeah, Bobby pyramids, yes, William knows what I mean. The testing pyramid, where would you think flaky tests exist the most?
[0:04:35.9] WJ: The tippy top for sure.
[0:04:39.2] DA: Those tippy top, flaky test meaning, a test that you run it once then maybe it passes or maybe like works on your computer and you run it on someone else’s computer and it fails or you run it 20 times and it fails and once — just inconsistent.
[0:04:58.5] MN: Yeah, I mean, I always use the excuse that it works on my machine and I try to ignore it because I really don’t want to deal with the flaky test. But sometimes we need to squash those flaky tests. I used to refer to them as heisenbugs. You know, you run a test and it’s fine and then all of a sudden, it comes up as a bug again or something like that, I imagine those are – what appear what bugs more than like a flaky test. I call them all heisenbugs.
[0:05:26.3] DA: I guess it is a bug, it’s a bug in your test, if you think about it, you shouldn’t have a test that fails sometimes, it should be deterministic like if you have some given inputs that should give you the outputs that you’re asking for in a test, so – I think heisenbug works.
I’m curious if like – if I look at a Google trends for heisenbugs if it like piqued around when Breaking Bad was popular.
[0:05:53.5] MN: I think we should definitely like that up just to see because I imagine that that probably was the time where it piqued the most. I’m trying to think yeah, I don’t remember, and then if I look up heisenbug and then Heisenberg, I’m sure the two will probably correlate and go up.
[0:06:14.3] DA: Interest in England, only England cares about heisenbugs.
[0:06:19.8] MN: I guess I’m wrong because throughout time, I guess the past 12 months, I probably have to look at the 2004 to present but no one uses heisenbug, you would think I made it up to be honest. I’m pretty sure I did that.
[0:06:33.0] DA: Maybe you did make it up.
[0:06:35.9] MN: I don’t know, there’s a next play CD on it, I’m sure. If you Google it, you’re going to find the heisenbug for sure.
[0:06:41.9] MN: Where would we normally see, there’s many different places we would find a – you know, a flaky test and they’re like some common pitfalls that one may run into, I was writing an article on Hacker Noon. I’ll put it on the show notes but they have a couple of common places to find a flaky test and I definitely agree with all of that so I don’t know if Dave, you want to start up with one that you may have in the pocket.
[0:07:08.1] DA: My god, my favorite is always the date time flaky test. You know, just time lord, powers start tingling.
[0:07:19.8] MN: Yeah, the time lord strikes again.
[0:07:22.8] DA: You have this test that will pass at any time of the day unless the hour is like one AM or something or like, if it strikes midnight then this test will fail.
[0:07:37.6] MN: Man, that is truly the worst, actually, we have – there’s a team member that I’m working with right now, he was in the Philippines and she cannot run end-to-end test on her machine, it was just all off by a lot.
[0:07:53.2] DA: Oh no, that’s really bad.
[0:07:55.7] WJ: That is brutal.
[0:07:56.1] MN: We’ve tried to spend some time to fix those tests for sure.
[0:07:59.9] DA: Man, that’s definitely a flakiness. I mean, it’s like consistent I guess, if you’re running it on your CI server then I guess it works still but –
[0:08:11.6] MN: Yeah, that’s the idea. If the CI server is running okay for the most part but we ran into this issue like we’re slowly converting all the things to like, UTC, it doesn’t matter about who is where and what part of the planet you're in in the first place.
[0:08:27.4] DA: No, there’s so many day time sends. I forgot about not using UTC, definitely don’t – I think you could talk for a whole episode about things that you can just really mess up with dates.
[0:08:40.9] WJ: Not using UTC?
[0:08:43.3] DA: Yeah, if you’re kind of creating a reference time and like you’re strewing a time somewhere and you don’t use UTC then it can get really confusing when you're trying to localize it and also if you’re using whatever time.now or datetime.now is in your given date time library in your test, maybe it will work but you don’t know.
[0:09:09.0] MN: You never know because the time lord may strike.
[0:09:12.2] DA: Yeah, good to like patch out that dependency and maybe like easier or harder depending on how you're doing it like in python, I think Ruby too, you often use like some kind of dependency that patch out the time like I think it was like time cop or something like that or –
[0:09:36.3] MN: Yeah, time cop is the one that we would use in Ruby where you can freeze time for the sake of testing or you could set the time to be something so you can – on your test.
[0:09:45.2] WJ: Freeze gun?
[0:09:46.8] DA: Python has freeze gun and some people, you can roll your own kind of thing too, just mock out the calls that are offensive but remember, facing a challenge with Go because you can’t really easily mock out modules. We ended up having to use dependency injection to inject a fake clock that would always respond with the tape that we asked it to, it would just have a method like now and we would inject it into our service class and then it would ask for the time from the clock and the clock would lie to our service in the test and it would work fine.
When I run the before-each here and there and sometimes I may get like a test that would fail that fails for some reason and I run it by itself and it passes and it gets really weird so –
[0:11:14.1] DA: Like if you have like nested levels of before and stake getting setup.
[0:11:20.9] MN: Yeah, exactly. The idea of that is just be mindful of the thing you have to do to setup your test is make sure you have that down and away that you’re not causing it to have some internal race condition with the before and the before each blocks aspects of that.
[0:11:39.1] DA: Right, it’s like you say, it’s best if you don’t have any shared stake between the test like we’re setting it up all beforehand but then you know, you get to a certain point and you need to – you want to refactor it like factor out that setup so it’s cleaner but when you start like getting too far down that path where you have many different setups happening at the same time, it can get little confusing about keeping track of it.
[0:12:28.1] DA: Yeah, jest has like the before setup.
[0:12:32.5] WJ: They have to before but they don’t’ have a wet lock, right? You just have a wet variable and then you have to remember to cover it up.
[0:12:39.8] DA: Yeah, I’ve just forgotten so much ruby, almost as much I ever knew but my gosh, that is so useful. I forgot about how nice our spec setup is.
[0:12:58.9] DA: Right, you often see people just setting up an object in the body of the describe or something and if you mutate that object, it’s just a reference so guess what, you’re taking the whole test, we had a wait along for the ride when you mutate that object.
[0:13:17.3] WJ: Yeah, same with the race.
[0:13:18.8] MN: It’s the absolute worst. I always run into the – it never fails, every so often I’ll run into like yeah, you got to be mindful and worry about that thing, let me call you that up.
[0:13:29.0] DA: Yeah, you can have a case where maybe everything works but like you were saying, everything works fine when you run all the test, if you run a single test then it doesn’t work because some other mutation had happened to the state that’s shared across the tests and then boom, it’s out of order.
[0:13:49.1] WJ: This is another one where the ruby community just pays way more attention to testing than any other communities. I think Ruby’s the only one that has like an R spec dissect to automatically do the dissect breakdown of your test suite to try and find order dependent tests if you provided a save.
[0:14:11.8] DA: So much power.
[0:14:12.5] MN: Yeah, I respect dissect, it’s pretty dope and that’s definitely a way to deal with flaky test in ruby. I wish java script has something like that.
[0:14:20.9] DA: I feel like it seems more default to shuffle your tests in the ruby community as well. I don’t feel like I often see like test being shuffled in CI or like on local runs by default, that often. As much as I did when I was working more with Ruby. But that can also expose these kind of state dependencies in between tests, but in order – to do your dissect and figure out what exactly the problem was, you have to make sure that you’re using the same seed so that your randomization, your order is deterministic.
[0:15:00.5] MN: Yeah, has anyone else have any other thoughts on where they would normally see their flaky tests like some –
[0:15:07.7] DA: Yeah, you can also see it like, we’re talking about like test setup and the interactions of the test data in memory but like sometimes you can also see that same kind of weirdness happening with caching if you’re using some kind of API layer, caching and you’re hitting the API if you’re not cleaning that out then that can cause weirdness and same thing with like database, state, if you’re not flushing that in between test runs, like a lot of things and Django and Rails and the test suite tool will flush the database in between runs.
So you don’t have to worry about it but if you're setting it up yourself, you definitely have to be mindful to clear that.
[0:15:58.9] MN: Yeah, because the DB could be in a certain state that maybe one of your other test depends that state to exist in and for whatever reason, if that – if you dropped the information in the database and all your test blow up for whatever reason then it shows that you have those test require specific state to exist for it to run and that’s like something you want addressed.
[0:16:22.3] DA: Yeah, something that surprised me, one bug that I had that really confused the heck out of me was like that a specific set of tests would fail every single time that someone added a new test that use some certain database model. It turned out the assertions and some of these tests were like very broad and they were testing for equality and we were using like factories like a certain python factory that would automatically create values for different attributes in the model so for example, the name of the object would automatically be generated to be like –
Say it was a window object and it would be like the name would be unique so it would be window one and then it would be window two when you made the second one and it would be window three and then what would happen was the order was deterministic but like when you made a new test file then it would maybe run before one of the other ones and so it would be like all of a sudden, everything after your window two test would start failing because you added like another thing that took the window three.
It was kind of nuts because I didn’t realize that our factories that we were using had like state that was persisting across all the tests and that just ruined my day. You know, if you’re like making assertion on value, maybe trying on to use all that generated value because that could make you sad.
[0:18:04.4] MN: It’s like by using Faker or something like that, no R spec has to be faker.
[0:18:08.3] DA: Yeah, using faker basically. Exactly, it was a faker implementation.
[0:18:13.7] WJ: The other one that I see sometimes is transparency issues like if you’re parallelizing your test suite, which I think is worth it even though it can be a pain to set up, if you don’t parallelize properly, if you aren’t running each set of test in a totally isolated environment, you can have problems like if two parallel test runs are using the same database connection for example or if they’re both hitting the same instance of the app then you can have like apps fade or database fade.
That has been modified by one test causing another test to fail. So you know if you are running your test and it never fails in isolation, it only ever fails on CI where you are doing your parallelization like that is something to look into, maybe try running the test in parallel in your local machine.
[0:19:10.3] MN: Yeah just I know that there is a –
[0:19:11.2] WJ: Or if you are not mocking out your third party dependencies like if you are actually making real API calls just some third party service. Sometimes the third party service has state that’s not getting cleaned up.
[0:19:24.6] DA: Right or like it could actually be down, which I guess will be a more dramatic failure like that may not be a flaky thing on the day to day but you may go to try to do a deployment or something or make a critical change and then you know that test is failing. So your CI is bred and you can’t merge until you beg the CTO to merge it for you using a super user powers or whatever.
[0:19:54.6] WJ: Yeah, it could be that the third party services isn’t even down. It just dropped one request, maybe you got super unlucky on that where it’s like the one out of 100,000 requests that gets dropped.
[0:20:04.8] MN: Yeah that would be really unlucky but it can happen and that like I imagine you know people may run into the idea of all of these unfortunate events add up in terms of your flakiness where some people start to disregard their own test as a form of like, “Oh yeah that test fails a lot of time” like oh, let’s just keep moving the plate to production, there is no problems about that.
[0:20:33.8] DA: Right, nothing to see here.
[0:20:35.1] MN: Yeah and everything’s got – this is a ship it, let it happen when in reality it is possible that you could run into something that is actually broken.
[0:20:44.1] WJ: Yeah, it is really dangerous when people don’t trust the testings. It makes the test rates effectively useless. You know people commenting at the test just so they can get an urgent deploy out of the door or just like rerunning a test until it passes because sometimes you could have a test that is legitimately failing because something is broken but it fails flakily and so you rerun it and then the sixth time it passes and so you merge and actually the test was boarding you the real problem.
[0:21:17.5] DA: Right, that’s no Bueno. I feel bad when I do that or like I’ve also seen the case where maybe there are many different groupings of tests that runs against a PR and you know some of those groups of tests are considered optional because maybe they are flaky or they are slower or what have you and if I had a PR and some optional checks are failing, I am still going to feel just as bad as if something was mandatory is failing but eventually I’m just going to learn to ignore red builds and just be like, “Okay, you know it’s fine.”
[0:22:07.3] WJ: I think it is worth it to either invest the time in de-flaking your tests or if that is too painful, converting them into synthetic monitors. Then you know you can assuming there were safety run in production, you just run them in production or even in the lower testing environment if they are not safe then you can look at the statistics afterwards and say, “Okay, well these tests generally pay a pass 80% of the time.” So if we see that the failure rates spike and now the test are only passing not at all or 10% of the time you can set up an alert and so somebody gets paged if that happens in production.
[0:22:53.7] DA: Right, it definitely like when you see also a point it definitely helps to start gathering metrics and being smart about those things. I saw a pretty nice Spotify article about flaky test where they are talking about different tools that their development I mean like dev ops teams have provided for flaky test and it seems like it makes it a bit easier to identify those things like when you’re gathering metrics and you can look at that pretty easily.
[0:23:27.1] WJ: Also using some synthetic monitors in addition to or perhaps instead of some of your brasser tests can be a really positive change. There are some – so it gives you a different kind of comfort. There are just some test that are just not going to catch. There is some problems that your test suite will never catch like problems that arise when your database changes in some way like maybe it starts to fill up or maybe you get a particular configuration of data that causes a bunch to arise.
And that is never going to be reproduced in your test suite unless you are doing some kind of crazy generative testing like that’s not something you can test for as part of your deployment process and so you get a lot of comfort out of just running those things in production on the front child and alerting on them if they fail and if you start doing that instead of like for some of the really critical work flows like can somebody place an order, you should have a synthetic monitor for that anyway.
[0:24:32.6] DA: Right, observability.
[0:24:34.3] WJ: And if you can shift some of these test that you really need because you need comfort over that particular scenario into a synthetic monitor you can make the number of tests that you have to keep not flaky smaller and you can also make your build faster because all of these browser tests you know they’re slow.
[0:24:55.5] MN: Super slow.
[0:24:57.2] DA: Right like the true like end to end standing up all of your micro services and your front ends and poking the front end and –
[0:25:04.0] WJ: And if you – one of your flaky tests fails and you have to rerun the test suite, they can double your test runtime.
[0:25:12.2] DA: Yeah like one of the things that really kills me also is are there broken windows that often go wrong along with a flaky test? Like a slow build, if you have a flaky build and it takes a minute to run or five minutes or maybe even 10 minutes that’s manageable to a degree but if it takes you 20 minutes or half an hour or 40 minutes to run a build or an hour then you’re just playing craps with hours of your life. If it fails like 40% of the time or 30% of the time, then that could be hours or days of your life that you’ve lost and you know I may have lost the amount of time. I may never get that time back.
[0:26:08.2] MN: Yeah, it is definitely something that you want to take time to hack down when it gets that large, right? Like hour builds, two-hour builds. That is just a lot of time.
[0:26:21.4] WJ: Yeah, you got parallelized. Got to pay the money for the wonderful nodes.
[0:26:26.6] MN: One thing you should not do for sure is as we mentioned with the one hour built is if a test pass because you wait a little bit, don’t add more time to wait for that test to pass and then that tests is still flaky. Sleep five is not cool, whoever is out there writing capybara test and you got sleep five in your code just try and fix that up, just clean that up a little bit. Don’t sleep five.
[0:26:51.4] DA: Sleep 10.
[0:26:52.5] MN: No, don’t sleep.
[0:26:54.2] WJ: No, no sleeping especially if they are using capybara or some type of –
[0:26:57.9] DA: Sleep five wasn’t enough, you got to sleep 10.
[0:27:00.7] MN: No, no sleep 10, no sleep.
[0:27:02.2] DA: Right.
[0:27:03.0] WJ: These testing frameworks they have you know have weight built in. So if the result that you are looking for is not on the page, it will automatically check every X number of notes that chinchilla shows up for up to whatever the default rate time is set up to.
[0:27:18.6] DA: So you tell it like, “Okay, look for this thing and give up after sometime.”
[0:27:26.0] WJ: Right. It is really easy to customize the wait time. So just use that instead of sleeping.
[0:27:33.0] MN: Right, the idea is that it will do it for five like say you can set up so that you check for a particular text on your page and it will do that in the incremental five seconds, which is equivalent to the same amount of time by sleep five let’s say but if it finds it in three seconds then you save the additional two seconds rather than sleeping five then checking for the page. You will just save more time.
[0:27:57.7] DA: Yeah, it definitely have some task like enter my nightmares that I was working on with actual concurrent code that was using threads or like in the case of Go, Go routines where you’re actually doing like network requests or things that are asynchronous but you have to build your own hook into it like you basically have to build in your own thing that will allow you to await because it is not like a first-class thing. I mean it is easier channels for communication between things but it took some thinking like a little bit of smart test set up and it ended up with I think better design code as a result of it but definitely a challenge.
[0:28:48.1] WJ: Yeah, I think also one trap that people fall into if you’re using the page object model where you create an object that represents the page and it has a bunch of methods on it and you can call those methods and urge you make assertions about the page or you make changes to the page as a way of like adding some more domain language to your test suite. So sometimes people will have negative assertions without realizing it, which can cause capybara to wait forever.
So like for example, if you’re checking to see if your signed in, you know maybe your paged object has a log in page got signed in question mark and so if you are asserting that the username appears on the page then that’s how you know you’re signed in then in cases where the user is signed in and that thing actually appears on the page, the tests are run super-fast but in cases where the person is not signed in, if you are asserting that they are not signed in using like a page object.
It could be like that particular test is going to take a really long time because capybara is going to wait a full five seconds or however long your default max wait time is, checking to see if all of a sudden the username is going to appear because it is just going to take a minute for you to finish logging in and so if you can –
[0:30:15.8] DA: Hopefully you would have run it and then realized as you are sitting there waiting for it to finish that there might be a better way rather than just pushing it out to see if the time that you would like not notice it that is if you are not running it locally. You are not seeing your life locally on your machine running it and you’re just pushing it to CI and you’re going to get a coffee.
[0:30:43.0] WJ: Or if you don’t realize that there is a better way because there are negative assertions which will flip it and then it will wait for it to disappear or if it is already not there then it will succeed immediately.
[0:30:55.6] DA: That’s fair, yeah.
[0:30:56.9] WJ: You use those negative assertions, it is a lot faster. So you can pass it and you like expected an argument to the method for whether you are expected to be true or false and then in the method you can use either a positive or a negative assertion accordingly. Page objects are totally worth it if you do them right. I don’t want to discourage people from using page objects because this one got you like they do make your test suite cleaner.
[0:31:22.7] MN: Yeah, I mean if you – yeah and this is definitely something like as long as you are aware of this particular gotcha, you program to that and then surely you are not falling into that trap.
[0:31:32.9] DA: Yeah. So flaky tests hurt but sometimes the pain teaches us things, you just have to listen to it and respond to it. You can’t just ignore it. You got to fix those tests.
[0:31:46.6] MN: Right and you know treat a flaky tests as you would have a failure, you know trying to do your best to clean that particular issue because you want to make sure that your tests are giving you the confidence that you can sleep at night without getting a phone call at 3 AM and when you do have a flaky test, it may end up having to call you at 3 AM you got to fix it and if it is flaky you might as well fix it to make sure that that phone call doesn’t come in again for the flaky tests.
[0:32:16.8] WJ: Yeah, I think you know it is so important to keep your test deterministic if you really can’t de-flake a test, I think it is better to delete it.
[0:32:26.6] MN: Yeah, I mean try really hard and getting ironing out that test but if you can’t iron down the flakiness, too flaky still got wrinkles, got heisenbugs, got to crush it, you might as well get rid of it. Figure out a different way to test this, that particular part of the code.
[END OF INTERVIEW]
[0:32:46.8] MN: Follow us now on Twitter @radiofreerabbit so we can keep the conversation going. Like what you hear? Give us a five star review and help developers like you find their way into The Rabbit Hole and never miss an episode, subscribe now however you listen to your favorite podcast. On behalf of our producer extraordinaire, William Jeffries and my amazing co-host, Dave Anderson and me, your host, Michael Nunez, thanks for listening to The Rabbit Hole.
Links and Resources: