Today the whole team is back talking about monitoring and what to do to mitigate failure. Crashes are not only inevitable, but they look different based on different systems and requirements, so it is important to know how to prevent them from happening in the first place. Not all problems are equal, so it is very important from the outset to be explicit about the performance criteria of the system. By establishing these criteria, when a problem occurs the level and urgency of the intervention is understood. It is not efficient for the engineer to respond to every problem as some, such as internet infrastructure are not in their control. The team also talks about some tools which they have used on projects to gather metrics to calculate availability. This can help ascertain what the problem is and provides insight into the potential solution. To find out more about these tools, join us today!
Key Points From This Episode:
- Crashes can happen in various ways.
- What can be put in place to measure failure.
- How SLA and SLO are useful for measuring potential problems.
- What the correct number of people to work on a problem is.
- Some tools to calculate your availability.
- Calculating latency goals that are tolerable is important to measure metrics against.
- And much more!
Transcript for Episode 118. Monitoring
[0:00:01.9] MN: Hello and welcome to The Rabbit Hole, the definitive developer’s podcast in boogie down Bronx. I’m your host, Michael Nunez. Our co-host today.
[0:00:09.8] DA: Dave Anderson.
[0:00:10.8] MN: Our producer.
[0:00:12.0] WJ: William Jeffries.
[0:00:13.5] MN: And today, we’re here to discuss how to know before your system crashes.
[0:00:19.0] DA: My god. Punching right now, I don’t know.
[0:00:23.3] MN: If you have systems that you need to manage, you should check if they’re crashing right now or we should definitely look into that. Yeah, we’ll talk about systems that you can have in place to know crashes happen and how to mitigate crashes from happening in the first place. I mean, I’m sure we all don’t like to be woken up in the middle of the night because things are constantly going down, but these applications will help you in mitigating some of those calls that happen at two, three, four in the morning.
[0:00:55.6] DA: Yeah, I guess like, to a degree like, you write your system and you understand it as much as you can, but it’s still a blank box once you put it out into production. You can see what the outputs are, you know, you can see that the result is on the database or if you’re creating a file or something like that.
But as far as like what is coming in and what’s happening inside of it, that’s up to you to better instrument it and we talked about that a little bit in our last episode about logging. You can go a little deeper than that.
[0:01:29.3] WJ: Yeah, I mean, there are different ways that your system can ‘crash,’ right? There’s straight up hard down, the whole website is just not returning anything or returning 500s or –
[0:01:44.7] DA: Which you might say is like availability, right? You're just down for the count.
[0:01:50.8] WJ: Or things might just take a kind of forever and that seems like that’s also down. If the webpage just hangs indefinitely until you give up and close the browser or you hit some kind of a browser level time out. That might not show up as an availability issue because your system is up, it’s just not responding.
[0:02:12.5] DA: Yeah, that sounds like bad for conversion, if you're running a commercial client or you know, consumer facing website.
[0:02:20.8] MN: Yeah, that check out button takes way too long, that’s going to be really concerning to your customers.
[0:02:27.3] WJ: Or, you might be up for some percentage of your users, but it just can’t handle the throughput and so above a certain threshold, everybody is just getting dropped. Or maybe something is getting returned but it’s wrong, you know, it’s wrong. I think those are the main categories of pro liability. Things that could wrong that make your system unreliable.
[0:02:48.5] MN: Right, you want to be alerted of these – or you want to be able to capture those problems that arise on your application in a way where you can collect the data and study it and kind of know what’s the next move or the next piece of functionality that needs to be in place in order for these things to work. To be available, to have – to handle the throughput and whatnot.
[0:03:14.6] WJ: Right, yeah. So that you could have some dashboards that you can check in on to see how things are going and probably so that you can make some budget, some targets, decide how much latency you’re willing to tolerate or how much availability or lack of availability you're willing to put up with.
[0:03:33.2] DA: Right, yeah. Like I said, that’s important in anything. Understanding the true importance of the system in the nature of things. And like, how critical is an error? How quickly do you need to know and be able to respond to it? Like, to have like a good balance for your work life and also, the profitability of your business and whatnot. Because you couldn’t be that the system is just used internally and you know, it gets a lot of requests, like once a week but they’re all done in batch and you know, it fails, you just run it again and it’s not really a huge problem if there’s an issue.
Versus some real time, needs to have immediate resolution and if you don’t fix it then it’s going to be a big headache for you.
[0:04:24.4] WJ: Right, you might have customers that have hard requirements. Like you can be contractually obligated to somebody with a serviceable agreement or SSLA that says, “you know, you have to respond to our API requests within 500 milliseconds or you owe us money and we don’t have to pay you.”
[0:04:43.8] DA: Yeah, absolutely. It could be like financial or reputational problems if you don’t meet those expectations, so it’s good to hash out exactly what those things are.
[0:04:55.8] WJ: Yeah, if you are in a situation like that, you probably want some service level objectives or SLO’s that are much more conservative than your SLA’s. Just to make sure that you don’t get close to violating those agreements or if you’re getting close that you know ahead of time.
[0:05:13.1] MN: Right, like an SLA could be as you mentioned, you know, you have a finance application, things need to respond. But the longest latency of 500 milliseconds. Would the service level objective be to have those response be alerted above 350 milliseconds, was that like the idea of like the SLO in this context?
[0:05:37.4] WJ: Yeah, your service level indicator or SLI would just be how long are requests taking.
[0:05:46.3] DA: I see, let’s just log the total request time and then you can graph that, you can do averages, you know, you can do some statistical analysis and figure out what your SLO and what you're actually tracking to hit if you’re going to violate your SLO or SLA.
[0:06:05.1] MN: Right, the SLI and the example would be just to track every single transaction that happened and if you find that the average is 170, then you know that that’s definitely below the SLO of 350 and the SLA agreement of 500 milliseconds. How do we bring it down, if we could bring that number down faster that you could work on that, but it’s still you know, much still like less than more than half the amount of time spent like to get those requests, I think that’s fine. That’s up to the organization, right? They want to be faster than the average and they would have to figure that out and put development to work behind making it faster.
[0:06:48.5] WJ: Yeah, I think you would probably want more than just averages, you probably want to know, okay, what’s my 50th percentile, what’s my 95th percentile, what’s my 99th percentile for latency? Or whatever metric you’re tracking. If your average, your 50th percentile response time is 150 milliseconds, but you know, you’re 70th percentile response rate is two seconds then it could be that actually, you have a lot of violations of your SLO or your SLA.
You probably have a threshold beyond which you don’t really care. If 1% or one tenth of one percent of your transactions are extremely slow then it’s probably not cause for concern, that could just be problems with internet infrastructure in general.
[0:07:40.1] DA: Although, there could be like some specific issue that’s impacting a certain user or a client that you have. It could be that their data is in the system, due to volume or the kind of data that they have that their performance is just really bad. And that could be especially bad if you know, when you look at that – who that person is, you realize it’s they’re, your most important customer and the reason why they’re having problems is because of they have too much data, they use their system too much.
So, again, kind of going back to what we’re talking about in previous episode about like having context for your logs and you know, driving that towards these different metrics for monitoring, to know like what might be the contributing factor? If it is, just like kind of internet knowledge or if it is more internal, something that’s more controllable.
[0:08:31.5] MN: Your application is on fire and the monitors are all going off. This happens during the day, let’s make the situation at least nicer, it’s not at four in the morning, it’s 10:00, you have your monitor is triggering all these different situations that could be happening, whether it’s throughput, availability, latency. Do you stop all your engineers to put out the fire?
Let’s talk, have you seen that in like your previous employments where everyone had to stop what they’re doing to fix things or is there like a dedicated team, do they have bat men or bat women in this situation? What are your thoughts on that?
[0:09:11.8] WJ: I think that if you have a monitoring system in place, that’s probably going to be a much more pleasant scenario for you. Logging as well so you can search to figure out what is going on, you could check some dashboards and see the severity. I think in terms on who deals with the issue, you probably don’t want your entire team debugging that because you're just going to – it’s too many crux situation. You're just going to step on each other. Probably wanting primary and a secondary.
[0:09:38.5] DA: Especially if you're like communicating asynchronously through Slack or something like that, you can get really messy when you have all those communications kind of crisscrossing.
[0:09:46.5] WJ: Yeah, I think it’s good to have a secondary who is sort of in charge of comms and have like an incident response plan so people know how to communicate and what to communicate and to whom they need to communicate like are we going to notify customers, after how long do we wake up the CTO?
[0:10:06.6] MN: Yeah, that’s important. Hopefully never but –
[0:10:10.4] DA: Otherwise the CTO sleeping in the afternoon in this scenario. That’s the question. What a sweet life he has. Maybe we should wake him up.
[0:10:21.8] WJ: There’s also a question of how much error is too much error, right? I mean, a certain amount of your requests are going to be failing at all time and then, at what point do you need to respond? That’s not just like for an individual response, right? That’s over a longer period of time. I mean, if one out of a hundred thousand of the requests of the website fail, I don’t care.
If all of the responses are failing then I immediately care, right? But then, what about over the period of one month, right? If 1% of all responses are failing for a full month, that’s a lot. It may not be a lot for one hour. But for a whole month, that’s a lot of people who are affected.
[0:11:12.7] DA: Right, that’s not really like – it’s not an abnormality, that is just the normal operation of your system this time. It kind of stinks or it has some like kind of flaws that need to be ironed out. Yeah, again, I guess going back to what we talked about before. Figuring out what the standards are and sticking to them, making sure that you’re like understanding what the requirements are for your system and putting the tools in place to make sure that you actually are meeting those goals, which I’m kind of curious, maybe we could go back to the categories or reliability and talk about like some tools that you could use in order to ensure that you can meet those goals.
[0:11:58.3] WJ: Yeah, I think for availability, something I see a lot is the number of nine’s. How many nines of time do you have? Is it 99% uptime, like two nines, 99.999% of the time? Like five nines, which seems to be kind of the gold standard, although maybe an unattainable one.
[0:12:18.3] MN: What, five nines?
[0:12:20.0] WJ: Yeah, five nines of availability.
[0:12:21.7] DA: All right, I found a website, which actually helps calculate what your up time would be and your down time. So, with up time IS, if your up time is five ninths, then that means daily or down for less than a second, 0.9 second and over the course of a month, it is less than 30 seconds. It is 26.3 seconds and yearly you only have five minutes and 15 seconds of downtime.
[0:12:51.6] WJ: Yeah that is not enough time to even Google a problem.
[0:12:55.8] DA: Yeah, it just took us longer to Google or figure out what that calculation of the uptime was.
[0:13:03.7] MN: Oh man, I mean there is a lot of – I mean do you think about Google needs to be up like in those five nine of you know whatever GitHub is down and it causes like this huge issue and I imagine they have all sorts of monitoring like tools to capture that information and help them figure out what the problem is.
[0:13:27.1] WJ: Well I think it also depends on like –
[0:13:28.8] DA: If you’re down for an hour then you have already lost one of those nines like you’re in four-nine’s territory and if you’re down for half a day then you’re in like three nine’s territory.
[0:13:40.0] WJ: Yeah, I think it also depends on how large of a percentage your user base you care about, right? I mean like Google is down for someone pretty often, right? And I think Google is down for everyone ever, but if Google is down for just one guy or like even one region, I think it’s hard to know. Is it really because of Google or is it because of some other piece of infrastructure? A lot of people are going to write that off as like, “well Google is never down. So, it must be my browser or my WiFi or my ISP.” People and everything else.
[0:14:18.2] MN: Damn you Comcast. Comcast down again.
[0:14:22.9] DA: So, then what are some tools that you use to measure availability?
[0:14:27.1] WJ: Datadog.
[0:14:28.8] MN: Datadog.
[0:14:29.4] DA: It’s pretty solid, yeah. You can set up some metrics there. I think we were talking about Pingdom.
[0:14:35.5] WJ: Pingdom is good, yeah they have a very specialized for availability.
[0:14:38.4] DA: Yeah like that is basically the main thing that does, just basically poking your site and seeing if it is up and how it’s responding.
[0:14:47.5] WJ: Write health checks. Has anybody worked in an environment where there were hard requirements on meeting SLO’s and like where those requirements were enforced? Where somebody actually set up a proper air budget and said, “you know if you exceed your availability SLO then you have to stop development work and everybody focuses on increasing availability until we are back in compliance with our on service level objective?”
[0:15:16.0] MN: So, I think I was on the opposite side of that. We used a tool where the SLA agreement was the availability needed to be five ninths. I mean yeah, like the five-nine principle. So, we used this application where the application itself needed to be out in the five-nine’s essentially and so like the organization I was working ensured that their third party application they had the agreement set up for that. So, like if they’ve already done the developmental work it wasn’t our thing was down.
It was because there was a problem for us to get the information for them and then they would have to ensure that their SLO’s were backed, essentially. You just had to stream it all.
[0:16:02.0] WJ: Somebody else and then if they have known to meet the SLA then you would identify that and then probably go hit them up for money.
[0:16:11.4] MN: Yeah essentially, well not me personally but that was the idea, yeah. Like, “yo five-nine’s, what’s up?” But yeah I think –
[0:16:20.4] WJ: Willing to find zeros after on that check that you are going to write me.
[0:16:25.7] MN: Exactly, I think that – I mean in regards to that, we also wanted to ensure that the service that we were providing was up at five-nine. So if there was any issue, there was like I think we mentioned before like a primary and a secondary person that would go and sift through metrics and monitors to ensure that we get back up to speed as soon as possible. But I never sat down in a meeting where we were told what the agreements were. It’s just like, “Hey guys keep it up all the time,” like no downtime ever kind of thing.
[0:16:59.0] WJ: Yeah, I think that’s a really rampant problem in the industry where people don’t have goals for how much availability or how much latency or you know, how much unreliability in general they are willing to tolerate and so anytime things are unreliable at all, people sort of freak out and then yell at the engineers and like, “why did this happen?” But there is no plan for illusion.
[0:17:24.5] DA: That someone’s identified. Even also if it was known to identify like that is important then it’s pretty likely that you are not going to be measuring it. You are not going to be looking at the latency or availability or you may not be looking at it if it hasn’t been something that has been like, “yes, this is something that we need,” and maybe it is a safe bet that you probably don’t need it, but better to have that conversation explicitly.
[0:17:47.5] MN: Has anyone else ever worked on SLA, SLO, SLI?
[0:17:50.5] DA: I’ve worked on it not like the high throughput kind of sense, but where there’s a hard stop for reporting requirements or things like that where it is not a lot of request but like what requests are there if even one of them fails then you need to drop everything and figure out how to resolve that. So you know, that’s like the completely opposite end of the spectrum where it’s like it is not five-nines. There is no nines, it’s just the ones and threes. There are two zeros basically.
[0:18:23.5] MN: 100%.
[0:18:28.1] WJ: I’ve worked in places with SLI’s, SLO’s and SLA’s and I’ve seen them mix exceed those and freak out about it, but I don’t think I have ever seen an organization have the guts to actually stop, develop and work until they were back in compliance. I think everybody wants the best of both worlds. They want their liability and they also want to be constantly shipping features that make the system less stable and so when you get out of compliance, management has the tendency to get mad without being willing to do, make the sacrifice that I think like the dev ops world will probably recommend for avoiding those kinds of incidents in the future.
[0:19:11.2] DA: Right, patenting down the hatches and all of that.
[0:19:13.4] WJ: Yeah, nobody is ever willing to stop features work.
[0:19:16.0] MN: Yeah, you got to keep finding the features out. Just keep on pushing them out.
[0:19:20.8] DA: Right but then you don’t know like if this next feature is going to be the one that really breaks the back of the performance of that application or whatnot unless you’re especially putting time into measuring, like measuring latency with something like we talked about Datadog and Stats Day or Prometheus on. If you don’t spend the time to actually collect that data and measure it against what the expected value should be, then you’ll just be caught off guard when someone breaks down your door and starts yelling and screaming.
So, are there any good tools for monitoring like scheduled jobs that are periodic that you guys are aware of?
[0:20:02.0] WJ: I’ve heard with Dead Man’s Snitch. You know, I think like, with jobs, the challenge there is that you can’t really do like a status check because you know, the issue is that the job doesn’t run in the future. Not that it isn’t running now. Most of the time that job’s not supposed to be running.
So, with Dead Man’s Snitch, it will – I mean, you setup an extra job and it will hit Dead Man’s Snitch regularly like on a crime. Then, if it ever doesn’t check in, then Dead Man’s Snitch will find you. It’s sort of like, you know, a dead man switch for like a bomb, you know, if you take your thumb off of the button then the bomb explodes.
[0:20:44.4] MN: Right. I was afraid when I was searching for Dead Man’s Snitch that I was going to be put on a list or something. This is like an actual service that does that, gets more out of your cron jobs, that’s what it says on the front page.
[0:20:58.4] DA: Says on the tin.
[0:21:00.3] MN: Yeah. I think yeah. So, there’s definitely a ton of different tools to use to monitor the types of production issues that you’ll see on a day to day basis.
[0:21:12.1] DA: Yeah, we’ve only really scratched the service here. There’s other tools that are like more robust out of the box metrics, like New Relic. Tools for error reporting like Rollbar. There’s all kinds of things you can compose together or you know, you can also consider maintaining your own system or building your own system from log messages that you’ve been implementing.
[0:21:33.7] WJ: And, do you want to use the SAS application where somebody else is keeping the monitoring tool up or do you want to maintain your own monogram server and then be responsible in case your monitor goes down?
[0:21:47.8] DA: Right. Who do you want to be responsible for the five nines of your own monitoring solution?
[0:21:53.5] WJ: Who monitors the monitoring tool?
[0:21:56.9] DA: Exactly.
[END OF INTERVIEW]
[0:21:58.8] MN: Follow us now on Twitter @radiofreerabbit so we can keep the conversation going. Like what you hear? Give us a five-star review and help developers like you find their way into The Rabbit Hole and never miss an episode; subscribe now however you listen to your favorite podcast. On behalf of our producer extraordinaire, William Jeffries, and my amazing co-host, Dave Anderson, and me, your host, Michael Nunez, thanks for listening to The Rabbit Hole.
Links and Resources: