Stride Tech Talk: Lisa van Gelder on Microservices

Stride CEO Debbie Madden interviews Stride VP of Engineering Lisa van Gelder

Let's start at the beginning. What's a microservice?

No two people will give you the same answer! In my opinion size isn't important and the name is misleading. It's really about clear responsibilities: you should be able to describe what your service does in a couple of words. This is the Comment Service, this is the Search Service etc. A microservice needs to be deployed and hosted separately - if they aren't you have a distributed monolith. A microservice also needs to have clear data ownership. Its ok for another service to read from a datastore, but one service should be responsible for the schema and data. Having another service change the schema or data you rely on causes problems.

How can a team decide when it's time to move from a monolith to microservices?

You should think about moving to microservices when you start to feel pain. It's all about moving fast - in the beginning, having all your code in one application and deploying it together is the fastest way to go. At some point having a lot of teams all contributing code to the same application starts to get difficult. Everyone is treading on everyone else's toes. It's hard to do big refactors or schema changes without lots of co-ordination between teams. Releasing is complicated and needs lots of co-ordination too - and its hard to know which one of the many changes that just got released to production has caused problems.

How can a team tell where to start? What should be pulled into a microservice first?

I'd focus on the area of the code that is causing the most pain. Is there one part in particular that has performance problems or needs to scale? Is there one part that is going to change very rapidly and require a lot of deployments? Does one part always cause issues when you deploy it to production? You want to have as few dependencies as possible between microservices to ensure you can make changes to one without affecting others, so I'd start by moving code into modules with distinct responsibilities.

Moving from a monolith to microservices can be a saving grace for many unwieldy code bases. What are the top 3 advantages?

Anti-fragile systems: loose coupling means performance problems in one part of the system can be isolated and the system can be more resilient.

Scaling: It's possible to scale only the parts of the system that really need to scale.

Empowered teams: teams can take control of their own architecture and technology stack and choose the right tool for the right job.

As you highlighted in your XP NYC Meetup talk, moving to microservices isn't without risk. What are the top 3 risks for teams to be aware of?

Tightly coupled services are harder to test and maintain than monoliths! Making changes to a lot of services to get a feature out is painful and requires careful co-ordination. Dependencies between services are really hard to test and hard to debug in production when things go wrong.

Silos: different technology choices by different teams can make cross-team communication and collaboration difficult. If your team has a great solution to a problem in scala, is it useful to my ruby team? Do I need to learn a new programming language to come join your team? When all the Guardian's teams used java we used to have people moving teams every couple of weeks. When we switched to different tech stacks, people stayed on the same team for years. This can impact company culture.

Someone still needs to monitor and maintain all these technologies in production. Developers don't necessarily have the knowledge to implement replication, backups, monitoring etc - but also operations don't want to have to support ten different languages and ten different datastores.

What 3 things do you wish you knew 5 years ago that you know now about the tricks for an effective microservices undertaking?

Shared ownership is key: as well as taking ownership of their own service, developers need to take ownership of the system as a whole. The user doesn't care if the search service has 100% uptime when the shopping cart app is down and they can't buy the product they want. Everyone needs to pull together to investigate outages and fix problems without assigning blame.

Cear SLOs, escalation points and owners: it should be clear what the expectations are when a service goes down. Does someone get paged? If so who? How critical is each service and what is the expected response time if it goes down?

Shared logging and user identifiers. Debugging incidents across services is hard. Having a shared user identifier to track a request between systems is invaluable, as is exporting logs to something like logstash so its possible to track an incident between logs without having to log into a hundred machines.