17 February 2018

Microservices Go-live checklist

More and more organisations break up monoliths (formerly known as “app”) in smaller services and adopt microservices architecture pattern and they communicate in an asynchronous way via Event Sourcing. The main benefits are:

ability to scale different services as needed,
allows you to use the right programming languages for the domain, design architecture specific to that service,
risk is minimised when a service fails as the entire platform/product doesn’t go down,
service oriented teams have better knowledge of a specific domain and clear view of how different services come together.

Go Live Checklist

Revise architecture, get rid of Single Points of Failure

Make sure everyone is happy with the current architecture, and if you need to deliver fast, plan when those fixes will be made, make sure everyone is on the same page, and document those decisions.

Additionally, you want your service to be highly available:

deploy in multiple availability zones and regions,
replicate data multiple times
design with failure as a real scenario and add retriable logic with exponential backoff

And secure (security is a huge topic and below list is by no means exhaustive):

hide in a private network the resources that don’t need to be publicly accessible,
protect the resources that are exposed to public internet.

Documentation

Ensure your documentation (README/Wiki entries), provide enough information about your app:

what it does, provide enough context about the problems it solves,
how to install it,
how tests are run,
Usage (--help output for binaries with available flags).

Your microservice should be fairly simple and do very specific tasks. If you find it does too many things, consider breaking it down to more microservices.

Containerize your app

There are many benefits in containerizing your application, described here.

In a nutshell, your artifact is a Docker image that is deployed and managed by a container orchestration framework, such as Kubernetes.

Healthcheck and readiness endpoints

To be used by the container orchestration framework to determine if the app is healthy and ready to accept connections.

Possible healthcheck values:

healthy (app and its dependencies are healthy)
degraded (a dependency has degraded status)
unhealthy (app or a dependency is unhealthy)

Degraded performance means some operations will fail.

Metrics endpoint

You can create meaningful dashboards (for example in Grafana) by leveraging the power of metrics. This is useful to see how your app is behaving, for example memory consumption, GC pauses, error rate, HTTP status codes, and whatever else is useful when troubleshooting your application.

A popular tool for metrics collection is Prometheus.

Logging

Structured logging provides useful insights on what your app does and can be aggregated and parsed in a service like Splunk or SumoLogic.

Set up alerts

Create alerts based on business and operations logic, here are some examples:

no files received from 3rd party provider within 6hrs,
disk usage is above 75%,
error rate is above a certain threshold rate for 15mins,
service is unhealthy,

Note: Finding the right value for a threshold sometimes takes time, and should be adjusted as you don’t want to be alerted too often as you’ll end up ignoring, or even worse missing alerts due to excessive noise.

Think about what threshold is acceptable for your service and investigate why an alert was triggered, and if necessary adjust the threshold.

Backups and restore plan

Have a backup strategy as well as a way to restore from previous backups. It is very important that your teammates are familiar with these processes and you have tested they work.

You don’t want restoring from a backup to fail in a critical moment.

You also want an alert if backups are not being taken, this can be achieved by your backup service exposing relevant metrics and you setting up alerts for all apps that have their data backed up.

Resources

For more on distributed systems, reliable microservices, and event sourcing I recommend the following resources:

Distributed Systems by Maarten van Steen,‎ Andrew S. Tanenbaum also freely available here,
Production-Ready Microservices by Susan J. Fowler,
eBook download for Event Sourcing and CQRS by Microsoft

Your experience

I’d love to hear your thoughts and about your experience in microservices world. Feel free to reach out to me on LinkedIn!

tags: microservices - go-live