I’ve had this on my mind for quite a while, but I had recently a couple of triggers, so I have to put it out.

It all started when my friend Jonathan Crossland asked this question on LinkedIn

One day after, a post from Ben Sigelman, for which I have a deep respect, popped out in my reading list

So, now’s the time.

Agile is evolving. Agile 2.0 is out. It is backed by a nice group of people, and mainly focused on DevOps. If you want to know more, follow Cliff Berg on LinkedIn.

Everybody’s doing devops. Right ? … RIGHT ???

It’s a no-brainer proposition. Repeatable builds, CI/CD pipelines, Infrastructure as code, Observability, etc…. Somebody stop me before I fill a page with buzzwords. You can read/watch tutorials, you can buy some books, learn a lot, and just do it. Or not. Let me try to fill in some gaps.

The “normal” flow is something like this:

  • you implement a new feature/fix a bug, check in your code, issue a PR
  • your automation kicks in, spins a build VM/docker container/whatever
  • it checks out your branch, runs some static checks, maybe code linting, maybe security scans, and the unit tests/functional tests(I hope you have some — this is an item for a totally different post)
  • if they pass, your artifact is checked in into a repository management system (usually, a docker container in a docker registry), tagged with some versioning, by example a part of the PR hash
  • your CI/CD deploys in a testing area the newly built container and runs an integration test suite. If it passes, container is promoted to production
  • CI/CD system deploys container to production

Simple and clear… Not. Let’s talk a little bit about the last two steps.

Consider this presumably simple micro-services architecture

System diagram

You have 6 microservices, in a relatively simple service graph. Let’s assume for simplicity that you have 6 teams, each team owns one of the services. Those teams have different velocities, different deployment strategies, different lifecycles for their respective components. Let’s assume, for simplicity, that each team has 5 members.

Now, you have a decision to make. CI/CD is all about short feedback cycles. I want to deploy my new shiny feature to production as fast as possible, as safe as possible. How do you handle that ? You have several options, for the integration testing part:

  • each developer gets his own testing environment, because you definitely want isolation. Maybe you have a dedicated Kubernetes cluster where each dev gets his own namespace, where CI/CD can spin up a full environment and run integration test. Maybe it is an option, but most likely not, because the number grows large rapidly.
  • you spin up a new environment for each PR, in you deployment environment of choice( K8s, AWS, GCP, etc.). Also could be an option, but also time and resource consuming.
  • you have one environment, where you try to test everything.

I would argue that for a medium-sized application these days, the choice should be the latest. Think of the above diagram as a zoomed-in part of a system of 40–50 micro-services. The first two options wear themselves out pretty shortly. I want to deploy just my pod/container, test it and promote it, and I don’t want to spin a full environment every time, because of time and resource constrains.

The third option comes with its own warts too. Although it might look simpler, let’s look at a new diagram.

Single test environment

So what happens here ?

While the system is running the test suite for service A (Test suite 1), one dev issues a PR for component D. You want to move as fast as possible, so you deploy the new version D2, and start the second test suite. Now, you need a system that is able to discern on which requests are coming from test suite 1 and which from test suite 2, and routes them accordingly. You might do that via a smart programable proxy, like Envoy, with your own control plane, and tag the test suite requests with a special header on which your proxy is able to discriminate on. Not an easy task. Maybe you could use something like Istio. But you will need to integrate your CI/CD with the control plane for the proxies, so it can push the right routing configuration to your proxies. Again, not easy.

As that would not look complicated enough, probably your microservices should own their own storage engine (SQL database, NoSQL database, etc.). I certainly hope so, I absolutely love when I hear everybody discussing how global variables are really bad, then when you look at their architecture you find a big fat global variable right in the core, i.e. a shared database, which is also used for synchronization. Lovely.

So, you need a repository of DB engines, some with test data, some completely clean, from where you can check out and deploy a fresh engine when needed. Oh, btw, somebody needs to maintain and curate this repo.

And, if the above doesn’t seems complicated enough, let me add some perks:

  • your 50-ish micro-services testbed is constantly running 5–10 PRs in parallel, with a complicated and ever moving service call graph.
  • you have inter-PR dependencies (my PR on service C needs my colleague’s PR on service D)
  • You have two different PRs on the same service, but one requires a breaking change at the database structure level (don’t tell me that doesn’t happens)
  • integration tests are notoriously flaky for various reasons. You run your test suite once, it fails, you run it the second time, it succeeds. What do you do, declare it passed or not ?

Alrighty. Let’s assume you solved all of the above, one way or the other. The next thing to worry about: did I introduced a regression of not ? Not a fault regression, that would be presumably caught in the previous step, but a latency/throughput one. How do you test that ?

Well, I suggest you have a few separate deployments, for the critical path components in your system, where you have all the dependencies set up, and over provisioned (i.e. I want to assume infinite resources downstream, because I’m testing the limits of this component only) . You deploy your new component in there, and hit it with a load tester. Best in show would be production traffic, so for each component you need something to log live traffic, and then play it back at variable request-per-second speed. And you go up until it starts breaking. You have now the telemetry data (you have a monitoring system here too, right ?), and then you compare with saved data from previous runs, to make sure you didn’t break something. For that, you check just a few golden metrics, like CPU load, storage performance, network saturation, latency and throughput.

Short detour, you’re probably measuring these wrong. For latency, search on youtube Gil Tene’s talk How not to measure latency. For the rest of the measures, just think how often you have in your dashboards average of the 95 percentile. Now try to put on paper that formula and see how much sense it makes, mathematically speaking, averaging a percentile. Anyways. I digress.

Or maybe your change is pretty big, so you need to take a performance hit. But you also need to communicate this to your automated load testing tool, which is in charge of approving/denying the promotion of your component to production. Easy-peasy.

OK, now that you’re sure that you don’t have errors, you have good enough performance data, and you don’t break any dependencies, time to deploy to production.

Let’s look at one more diagram.

Single service deployment

Presumably, you’ll want to do a canary deployment. Send 5% of the traffic to your new component. Now, to be on the safe side, I would argue that you want a canary router which mirrors traffic between your component and the live version (previous version) of the same component. Why ? Because you want to compare exactly how it behaves, old vs new. So your router will send two requests, return only the one from the new one, but compare traffic data between the two (latency, throughput, error rate mostly), so you are sure that they behave more or less the same. Don’t forget to deploy them on exactly same underlying hardware, otherwise you’re comparing apples to oranges. Oh, you noticed deltas between the two ? Well, is something that your performance regression didn’t caught, is it your canary wrong, or is it a noisy neighbours issue ? If you haven’t meet yet the last one, here’s a hint:

OK, this shall pas too …. And now, you can finally roll out your container and replace everything in live. And cross your fingers that your automation is properly set up, like Ben advises in the article quoted at the top, so at least you can roll back fast in case something bad still escaped through the cracks.

Phew… Quite a journey, right ?

Now, if you reached this point, take one step back and think: can your CI/CD automation and monitoring/observability systems handle that? How much work do you need to set up all these? What is the size of your devops tooling team that handles all these? Wanna go deeper and add security in the mix?

Good. Now, you might ask, what’s your point ?

Actually, I have several:

  • devops, done right, is hard. The stuff you find in tutorials just barely scratches the surface. What I did here is basically the same, just scratching the surface of some really hairy issues that you will face as long as you’re not deploying a Hello world
  • you need a team handling the tooling with a lot of multi-disciplinary knowledge coverage (networking, hardware, cloud, monitoring, data science, etc.). Show some love to this team. Because as long as it does its job properly, they are as good as invisible. When something’s wrong, they get a lot of heat. So give them some praise.
  • as Fred Brooks said a long time ago, there’s no silver bullet. And it still stands. Devops is not the universal panacea that will fix all your issues. It’s a permanent, ongoing process. And it’s not an easy road.
  • last but not least, why the “H” in the title ? Because I would like to see more “human in the middle” approach in devops. Automation is not everything. We have the most performant computing engine, capable of both frequentist and bayesian inferences, better than any AI. It’s called “brain”. Use it.

The end … I’m sure I got things mixed up and/or wrong, so please feel free to let me know in the comments.

Thank you!

I write code