Dev H Ops

Photo by Hello I'm Nik on Unsplash

I’ve had this on my mind for quite a while, but I had recently a couple of triggers, so I have to put it out.

It all started when my friend Jonathan Crossland asked this question on LinkedIn

One day after, a post from Ben Sigelman, for which I have a deep respect, popped out in my reading list

So, now’s the time.

Agile is evolving. Agile 2.0 is out. It is backed by a nice group of people, and mainly focused on DevOps. If you want to know more, follow Cliff Berg on LinkedIn.

Everybody’s doing devops. Right ? … RIGHT ???

It’s a no-brainer proposition. Repeatable builds, CI/CD pipelines, Infrastructure as code, Observability, etc…. Somebody stop me before I fill a page with buzzwords. You can read/watch tutorials, you can buy some books, learn a lot, and just do it. Or not. Let me try to fill in some gaps.

The “normal” flow is something like this:

Simple and clear… Not. Let’s talk a little bit about the last two steps.

Consider this presumably simple micro-services architecture

System diagram

You have 6 microservices, in a relatively simple service graph. Let’s assume for simplicity that you have 6 teams, each team owns one of the services. Those teams have different velocities, different deployment strategies, different lifecycles for their respective components. Let’s assume, for simplicity, that each team has 5 members.

Now, you have a decision to make. CI/CD is all about short feedback cycles. I want to deploy my new shiny feature to production as fast as possible, as safe as possible. How do you handle that ? You have several options, for the integration testing part:

I would argue that for a medium-sized application these days, the choice should be the latest. Think of the above diagram as a zoomed-in part of a system of 40–50 micro-services. The first two options wear themselves out pretty shortly. I want to deploy just my pod/container, test it and promote it, and I don’t want to spin a full environment every time, because of time and resource constrains.

The third option comes with its own warts too. Although it might look simpler, let’s look at a new diagram.

Single test environment

So what happens here ?

While the system is running the test suite for service A (Test suite 1), one dev issues a PR for component D. You want to move as fast as possible, so you deploy the new version D2, and start the second test suite. Now, you need a system that is able to discern on which requests are coming from test suite 1 and which from test suite 2, and routes them accordingly. You might do that via a smart programable proxy, like Envoy, with your own control plane, and tag the test suite requests with a special header on which your proxy is able to discriminate on. Not an easy task. Maybe you could use something like Istio. But you will need to integrate your CI/CD with the control plane for the proxies, so it can push the right routing configuration to your proxies. Again, not easy.

As that would not look complicated enough, probably your microservices should own their own storage engine (SQL database, NoSQL database, etc.). I certainly hope so, I absolutely love when I hear everybody discussing how global variables are really bad, then when you look at their architecture you find a big fat global variable right in the core, i.e. a shared database, which is also used for synchronization. Lovely.

So, you need a repository of DB engines, some with test data, some completely clean, from where you can check out and deploy a fresh engine when needed. Oh, btw, somebody needs to maintain and curate this repo.

And, if the above doesn’t seems complicated enough, let me add some perks:

Alrighty. Let’s assume you solved all of the above, one way or the other. The next thing to worry about: did I introduced a regression of not ? Not a fault regression, that would be presumably caught in the previous step, but a latency/throughput one. How do you test that ?

Well, I suggest you have a few separate deployments, for the critical path components in your system, where you have all the dependencies set up, and over provisioned (i.e. I want to assume infinite resources downstream, because I’m testing the limits of this component only) . You deploy your new component in there, and hit it with a load tester. Best in show would be production traffic, so for each component you need something to log live traffic, and then play it back at variable request-per-second speed. And you go up until it starts breaking. You have now the telemetry data (you have a monitoring system here too, right ?), and then you compare with saved data from previous runs, to make sure you didn’t break something. For that, you check just a few golden metrics, like CPU load, storage performance, network saturation, latency and throughput.

Short detour, you’re probably measuring these wrong. For latency, search on youtube Gil Tene’s talk How not to measure latency. For the rest of the measures, just think how often you have in your dashboards average of the 95 percentile. Now try to put on paper that formula and see how much sense it makes, mathematically speaking, averaging a percentile. Anyways. I digress.

Or maybe your change is pretty big, so you need to take a performance hit. But you also need to communicate this to your automated load testing tool, which is in charge of approving/denying the promotion of your component to production. Easy-peasy.

OK, now that you’re sure that you don’t have errors, you have good enough performance data, and you don’t break any dependencies, time to deploy to production.

Let’s look at one more diagram.

Single service deployment

Presumably, you’ll want to do a canary deployment. Send 5% of the traffic to your new component. Now, to be on the safe side, I would argue that you want a canary router which mirrors traffic between your component and the live version (previous version) of the same component. Why ? Because you want to compare exactly how it behaves, old vs new. So your router will send two requests, return only the one from the new one, but compare traffic data between the two (latency, throughput, error rate mostly), so you are sure that they behave more or less the same. Don’t forget to deploy them on exactly same underlying hardware, otherwise you’re comparing apples to oranges. Oh, you noticed deltas between the two ? Well, is something that your performance regression didn’t caught, is it your canary wrong, or is it a noisy neighbours issue ? If you haven’t meet yet the last one, here’s a hint:

OK, this shall pas too …. And now, you can finally roll out your container and replace everything in live. And cross your fingers that your automation is properly set up, like Ben advises in the article quoted at the top, so at least you can roll back fast in case something bad still escaped through the cracks.

Phew… Quite a journey, right ?

Now, if you reached this point, take one step back and think: can your CI/CD automation and monitoring/observability systems handle that? How much work do you need to set up all these? What is the size of your devops tooling team that handles all these? Wanna go deeper and add security in the mix?

Good. Now, you might ask, what’s your point ?

Actually, I have several:

The end … I’m sure I got things mixed up and/or wrong, so please feel free to let me know in the comments.

Thank you!

I write code

Love podcasts or audiobooks? Learn on the go with our new app.