Roll Over Beethoven

April 28 2024

no matter how good your idea is, about next shape a system should have, the difficult and most creative part is actually getting there, from current shape. in other words, how to be “gradually” rolling out.

this sounds to be a second take on the “what I learned in the past few years about Continuous Delivery” series, which I kicked off early this year. today’s focus is on some techniques I was able to test, with my past team, on incrementally collecting feedback, before releasing features in the hand of customers.

and sorry Chuck, but my cents are for the Brit ones, today!

Migration: From the Bottom-up

let’s start from the easiest take: ensuring your system is always deployable, even when in the middle of a technical migration. for example, given an online reservation service, the initiative of moving its main integration point from an external 3rd party system, to a shiny new corporate platform. one day this system would have to switch service used for notifying transactions. in the meanwhile, we need to keep deploying, at least daily, to release other features.

among all the options available, this sounds to be the perfect match for Branch by Abstraction, one of the tenants of trunk-based development: isolating a new implementation from the main path, behind a proper abstraction, until it’s ready to go. during this isolated stage, feedback can still be collected, even if code is not attached to any visible UI or API, by running automatic tests on inner entry points, at different levels:

many focused tests, hitting an in-memory, programmatically prepared system. any failure here would imply a work in progress, or a regression
very few integration tests, hitting a real external system (eg: a sandbox test version). as we don’t control the counterpart (which can be unavailable sometime), they’re temporarily allowed to fail, and so some human judgment here is required

they can all run in-process, so still being part of Continuous Integration (CI) stage (the one targeting artefact creation, with its quality assurance). no need to actually test any real deployed environment, yet.

in addition to that, we’ve also experimented using Probes, part of (or in addition to) liveness. something like a “ping” query towards target service, linked to any /status endpoint or similar. so that we could perform a sanity or smoke test, checking all required infrastructure was properly setup, up to production, way before that integration was actually used by user transactions. only caveat: in this early stage, ensure any failure would not imply instances being marked as unhealthy, and then removed from load balancer pool.

then, once we’re ready to connect the paths, that’s when Feature Toggles come in. they’re the enabling points in Michael Feathers’s model about seams. early strategy would be enabling a global toggle in dev environment. eg:

TOGGLE_INTERNAL_PLATFORM_ENABLED: true

this would also allow writing even fewer end-to-end tests: using public API or visible UI as entry point, going through TO-BE path, thanks to the enabled toggle. they would be out-of-process tests, right after deployment on dev environment is completed, during Continuous Delivery (CD) stage (the one targeting increasing confidence about “deployability” of current version, up to production).

the closer to production, the more we’d probably need to support both scenario, to easily compare TO-BE with AS-IS, and not to alter feedback from other features being tested. that’s when we can consider finer-grained toggles. in my experience, two strategies are worth exploring here.

if there’s any gradual rollout planned, toggles can be set conditionally around rollout stages, eg: by country, by language, by channel, etc. or a combination of those. for example, pre-production environment can be set with transactions to few countries and languages to be handled by TO-BE integration, while keeping all other countries on the legacy AS-IS integration. or the other way around. eg:

TOGGLE_INTERNAL_PLATFORM_COUNTRIES: ['us', 'uk']
TOGGLE_INTERNAL_PLATFORM_CHANNELS: ['en']

on the other hand, toggles can be also enabled on individual requests, by relying on HTTP request test parameters or headers, prepared during tests execution. while performing manual exploratory tests, browser extensions such as ModHeader for Chrome can be used. eg:

X-test-config: {TOGGLE_INTERNAL_PLATFORM_ENABLED:true}

once ready for a full rollout in production, a global toggle to true can be still achieved with conditional toggles, by using '*', given the semantic is still for whitelisting. eg:

TOGGLE_INTERNAL_PLATFORM_COUNTRIES: ['*']
TOGGLE_INTERNAL_PLATFORM_CHANNELS: ['*']

whatever the strategy chosen, please don’t forget to dismiss toggles then!

Re-platforming: Don’t tell Anybody

raising up the bar, sometimes you’d probably need to have multiple moving parts. for example, by re-platforming a legacy distributed system from the ground up. you’ll still have upstream (clients) and downstream (3rd parties) systems to be integrated, while you’ll take one piece at a time out of multiple existing applications, and move to new bounded contexts (for a comprehensive reference, see Patterns of Legacy Displacement).

still on toggles, one thing I’ve learned. given this distributed system to be coordinated, one approach is relying on distributed feature toggles, shared and to be reloaded in synch from multiple applications. the issue here is with that synch: can we ensure caches are getting flushed at the very same time?

a better approach is keeping toggles local to one upstream system only, and let the toggle values flow within the request chain to downstream ones: evaluate the toggle once, and let the decision be propagated. this requires little more instrumentation to be set, which acts as dedicated test-bus for our applications. for example, “individual requests” toggle from previous section can be reused for that.

finally, no matter how much testing you can arrange, the only true feedback is from production. is there any safe way to gather that feedback, and mitigate the uncertainty? one technique I successfully tested recently is Dark Launching. it can be basically achieved by two enabling mechanism for processing transactions: a “dry-run” mode and a “fire and forget” execution strategy, probably run in parallel with main one.

dry-run means performing as much as possible of the actual chain, but avoiding committing any operational change. as in the previous example, going though the whole new integration, but actually not sending any data to remote party — maybe just logging payloads.

then, “fire and forget” execution strategy means transaction are also processed by TO-BE chain (in dry-run mode), even if we don’t need to (and probably won’t) wait for results to be available. even more, we don’t want any error on TO-BE to impact on AS-IS execution. results to users are still collected from AS-IS only, while both are “persisted” for later analysis. of course, they have to be comparable.

simplest implementation for this is making both chains add logging information in the very same format: key info from the request data, to match transactions, and key info from the results data, to be automatically diff-ed. then a script can frequently grab logs from production, match requests by key, compare results and warn about differences. warn, not alarm, as in this setup lots of false negatives (and false positives as well) will be found, while the “comparing” logic would get improving over time — trained, as we use to say today for AI.

[itsme@mail.com|us|en|web|LEGACY] processed [12.45USD|store-pickup|Mr.]
[itsme@mail.com|us|en|web|REPLAT] processed [12.46USD|store-pickup|mr.]

from the above example, differences on decimal rounding and title case would then be found, before actually being an issue: tested in production, on real users, with no impact on real transactions.

Roll your dices

indeed, the difficult and most creative part is actually getting your system to desired target shape, by many individual step.

difficult because a big-bang integration from AS-IS to TO-BE would not require all that machinery that we just discussed, all that scaffolding that we’d get rid off once done (also known as Transitional Architecture). it’s like a 15 puzzle game, where we need to move individual squares, when we’d really like to just fill an empty board from scratch, instead.

but it’s also an incredibly creative process, kind of those where constraints drive innovation, and team working pays off the most. key tooling here is not the technology we can find, but a whiteboard we bring in a team space.

I argue that’s also one of the essential skills, for modern software architecture.

Posted by jfranzoi
Filed in Architecture, Design, DevOps

Software Engineering Slave