Design change-tolerant software with cloud native patterns

February 13, 2021 | 13 minute read
Text Size 100%:

By applying cloud native designs, you can build resilient, easily adaptable, web-scale distributed applications that handle massive user traffic and data loads.

Download a PDF of this article

[The following is an excerpt from Cloud Native Patterns by Cornelia Davis (Manning, May 2019), reprinted with permission. You can download a larger excerpt at no charge or order the full 400-page book from Manning. —Ed.]

Let’s talk about the detrimental effects of variability, or as we often call them, snowflakes, on the workings of IT. They make things hard to deploy because you must constantly adjust to differences in both the environments into which you’re deploying and in the variability of the artifacts you’re deploying. That same inconsistency makes it extremely difficult to keep things running well once in production, because every environment and piece of software gets special treatment anytime something changes. Drift from a known configuration is a constant threat to stability when you can’t reliably re-create the configuration that was working before a crash.

When you turn that negative into a positive in your enabling system, the key concept is repeatability. It’s analogous to the steps in an assembly line: each time you attach a steering wheel to a car, you repeat the same process. If the conditions are the same within some parameters (I’ll elaborate on this more in a moment) and the same process is executed, the outcome is predictable.

The benefits of repeatability on our two goals—getting things deployed and maintaining stability—are great. Iterative cycles are essential to frequent releases, and by removing the variability from the dev/test process that happens with each turn of the crank, the time to deliver a new capability within the iteration is compressed. And once running in production, whether you’re responding to a failure or increasing capacity to handle greater volumes, the ability to stamp out deployments with complete predictability relieves tremendous stress from the system.

How do we then achieve this sought-after repeatability? One of the advantages of software is that it’s easy to change, and that malleability can be done quickly. But this is also exactly what has invited us to create snowflakes in the past. To achieve the needed repeatability, you must be disciplined. In particular, you need to do the following:

  • Control the environments into which you’ll deploy the software
  • Control the software that you’re deploying—also known as the deployable artifact
  • Control the deployment processes

Control the environment

In an assembly line, you control the environment by laying out the parts being assembled and the tools used for assembly in exactly the same way—no need to search for the three-quarter-inch socket wrench each time you need it, because it’s always in the same place. In software, you use two primary mechanisms to consistently lay out the context in which the implementation runs.

First, you must begin with standardized machine images. In building up environments, you must consistently begin with a known starting point. Second, changes applied to that base image to establish the context into which your software is deployed must be coded.

For example, if you begin with a base Ubuntu image and your software requires the Java Development Kit (JDK), you’ll script the installation of the JDK into the base image. The term often used for this latter concept is infrastructure as code. When you need a new instance of an environment, you begin with the base image and apply the script, and you’re guaranteed to have the same environment each time.

Once established, any changes to an environment must also be equally controlled. If operations staff routinely ssh into machines and make configuration changes, the rigor you’ve applied to setting up the systems is for naught. Numerous techniques can be used to ensure control after initial deployment. You may not allow SSH access into running environments, or if you do, automatically take a machine offline as soon as someone has ssh’d in. The latter is a useful pattern in that it allows someone to go into a box to investigate a problem but doesn’t allow for any potential changes to color the running environment.

If a change needs to be made to running environments, the only way for this to happen is by updating the standard machine image as well as the code that applies the runtime environment to it—both of which are controlled in a source code control system or something equivalent.

Who is responsible for the creation of the standardized machine images and the infrastructure as code varies, but as an application developer, it’s essential that you use such a system. Practices that you apply (or don’t) early in the software development lifecycle have a marked effect on the organization’s ability to efficiently deploy and manage that software in production.

Control the deployable artifact

Let’s take a moment to acknowledge the obvious: there are always differences in environments. In production, your software connects to your live customer database, found at a URL such as http://prod.example.com/customerDB; in staging, it connects to a copy of that database that has been cleansed of personally identifiable information and is found at http://staging.example.com/cleansedDB; and during initial development, there may be a mock database that’s accessed at http://localhost/mockDB. Obviously, credentials differ from one environment to the next. How do you account for such differences in the code you’re creating?

I know you aren’t hardcoding such strings directly into your code (right?). Likely, you’re parameterizing your code and putting these values into some type of a property file. This is a good first step, but often a problem remains: the property files, and hence the parameter values for the different environments, are often compiled into the deployable artifact.

For example, in a Java setting, the application.properties file is often included in the JAR or WAR file, which is then deployed into one of the environments. And therein lies the problem. When the environment-specific settings are compiled in, the JAR file that you deploy in the test environment is different from the JAR file that you deploy into production; see Figure 1.

Environment-specific values are organized into property files

Figure 1. Even when environment-specific values are organized into property files, by including property files in the deployable artifact, you’ll have different artifacts throughout the software development lifecycle.

As soon as you build different artifacts for different stages in the SDLC [software development lifecycle], repeatability may be compromised. The discipline for controlling the variability of that software artifact, ensuring that the only difference in the artifacts is the contents of the property files, must now be implanted into the build process itself.

Unfortunately, because the JAR files are different, you can no longer compare file hashes to verify that the artifact that you’ve deployed into the staging environment is exactly the same as that which you’ve deployed into production. And if something changes in one of the environments, and one of the property values changes, you must update the property file, which means a new deployable artifact and a new deployment.

For efficient, safe, and repeatable production operations, it’s essential that a single deployable artifact is used through the entire SDLC. The JAR file you build and run through regression tests during development is the exact JAR file deployed into the test, staging, and production environments.

To make this happen, the code needs to be structured in the right way. For example, property files don’t carry environment-specific values but instead define a set of parameters for which values may later be injected. You can then bind values to these parameters at the appropriate time, drawing values from the right sources. It’s up to you as the developer to create implementations that properly abstract the environmental variability. Doing this allows you to create a single deployable artifact that can be carried through the entire SDLC, bringing with it agility and reliability.

Control the process

Having established environment consistency and the discipline of creating a single deployable artifact to carry through the entire software development lifecycle, what’s left is ensuring that these pieces come together in a controlled, repeatable manner. Figure 2 depicts the desired outcome: in all stages of the SDLC, you can reliably stamp out exact copies of as many running units as needed.

Environment-specific values are organized into property files

Figure 2. The desired outcome is to be able to consistently establish apps running in standardized environments. Note that the app is the same across all environments; the runtime environment is standardized within an SDLC stage.

Figure 2 has no snowflakes. The deployable artifact, the app, is exactly the same across all deployments and environments. The runtime environment has variation across the different stages, but (as indicated by the different shades of the same gray coloring) the base is the same and has only different configurations applied, such as database bindings.

Within a lifecycle stage, all the configurations are the same; they have exactly the same shade of gray. Those anti-snowflake boxes are assembled from the two controlled entities I’ve been talking about: standardized runtime environments and single deployable artifacts, as seen in Figure 3.

Environment-specific values are organized into property files

Figure 3. The assembly of standardized base images, controlled environment configurations, and single deployable artifacts is automated.

A whole lot is under the surface of this simple figure. What makes a good base image, and how is it made available to developers and operators? What is the source of the environment configuration, and when is it brought into the application context? Exactly when is the app “installed” into the runtime context? At this juncture my main point is this: The only way to draw the pieces together in a manner that ensures consistency is to automate.

Although the use of continuous integration tools and practices is fairly ubiquitous in the development phase of writing software (for example, a build pipeline compiles checked-in code and runs some tests), its use in driving the entire SDLC isn’t as widely adopted. But the automation must carry all the way from code check-in, through deployments, into test and production environments.

And when I say it’s all automated, I mean everything. Even when you aren’t responsible for the creation of the various bits and pieces, the assembly must be controlled in this manner. For example, users of Pivotal Cloud Foundry, a popular cloud native platform, use an API to download new “stem cells,” the base images into which apps are deployed, from a software distribution site, and they use pipelines to complete the assembly of the runtime environment and the application artifact. Another pipeline does the final deployment into production. In fact, when deployments into production also happen via pipelines, servers aren’t touched directly by humans, something that’ll make your chief security officer (and other control-related personnel) happy.

But if you’ve totally automated things all the way to deployment, how do you ensure that these deployments are safe? This is another area that requires a new philosophy.

Safe deployments

Earlier I talked about risky deployments and that the most common mechanism that organizations use as an attempt to control the risk is to put in place expansive and expensive testing environments with complex and slow processes to govern their use. Initially, you might think that there’s no alternative because the only way to know that something works when deployed into production is to test it first. But I suggest that it’s more a symptom of what Grace Hopper said was the most dangerous phrase: “We’ve always done it this way.”

The born-in-the-cloud-era software companies have shown us a new way: They experiment in production. Egad! What am I talking about?! Let me add one word: They safely experiment in production.

Let’s first look at what I mean by safe experimentation and then look at the impact it has on our goals of easy deployments and production stability.

When trapeze artists let go of one ring, spin through the air, and grasp another, they most often achieve their goal and entertain spectators. No question about it, their success depends on the right training and tooling, and a whole load of practice. But acrobats aren’t fools; they know that things sometimes go wrong, so they perform over a safety net.

When you experiment in production, you do it with the right safety nets in place. Both operational practices and software design patterns come together to weave that net. Add in solid software-engineering practices such as test-driven development, and you can minimize the chance of failure. But eliminating it entirely isn’t the goal. Expecting failure (and failure will happen) greatly lessens the chances of it being catastrophic. Perhaps a small handful of users will receive an error message and need to refresh, but overall, the system remains up and running.

Here’s the key: Everything about the software design and the operational practices allows you to easily and quickly pull back the experiment and return to a known working state (or advance to the next one) when necessary.

This is the fundamental difference between the old and the new mindset. In the former, you tested extensively before going to production, believing you’d worked out all the kinks. When that delusion proved incorrect, you were left scrambling. With the new, you plan for failure, intentionally creating a retreat path to make failures a nonevent. This is empowering! And the impact on your goals, easier and faster deployments, and stability after you’re up and running is obvious and immediate.

First, if you eliminate traditional complex and time-consuming testing processes, and instead go straight to production following basic integration testing, a great deal of time is cut from the cycle and, clearly, releases can occur more frequently. The release process is intentionally designed to encourage its use and involves little ceremony to begin. And having the right safety nets in place allows you to not only avert disaster but to quickly return to a fully functional system in a matter of seconds.

When deployments come without ceremony and with greater frequency, you’re better able to address the failings of what you’re currently running in production, allowing you to maintain a more stable system as a whole.

Let’s talk a bit more about what that safety net looks like and, in particular, the role that the developer, architect, and application operators play in constructing it. There are three inextricably linked patterns:

  • Parallel deployments and versioned services
  • Generation of necessary telemetry
  • Flexible routing

In the past, a deployment of version n of some software was almost always a replacement of version n-1. In addition, the things we deployed were large pieces of software encompassing a wide range of capabilities, so when the unexpected happened, the results could be catastrophic. An entire mission-critical application could experience significant downtime, for example.

At the core of your safe deployment practices is parallel deployment. Instead of completely replacing one version of running software with a new version, you keep the known working version running as you add a new version to run alongside it. You start out with only a small portion of traffic routed to the new implementation, and you watch what happens. You can control which traffic is routed to the new implementation based on a variety of available criteria, such as where the requests are coming from (either geographically or what the referring page is, for example) or who the user is.

To assess whether the experiment is yielding positive results, you look at data. Is the implementation running without crashing? Has new latency been introduced? Have click-through rates increased or decreased?

If things are going well, you can continue to increase the load directed at the new implementation. If at any time things aren’t happy, you can shift all the traffic back to the previous version. This is the retreat path that allows you to experiment in production.

Environment-specific values are organized into property files

Figure 4. Data tells you how parallel deployments of multiple versions of your apps are operating. You use that data to program control flows to those apps, supporting safe rollouts of new software in production.

None of this can be done if proper software engineering disciplines are ignored or applications don’t embody the right architectural patterns. Some of the keys to enable this form of A/B testing are as follows:

  • Software artifacts must be versioned, and the versions must be visible to the routing mechanism to allow it to appropriately direct traffic. Further, because you’ll be analyzing data to determine whether the new deployment is stable and achieving the desired outcomes, all data must be associated with the appropriate version of the software in order to make the proper comparisons.
  • The data used to analyze how the new version is functioning takes a variety of forms. Some metrics are completely independent of any details of the implementation, for example, the latency between a request and response. Other metrics begin to peer into the running processes, reporting on things such as the number of threads or memory being consumed. And finally, domain-specific metrics, such as the average total purchase amount of an online transaction, may also be used to drive deployment decisions. Although some of the data may automatically be provided by the environment in which the implementation is running, you won’t have to write code to produce it. The availability of data metrics is a first-class concern. I want you to think about producing data that supports experimentation in production.
  • Clearly, routing is a key enabler of parallel deployments, and the routing algorithms are pieces of software. Sometimes the algorithm is simple, such as sending a percentage of all the traffic to the new version, and the routing software “implementation” can be realized by configuring some of the components of your infrastructure. Other times you may want more-sophisticated routing logic and need to write code to realize it. For example, you may want to test some geographically localized optimizations and want to send requests only from within the same geography to the new version. Or perhaps you wish to expose a new feature only to your premium customers. Whether the responsibility for implementing the routing logic falls to the developer or is achieved via configuration of the execution environment, routing is a first-class concern for the developer.
  • Finally, something I’ve already hinted at is creating smaller units of deployment. Rather than a deployment encompassing a huge portion of your ecommerce system—for example, the catalog, search engine, image service, recommendation engine, shopping cart, and payment-processing module all in one—deployments should have a far smaller scope. You can easily imagine that a new release of the image service poses far less risk to the business than something that involves payment processing. Proper componentization of your applications—or as many would call it today, a microservices-based architecture—is directly linked to the operability of digital solutions.

Although the platform your applications run on provides some of the necessary support for safe deployments, all four of these factors—versioning, metrics, routing, and componentization—are things that you, as a developer, must consider when you design and build your cloud native application. There’s more to cloud native software than these things (for example, designing bulkheads into your architecture to keep failures from cascading through the entire system), but these are some of the key enablers of safe deployments.

Cornelia Davis

Cornelia Davis is CTO of Weaveworks and an industry veteran with almost three decades of experience in image processing, scientific visualization, distributed systems and web application architectures, and cloud native platforms. She cut her teeth in the space of modern application platforms at Pivotal where she was on the teams that brought Pivotal Cloud, various data products, and Pivotal Container Service to market. Davis is the author of the book Cloud Native Patterns: Design Change-Tolerant Software. Follow her @cdavisafc.


Previous Post

Quiz yourself: Reductions with Java streams using collectors

Mikalai Zaikin | 5 min read

Next Post


Fast data access in Java with the Helidon microservices platform

Paul Parkinson | 9 min read