In the new world where data loss is unforgivable and even a 99.9% system availability is considered low in many cases, the demands on resiliency services are increasing non-linearly. The semantic of resiliency is evolving, keeping pace with the digital transformation being witnessed across industries. Let us examine how resiliency technology is advancing in response to rising expectations.
The main goal of a resiliency strategy is to keep a business operational when IT systems fail. It can be to regain business applications if a disaster strikes the primary data centre (via disaster recovery), recoup from software or hardware component failures (by high availability engineering) or retrieve lost data (through backup and restore). Let us look at the contemporary challenges and ensuing innovations in these three areas in turn.
Advances in Disaster Recovery: Two parameters are usually used to measure disaster recovery mechanisms – the time to failover when primary operations are impacted, and the amount of data loss suffered during an outage. The goal of recovery design is to build a failover workflow that meets targets on these parameters at minimum cost.
One of the new challenges that resiliency technology is encountering pertains to a specific type of disaster: cyber attack. While most industries are sensitive to cyber attacks today, financial services and the government are especially vulnerable; they endeavour to build cyber protection and recovery into their systems by design. Cyber recovery technology differs from classical disaster recovery mechanisms mainly because the subject may realize a cyber breach only days or even months after a successful attack, by which time both primary and replicated copies of the data would have been compromised.
Cyber resiliency solutions adopt a three-pronged approach. First, rather than continuously replicate data to a secondary system, which is the classic strategy to engineer disaster recovery, cyber recovery solutions save copies of data at regular points in time. These periodic snapshots are stored in “immutable storage” that are “locked” to ensure tamper proofing; the snapshot frequency is determined based on the amount of data loss that the business can tolerate when challenged with a breach. Second, the backup environment is “air gapped” relative to the primary system; connectivity and access to the immutable secondary storage are enabled only when needed. Third, artificial intelligence based anomaly scanning techniques are used to detect any compromise and to trigger recovery to a clean copy in the event of a cyber intrusion.
A related technology known as disaster recovery orchestration (DRO) is becoming mainstream today in response to the growing need for automating the retrieval of complex environments to within recovery objectives. Analytics is used to predict recovery metrics that can be realized at any given time in a granular manner; workflows are used to automate test drills, both for regulatory compliance and for the client to be confident that the recovery technology will indeed ensure business continuity.
Modern Availability Engineering: Another challenge that a resiliency architect has to contend with is to support system uptime guarantees regardless of software or hardware component failures. Building high availability is a function of assembling redundancies, sometimes called clustering. Factoring in surplus componentry in your design to enable fallback during times of failure might be easy especially in the world of cloud and infrastructure-as-code, but is operationally expensive. So it is not viable to introduce redundancies in an ad hoc manner; they have to be carefully architected.
System uptime has to be modelled taking into account two major considerations – time to failover during recoverable faults and the time to repair unrecoverable outages. Under-engineering system availability is harmful, both to corporate reputation as well as to profitability since uptime slippage below the agreed service level agreement (SLA) can incur penalties. For example, stock trading systems are generally designed to operate at 99.99+% availability even in the face of market volatility and transaction surges. On the other hand, over-engineering availability (for example, designing for 99.99% uptime when the system requires 99.9%) results in wasted cost. This is because the investment to reach perfection generally follows an exponential trajectory – if it takes one unit of effort to create 90% system availability (36 days of downtime in a year), 2 units of effort are required to get to 99% (4 days of downtime in a year), 3 units to reach 99.9% (9 hours of downtime in a year), and so forth. It is thus important for architects to mathematically map their system’s redundancy model to the required uptime SLA and to any expected slippage penalty payout.
Developments in Data Backup Solutions: Whenever a new hosting technology arrives, resiliency services are called upon to adapt and keep up. For example, container adoption is accelerating today, and so are matching demands on resiliency. Legacy backup and restore mechanisms do not work well with containerized applications. Application-consistent backup and recovery across multiple containers spanning stateless and stateful microservices present challenges that call for container-native backup solutions.
Before concluding, let us look at how cloud computing has injected software-definition into resiliency services. Hyperscalers often support multiple cloud regions within geographies, and multiple zones connected via low-latency network links within regions. Cloud services such as container platforms often have configurable multi-zone awareness built into them that guarantees high system availability; some such as data stores also support cross-region replicas that enable continuity in the face of a disaster that causes a region-wide outage.
Can a resilient design also lead to higher performance? With an innovative blueprint, you can achieve better performance as a by-product of resiliency. Consider an e-governance application on cloud where the data store has been deployed to provide region safety with read replicas in two regions. While this will provide resiliency when confronted with a regional data centre failure, it can also spruce up application response times during steady state if read requests are directed to the replica hosted on the region nearest to the requester.
Resiliency is a function of what it seeks to protect, so it has been adapting along with the evolution of hosting technologies – from virtualization and cloud, to containers and edge computing. But resiliency is also a function of what it seeks to protect against – from natural disasters and human errors, to hacking and cyber attacks. Resiliency technology will continue to see paradigm shifts as it keeps up with the times.
Sreekrishnan Venkiteswaran, CTO, Kyndryl India