From Infrastructure to Application Availability: A Shift in Perspective

Relying on Infrastructure Instead of Application-Focused Availability [link to blog https://lnkd.in/gk5NQvpB] From 1996-2009, it was believed that application availability was an infrastructure issue, so improving reliability meant making infrastructure more resilient. Tandem NeverFail symbolized the ideal of reliable infrastructure, with systems like the Origin 2000, featuring a single system image and NUMA, representing the peak of scalable computing. However, in the mid-1990s, Gregory Pfister's book "In Search of Clusters" argued that building such infrastructure was too complex, advocating instead for clustering. At the time, this idea seemed absurd. When distributed systems were first being deployed in the mid-1990s, it seemed truly crazy. As a result, infrastructure vendors continued to focus on making single systems more resilient. When the cloud emerged, infrastructure architects like myself viewed it skeptically because of its lack of guaranteed availability. "How would applications run on it?" we wondered. What we didn't realize was that software naturally seeks to operate on cheaper hardware, and because of this, new technologies have arisen to make that easier. For me, the pivotal moment came in 2009, when Cafeville was running on an effectively 1000-node cluster. The team combined various components with some critical innovations. This marked the beginning of an era where availability shifted from being purely an infrastructure concern to an application problem because infrastructure became less reliable. My critique of vSphere and similar systems that aren’t natively clustered is that they are inherently less reliable than what applications require. Consequently, application teams must write code assuming infrastructure instability rather than depending on the system’s reliability. What do I mean by infrastructure instability? In the pre-Cloud era, infrastructure was assumed to either work or fail. In the cloud era, uncertainty in infrastructure was acceptable if the application, its developers, and the operations team could identify what went wrong—uncertainty was tolerated. The problem was that this increased the cost of maintaining and supporting applications and slowed down development, as teams spent more time on infrastructure issues than on the applications themselves. At Zynga, when my team provided a reliable infrastructure, team sizes decreased, and productivity for the game teams increased. Our team ensured there would be no ambiguity about how the infrastructure was performing. We provided guarantees. By stating that infrastructure needs to be more robust, I mean that it must ensure the application and its components are operational, that data is available, that no infrastructure changes have occurred, and that the system can recover if necessary. Nutanix's AHV uniquely does that.

Nicholas Pace

CEO/CTO @ Latchfield | AI, Enterprise Technology

2w

Nicely said! Whether resiliency is built into the infrastructure, or the application, the cost to achieve resiliency must occur *somewhere*. But investing in infrastructure resiliency can yield better ROI at scale, as development teams can then focus more of their time on delivering business value instead of operational stability, and separate application teams do not need to repeatedly solve the same reliability engineering challenges over and over (since it has already been solved once for everyone at the infrastructure layer). While I do think approaches like microservice architecture have their place, it's all about using the right tool for the job. If one decides to adopt microservices simply because the storage layer is unreliable, then microservices are being abused as a work-around to ignore the flaw in the storage system; and I'd say that all workarounds are intrinsically a type of technical debt.

Well put Kostadis Roussos! Also the definition of infrastructure has moved up the stack into infrastructure software like AHV and NKP. Slowly and steadily the most demanding infrastructure softwares like OLTP DBs are also becoming ultra-robust which is enabling app-developers to demand ACID guarantees across multiple geos - AZs and Regions.

See more comments

To view or add a comment, sign in

Explore content categories