From Infrastructure to Application Availability: A Shift in Perspective

2w Edited

Relying on Infrastructure Instead of Application-Focused Availability [link to blog https://lnkd.in/gk5NQvpB] From 1996-2009, it was believed that application availability was an infrastructure issue, so improving reliability meant making infrastructure more resilient. Tandem NeverFail symbolized the ideal of reliable infrastructure, with systems like the Origin 2000, featuring a single system image and NUMA, representing the peak of scalable computing. However, in the mid-1990s, Gregory Pfister's book "In Search of Clusters" argued that building such infrastructure was too complex, advocating instead for clustering. At the time, this idea seemed absurd. When distributed systems were first being deployed in the mid-1990s, it seemed truly crazy. As a result, infrastructure vendors continued to focus on making single systems more resilient. When the cloud emerged, infrastructure architects like myself viewed it skeptically because of its lack of guaranteed availability. "How would applications run on it?" we wondered. What we didn't realize was that software naturally seeks to operate on cheaper hardware, and because of this, new technologies have arisen to make that easier. For me, the pivotal moment came in 2009, when Cafeville was running on an effectively 1000-node cluster. The team combined various components with some critical innovations. This marked the beginning of an era where availability shifted from being purely an infrastructure concern to an application problem because infrastructure became less reliable. My critique of vSphere and similar systems that aren’t natively clustered is that they are inherently less reliable than what applications require. Consequently, application teams must write code assuming infrastructure instability rather than depending on the system’s reliability. What do I mean by infrastructure instability? In the pre-Cloud era, infrastructure was assumed to either work or fail. In the cloud era, uncertainty in infrastructure was acceptable if the application, its developers, and the operations team could identify what went wrong—uncertainty was tolerated. The problem was that this increased the cost of maintaining and supporting applications and slowed down development, as teams spent more time on infrastructure issues than on the applications themselves. At Zynga, when my team provided a reliable infrastructure, team sizes decreased, and productivity for the game teams increased. Our team ensured there would be no ambiguity about how the infrastructure was performing. We provided guarantees. By stating that infrastructure needs to be more robust, I mean that it must ensure the application and its components are operational, that data is available, that no infrastructure changes have occurred, and that the system can recover if necessary. Nutanix's AHV uniquely does that.

4 Comments

Nicholas Pace

CEO/CTO @ Latchfield | AI, Enterprise Technology

Nicely said! Whether resiliency is built into the infrastructure, or the application, the cost to achieve resiliency must occur *somewhere*. But investing in infrastructure resiliency can yield better ROI at scale, as development teams can then focus more of their time on delivering business value instead of operational stability, and separate application teams do not need to repeatedly solve the same reliability engineering challenges over and over (since it has already been solved once for everyone at the infrastructure layer). While I do think approaches like microservice architecture have their place, it's all about using the right tool for the job. If one decides to adopt microservices simply because the storage layer is unreliable, then microservices are being abused as a work-around to ignore the flaw in the storage system; and I'd say that all workarounds are intrinsically a type of technical debt.

2 Reactions

Nikhil Bhatia

Well put Kostadis Roussos! Also the definition of infrastructure has moved up the stack into infrastructure software like AHV and NKP. Slowly and steadily the most demanding infrastructure softwares like OLTP DBs are also becoming ultra-robust which is enabling app-developers to demand ACID guarantees across multiple geos - AZs and Regions.

2 Reactions

See more comments

To view or add a comment, sign in

More Relevant Posts

Jimmy Jobe

President and CEO at Verge Technologies, Inc.
1mo Edited
Report this post
In 30 years building enterprise IT systems, I’ve watched one metric rise to near-religious status: Uptime. Five nines. Zero downtime. Availability guarantees. It became the north star for infrastructure teams, vendors, cloud platforms - even boardrooms. Somewhere along the way, it stopped being a means to an end…and became the end itself. And here’s the problem: 99.999% uptime won’t save you if your applications can’t adapt to a new market reality. That’s the sacred cow I want to challenge. Because we’ve built an entire generation of IT orgs that treat infrastructure uptime as synonymous with business resilience. But in the process, we lost sight of the real mission: building and evolving applications that give enterprises their edge. When I talk to CIOs and architects today, many of them don’t even realize how deeply embedded this mindset has become. It’s subtle. It’s systemic. And it’s expensive. Here’s what’s happening under the hood: as enterprises moved to the cloud, they unknowingly started designing around cloud vendor pricing structures. Every byte of data that moves - even between regions or services inside the same cloud - incurs cost. So what do ops teams do? They optimize for locality. They split workloads. They isolate functions. They build brittle data paths to avoid egress fees. They suboptimize their business processes just to play by the rules of infrastructure. That’s not agility. That’s surrender. The side effect is this: instead of applications that evolve fast, integrate cleanly, and deliver new value across departments…you get siloed apps, glued together by manual processes, running in isolation to avoid triggering more cost. And because uptime is still the KPI that gets measured, funded, and rewarded - the problem hides in plain sight. But ask yourself: what’s more valuable? A database that’s up 24/7 - but can’t deliver fresh data where it’s needed, when it’s needed? Or one that intelligently adapts, moves, replicates, and serves your applications - even if it sacrifices some of that “perfect” uptime? We’ve created an industry-wide illusion of control. Uptime looks like safety. But it’s not. It’s a false god. Real resilience isn’t about staying up. It’s about staying relevant. And that’s not something your infrastructure provider is going to solve for you. That responsibility sits squarely on the enterprise. Which means it’s time to rethink what we measure, what we celebrate, and what we build for. Because the next generation of enterprise advantage won’t come from the clouds you run on. It’ll come from how quickly your systems - and your teams - can evolve when change hits.

5 Comments
Like Comment
To view or add a comment, sign in
Jimmy Jobe

President and CEO at Verge Technologies, Inc.
2w
Report this post
Your infrastructure should understand English. “If memory > 90% for 10 minutes, migrate to a larger server.” That’s not a dashboard metric. That’s a policy. And if your infrastructure is modern enough, it won’t just understand that policy - it will execute it. When we built SentientDB, this was the core idea: infrastructure should act on your intent, not just report that something went wrong. Because if you're managing databases across seven clouds and multiple regions, real-time decisions can't wait on a human to spot a red icon and escalate. By then, you're already dealing with a service degradation - or worse, a missed SLA. What we needed - and what we ultimately built - was policy-as-architecture. A platform where infrastructure doesn’t just follow instructions. It evaluates conditions, applies logic, and takes action. In production environments today, our customers are using plain-language logic to: - Scale out workloads during demand spikes - then scale them back - Auto-migrate databases to swing servers for zero-downtime maintenance - Trigger SLA compliance workflows the moment performance risks emerge These aren’t alerts. These are live actions, taken at runtime, without touching a console. Here’s what changed: We stopped treating automation as a convenience layer. Instead, we embedded policy into the core execution fabric - as critical as networking or storage. It’s not magic. It’s just maturity. This is where the industry is headed, whether you’re ready for it or not. If your team is still managing workloads manually across AWS, Azure, and on-prem, it’s not just a scaling issue - it’s a risk issue. Because in complex, distributed environments, human latency is operational debt. Policy-driven infrastructure isn’t “nice to have” anymore. It’s what lets you meet your SLAs in a world where downtime is no longer acceptable. And that’s not a roadmap. It’s real. We’ve deployed it across critical federal systems and global enterprise environments. It works. It scales. It doesn’t wait. If you're ready to stop reacting to problems and start defining your architecture in plain language, let’s talk.
Like Comment
To view or add a comment, sign in
Fairwinds

3,085 followers
1mo
Report this post
Supporting both legacy and cloud-native workloads? Standardizing on a general-purpose OS may be right if you need: * Simplified management * To support diverse workloads * Streamlined compliance * Minimal operational overhead Brian Bensky's blog provides a structured approach you can take to evaluate the trade-offs between purpose-built container OSes and general purpose OSes, so you can select the right strategy for your organization. https://bit.ly/44du6pP #kubernetes #infrastructure #purposebuiltOS #containerOS

Is Container OS Insecurity Making Your K8s Infrastructure Less Secure? fairwinds.com
Like Comment
To view or add a comment, sign in
Bryan P.

Senior Solutions Architect @ Tangoe | Advising C-Level executives in ways to work smarter, save money, and be more confident in their decisions | ITEM, Mobility, Telecom, Cloud, FinOps Practitioner
3w
Report this post
Technology leaders must balance impact and disruption when prioritizing upgrades, especially as every investment decision shapes long-term agility. Cloud platforms help identify over-provisioned resources, scale Kubernetes clusters more efficiently, and shift inactive data to lower-cost storage. While these actions deliver fast savings, they’re just early steps in a broader modernization strategy. To move beyond cost-cutting and toward transformation, organizations need an approach that connects optimization with infrastructure evolution—laying the groundwork for innovation, flexibility, and growth: https://bit.ly/3T1juFo #ITModernization #DigitalTransformation #CIOStrategy #TechInvestment #CloudOptimization #ModernInfrastructure #EnterpriseIT #FutureIT #ITEM #Cloud #ITStrategy #TechnologyLeaders #ITCosts #CloudStrategy #ITLeaders

Why CIOs should prioritize IT modernization ciodive.com
Like Comment
To view or add a comment, sign in
Isreal Urephu

Lead DevOps / DevSecOps Engineer at Searchlight Cyber - The Dark Web Experts
1w Edited
Report this post
𝗬𝗼𝘂𝗿 𝗶𝗻𝗳𝗿𝗮𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗯𝗶𝗹𝗹𝘀 𝗸𝗲𝗲𝗽 𝗿𝗶𝘀𝗶𝗻𝗴. At first, the increase doesn’t seem significant, but if left unaddressed it can snowball into huge costs. In Kubernetes, this often happens because of how we provision resources either over-provisioning (paying for capacity that sits unused) or under-provisioning (saving costs upfront but paying later in performance and downtime). It is down to two major causes: 🔹 𝗪𝗮𝘀𝘁𝗲𝗱 𝗿𝗲𝘀𝗼𝘂𝗿𝗰𝗲𝘀 → Resources requested by pods but never fully utilized. They’re locked away and unavailable to other workloads. 🔹 𝗜𝗱𝗹𝗲 𝗿𝗲𝘀𝗼𝘂𝗿𝗰𝗲s → Resources allocated at the cluster level but never requested by any pod. They sit unused, often reserved for “just in case” scenarios, but end up wasted if workloads never need them. Many Kubernetes clusters fall into one (or both) of these categories, according to a report published by 𝗣𝗲𝗿𝗳𝗲𝗰𝘁𝗦𝗰𝗮𝗹𝗲. ✅ The solution is to invest in proper autoscaling whether with 𝗛𝗣𝗔, 𝗩𝗣𝗔, 𝗞𝗘𝗗𝗔, 𝗖𝗹𝘂𝘀𝘁𝗲𝗿 𝗔𝘂𝘁𝗼𝘀𝗰𝗮𝗹𝗲𝗿, 𝗼𝗿 𝗞𝗮𝗿𝗽𝗲𝗻𝘁𝗲𝗿 to scale pods and nodes based on real demand and actual usage, not assumptions. This helps avoid the pitfalls of over-provisioning (higher costs) and under-provisioning (latency and downtime), leading to a healthier and more cost-efficient Kubernetes environment. How are you handling autoscaling in your clusters?
4 Comments
Like Comment
To view or add a comment, sign in
NexQloud

353 followers
3w
Report this post
Why Your Idle Servers Cost You Twice: The Hidden Economics of Underutilized Infrastructure Your data center utilization report shows 30-40% average usage. Your disaster recovery servers sit idle 95% of the time. Your peak capacity hardware runs at full load maybe 20 hours per month. Yet you're paying for 24/7 power, cooling, and maintenance while purchasing additional cloud services. Here's what this means: You're maintaining expensive infrastructure for peak loads and compliance while paying premium cloud fees—when that same infrastructure could be generating revenue. The Dual Cost of Hybrid Architecture The infrastructure economics hitting your budget: - 30-40% average server utilization across enterprises - $50K-$200K annual costs for idle disaster recovery infrastructure - IT budgets now split between on-premises and cloud For IT leaders managing hybrid environments, this represents stranded capital. Your servers maintain critical data sovereignty and compliance requirements, but their idle cycles represent unrealized revenue potential. For CFOs, it's a line item begging for optimization. With cloud costs rising 21.5% annually while your fixed infrastructure costs remain constant, the economics of idle capacity become increasingly painful. Your On Premises Assets Can Become Revenue Generators You've invested in on-premises infrastructure for good reasons: data sovereignty, regulatory compliance, latency requirements, security control. These servers can't move to the public cloud, but their idle computing cycles could serve others' workloads without compromising your primary operations. This isn't about choosing cloud or on-premises—it's about maximizing the ROI of necessary infrastructure. Hybrid Cloud Meets Revenue Generation Forward-thinking enterprises are discovering their idle infrastructure can contribute to distributed computing networks during off-peak hours. Your disaster recovery servers, test environments, and over-provisioned capacity can generate revenue while maintaining full availability for primary operations. We're working to enable enterprises to monetize idle server capacity without compromising security or availability. Our enterprise-grade orchestration ensures your primary workloads always have priority while idle cycles contribute to our distributed network. Transform Infrastructure From Cost Center to Profit Center As hybrid cloud strategies mature, shouldn't your necessarily idle infrastructure contribute to your bottom line rather than drain it? The question for IT leadership: Will you continue paying twice—once for idle infrastructure and again for cloud services—or transform necessary capacity into a revenue stream?
Like Comment
To view or add a comment, sign in
Avinash Anand

From Engineering Head to Co-founder & CTO at Koorier | Spearheading Technological Vision & Strategic Growth
3w Edited
Report this post
𝗦𝗰𝗮𝗹𝗶𝗻𝗴 𝗶𝗻𝗳𝗿𝗮𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲: 𝗧𝗵𝗲 𝗙𝗿𝘂𝗴𝗮𝗹 𝗪𝗮𝘆 at Koorier Inc. I'm incredibly proud to share a story of true engineering grit and fiscal responsibility from our team. When we started, our Spring Boot application's performance on individual nodes was a bottleneck. We were running on t3.medium EC2 instances, and the cost-to-performance ratio wasn't where it needed to be. Our mission was to triple our node capacity and improve efficiency, without simply throwing more expensive hardware at the problem. This wasn't just a technical challenge; it was a business imperative to stay frugal and scalable. 𝗧𝗵𝗲 𝗖𝗵𝗮𝗹𝗹𝗲𝗻𝗴𝗲 The situation was complex and multi-dimensional. We had a mix of inefficient practices: n+1 database calls and eager loading were creating significant overhead, especially for API calls that processed large streams of data. Multiple repository calls and inefficient cache management added to the load. Our web-sockets were draining server resources with every client connection. A scheduler running every two minutes for pre-computation was a constant CPU hog. An expensive eventing framework for package delivery notifications was causing a performance hit every time a package moved. 𝗧𝗵𝗲 𝗔𝗰𝘁𝗶𝗼𝗻 Our team didn't just look for a quick fix; they dug deep. We started with JVM heap dump analysis to identify the true culprits: inefficient garbage collection and multiple memory bottlenecks. This wasn't about guess-work; it was about data-driven optimization. We optimized our database thread connection pool, fine-tuned the JVM heap, and reduced memory and CPU cycles through smarter code. We tackled the n+1 problems with join fetching and lazy loading, refactored our API calls to make a single repository call instead of many, and implemented a more efficient caching strategy. We even optimized our web-socket connections to be more lightweight and re-architected the costly eventing framework. 𝗧𝗵𝗲 𝗥𝗲𝘀𝘂𝗹𝘁 The results were phenomenal and a testament to the power of meticulous engineering. By focusing on efficiency and code quality, we achieved a 3x increase in capacity on our t3.medium instances and a 3.5x increase on t3.large instances. We didn't just upgrade our hardware; we upgraded our code. This not only significantly improved our user experience but also allowed us to defer costly infrastructure upgrades, saving the company a substantial amount of money. This journey reminds us that true engineering isn't about brute force. It's about being clever, being deliberate, and being frugal. It's about getting 3x the performance without 3x the cost. What are some of your biggest optimization stories? What lessons did you learn about balancing performance and cost? I'd love to hear your thoughts in the comments! 👇 Tagging few champions for their perspective Md Anjum Siddique Abhinav Chauhan Andre Aragao Manish Jain Prashant Thanki Vivek Sriwastava Ram Pandey Ashwini Kumar Shubham Jain

5 Comments
Like Comment
To view or add a comment, sign in
TechCentral

34,260 followers
1mo
Report this post
Promoted | Lift-and-shift migrations fail; modernisation with containers, open-source and cloud-native principles reduces costs, writes LSD Open's Seagyn Davis. https://lnkd.in/dZHVVx_x

Why modernising while migrating to the cloud is critical https://techcentral.co.za
Like Comment
To view or add a comment, sign in
Akoley Aristide BEKROUNDJO

Cloud Expert & IT Trainer | 20,000+ Professionals Trained | CEO EC INTELLIGENCE
3w
Report this post
The evolution of datacenter network architecture is not merely a technical upgrade but a fundamental reimagining of how we engineer digital infrastructure. The greatest technological transformations, from the rigid 3-tier hierarchies of 2010 to today's fluid spine-leaf architectures, are defined not by incremental improvements but by their ability to reshape the very physics of data movement. They understand that winning in the infrastructure game is not about optimizing within existing constraints; it is about establishing new architectural paradigms that redefine what's possible. This foundation of architectural innovation creates an unassailable competitive advantage for organizations bold enough to embrace it. This transformation is powered by a relentless, dual-pronged focus on latency reduction and bandwidth multiplication. Modern datacenters are perceptively innovative, constantly pushing the boundaries of what's achievable—from microsecond-scale latencies to 800 gigabit interfaces—yet they are also ruthlessly pragmatic in their implementation. The new infrastructure playbook is elegantly simple: eliminate every unnecessary hop, activate every available path, and build networks that applications cannot distinguish from local resources. This performance obsession is not marketing rhetoric but an operational imperative, driving every design decision from physical cabling to protocol selection. Ultimately, the blueprint for modern datacenter networking is built on a profound shift: from managing scarcity to orchestrating abundance. The journey from spanning tree's 50% link utilization to ECMP's full mesh activation, from FCoE's complex protocol stacks to NVMe-oF's elegant simplicity, from manual configuration to intent-based automation—each represents a step toward infrastructure that doesn't just support business but accelerates it. It is this combination of architectural vision and flawless execution that transforms mere connectivity into competitive advantage, securing not just operational excellence but a foundation for whatever digital future emerges. Today's infrastructure leaders understand four key principles that separate evolutionary updates from revolutionary transformation. The path from Greenwich's 2010 vision to today's reality teaches us that enduring infrastructure isn't built by perfecting yesterday's approach but by courageously embracing tomorrow's possibilities.
Like Comment
To view or add a comment, sign in
Rizqi Mulki

🚀 Muslimpreneur | 🎓 Lecturer | 💻 Software Developer | Bridging Innovation, Education & Technology for Real-World Impact
1w
Report this post
The enterprise infrastructure revolution is happening behind closed doors—Fortune 500 companies are quietly abandoning dedicated servers and saving millions in the process. Just published an in-depth analysis of why this shift is accelerating and how companies are reducing infrastructure costs by 40-80% while improving performance. Worth a read for anyone involved in IT strategy or digital transformation. #CloudTransformation #EnterpriseIT #DigitalTransformation

Why Enterprises Are Quietly Ditching Dedicated Servers rizqimulki.com
Like Comment
To view or add a comment, sign in

5,529 followers

View Profile Follow

From Infrastructure to Application Availability: A Shift in Perspective

More from this author

11 architecturalist papers: The area under the graph

Explore content categories