Part 6/7: Leadership in SRE: Building and Mentoring Resilient Teams
visual credit: Vipin Sharma (https://www.linkedin.com/in/vipinsharma-autopia/)

Part 6/7: Leadership in SRE: Building and Mentoring Resilient Teams

Ever wonder why some teams thrive under pressure while others struggle to hold it together? It’s not 𝓁𝓊𝒸𝓀—it’s 𝓁𝑒𝒶𝒹𝑒𝓇𝓈𝒽𝒾𝓅.

In Site Reliability Engineering (SRE), resilient teams are the driving force behind reliable systems and scalable operations. For CXOs, this isn’t just about uptime; it’s about creating teams that drive business continuity, agility, and client trust. In this sixth installment of my SRE leadership series, I share how I mentored global teams to deliver a 248% increase in backup clients and achieved zero change failures over five years. Here’s what 16 years of leading infrastructure operations have taught me about building teams that excel under pressure.


Why Leadership Matters in SRE

SRE is as much about people as it is about systems. Resilient teams strike a balance between 𝓉𝑜𝒾𝓁, 𝒾𝓃𝓃𝑜𝓋𝒶𝓉𝒾𝑜𝓃, and 𝓇𝑒𝓁𝒾𝒶𝒷𝒾𝓁𝒾𝓉𝓎, enabling organizations to scale without sacrificing 𝓈𝓉𝒶𝒷𝒾𝓁𝒾𝓉𝓎. Leadership in this context means more than managing incidents. It’s about creating psychological safety, enabling ownership, and ensuring alignment with business goals.

As highlighted in Google’s SRE Book, psychological safety is critical to high-performing teams. My approach has been consistent: 𝓂𝑒𝓃𝓉𝑜𝓇 𝓅𝓇𝑜𝒶𝒸𝓉𝒾𝓋𝑒𝓁𝓎, 𝑒𝓂𝓅𝑜𝓌𝑒𝓇 𝓉𝑒𝒶𝓂𝓈 𝓉𝑜 𝓂𝒶𝓀𝑒 𝒹𝑒𝒸𝒾𝓈𝒾𝑜𝓃𝓈, and 𝑒𝓃𝓈𝓊𝓇𝑒 𝑒𝓋𝑒𝓇𝓎 𝒾𝓃𝒾𝓉𝒾𝒶𝓉𝒾𝓋𝑒 𝓂𝒶𝓅𝓈 𝓉𝑜 𝓇𝑒𝒶𝓁 𝒷𝓊𝓈𝒾𝓃𝑒𝓈𝓈 𝑜𝓊𝓉𝒸𝑜𝓂𝑒𝓈.


Building Resilient Teams

  • Foster a Blameless Culture: Outages are inevitable—but blame doesn’t fix them. I introduced blameless post-mortems using structured approaches like 𝙎𝙞𝙭 𝙏𝙝𝙞𝙣𝙠𝙞𝙣𝙜 𝙃𝙖𝙩𝙨, which shifted the focus from fault to solutions. This fostered trust and reduced repeat incidents.
  • Empower Through Automation: Toil is the enemy of innovation. By automating alerts and operational workflows, our teams saved over 100 𝙢𝙖𝙣-𝙝𝙤𝙪𝙧𝙨 𝙚𝙖𝙘𝙝 𝙢𝙤𝙣𝙩𝙝. That capacity was redirected toward scaling operations, resulting in a 248% 𝙜𝙧𝙤𝙬𝙩𝙝 in backup clients within 1 year.
  • Set Clear Objectives: Every engineer knew how their work contributed to business outcomes. We aligned daily responsibilities with service-level objectives (e.g., 99.95% uptime for critical infrastructure), ensuring operational priorities translated to customer impact.


Mentoring for Long-Term Success

  • Develop Technical Depth: I mentored cross-functional teams across India in scripting (PowerShell, Bash), capacity planning, and disaster recovery readiness. This helped us execute 20+ DR tests across 10 business units with zero escalations.
  • Cultivate Leadership Skills: Through 1:1 coaching, I supported engineers in resolving conflicts, leading projects, and stepping into decision-making roles. The result? Increased service quality and faster problem-solving across time zones.
  • Promote a Growth Mindset: Our post-mortems became continuous learning labs. Engineers came to expect—not fear—constructive feedback, which fostered accountability and strengthened team cohesion.


From Leadership to Business Impact

  • Zero Change Failures: Over five years, our team achieved a 100% success rate in change management—no failed or rolled-back changes—thanks to strong mentoring, rigorous planning, and peer reviews.
  • Operational Scalability: The 248% growth in backup clients was driven not by headcount but by aligning automation, SLOs, and talent development with evolving client needs.
  • Cost Optimization: Empowered engineers built dashboards to monitor backup success, storage trends, and saved $40K annually through better resource utilization.
  • Global Synergy: By fostering collaboration across three continents, we scaled operations without compromising reliability, uptime, or team morale.


Key Takeaways for SRE Leaders

  • Lead by Example: Model blamelessness and accountability to build psychological safety.
  • Invest in People: Mentoring builds both confidence and capability.
  • Align with Business: Make sure every team metric supports a client or business outcome.
  • Celebrate Small Wins: Recognizing automation breakthroughs or DR test successes boosts morale and momentum.


Leadership in SRE is about building teams that outlast outages, scale with confidence, and deliver tangible business value.

What’s your top tip for building resilient teams? Share your thoughts in the comments—I’d love to hear from you.

🔁 Join me every Monday at 8:06 AM IST for more SRE leadership insights.

#SRE #Leadership #Reliability #TechLeadership #TeamBuilding #Resilience #SiteReliability

ROSHAAN MAHBUBANI

Private Banking Leader • Financial Strategist focused on Private Banking and Wealth Management

1mo

Thanks for sharing, Harsh

Harsh Ved

Strategic IT Operations Leader | Driving Digital Transformation, Automation & Resilience | $600K Revenue Growth | 16+ Years in Banking & Oil & Gas | Thought Leader in Infra & Cloud.

1mo

Thank you everyone for your feedback and DMs on the #SRE series. Please stay tuned for the concluding article next Monday, 30th June at 0806 hours IST.

Shreenivasa KM

🚀 Product Leader | AI & IoT Leader in Manufacturing | Driving 30% Faster Time-to-Market, 25% Uptime via Predictive Maintenance & Industry 4.0 Innovation🔸Director of Product | Connect for Insights 👇

1mo

Inspiring post! Harsh Ved, Resilient teams thrive when diplomatic leadership fosters trust and empowerment, like mentoring my community to join a subgroup, sparking 10x engagement. Reflecting on your 248% growth, how do you balance blameless cultures with high-stakes chaos? Let’s share insights to elevate resilience!

Subrata S.

Driving Business Impact through Data | Analytics Strategy & Engineering | Scalable Pipelines • Actionable Insights | PySpark • Snowflake • Power BI

1mo

Incredible insight, Harsh Ved! 👏 Leadership truly is the backbone of resilience—especially when navigating complexity at scale. Your results (248% client growth + 100% change success) speak volumes about the power of culture and coaching. 𝗣𝘀𝘆𝗰𝗵𝗼𝗹𝗼𝗴𝗶𝗰𝗮𝗹 𝘀𝗮𝗳𝗲𝘁𝘆 𝗶𝘀𝗻'𝘁 𝗼𝗽𝘁𝗶𝗼𝗻𝗮𝗹—𝗶𝘁'𝘀 𝗳𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻𝗮𝗹. When teams know they can take risks without blame, innovation and ownership thrive. Looking forward to reading your full article—thanks for leading from the front in SRE and beyond! #TechLeadership #TeamBuilding #Resilience #SRE

To view or add a comment, sign in

Others also viewed

Explore topics