The Hard Work of Cloud Security

Travis McPeak

Secure AI Coding

Published Apr 18, 2025

What cloud security engineers actually do, and how to do it the right way

In any discipline of security, visibility is step 1. You can’t secure what you can’t see. Today, cloud visibility is a solved problem, with cloud native tools like AWS Config/Google Cloud Asset Inventory and third-party solutions like Wiz giving perfect insight into deployed resources and how those align with security requirements and best practices.

What do Cloud Security teams do after adopting a visibility solution? Their job is to effectively reduce risk and ensure compliance with the requirements of common standards such as CIS. Security teams have traditionally hesitated to enforce gates on deployment, so instead, they’re left with retroactive clean-up work.

Unfortunately, most of this clean-up work is laborious, slow, and risky. Projects to document public infrastructure, migrate away from IAM users, downsize roles, apply SCPs, or add backup/encryption to datastores take quarters or years. Often, they aren’t cleaned up in a timely fashion, leading to yet another public cloud data breach. If not a breach, they frequently result in an embarrassing audit report finding. Still other times, security teams trying to do the right thing end up breaking something (IMDSv2, anyone?) and losing trust with developers.

It is possible to broadly improve cloud infrastructure, but the work requires careful execution and tooling. In this post, we’ll analyze a few of these projects, the pitfalls, and how to make them successful.

Tagging

Resource tagging is generally helpful in a few cases:

Ownership – assign vulnerability tickets to the responsible team and coordinate necessary changes
Public exceptions – document anything that is supposed to be public, and why
FinOps – tracking and categorizing cloud spend

Starting our list is tagging. Tagging is a generally benign change, with less risk involved than others we’ll discuss in this post. The “hard” part of tagging projects ends up being reaching out to the suspected owners, collecting the necessary information from them, and getting them to actually make the changes. This project ends up being a lot of:

Send a Slack message/email to somebody and confirm if they’re the owner
Send the owner instructions about how to apply tags and what the tag standard is
Follow up repeatedly until the tags are applied
Track the progress and try to rally the hold-outs

The typical tools of the trade here are Slack, email, Jira, and Confluence. While tags are valuable for many other security and compliance work, they’re painful to roll out. Even the successful organizations get this done over many quarters, and only for a subset of their infrastructure.

Sometimes, you’ll end up with a list of infrastructure that nobody claims ownership of. At this point, you can:

Give up
Escalate to a leader and let them delegate an owner
Start making (restorable) changes to shut it off and wait for somebody to come screaming, they’re the owner or know who is

Deploying SCP/Org Policies

We’ll cover SCPs in this blog, but Google’s Org Policies are very similar. Unlike tags, deploying SCPs can be a destructive change – if you get it wrong you might break something. This project should involve careful coordination between the security team and the appropriate owner (hopefully you got them tagged previously!)

The process for a good SCP rollout involves:

Make a list of accounts where desired SCPs are not applied
Reach out to the owner of that account and inform them of the change and potential impact
Document any known carveouts (conditions that will apply to the SCP to limit what it applies to)
Craft the SCP (with conditions)
Coordinate a good time to deploy the SCP with the owner
Deploy it, and be ready to roll it back if needed

Security teams should typically roll out SCPs themselves (vs. giving users instructions) because of the type of permissions required to deploy an SCP. If you’re updating an existing SCP, you should store a previous version so you can quickly roll back if required. Coordinating the timing with the user is important. You’ll both want to do this at a time when people are around to rollback AND you aren’t at peak time for whatever your business does.

Removing IAM users

Rounding out our list is removing IAM users. IAM users (and their often associated static keys) present a huge risk to organizations using cloud, and these findings often top lists of cloud breach vectors. Unfortunately, some legacy vendor tooling still doesn’t support IAM role assumption, so some of the IAM users/keys are going to be required and will need an exception granted.

A successful project to clean up IAM users involves:

Making the list of users and associated owners
For each owner, document a reason why the IAM user exists:
Select a time to make changes with lower business impact
Select a strategy to neutralize the key while allowing for rollback
Apply the changes gradually and document the project

Lessons learned from Repokid

At Netflix, I ran a project and tool called Repokid, which used IAM Access Advisor and CloudTrail data to downsize IAM roles to only the used permissions. The project was effective in removing 59% of permissions across our AWS accounts, and is still running today.

Our goals were:

Limit developer action required – developers shouldn’t have to spend time on IAM least privilege; they just get the role permissions they need
Limit disruption to environments – we didn’t want to cause outages
Enable fast restoration – sometimes we’d remove permissions from a role that hadn’t been used in a while, but the owner would need those permissions back later. We had to account for this.
Tell developers what we were doing and when
Allow developers to opt-out at any point
Reporting – we wanted to automatically gather progress metrics regularly so we could report them to our stakeholders

We built all of these functions into Repokid, which enabled the project to succeed. Building these features was time-consuming, and we had a few advantages that many organizations wouldn’t for the projects I mentioned above:

It only had to work for IAM role permissions
It made direct cloud API calls; it didn’t have to work with IaC
We didn’t have to collect user input – users got an email notifying them we were making changes and giving them an option to opt-out
We didn’t have to coordinate changes at a specific time, the permissions we were removing were unused for at least a quarter

Organizations having to implement this kind of tooling and process and do the hard work of manual outreach and coordination are why not enough of these projects are completed. In the rare case they are successful, they take a long time with slow visible progress.

At Resourcely, we’re building the suite of tools I wish I had to help Repokid be successful, including:

Coordinating with developers, in their preferred communication method
Managing the pipeline of changes
Applying changes at the right location (IaC, cloud API)
Enabling fast and effective rollback
Continuous progress tracking/metrics
Opt-outs/exception tracking

Thanks for reading this far! I would love to compare notes if you’re working on these projects. You can reach me on LinkedIn or Twitter, or by email.

Patrick Kelley

4mo

I love it. This is hilarious.

1 Reaction

Snir Ben Shimol

CEO | CSO | Fixing Cloud Risks

4mo

Where you’re the policy cop, I’m happy to be the risk cleaning janitor 🧼

1 Reaction

Grady Lancaster

Founder - Security | ML/AI | Engineering

4mo

Absolute comedy gold! 😂

4 Reactions

Russell Rosario

Cofounder @ Profit Leap and the 1st AI advisor for Entrepreneurs | CFO, CPA, Software Engineer

4mo

Travis McPeak, data visibility tools are great, but the real work is in cleanup. What's your biggest challenge? 🔍

1 Reaction

Cee Ng

Executing with you to the finish line 🎉

4mo

You are the best, this is gold 🤣 Happy Friday to you too, Travis 😂😂😂 can’t wait to dive into your post, thanks for always sharing your pov in the cloud 🔒

The Hard Work of Cloud Security

Travis McPeak

Secure AI Coding

What cloud security engineers actually do, and how to do it the right way

Tagging

Deploying SCP/Org Policies

Removing IAM users

Lessons learned from Repokid

More articles by this author

Others also viewed

AWS Security Services Snapshot: 10-Point Executive Checklist

Best Practices for Securing your Cloud Environments

Securing the Cloud: Challenges and Best Practices for Cloud Security in 2023.

Managing Cloud Security: How to Avoid Common Misconfiguration Errors

How CSPM Enhances Compliance and Security in Cloud Environments

Discovering your Cloud Security Posture Management (CSPM) Maturity Level

Introduction to AWS Cloud Security: Challenges for Enterprises

The State of Cloud Security in 2024 - Edition 1

Cloud Security: Protecting Your Data in the Cloud

Cloud-Native Security: How to Secure Containers, Kubernetes, and Serverless Applications

Explore topics

What cloud security engineers actually do, and how to do it the right way

Tagging

Deploying SCP/Org Policies

Removing IAM users

Lessons learned from Repokid

Some ideas to fix security

Jun 20, 2025

Why does cloud misconfiguration happen?

Jan 6, 2025

How Netflix freed developers by adopting self-service infrastructure

Jul 29, 2024

Why Product Security at Databricks

Apr 23, 2021

Let’s Tackle Commonly Shared Problems, But Let’s Do It Together – A Preview for OWASP AppSec USA 2018

Oct 3, 2018

I'm going to AppSec USA 2018 and you should too

Apr 11, 2018

Others also viewed

AWS Security Services Snapshot: 10-Point Executive Checklist

Best Practices for Securing your Cloud Environments

Securing the Cloud: Challenges and Best Practices for Cloud Security in 2023.

Managing Cloud Security: How to Avoid Common Misconfiguration Errors

How CSPM Enhances Compliance and Security in Cloud Environments

Discovering your Cloud Security Posture Management (CSPM) Maturity Level

Introduction to AWS Cloud Security: Challenges for Enterprises

The State of Cloud Security in 2024 - Edition 1

Cloud Security: Protecting Your Data in the Cloud

Cloud-Native Security: How to Secure Containers, Kubernetes, and Serverless Applications

Explore topics