Security Fundamentals for Scalable Systems
Junior and mid-level engineers often encounter a gap between knowing basic security concepts and applying them to complex distributed systems. This article bridges that gap with practical mental models and real-world examples, focusing on authentication, authorisation, and security practices at scale. We’ll cover common protocols and tools, highlight trade-offs (there are no silver bullets!), and give you clear takeaways to apply at work right away.
Authentication vs Authorisation
Authentication is about verifying who you are, while authorisation is about what you’re allowed to do. It’s a classic mix-up: authentication checks identity (e.g. login with a password or key), and authorisation checks permissions (e.g. are you allowed to access a resource). A quick analogy: hiring a pet sitter. You give them a house key to prove their identity and let them in (authentication), but you only permit them to feed the pets and maybe enter the kitchen – not snoop through bedrooms (authorisation). In other words, having the “key” to get in doesn’t automatically grant unlimited freedom inside.
Common protocols and standards help implement authentication and authorisation in distributed systems:
OAuth 2.0: An authorisation framework widely used for granting third-party apps limited access to an API on a user’s behalf. (For example, “Login with Google” uses OAuth 2.0 under the hood to let Google authorise another app to use your identity).
OpenID Connect (OIDC): An identity layer on top of OAuth 2.0 that provides authentication. OIDC issues ID tokens (JSON Web Tokens) containing user identity – this is how you actually “log in” using OAuth protocols.
Understanding the difference between authentication and authorisation is crucial before diving into implementation. Always authenticate first, then authorise – e.g. verify the user’s identity, then check their roles/permissions for each action.
Token-Based Authentication (Stateless Sessions)
Modern distributed systems favour token-based authentication over traditional server sessions. Instead of storing session state on the server (and using “sticky sessions” to tie a user to one server), the server issues a token to the client after login. This token (often a JWT or a random opaque token) is included with each request, and it identifies and authenticates the user in a stateless way.
Why tokens? Scalability. Any server instance can validate a token without needing in-memory session data, avoiding the need for users to always hit the same server. A JSON Web Token (JWT), for example, is self-contained: it includes user info and a signature. As a result, services can verify the token using a public key, with no database lookup, making distributed architectures more efficient. In other words, the token itself carries the session info.
JWT Structure: A JWT is composed of three parts – a header, payload, and signature. The header identifies the token type and algorithm; the payload contains claims (like user ID and expiry); the signature (signed by the auth server’s secret key) lets any resource server verify the token’s integrity. Because the token is self-contained and signed, any microservice can trust it without a central session store. However, note that JWTs are not encrypted by default – the payload is simply Base64-encoded, so if it contains sensitive data, an attacker who steals the token can read it. Always use HTTPS to protect tokens in transit, and avoid putting secrets or PII in a JWT payload.
Security risks with tokens: Be aware of token leakage and replay attacks. Leakage can occur if tokens are stored insecurely (e.g. in localStorage where malicious scripts can snatch them). Always store tokens securely (for web apps, many prefer HTTP-only cookies to mitigate XSS). Replay attacks are when an attacker intercepts or steals a valid token and reuses it to impersonate the user. Because tokens are bearer credentials, any possession of a valid token = access. Even a signed JWT is not immune: if intercepted, it can be replayed until it expires. Defences include enforcing TLS everywhere (to prevent eavesdropping) and using short token lifetimes or one-time tokens. In high-security scenarios, you might add token binding to detect replays, but these add complexity. A practical approach is to keep access tokens short-lived (minutes) and refresh them with additional checks (refresh tokens can be stored more securely and revoked if misuse is detected).
API Security at Scale
Exposing APIs at scale (whether publicly or for internal microservice use) introduces new challenges. You must secure every request and assume clients (or even internal services) could be hostile if compromised.
Public vs Internal APIs: Public-facing APIs (e.g. an open web service) are at the frontline – they must handle internet traffic, which means strict validation, authentication, rate limiting, and abuse detection. Internal service-to-service APIs (within your microservice architecture) often operate in a trusted network zone, but don’t get complacent. Adopting a Zero Trust mindset (more on that later) means treating internal calls with healthy skepticism too – authenticate and authorise everything, even behind the firewall. For instance, internal APIs might use mutual TLS or signed tokens for service auth, so that one compromised service can’t freely call others.
API Gateway and Rate Limiting: As systems grow, an API gateway is commonly used as the single entry point for external requests. The gateway can offload cross-cutting concerns like request routing, authentication, quotas, and monitoring. At Uber, for example, a central API gateway handles routing, rate limiting, header propagation, and even blocking abusive clients, acting as a first line of defense. Rate limiting protects against brute-force attacks and abuse by capping how many requests a client can make in a time window. This is crucial for public APIs to prevent denial-of-service or credential stuffing attacks. Internal APIs may also need rate limits to contain malfunctioning clients or stop cascading failures. An API gateway or service mesh can enforce these policies globally.
Common attack vectors: Be mindful of the OWASP Top 10 (and OWASP API Security Top 10) vulnerabilities. For instance, injection attacks (SQL injection, command injection) occur when malicious data is sent as part of inputs (query params, JSON bodies, etc.), tricking the system into executing unintended commands. Prevent this by sanitizing and validating all inputs server-side – never trust client data, even if it’s an internal service call. Broken authentication is another risk – if you fail to properly verify tokens or passwords, attackers can slip in. Always validate JWT signatures and token expirations, and use strong password policies for any direct logins. Broken authorisation (or “broken access control”) is when users can act beyond their privileges – e.g., accessing others’ data by tweaking a URL. Guard against this by rigorously checking user scopes/roles on every request. Other issues include misconfigured security headers, unrestricted file uploads, and so on – but the key is to build a habit of defensive programming and regular security reviews for your APIs. Use libraries and frameworks that provide security features out of the box (for example, many web frameworks auto-sanitize SQL inputs or have authentication middleware – use them!).
Secure coding and testing: At scale, even a small vulnerability can be magnified. Incorporate security testing (like pen-tests or automated scans) into your development cycle. For public APIs, consider an external bug bounty program to catch issues early. Internally, run threat modelling sessions: think like an attacker about how your API or microservice could be misused, and put mitigations in place. Remember, defence in depth – don’t rely on just one barrier. Even if your API gateway authenticates a user, your downstream service should still enforce authorisation checks, for example, in case someone finds a way around the gateway.
Secure Secrets Management
In a distributed system, there are many “secrets” – database passwords, API keys, encryption keys, tokens for third-party services, etc. Never hardcode secrets in code or config files. Hardcoding secrets (or embedding them in container images) is like hiding a house key under the doormat – it’s the first place attackers will check. Those secrets can end up in source control, logs, or container registries and leak to adversaries. Instead, use dedicated secret management solutions and environment variables. In fact, AWS best practices explicitly state: don’t hardcode secrets in source – use AWS Secrets Manager or Parameter Store to inject them securely at runtime.
Popular tools for secrets management include:
AWS Secrets Manager / Parameter Store: Managed services that store secrets encrypted, and provide them to your applications on request (with IAM access control). They can auto-rotate certain secrets (like database credentials) seamlessly.
HashiCorp Vault: A powerful open-source tool that acts as a central vault for secrets. Vault can dynamically generate secrets (e.g. create a temporary DB user on the fly) so that credentials are short-lived. It has a robust access control policy framework and audit logging.
Kubernetes Secrets: If you’re using k8s, it has a built-in secrets object. These are base64-encoded by default (not true encryption), but you can integrate KMS plugins for encryption at rest. K8s Secrets make it easy to decouple secret data from your container images and inject via environment vars or volumes.
Good secrets hygiene also means rotation and least privilege. Rotation ensures that even if a secret leaks, it’s only valid for a limited time. For example, rotate API keys periodically or immediately if someone who knew the secret leaves the team. Least privilege means each service or component should only have the secrets it truly needs – e.g. your payment microservice doesn’t need access to the marketing API keys. By limiting exposure, a breached component won’t expose everything.
Scaling Session Management
When your user base and server pool scales up, session management can become a headache. Traditional session management (storing a user’s login session in memory on one server) doesn’t scale well – if that server dies or the user’s next request goes to a different server, the session is lost. One naive fix is “sticky sessions,” where the load balancer always sends a given user to the same server that holds their session. This introduces statefulness that hurts resilience and flexibility: if that server is overwhelmed or down, the user is out of luck, and scaling out isn’t as effective because sessions are siloed.
A better approach is to externalise or eliminate session state:
Distributed session store: Use an external in-memory datastore like Redis or Memcached to hold session data, accessible by all server instances. For example, many web apps use Redis to store session objects; it’s fast and allows horizontal scaling because any server can fetch/update session info from Redis, rather than keeping it locally. This way, you can add servers at will – they all consult the same session store. (Make sure to replicate or persist the store as needed so you don’t lose sessions on a single node failure.)
Stateless tokens: As discussed earlier, you can avoid server-stored sessions entirely by using JWTs or similar tokens that the client holds. The server just validates the token each time. This is the stateless architecture approach, which significantly simplifies scaling. In fact, modern microservices and cloud architecture favor stateless services because it makes scaling and failover almost seamless. As one illustration, consider an e-commerce site on Black Friday: if it kept sessions in memory, each server would hold thousands of sessions and a crash could drop those users. In a stateless design, any server can handle any user’s next request without warm-up, because the session is not stored locally at all. Externalizing state (either in a token or a distributed cache) dramatically improves reliability and scalability.
Be mindful of sticky session pitfalls if you’re using them. They can cause uneven load (some servers tied to heavy users become hot) and complicate rolling deployments (you can’t freely shift users around during an upgrade). If you must maintain state, lean towards a distributed cache or database approach rather than affinity. Many frameworks have drop-in support to use Redis for session backing – use those rather than reinventing.
Finally, consider the stateless philosophy even for internal service interactions – for example, rather than one service storing data expecting the same instance to handle the next request, design services to be idempotent or to store any needed context in a shared DB or cache. This makes horizontal scaling and auto-recovery much easier.
Distributed System Attack Surfaces
A distributed system (hundreds of microservices, multiple data stores, etc.) greatly expands the attack surface compared to a monolith. With many moving parts, you have more things to secure. Some considerations:
More servers and containers: Each service runs on its own host or container. That means more OS instances and images to keep patched and up-to-date. An unpatched vulnerability on any one machine could be a way in for attackers.
Service-to-service communication: Calls between microservices (over APIs, message queues, etc.) are happening constantly. Without proper protections, attackers could eavesdrop or tamper with these messages. Encrypt internal traffic (e.g. use HTTPS or secure gRPC for all service calls) so that even behind the firewall, data isn’t in the clear. This also helps prevent man-in-the-middle attacks if someone breaches your internal network.
Authentication everywhere: In a sprawling microservice system, a common mistake is to authenticate at the edge (user-facing service) but then assume everything internal can be trusted. If an attacker penetrates one service, they might move laterally. Adopting a Zero Trust model helps: “Trust nothing, verify everything”. Concretely, this could mean using mutual TLS between services, or having internal services require signed tokens or API keys from their callers, even if they’re sister services.
Network segmentation and firewalls: Use cloud VPCs, security groups, or Kubernetes network policies to limit which services can talk to which. If Service A never needs to call Service B, enforce that at the network level. This way, if A is compromised, it can’t freely scan or attack others. Newman describes using tools like Project Calico or similar to enforce that only expected traffic flows are allowed – an attacker finds it much harder to pivot across your system.
Defence in depth: No single security measure is foolproof. Embrace multiple layers – defence in depth – so that if one layer fails, others still protect you. For example, you might rely on OAuth tokens for user auth (first layer), but also have each microservice check a service-level ACL or role (second layer), and also ensure the database only returns rows that belong to that user (third layer). An attacker would need to defeat all layers to get data they shouldn’t. Microservices actually enable this: you can harden critical services separately (extra auth checks, stricter network rules) while maybe relaxing on low-risk services. Prioritise securing the most sensitive assets with multiple controls.
Finally, plan for failure. Assume that someday an attack will get through one layer. Have an incident response plan (isolate affected services, rotate keys, etc.) and practice it. Security at scale is as much about resilience as prevention.
Auditing and Monitoring
Security isn’t just about prevention – it’s also about detection and response. This is where observability (logs, metrics, tracing) becomes a security asset. Aim for a system where you can answer “Who did what, when, and where?” easily from your logs. We discussed this in more depth in last week's article but here are some quick tips for completeness of this article.
Audit logs: An audit trail records user actions and important system events. For example, whenever an admin changes a setting or a user requests sensitive data, log it (with user ID, timestamp, IP, etc.). These logs should be stored centrally and securely (to prevent tampering) and retained for a long period. Unlike debugging logs that you might rotate out after a week, audit logs might need to be kept for months or years. Why? A breach might only be discovered long after it happened, and you’ll want the historical logs to investigate. As Sam Newman notes, developers typically keep logs for a few weeks for debugging, but intrusion detection might require months-old logs. Plan your log retention and storage accordingly (solutions like the ELK stack or cloud log archives are common for this).
Monitoring and alerts: In a large system, you should have automated monitoring to catch weird behaviour. For example, if one API endpoint that normally gets 100 requests/hour suddenly gets 10,000/hour, that could be an attack (or a bug) – raise an alert. Many organisations integrate security alerts into their monitoring: failed login spikes, unusual IP geolocations, heavy use of an admin API, etc., should trigger notifications. There are specialized tools (SIEMs, intrusion detection systems) that watch network and application logs for suspicious patterns. Even simple anomaly detection helps e.g. if a microservice that rarely calls the user database suddenly starts making thousands of queries, it could signal a compromise.
Who did what and when: Audit logs shine here. You want to know, for example, which user account performed a sensitive action and when. This is vital for forensic analysis after an incident and for compliance in regulated industries. Many cloud platforms provide audit trails (e.g. AWS CloudTrail logs every API call in your cloud environment). Leverage these. Ensure that admin actions in your system (like granting roles, deleting data, etc.) leave an audit record. It might seem tedious, but when something goes wrong, you’ll be glad to have a clear timeline of events. As a best practice, regularly review audit logs for anomalies – or better, feed them into an automated system that flags anomalies (like an admin account doing something at 3 AM from an unusual IP).
In summary, monitoring is a security requirement at scale. Build strong observability into your system design. It’s not just about keeping systems running; it’s about spotting evil behaviour in a sea of normal operations. And when an incident occurs, good logs and metrics turn a wild witch hunt into a straightforward investigation.
Conclusion
Security and authentication in large-scale systems can feel like juggling many balls – user auth, service auth, network security, data protection, etc. The key is to establish a clear mental model for each aspect and apply best practices consistently. We’ve seen that each tool or pattern comes with trade-offs: OAuth tokens enable scalability but require careful handling to avoid leaks; microservices let you compartmentalise but increase the surface you need to defend. There is truly no silver bullet – the art of building secure systems lies in layering multiple defences and making informed trade-offs.
As you design systems on the “road to 10x,” remember to think like an attacker as well as an engineer. Use the wisdom from those who’ve done it and don’t shy away from established solutions – you’re not the first to face these problems. Defence in depth, least privilege, and zero trust aren’t just buzzwords; they’re guiding principles that will serve you well when your system grows beyond one simple service.
Most of all, foster a culture of security in your team: code reviews that catch security issues, regular knowledge sharing about new threats, and openness to improve. A truly scalable system design is one that scales safely. With the concepts and examples above, you’re equipped to design systems that not only handle millions of users, but do so while keeping the bad guys out and the auditors happy.
Kiteworks CMO | AI-first Strategist | I Love Lamp
2moSecurity is truly essential for system design. Great tips for improving approachability and robustness. Excited to dive deeper into this. 🚀 #SystemDesign