The core principle
The best architecture is the simplest one that meets your current needs, with a clear path to evolve when requirements change. Premature optimization wastes engineering time and introduces complexity before you understand where your actual bottlenecks are. Scale incrementally, measure constantly, and resist the urge to build for 10 million users when you have 100.
Here's how that progression looks in practice across seven stages.
Stage 1: Single server (0–100 users)
Everything runs on one machine — web application, database, and background jobs.
Why it works at this scale:
- Fast deployment and iteration
- Low cost ($20–50/month)
- Centralized logs make debugging straightforward
- End-to-end request visibility with no network hops
When to move on: CPU or memory consistently exceeds 70–80%, deployments cause downtime, or a crash takes down the entire service.
Stage 2: Separate the database (100–1K users)
Split your application server and database onto independent machines.
What you gain:
- Resource isolation — the database no longer competes with the app for CPU and memory
- Independent scaling — you can right-size each machine
- Better security through network isolation
- Backup processes no longer impact application performance
Key optimization at this stage: Connection pooling with a tool like PgBouncer lets you reuse 20–30 database connections across 100+ application connections, improving throughput 3–5x without changing a line of application code.
Stage 3: Load balancer + horizontal scaling (1K–10K users)
Add a load balancer and run multiple application server instances behind it.
The critical challenge: Your servers must become stateless. Session data stored in server memory breaks as soon as requests can land on different servers. You have two options:
- Sticky sessions — route each user to the same server. Simpler to implement, but less resilient. If that server dies, the session is gone.
- External session store — move sessions to Redis or a similar store. Every server can serve every request. This is the right long-term approach.
The moment you move sessions out of server memory, you've unlocked true horizontal scaling.
Stage 4: Caching + read replicas + CDN (10K–100K users)
At this scale, the database becomes the bottleneck. The solution is to hit it less often.
Caching
Most applications follow an 80/20 pattern: 80% of requests touch 20% of the data. Cache that 20%.
The cache-aside pattern works well here:
- Check the cache for the requested data
- On a miss, query the database
- Store the result in cache with a TTL
- Return the data
Cache invalidation strategy matters. TTL-based expiration (5–60 minutes) works for data that can tolerate slight staleness. Explicit invalidation on write is needed for anything requiring stronger consistency.
Read replicas
Distribute read queries across database replicas while keeping all writes on the primary. This directly reduces contention on your most constrained resource.
One gotcha: asynchronous replication introduces millisecond-to-second lag. If a user writes data and immediately reads it back, they might hit a replica that hasn't caught up. The fix is to route reads to the primary for a short window after a write — this is called read-your-writes consistency.
CDN
A CDN caches your static assets (images, JS, CSS) at edge servers globally. For a user in Tokyo hitting your US-based origin, latency drops from ~300ms to ~50ms. This is one of the highest-leverage, lowest-effort improvements you can make.
Stage 5: Auto-scaling + stateless design (100K–500K users)
Manual scaling doesn't work at this volume. Traffic patterns are unpredictable, and over-provisioning is expensive. Auto-scaling adjusts your fleet based on real-time load.
Sensible starting parameters:
- Minimum instances: 2 (always maintain redundancy)
- Maximum instances: set a cost ceiling
- Scale-up threshold: 70% CPU
- Scale-down threshold: 30% CPU
- Scale-up cooldown: 3 minutes
- Scale-down cooldown: 10 minutes (scale down conservatively to avoid thrashing)
JWT for stateless authentication
Session cookies work at this scale but add complexity when scaling horizontally. JWT tokens shift authentication state to the client — servers verify the signature without a database lookup.
The trade-off: You can't revoke a JWT before it expires.
The solution: Short-lived access tokens (15 minutes) combined with long-lived refresh tokens (7 days). You can revoke refresh tokens in a small database, and access tokens expire quickly enough that the attack window is narrow.
Stage 6: Sharding + microservices + message queues (500K–1M users)
You've optimized, cached, and scaled horizontally. Now your single database is the ceiling.
Database sharding
Split data across multiple databases using a shard key (commonly user ID modulo shard count). Each database handles a fraction of the total data and traffic.
Important caveat: Only do this after exhausting all other options. Sharding is a one-way architectural decision. Queries that cross shard boundaries become application-level joins. Transactions spanning multiple shards require distributed coordination. It's expensive to undo.
Microservices
A single monolith gets harder to deploy safely as teams grow. Microservices let independent services own their data, deploy on their own schedules, and scale independently.
The cost is operational complexity: service discovery, inter-service communication, distributed tracing, and more failure modes to reason about.
Message queues
Not everything needs to happen synchronously in the request path. Sending an email, generating a report, or updating a search index can all happen asynchronously.
Tools like Kafka or RabbitMQ let producers and consumers operate independently. Downstream failures don't block primary operations. Consumers can be scaled separately from producers. New consumers can be added without changing existing producers.
Stage 7: Multi-region + advanced patterns (1M–10M+ users)
Single-region deployments have geographic latency limitations and single points of failure. At this scale you need global distribution.
Active-passive vs. active-active
Active-passive: One region handles all writes; others serve reads. Simpler to reason about, but users far from the write region experience higher write latency. Failover is straightforward.
Active-active: All regions handle both reads and writes. Lower global latency, but requires conflict resolution when the same data is written in two regions simultaneously. You're now operating in the world of eventual consistency.
CAP theorem in practice
During a network partition, you must choose between consistency and availability. Most user-facing systems choose availability — accept that some users might see slightly stale data rather than showing an error page.
CQRS
Separate your write path (normalized schema, ACID transactions, optimized for correctness) from your read path (denormalized views, optimized for query speed). Different databases can back each side, each tuned for its workload.
Advanced caching
- Multi-tier caching: local in-process cache → regional cache cluster → global cache
- Cache warming: pre-populate caches before traffic arrives (critical after deployments)
- Write-behind caching: write to cache first, flush to the database asynchronously for write-heavy workloads
Beyond 10 million users
At this point general-purpose solutions start hitting their limits.
Polyglot persistence — use the right database for each access pattern. PostgreSQL for transactions, Elasticsearch for full-text search, a time-series database for metrics, a graph database for relationship traversal.
Edge computing — push computation to CDN edges (Cloudflare Workers, AWS Lambda@Edge) to serve requests without ever hitting your origin servers.
Custom infrastructure — Netflix, Discord, and Uber all eventually built proprietary systems when off-the-shelf solutions couldn't meet their requirements. This is the exception, not the rule.
The principles that hold at every stage
- Start simple; optimize only when you can measure a bottleneck
- Stateless servers unlock horizontal scaling and auto-scaling
- Aggressive caching provides 10–100x performance gains with relatively low complexity
- Queue non-critical work asynchronously to decouple services
- Shard reluctantly — it's a last resort, not a starting point
- Accept that consistency and availability trade off against each other
- Every component you add is a component you have to operate
Scale is earned, not planned for in advance.
