A common reason that companies seemingly like to hire Big Tech engineers (at least before the recent hiring market) is that they are good at scaling large systems. While Big Tech today might be too big to just assume that any engineer is good at scaling systems, it’s still important to understand what system scalability is and how to scale a system properly.
Systems tend to slow down as they grow unless proactively adjusted to handle the increased demands.
Scalability is the ability to handle more load by adding resources.
A truly scalable system can adapt and evolve to consistently manage a growing workload.
This article will examine various dimensions of system growth and explore common strategies for achieving scalability.
How can a system grow?
A system can grow in different ways. Here are the most common:
1. More Users: A larger user base creates a greater number of requests.
- Example: A social media platform experiencing a surge in new users.
2. More Features: Adding new features to the system increases its capabilities.
- Example: An e-commerce website adding support for a new payment method.
3. More Data: The system stores and manages more data because of user activity or logging.
- Example: A video streaming platform like YouTube storing more video content over time.
4. More Complexity: The system’s architecture evolves to handle new features and scale, adding more parts and connections.
- Example: A system that started as a simple application is broken into smaller, independent systems.
5. More Locations: The system serves users in new regions or countries.
- Example: An e-commerce company launching websites and distribution in new international markets.
How to Scale a Software System
Here are 10 common ways to make a system scalable:
1. Vertical Scaling (Scaling Up)
This means adding more power to your existing machines by upgrading server with more RAM, faster (or more) CPUs, or additional storage.
It’s a good approach for simpler architectures but has limitations in how far you can go.
graph LR
subgraph "After Vertical Scaling"
direction TB
CPU2[8 CPU Cores]
RAM2[32GB RAM]
SSD2[500GB SSD]
end
subgraph "Before Scaling"
direction TB
CPU1[2 CPU Cores]
RAM1[8GB RAM]
SSD1[100GB SSD]
end
Before --> After
style CPU1 fill:#f9f,stroke:#333
style RAM1 fill:#bbf,stroke:#333
style SSD1 fill:#bfb,stroke:#333
style CPU2 fill:#f9f,stroke:#333
style RAM2 fill:#bbf,stroke:#333
style SSD2 fill:#bfb,stroke:#333
classDef default fill:#fff,stroke:#333,stroke-width:2px2. Horizontal Scaling (Scaling Out)
This means adding more machines to your system to spread the workload across multiple servers.
This is usually the simplest and most efficient way to scale a system.
graph LR
subgraph "Before Scaling"
direction TB
SERVER1[Server<br/>2 CPU Cores<br/>8GB RAM]
end
subgraph "After Horizontal Scaling"
direction TB
SERVER2[Server<br/>2 CPU Cores<br/>8GB RAM]
SERVER3[Server<br/>2 CPU Cores<br/>8GB RAM]
SERVER4[Server<br/>2 CPU Cores<br/>8GB RAM]
end
Before --> After
style SERVER1 fill:#bbf,stroke:#333
style SERVER2 fill:#bbf,stroke:#333
style SERVER3 fill:#bbf,stroke:#333
style SERVER4 fill:#bbf,stroke:#333
classDef default fill:#fff,stroke:#333,stroke-width:2pxExample: Netflix uses horizontal scaling for its streaming service, adding more servers to their clusters to handle the growing number of users and data traffic.
3. Load Balancing
Load balancing is the process of distributing traffic across multiple servers to ensure no single server becomes overwhelmed.
graph LR
subgraph "Before Scaling"
direction TB
LB1[Load Balancer]
SERVER1[Server 1<br/>2 CPU Cores<br/>8GB RAM]
end
subgraph "After Horizontal Scaling"
direction TB
LB2[Load Balancer]
SERVER2[Server 1<br/>2 CPU Cores<br/>8GB RAM]
SERVER3[Server 2<br/>2 CPU Cores<br/>8GB RAM]
SERVER4[Server 3<br/>2 CPU Cores<br/>8GB RAM]
LB2 --> SERVER2
LB2 --> SERVER3
LB2 --> SERVER4
end
LB1 --> SERVER1
Before --> After
style LB1 fill:#f96,stroke:#333
style LB2 fill:#f96,stroke:#333
style SERVER1 fill:#bbf,stroke:#333
style SERVER2 fill:#bbf,stroke:#333
style SERVER3 fill:#bbf,stroke:#333
style SERVER4 fill:#bbf,stroke:#333
classDef default fill:#fff,stroke:#333,stroke-width:2pxExample: Google employs load balancing extensively across its global infrastructure to distribute search queries and traffic evenly across its massive server farms.
4. Caching
Caching is a technique to store frequently accessed data in-memory (like RAM) to reduce the load on the server or database. Caching can improve response times by a lot.
graph TD
C[Client]
CACHE[Cache Layer<br/>Response: ~1ms]
DB[(Database<br/>Response: ~100ms)]
C --> |Request Data| CACHE
CACHE --> |Cache Hit| C
CACHE --> |Cache Miss| DB
DB --> |Fetch & Store| CACHE
CACHE --> |Return Data| C
style C fill:#f9f,stroke:#333
style CACHE fill:#bbf,stroke:#333
style DB fill:#bfb,stroke:#333
classDef default fill:#fff,stroke:#333,stroke-width:2pxExample: Reddit uses caching to store frequently accessed content like hot posts and comments so that they can be served quickly without querying the database each time.
5. Content Delivery Networks (CDNs)
CDN distributes static assets (images, videos, etc.) closer to users. This can reduce latency and result in faster load times.
Example: Cloudflare provides CDN services, speeding up website access for users worldwide by caching content in servers located close to users.
graph TD OS[Origin Server<br/>New York] EDGE1[Edge Server<br/>London] EDGE2[Edge Server<br/>Tokyo] EDGE3[Edge Server<br/>Sydney] U1[User<br/>Europe] U2[User<br/>Asia] U3[User<br/>Australia] OS --> EDGE1 OS --> EDGE2 OS --> EDGE3 U1 --> EDGE1 U2 --> EDGE2 U3 --> EDGE3 style OS fill:#f96,stroke:#333 style EDGE1 fill:#bbf,stroke:#333 style EDGE2 fill:#bbf,stroke:#333 style EDGE3 fill:#bbf,stroke:#333 style U1 fill:#bfb,stroke:#333 style U2 fill:#bfb,stroke:#333 style U3 fill:#bfb,stroke:#333 classDef default fill:#fff,stroke:#333,stroke-width:2px
6. Sharding/Partitioning
Partitioning means splitting data or functionality across multiple nodes/servers to distribute workload and avoid bottlenecks.
graph TD
APP[Application]
subgraph "Shard Key: User ID"
KEY1[ID: 1-1000]
KEY2[ID: 1001-2000]
KEY3[ID: 2001-3000]
end
subgraph "Database Shards"
DB1[(Shard 1<br/>Users 1-1000)]
DB2[(Shard 2<br/>Users 1001-2000)]
DB3[(Shard 3<br/>Users 2001-3000)]
end
APP --> KEY1
APP --> KEY2
APP --> KEY3
KEY1 --> DB1
KEY2 --> DB2
KEY3 --> DB3
style APP fill:#f9f,stroke:#333
style KEY1 fill:#bfb,stroke:#333
style KEY2 fill:#bfb,stroke:#333
style KEY3 fill:#bfb,stroke:#333
style DB1 fill:#bbf,stroke:#333
style DB2 fill:#bbf,stroke:#333
style DB3 fill:#bbf,stroke:#333
classDef default fill:#fff,stroke:#333,stroke-width:2pxExample: Amazon DynamoDB uses partitioning to distribute data and traffic for its NoSQL database service across many servers, ensuring fast performance and scalability.
7. Asynchronous communication
Asynchronous communication means deferring long-running or non-critical tasks to background queues or message brokers.
This ensures your main application remains responsive to users.
graph LR
A[App Server<br/>Instant Response]
Q[Message Queue]
W1[Worker 1]
W2[Worker 2]
W3[Worker 3]
U1[User 1<br/>Sends Message] --> A
U2[User 2<br/>Continues Using App] --> A
A -->|1 Store Task| Q
A -->|2 Return Success| U1
Q -->|3a. Process Task| W1
Q -->|3b. Process Task| W2
Q -->|3c. Process Task| W3
style A fill:#bbf,stroke:#333
style Q fill:#f96,stroke:#333
style W1 fill:#bfb,stroke:#333
style W2 fill:#bfb,stroke:#333
style W3 fill:#bfb,stroke:#333
style U1 fill:#f9f,stroke:#333
style U2 fill:#f9f,stroke:#333
classDef default fill:#fff,stroke:#333,stroke-width:2pxExample: Slack uses asynchronous communication for messaging. When a message is sent, the sender’s interface doesn’t freeze; it continues to be responsive while the message is processed and delivered in the background.
8. Microservices Architecture
Micro-services architecture breaks down application into smaller, independent services that can be scaled independently.
This improves resilience and allows teams to work on specific components in parallel.
graph LR
subgraph "Monolithic"
M[Monolithic App<br/>Auth + Orders<br/>Products + Cart<br/>Notifications]
end
subgraph "Microservices"
A[Auth Service]
O[Orders Service]
P[Products Service]
C[Cart Service]
N[Notifications Service]
A --> O
O --> P
P --> C
O --> N
end
Monolithic --> Microservices
style M fill:#f96,stroke:#333
style A fill:#bbf,stroke:#333
style O fill:#bbf,stroke:#333
style P fill:#bbf,stroke:#333
style C fill:#bbf,stroke:#333
style N fill:#bbf,stroke:#333
classDef default fill:#fff,stroke:#333,stroke-width:2pxExample: Uber has evolved its architecture into microservices to handle different functions like billing, notifications, and ride matching independently, allowing for efficient scaling and rapid development.
9. Auto-Scaling
Auto-Scaling means automatically adjusting the number of active servers based on the current load.
This ensures that the system can handle spikes in traffic without manual intervention.
graph TD
subgraph "Low Load: 20% CPU"
L1[Server 1]
L2[Server 2]
end
subgraph "Medium Load: 60% CPU"
M1[Server 1]
M2[Server 2]
M3[Server 3]
M4[Server 4]
end
subgraph "High Load: 80% CPU"
H1[Server 1]
H2[Server 2]
H3[Server 3]
H4[Server 4]
H5[Server 5]
H6[Server 6]
end
Low --> Medium
Medium --> High
style L1 fill:#bfb,stroke:#333
style L2 fill:#bfb,stroke:#333
style M1 fill:#f96,stroke:#333
style M2 fill:#f96,stroke:#333
style M3 fill:#f96,stroke:#333
style M4 fill:#f96,stroke:#333
style H1 fill:#f9f,stroke:#333
style H2 fill:#f9f,stroke:#333
style H3 fill:#f9f,stroke:#333
style H4 fill:#f9f,stroke:#333
style H5 fill:#f9f,stroke:#333
style H6 fill:#f9f,stroke:#333
classDef default fill:#fff,stroke:#333,stroke-width:2pxExample: AWS Auto Scaling monitors applications and automatically adjusts capacity to maintain steady, predictable performance at the lowest possible cost.
10. Multi-region Deployment
Deploy the application in multiple data centers or cloud regions to reduce latency and improve redundancy.
graph TD
subgraph "US Region"
US_APP[App Servers]
US_DB[(Database)]
US_APP --> US_DB
end
subgraph "EU Region"
EU_APP[App Servers]
EU_DB[(Database)]
EU_APP --> EU_DB
end
subgraph "Asia Region"
ASIA_APP[App Servers]
ASIA_DB[(Database)]
ASIA_APP --> ASIA_DB
end
US_DB <-->|Sync| EU_DB
EU_DB <-->|Sync| ASIA_DB
ASIA_DB <-->|Sync| US_DB
US_USER[US Users] --> US_APP
EU_USER[EU Users] --> EU_APP
ASIA_USER[Asia Users] --> ASIA_APP
style US_APP fill:#bbf,stroke:#333
style EU_APP fill:#bbf,stroke:#333
style ASIA_APP fill:#bbf,stroke:#333
style US_DB fill:#f96,stroke:#333
style EU_DB fill:#f96,stroke:#333
style ASIA_DB fill:#f96,stroke:#333
style US_USER fill:#bfb,stroke:#333
style EU_USER fill:#bfb,stroke:#333
style ASIA_USER fill:#bfb,stroke:#333
classDef default fill:#fff,stroke:#333,stroke-width:2pxExample: Spotify uses multi-region deployments to ensure their music streaming service remains highly available and responsive to users all over the world, regardless of where they are located.