Designing Data Intensive Applications: Reliable, Scalable and Maintainable Applications
Every application is data-intensive to some degree. Raw CPU power isn't usually the bottleneck—it's how you handle the data that matters. And the landscape is sprawling: relational databases, document stores, graph databases, message queues, stream processors. One tool rarely fits all.
This is a summary of the first chapter of Martin Kleppmann's Designing Data-Intensive Applications—a book I keep recommending to anyone building distributed systems.
Three Things That Matter
When you're evaluating databases or designing systems, Kleppmann distills everything into three concerns:
- Reliability — The system keeps working correctly even when things break.
- Scalability — The system can handle growth.
- Maintainability — The system can evolve over time.
These aren't abstract ideals. They're concrete engineering decisions.
Reliability
"The system should continue to work correctly... even if things go wrong."
Things will go wrong. Hard disks fail. Networks partition. Someone deploys a buggy release at 4 PM on Friday. Reliability is about building systems that survive these failures gracefully.
Types of Faults
Hardware faults are inevitable. Disks die. RAM degrades. The mean time to failure (MTTF) for a disk might be 10-15 years, but scale that to thousands of disks and you're looking at failures every day. The answer is redundancy—RAID arrays, replication, multi-AZ deployments.
Software faults are nastier. A hardware fault is usually isolated. A software bug can bring down every instance at once. Bad memory leaks, edge cases in libraries, cascading timeouts. These are harder to predict and recover from. Thorough testing, feature flags, and gradual rollouts help.
Human faults are the most common. We make mistakes. The solution isn't to blame people—it's to build systems that catch mistakes before they reach production. Telemetry, automated tests, and good monitoring turn incidents into learning opportunities.
Fault Tolerance vs Prevention
You can't prevent all faults. A meteor strike takes out your data center—there's no software fix for that. The goal is fault tolerance: building systems that recover to a valid state even when bad things happen.
Scalability
Reliability is about "does it work?" Scalability is about "does it keep working when load increases?"
Describing Load
Before you can scale, you need to understand what you're actually measuring. Twitter is a classic example:
| Operation | Frequency |
|---|---|
| Home timeline reads | ~300k req/sec |
| New tweet writes | ~5k req/sec |
Reads vastly outnumber writes. This asymmetry drives architectural decisions—caching, denormalization, read replicas.
Load parameters aren't always obvious. It might be:
- Requests per second
- Read-to-write ratio
- Concurrent users
- Size of working set
Describing Performance
Once you understand load, you measure how the system responds:
Throughput — How many requests can you handle per second?
Response time — The actual time a user waits. This is a distribution, not a single number.
That's why you look at percentiles:
| Percentile | Meaning |
|---|---|
| p50 (median) | Half of requests are faster |
| p95 | 5% of requests are slower |
| p99 | 1% of requests are slow outliers |
| p99.9 | 0.1% of requests are very slow |
Netflix and Google optimize for p99.9+ because even a tiny fraction of slow requests affects millions of users. But there's a cost—tail latency is expensive to eliminate.
Scaling Strategies
Vertical scaling (bigger machines) has limits and gets pricey fast.
Horizontal scaling (more machines) is the usual path, but it requires your architecture to support it. Stateless services are easy to scale. Distributed databases are harder.
The key is elasticity—the ability to scale up during peaks and down during quiet periods. Cloud infrastructure makes this easier but doesn't solve the architectural challenges.
Maintainability
Software isn't finished. It's maintained. And over time, many people will work on it—adding features, fixing bugs, adapting to new requirements.
Kleppmann breaks maintainability into three dimensions:
Operability
Good operations teams can keep the system running smoothly. This means:
- Visibility — Logs, metrics, and traces that tell you what's happening
- Predictability — Consistent behavior that doesn't surprise you at 2 AM
- Interoperability — Understanding how your system interacts with others
- Automation — Reducing manual toil through scripting and tooling
If your on-call engineer can't diagnose an incident from the dashboards, the system isn't maintainable.
Simplicity
Complexity is the enemy. As systems grow, they accumulate:
- Inconsistent terminology
- Undocumented dependencies
- Hacks to work around previous hacks
- State spaces that explode combinatorially
Good abstractions hide complexity behind clean interfaces. The goal isn't to make things simple for the computer—it's to make them simple for the humans who have to work with the code.
"Abstractions should make common things easy and rare things possible."
Evolvability
Systems change. New use cases emerge. New technologies appear. The question isn't if your architecture will need to change, but when.
TDD, good test coverage, and clean interfaces make change safer. Microservices let teams work independently. Event-driven architectures decouple producers from consumers.
The systems that age well are the ones designed for change, not stability.
The Practical Takeaway
These three properties—reliability, scalability, maintainability—aren't independent. A system that's hard to maintain won't stay reliable. A system that can't scale will eventually fail. They're different facets of the same problem: building software that stands the test of time.
The book goes deep on trade-offs. You'll see how every choice—relational vs. NoSQL, synchronous vs. asynchronous, monolith vs. microservices—affects these properties.
Next up: Data Models and Query Languages—how we structure the data that powers these systems.
