Designing Data Intensive Applications: Reliable, Scalable and Maintainable Applications

December 30, 2021

This blog captures important notes from the book "Designing Data-Intensive Applications" by Martin Kleppmann.

Reliable, Scalable, and Maintainable Applications

Raw CPU power is rarely a constraint in today's world. Most applications are complex in handling data. There are many databases with different characteristics. When writing an application, we need explore these characteristics and find out which one suits our need. A database & message queue has very different characteristics. Many apps have wide ranging characteristics, such that one single database system can't support all of them.

A decision on DB tools depend on: - Skills of people involved - Legacy system dependencies - Time scale of delivery - Org tolerance for risks and experiments - Regulatory constraints

Reliability:

The system should continue to work correctly performing the correct functions at desired level of performance, even if things go wrong.

Scalability:

As system grows in traffic volume & data volume, the system should be capable enough to scale to that volume

Maintainability:

Over time, many people work on the system. New use-cases will come and the system should be architected enough to support such changes.

Reliability

Things that go wrong are called faults. The system should be tolerant to faults. Not all faults can be cured. Some can be out of a human control (Like natural disasters). The system should be tolerant to faults which are curable and recoverable to a valid state to faults which are out of control.

Hard disks failures are common and reported to MTTF for 10-15 years. i.e 1 disk per day. Having redundancy help to overcome random hard disk failures.

Another case for faults are systematic faults. A software bug which cause every instance of application to crash. CPU time / disk space / memory bandwidth can cause unresponsiveness. Tracking metrics like no of requests vs no of responses help track faults and early signals.

Humans design & build software systems. Humans are known to be unreliable. Telemetry & metrics help to track unreliability issues and humans are encouraged to monitor such metrics.

Scalability

Describing load
- The no. of read vs write requests vary on a system. Usually reads are much higher than the write. It's important to measure the avg load per second before designing the system.
- Example: Twitter home page (read) vs post tweet (write)
Describing performance
- What happens when the load increases? When you increase a load parameter, how much more resource is needed to keep the system reliable?
- Measuring performance is vital. The response time is actual time to process the request. Latency is the duration the request is waiting to be handled.
- Response time is a distribution of values. The median of these values is the 50 percentile or P50. Similarly P95, P99, P999 are measured to determine how many outliers exist with high response time.
- Optimizing for 99.9 percentile (P999) is too expensive and help only the 0.1 percentile of the requests. SLAs are defined as per this number & cost of maintaining it.

Once load and performance are determined, they are used as factors to setup scalability. Scaling up and down as per these factors help to maintain system at optimal level.

Maintainability

Can & should design software that minimize headache during maintenance.

Operability
- Monitoring the health of the system and quickly .
- Tracking cause of problems such as software errors.
- Keeping software / deps up to date.
- Keeping tabs on how different systems affect each other.
- Anticipating future problems and identifying how it can be resolved.
- Handling complex moving systems.
- Preserving knowledge about systems among different stakeholders.
Simplicity
- As projects get larger it becomes more complex. Explosion of state space, inconsistent terminologies, hacks for bugs / performance issues.
- Removing accidental complexity by making good abstractions. define well defined reusable components.
Evolvability
- Architecture systems change, new use-case evolve, underlying technology changes. System should be capable to evolve to these changes.
- TDD with good coverage help evolve systems in a easier way.