Book Summary: Designing Data-Intensive Applications
TLDR: Data-Intensive applications should be reliable, scalable, and maintainable. A big challenge is understanding the tradeoffs to pick the right tool for the job, often many different tools are needed to build a complex system.
Reliable, Scalable, and Maintainable Systems
Hardware can fail, application usage can grow, and stakeholder needs can change. Data is extremely important to companies today. We need data systems that are reliable, scalable, and maintainable.
Reliable systems are able to tolerate human error and survive hardware or software faults. To increase reliability look to:
- Minimize opportunities for human errors by using restrictive interfaces
- Utilize a sandbox environment
- Have automated testing
- Make use of telemetry and alerts around monitoring
Scalable systems are able to handle growth in data, traffic, and complexity. To remain scalable your system should:
- Be designed to scale based on the most common operations
- Measure latency by looking at the median and also the outliers in response time
- Understand how adding resources can improve performance
- Vertical scaling – get a bigger, more expensive machine. Horizontal scaling – add nodes
Maintainable systems are simple and able to evolve. To have an easy to maintain system:
- Use abstractions to hide implementation details as much as possible. Smaller code files are easier to reason about.
- Make it easy for engineers to work productively by implementing CI and having organized documentation.
Technical Implementation Details
There are always tradeoffs when choosing a specific tool over another. The goal of a system designer is to understand the tradeoffs so they can pick the right tool for the job. Often, many different tools will be needed and these will be combined into a functional data system.
Here are some technical highlights that stood out to me while reading this book:
Dimensional Model
- Anything that is meaningful to humans may change at some time in the future – using an id to represent a human meaningful value makes change easier here.
Two general purposes for a Database
- OLTP – Online Transactional Processing
- This is the database your user-facing application should use. It is optimized for a high volume of writes and having a small number of records read each query.
- MySQL, Postgresql
- OLAP – Online Analytical Processing
- This is the data warehouse where analytic queries from a BI tool are being conducted. This DB should use column-oriented storage so aggregation queries where a DB needs to scan a few columns in a large number of records are optimized.
- Redshift
Encoding Data
- Data should be encoded to provide forward AND backward compatibility. Can’t upgrade all nodes at once (some will need to run an older version of code temporarily during a rolling upgrade).
- Avro offers an easy way to maintain forward and backward compatibility. Different schemas can be used by writers and readers since they match on field names. A reader ignores the fields not defined in its schema and a writer fills missing fields with a default value.
Distributed Data
- Replicate data across nodes in a cluster (horizontal scaling). This can make it difficult to handle changes in your data. Different techniques for keeping data in sync:
- Leader/ Follower replication – clients write to leader node, followers replicate. Can lead to the problem of replication lag
- Replication Logs – log of changes sent to followers
- Partition data by splitting into chunks and storing these partitions on different nodes. Can help scale by spreading data and query load across nodes.
- Partial failures in distributed systems can make it hard to tell what failed. Unreliable networks, slightly inaccurate system clocks, and async communication can make it difficult to know for certain the status of each node. Distributed systems use an algorithm to find consensus and agree on the state.
- Zookeeper is a service that can provide consensus.
Derived Data
- The first place data is written, where each fact is stored once should be your source of truth and remain immutable.
- Derived data is the result of processing this immutable “source” data and can be recreated at any time from this input data.
- Batch Processing
- Fixed sized chunks of data, typically one day’s worth. “Bounded” data.
- Input is immutable to these jobs. Output of one job can easily be the “piped” as input into another job in the pipeline.
- Complex problems of processing data broken up into small jobs that do one thing really well.
- Stream Processing
- Processing every event as it happens. “Unbounded” data.
- Input is a single event or record. A record is a self-contained, immutable object with details of what happened at some point in time.
- Events are immutable and the event log is your source of truth where other data can be derived from.
- Log-based message brokers, like Kafka, add durability to a stream processing pipeline. The logs of events can be backed up to someplace like S3.
- Logs are an append-only sequence of records
- Partitions of a single stream live on different machines
- Ability to “replay” old messages to re-run through the pipeline
- Kafka can guarantee the sequence of messages in a specific partition. Adding partitions and more consumers allow for horizontal scaling.
What’s next?
The book ends thinking about the future of data. The big takeaway at the end is to treat data with respect. Think of your users as humans, not a metric to be optimized. Organizations should give users as much control over their data as possible.