Day 20/100

Designing Data-Intensive Applications [Book Highlights ]

[Part I : Chapter I]

Reliable, Scalable and Maintainable Applications

  • Many applications today are Data-intensive as opposed to Compute-intensive.
  • Many applications need at least one of following, database, caches, search indexes, stream processing, batch processing.

Reliability

  • The application is expected to perform a function user is expected.
  • It can tolerate user making mistake or using the software in unexpected ways.
  • Its performance is good enough, under the expected load and data volume.
  • the system prevents any unauthorised access and abuse. Basically reliability is "continuing to work correctly, even when things go wrong". Also called as fault-tolerant or resilient
  • it is best to design fault-tolerance mechanisms that prevent faults from causing failures

Hardware Faults

  • Hard disks crash, RAM becomes faulty, the power grid has a blackout, someone unplugs the wrong network cable, etc.
  • Hard disks are reported as having a mean time to failure (MTTF) of about 10 to 50 years, thus on a storage cluster with 10,000 disks, we should expect on average one disk to die per day.

Software Faults

  • Software Faults is a class of fault is a systematic error within the system
  • There is no quick solution to the problem of systematic faults in software.
  • the problem of systematic faults in software.

Human Errors

  • Design systems in a way that minimizes opportunities for error.
  • For example, well-designed abstractions, APIs, and admin interfaces make it easy to do “the right thing” and discourage “the wrong thing.”
  • Decouple the places where people make the most mistakes from the places where they can cause failures.
  • fully featured non-production sandbox environments where people can explore and experiment safely
  • Test thoroughly at all levels, from unit tests to whole-system integration tests and manual tests
  • Allow quick and easy recovery from human errors, to minimize the impact in the case of a failure. For example, make it fast to roll back configuration changes
  • Set up detailed and clear monitoring, such as performance metrics and error rates.

There are situations in which we may choose to sacrifice reliability in order to reduce development cost or operational cost but we should be very conscious of when we are cutting corners.

Scalability

Scalability is the term we use to describe a system’s ability to cope with increased load. You can look at it in two ways:

  • When you increase a load parameter and keep the system resources (CPU, memory, network bandwidth, etc.) unchanged, how is the performance of your system affected?
  • When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?

  • throughput—the number of records we can process per second

  • Latency and response time are often used synonymously, but they are not the same.
  • The response time is what the client sees -Latency is the duration that a request is waiting to be handled—during which it is latent, awaiting service
  • It’s common to see the average response time of a service reported, Usually it is better to use percentiles, over averages
  • If you take your list of response times and sort it from fastest to slowest, then the median is the halfway point
  • If 95th percentile response time is 1.5 seconds, that means 95 out of 100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more.