Three Concerns in Software Systems As we start the system design journey first I would really recommend you to check Designing Data-Intensive Applications or as people in tech community called it the big red pig book :D Most of the posts in the fundamentals were actually inspired by this book and it’s great author Martin Kleppmann.
First of all is important to note that many applications are data-intensive, versus compute-intensive. There are few types of building blocks we use to create data-intensive applications:
For software systems there are three concerns that are important:
“Anything that can go wrong will go wrong”. Murphy’s law
The straightforward definition of reliability is property of system to continue working when faced with faults. Faults are things that can go wrong, and we call systems that can handle them fault-tolerant, resilient or reliable system. It’s important to note the difference between failure and fault. Fault is one system component’s deviation from its spec while failure is situation where the whole system isn’t capable of providing services to user. How do we test system for resiliency? One of the utilities I found amazing was Netflix Chaos Monkey . Chaos Monkey is responsible for randomly terminating instances in production to ensure that engineers implement their services to be resilient to instance failures. Faults can be categorized in the following three categories:
We use term scalability to describe property or level of system to handle increased load. To describe load we can use load parameters which can be anything from cache hit rate, database read/write ratio, number of request to a web server per unit of time. Let’s go over some quick example: Just for demonstrating how we choose a load parameter let’s take Twitter user timeline generation. When a user posts a tweet, we get the list of all the user who follow the original poster and insert new tweet into their cached timelines. This moves the boundary between read and write paths, in order words we spend more time on writing to make the reading operation faster. One of the important load parameters for this scenario is the number of followers per user, since it often happens that we have celebrity/hot users and inserting a tweet into timelines of all of their followers would take too long. One approach would be to detect celebrity users and then during the timeline generation for any user we would merge their cached list of tweets (containing tweets from non-celebrity users) with separately fetched tweets from other celebrity users.
We can describe performance by asking ourselves two questions (two different ways):
It’s important to note the difference between latency and response time: The response time is what the client see, request processing time along with network delays and queueing delays. Latency is the duration a request is waiting to be handled—during which it is latent. To describe performance and availability of service (e.g. service level objectives SLOs and service level agreements SLAs) we often use percentiles: if the 95th percentile response time is 1.5 seconds, that means 95 out of 100 requests take less than 1.5 seconds
. During load testing is to important to take care of tail latency amplification (slowest parallel requests makes the request slow, critical path in other words), to cope with this load-generating client should be sending request independently of the response time. Histograms are the right way of aggregating response time data.
You often hear about difference between scaling up (vertical scaling, leveling up to a better machine) and scaling out (horizonal scaling, load distributions over smaller machines). Distributing load across multiple machines is also known as a shared-nothing architecture. Today it is often the case to have elastic systems, which automatically detect load increase/decrease and add or remove machines (nodes).
Maintainability can be broken down to three concepts:
A project with high complexity is sometimes called bit ball of mud.
Reducing complexity greatly improves the maintainability of software, so simplicity should be a key goal when building systems. Implementing good abstractions can help reduce complexity and make the system easier to evolve.
We can define complexity as accidental if it is not inherent in the problem that the software solves (as seen by the users) but arises only from the implementation. It’s also important to note the difference between functional and nonfunctional requirements as our system evolves:
Don’t settle for anything less than the crown. Join our newsletter and become the King of Interviews! Click here to join now and get the latest updates.