System design fundamentals

Encoding and Evolution

Encoding and Evolution

Encoding and Evolution

In order to make changes to an application’s features, it is necessary to also make changes to the data that the application stores. Relational databases follow a strict schema, which can be changed but only one schema can be in effect at any given time. On the other hand, schema-on-read (or schemaless) databases allow for a mixture of older and newer data formats.

In large applications, changes are not made instantly. Instead, a rolling upgrade is typically used, where a new version of the code is deployed to a few nodes at a time, gradually working through all the nodes without causing any downtime in the service. This means that old and new versions of the code, and old and new data formats, may potentially all coexist. In order to ensure that the application continues to function properly, it is necessary to maintain compatibility in both directions: backward compatibility, where newer code can read data that was written by older code, and forward compatibility, where older code can read data that was written by newer code.

Formats for encoding data

There are two different representations of data: in memory and as a byte sequence. In order to write data to a file or send it over a network, it must be encoded into a byte sequence. This process of converting the in-memory representation of data into a byte sequence is called encoding, serialization, or marshalling, and the reverse process of converting a byte sequence into an in-memory representation is called decoding, parsing, deserialization, or unmarshalling.

Programming languages typically include built-in support for encoding and decoding objects, but it is generally a bad idea to use these built-in functions. This is because they are often tied to a particular programming language, and the decoding process may be a security hole because it allows the instantiation of arbitrary classes. Additionally, these built-in functions do not support versioning or efficient encoding and decoding.

Instead, it is better to use a standardized encoding that can be read and written by many different programming languages. JSON, XML, and CSV are human-readable and popular, especially as data interchange formats, but they have some subtle issues. For example, they are ambiguous about how to encode numbers and deal with large numbers. They also support Unicode character strings, but not binary strings. People often work around this limitation by encoding binary data as Base64, which increases the data size by 33%. Additionally, XML and JSON have optional schema support, but CSV does not have any schema support.

Binary Encoding

Binary encoding is more efficient than text-based encoding formats like JSON and XML. There are binary encodings for JSON (such as MesagePack, BSON, BJSON, UBJSON, BISON, and Smile), as well as for XML (such as WBXML and Fast Infoset).

Apache Thrift and Protocol Buffers (protobuf) are binary encoding libraries that offer compact and efficient encoding and decoding of data. Thrift offers two protocols:

  1. BinaryProtocol, which uses field tags (numbers such as 1, 2) instead of field names
  2. CompactProtocol, which packs the same information as BinaryProtocol into less space by packing the field type and the tag number into the same byte.

Protocol Buffers are similar to Thrift’s CompactProtocol, but use a different approach to bit packing that may result in smaller compressed sizes.

Schemas inevitably need to change over time (a process known as schema evolution), so it is important for binary encoding libraries to support both backward and forward compatibility. Thrift and Protocol Buffers handle schema changes in the following ways:

  • Forward compatibility: new fields are added with new tag numbers, so that old code that tries to read new data can simply ignore unrecognized tags.
  • Backward compatibility: as long as each field has a unique tag number, new code can always read old data. However, every field added after the initial deployment of the schema must be optional or have a default value.
  • Removing fields is similar to adding fields, but with the backward and forward compatibility concerns reversed. A field can only be removed if it is optional, and its tag number must never be used again.

Changing the data type of a field can be risky, as there is a potential for values to lose precision or be truncated. Thrift and Protocol Buffers do not provide specific support for changing the data type of a field, so care must be taken to ensure that the new data type is compatible with the old one and can represent all the values that were stored in the old data type.

Avro

Apache Avro is another binary format that uses schema to encode data in a compact and efficient way. Avro has two schema languages: Avro IDL, which is intended for human editing, and a JSON-based schema language that is more easily machine-readable. When encoding data, the Avro library uses the writer’s schema to determine the data type of each field, and when decoding data, it uses the reader’s schema to determine the data type of each field. If there is a mismatch between the writer’s schema and the reader’s schema, the data will be decoded incorrectly.

Avro supports schema evolution, which means that the writer’s schema and the reader’s schema do not have to be the same. Avro handles differences between the two schemas by comparing the writer’s schema and the reader’s schema and using rules for compatibility. This allows Avro to maintain both forward compatibility, where a new version of the schema can be used as the writer and an old version of the schema can be used as the reader, and backward compatibility, where a new version of the schema can be used as the reader and an old version of the schema can be used as the writer.

To maintain compatibility, Avro allows new fields to be added or old fields to be removed, but only if the field has a default value. If a field is added without a default value, new readers will not be able to read data written by old writers. Avro also allows the data type of a field to be changed, provided that Avro can convert the old type to the new type. However, changing the name of a field is trickier and may not be forward or backward compatible.

The schema is encoded in the data, and the exact method for including the schema depends on the context in which the data is used. In a large file with many records, the writer of the file can include the schema at the beginning of the file. In a database with individually written records, each record may have a different schema, so the version number of the schema must be included at the beginning of each encoded record. When sending records over a network, the schema version can be negotiated during the connection setup.

Avro is well-suited to dynamically generated schemas, such as those produced by exporting data from a database. It is easy to generate an Avro schema in JSON, and if the database schema changes, a new Avro schema can be generated and used to export the updated data. In contrast, with Thrift and Protocol Buffers, every time the database schema changes, the mappings from database column names to field tags must be manually updated.

Textual formats like JSON, XML, and CSV are widely used for encoding and decoding data. However, binary encodings that are based on schemas are also a viable option. They have several advantages over textual formats:

  • They can be much more compact, because they can omit field names from the encoded data.
  • The schema is a valuable form of documentation that is required for decoding the data, and you can be sure that it is up to date.
  • A database of schemas allows you to check for compatibility when making changes to the schema.
  • Generating code from the schema is useful, because it enables type checking at compile time.

Overall, binary encodings based on schemas offer many benefits over textual formats, and are worth considering for applications that need to encode and decode data in a compact, efficient, and well-documented way.

Modes of dataflow

There are several ways that data can flow between processes, each with its own advantages and disadvantages.

Via databases: in this approach, the process that writes to the database encodes the data, and the process that reads from the database decodes it. This allows data to be shared between different versions of an application, but it can be expensive and time-consuming to migrate or rewrite old data.

Via service calls: when processes need to communicate over a network, they can use remote procedure calls (RPC) or RESTful APIs to send and receive data. RPC frameworks try to make network requests look like local function calls, but they can be unpredictable and slow. RESTful APIs are better suited to experimentation and debugging, and are often used for public APIs.

Via asynchronous message passing: in this approach, a client’s request (message) is delivered to another process via a message broker, which stores the message temporarily and ensures that it is delivered to the recipient. This decouples the sender from the recipient and allows messages to be delivered even if the recipient is unavailable or overloaded. It also allows messages to be sent to multiple recipients and avoids the sender needing to know the IP address and port number of the recipient. Open source implementations of message brokers include RabbitMQ, ActiveMQ, HornetQ, NATS, and Apache Kafka.

The actor model is a programming model that uses asynchronous message passing to communicate between processes. It is based on the idea of actors, which are independent, concurrent processes that communicate with each other by sending and receiving messages. The actor model provides a scalable and resilient approach to building distributed systems, and is used in languages such as Erlang, Elixir, and Akka.

The need for distributed systems

If a database is distributed across multiple machines, it means that the storage and retrieval of data is handled by more than one machine. This can provide several benefits, including increased scalability, improved fault tolerance and high availability, and reduced latency.

Scalability refers to the ability of a system to handle increasing workloads without reducing performance. By distributing a database across multiple machines, you can add more machines to the system as needed to handle the increased workload, which can help the system to scale more easily and efficiently.

Fault tolerance and high availability refer to the ability of a system to continue operating even if one or more of its components fail. By distributing a database across multiple machines, you can reduce the impact of a single machine failure on the overall system, which can improve its reliability and availability.

Latency refers to the time it takes for data to be transmitted from one location to another. If a database is distributed across multiple machines in different locations, it can reduce the latency of data retrieval, which can improve the overall performance of the system.

Summary

We discussed the importance of data encoding in the architecture of applications. We explored how the choice of data encoding format can affect the efficiency and evolvability of an application, and we looked at several different encoding formats, including programming language-specific encodings, textual formats like JSON and XML, and binary schema-driven formats like Thrift and Protocol Buffers. We also discussed different modes of dataflow, including databases, RPC and REST APIs, and asynchronous message passing, and how these modes can influence the use of different encoding formats. Overall, we concluded that careful consideration of data encoding can greatly improve the evolvability and reliability of an application.

comments powered by Disqus

Join Our Newsletter

Don’t settle for anything less than the crown. Join our newsletter and become the King of Interviews! Click here to join now and get the latest updates.