Storage and Retrieval
It’s an oversimplification but in essence database has to do only two things:
So we could build the simplest database ever by using only two Bash functions:
#!/bin/bash
db_set () {
echo "$1,$2" >> dummydatabase
}
db_get () {
grep "^$1," dummydatabase | sed -e "s/^$1,//" | tail -n 1
}
Of course real database systems have to deal with many more things such handling errors, reclaiming disks, concurrency…
But from the fundamental point of view these two functions keep the log, write to it key/value pairs and retrieve the last occurrence of the key with corresponding value. Generally you should look at log as an append-only sequence of records. The db_set
function performs well since it only writes to our log file, but our db_get
is terrible in context of getting stored data since it scans all the records in log to find the last occurrence of the key making in O(n) traversal which is of course not good enough for real database systems. The next section will go into details on how use index
to speed up the retrieval of data. You can think of an index
as an auxiliary structure derived from primary data that speeds up the read operation
. It’s important to note that index
is a simple example of pushing the boundary between read and write paths (in other words we gain on read queries, but lose at writes).
A key-value store is similar to a dictionary in that it uses a hash map to associate keys with values. In a key-value store, data is stored in a file, and an in-memory hash map is used to keep track of the location of each key in the file. When a new key-value pair is added to the store, the hash map is updated to reflect the location of the data in the file. Storage engines like Bitcask, which is used in Riak as default storage engine, are suitable for situations where the values associated with each key are frequently updated. To prevent the file from growing too large, the log can be broken into segments (when they reach certain size) and compaction can be performed to remove duplicate keys and keep only the most recent updates. Merging and compaction can be done in a background thread, and once the process is complete, the old segment files can be deleted. Each segment has its own in-memory hash map, and to find a value for a key, the most recent segment is checked first, followed by the second-most recent segment, and so on. The merging process helps to keep the number of segments small, so that lookups are efficient. Simple example of merging and compaction is show in the image below:
When implementing a key-value store, there are several important issues to consider. One issue is the file format, which can be binary for simplicity. Another issue is how to delete records, which can be done by appending a special “tombstone” record to the data file to indicate that previous values should be discarded during the merging process. Crash recovery is also important, as the in-memory hash maps will be lost if the database is restarted. Bitcask speeds up recovery by storing snapshots of each segment hash map on disk. It is also important to handle partially written records, which can occur if the database crashes. Bitcask includes checksums to detect and ignore corrupted parts of the log. Finally, concurrency control is necessary, as writes must be appended to the log in a sequential order. A common implementation is to use a single writer thread, while segments can be read concurrently by multiple threads because they are immutable.
Using an append-only design for a key-value store has several benefits:
While a hash table is useful for storing and accessing data in a key-value store, it also has some limitations. The main limitation is that the hash table must fit in memory, which can be a problem if there are a large number of keys. It is also difficult to make an on-disk hash map perform well. Additionally, range queries are not efficient with a hash table, which can make it difficult to retrieve data within a certain range.
In order to improve the performance of a key-value store, we can introduce a new requirement for segment files: the sequence of key-value pairs must be sorted by key. This type of segment file is called a Sorted String Table, or SSTable. To ensure that each key only appears once in each merged segment file (which is already ensured by compaction), we require that the keys are sorted. SSTables have several advantages over log segments with hash indexes:
b
is between a
and c
we can jump from a
(using the offset) and read until we hit c
or find b
)To ensure that data is sorted in a key-value store, we can use a data structure like a red-black tree or AVL tree. With these structures, we can insert keys in any order and read them back in sorted order. When a write request is received, the key and value are added to an in-memory balanced tree structure (memtable). When the memtable reaches a certain size (e.g., a few megabytes), it is written out to disk as an SSTable file, and writes can continue to a new memtable instance. On a read request, the key is first searched for in the memtable, and if it is not found, the most recent on-disk segment is searched, followed by the next-older segment, and so on. From time to time, merging and compaction are run in the background to discard overwritten and deleted values. This helps to keep the number of on-disk segments small and ensures that read requests are efficient.
A key-value store based on the LSM-tree algorithm stores data in sorted files that are periodically merged and compacted. In order to prevent the loss of data in case of a crash, the database can keep a separate log on disk to which every write is immediately appended. This log is not in sorted order, but it serves the important purpose of restoring the memtable after a crash. When the memtable is written out to an SSTable, the log can be discarded. Storage engines that use this approach are often called LSM structure engines, and Lucene, an indexing engine for full-text search, uses a similar method for storing its term dictionary. However, the LSM-tree algorithm can be slow when looking up keys that do not exist in the database. To improve performance, storage engines often use additional Bloom filters to approximate the contents of a set. There are also different strategies for determining the order and timing of SSTable compaction and merging. Two common strategies are size-tiered and leveled compaction. LevelDB and RocksDB use leveled compaction, HBase uses size-tiered compaction, and Cassandra supports both. In size-tiered compaction, newer and smaller SSTables are successively merged into older and larger ones, while in leveled compaction, the key range is split into smaller SSTables, and older data is moved into separate “levels” to reduce the amount of disk space used.
B-trees are a widely used indexing structure for key-value stores. They keep key-value pairs sorted by key, which allows for efficient key-value lookups and range queries. Unlike log-structured indexes, which break the database down into variable-size segments, B-trees break the database down into fixed-size blocks or pages, traditionally 4KB. One page is designated as the root, and searches begin from there. The page contains several keys and references to child pages. When updating the value for an existing key, the leaf page containing the key is located, the value is changed, and the page is written back to disk. To add a new key, the page is found and the key is added to it. If there is not enough free space in the page, it is split into two half-full pages, and the parent page is updated to reflect the new subdivision of key ranges. B-trees are balanced, and a B-tree with n keys always has a depth of log n. The basic underlying write operation of a B-tree is to overwrite a page on disk with new data. However, some operations, such as page splits, require multiple pages to be overwritten, which can lead to index corruption if the database crashes during the operation. To prevent this, it is common to use a write-ahead log (WAL) to track changes to the database. Concurrency control is also important when multiple threads are accessing the B-tree, and this is typically achieved by using latches (lightweight locks) to protect the tree’s internal data structures.
B-trees are a well-established indexing structure, and over the years, many optimizations have been developed to improve their performance. For example, instead of overwriting pages and using a write-ahead log (WAL) for crash recovery, some databases use a copy-on-write scheme. This involves writing modified pages to a different location and creating new versions of the parent pages that point to the new location. This approach is also useful for concurrency control. To save space in pages, keys can be abbreviated, especially in pages on the interior of the tree. Additionally, B-tree implementations often try to lay out the tree so that leaf pages are in sequential order on disk, which can improve performance for queries that need to scan over a large part of the key range. Other B-tree variants, such as fractal trees, borrow some log-structured ideas to reduce disk seeks.
LSM-trees are typically faster for writes, while B-trees are thought to be faster for reads. This is because LSM-trees have to check multiple data structures and SSTables at different stages of compaction, which can slow down read performance. LSM-trees have several advantages over B-trees, including higher write throughput, better compression, and smaller files on disk. However, they also have some downsides, such as the potential for compaction to interfere with ongoing reads and writes, and the need for more disk bandwidth as the database grows. In contrast, B-trees offer stronger transactional semantics, as each key exists in exactly one place in the index, allowing for the use of locks on key ranges.
We have only discussed key-value indexes, which act like a primary key index. However, there are also secondary indexes, which are similar to key-value indexes but allow for non-unique values. This can be achieved by either making each value in the index a list of matching row identifiers, or by appending a row identifier to each entry to make it unique. Secondary indexes can be easily constructed from key-value indexes.
In-memory databases take advantage of the durability and lower cost per gigabyte of disk storage. They are able to keep the entire database in memory, which can improve performance by avoiding the overhead of encoding in-memory data structures for writing to disk. In-memory databases can be used for both durable and non-durable data storage. When used for durable storage, the database may use a log of changes or replicas for data recovery. In-memory databases may also provide data models that are difficult to implement with disk-based indexes. Products such as VoltDB, MemSQL, and Oracle TimesTime are examples of in-memory databases.
Indexes do not allow for searching for similar keys, such as misspelled words. This type of fuzzy querying requires different techniques, such as those used in full-text search engines. These engines are able to handle synonyms, grammatical variations, and occurrences of words near each other. Lucene uses an SSTable-like structure for its term dictionary, and the in-memory index is a finite state automaton, similar to a trie.
A transaction is a group of related reads and writes that form a logical unit. This pattern is known as online transaction processing (OLTP). Data analytics often has different access patterns, such as scanning over a large number of records and only reading a few columns per record to calculate aggregate statistics. These types of queries are often used by business analysts and fed into reports, and this pattern is known as online analytics processing (OLAP).
In a data warehouse, data is organized into a star schema where the fact table is in the center, surrounded by its dimension tables, like the rays of a star. This schema is used because it allows for maximum flexibility in analysis. The fact table contains individual events and can be very large, while dimension tables contain details about the events, such as the who, what, where, when, how, and why. These tables can also be very wide, sometimes with over 100 columns. Data warehouses are used for online analytics processing (OLAP) and are separate from online transaction processing (OLTP) databases. They are often created using Extract-Transform-Load (ETL) processes.
Data warehouses often use column-oriented storage engines to optimize the performance of large read-only queries. In this approach, instead of storing all the data for a single row together, each column is stored in a separate file. This allows queries to only read and parse the columns that are needed, reducing the amount of work required. Column-oriented storage also tends to be well-suited for compression, as the sequences of values for each column are often repetitive. This, along with sorting, can help to make read queries faster in data warehouses. However, writes can be more difficult with column-oriented storage, as an update-in-place approach is not possible with compressed columns. One solution to this is the use of materialized views, which are pre-computed, denormalized copies of query results. These can be updated automatically when the underlying data changes, but this can make writes more expensive.
In summary, we discussed the differences between storage engines for online transaction processing (OLTP) and online analytics processing (OLAP). It also covered the various indexing structures used by storage engines, including B-trees, LSM-trees, and column-oriented storage. The post also briefly discussed in-memory databases and data warehouses. Overall, the chapter aimed to provide a high-level understanding of how databases handle storage and retrieval, and how different storage engines are optimized for different workloads.
Did you find it useful? Let me know it the comments. Keep on crushing it!
Don’t settle for anything less than the crown. Join our newsletter and become the King of Interviews! Click here to join now and get the latest updates.