King Of Interviews

Data Models and Query Language

For the most application it is true that they are built by layering data models on top of each other. These abstractions increase team’s productivity by hiding the complexity of the layer below.

Relational model Versus Document model

There are three foundations for all relational databases:

Business data processing
Transaction processing
Batch processing.

NoSql (Not only SQL) on the other hand focuses on:

Scalability
Free and open source software
Query optimizations
Dynamic and expressive data model

Most application development today is done in object-oriented programming languages, which leads to the following scenario: if data is stored in relational tables, there has to exist an additional layer converting application code to relational database model, row, columns and vice-versa. Discrepancy between these two model sets is called an impedance mismatch. On the other hand one of the advantages for JSON (document) models is reduction in that mismatch and a also locality of data (everything we need is in one place). Relational databases use IDs (foreign keys) to point from one table to the other making the joins seem natural and easy while in the document models we don’t really have a great support for joins (since most of the time they are not needed because of data locality). For document models it’s however possible to simulate joins in application code by doing multiple queries (of course if you find yourself doing it over and over again, the relational database might have been a better choice).

Network Model

Standardized by Conference on Data Systems Languages (CODASYL) the network model presented generalization of hierarchical (imagine tree structure model in which every record has only one parent) with the main difference being multi-parent relation (one record could have multiple parents). The links between records behaved like pointers in popular programming languages and it was only possible to get a record by following path from root record called access path. Query in CODASYL was done my moving cursor while iterating over list of records. If you found yourself in a situation where you didn’t have path to the data, you would be up for a crazy ride since the changes to application code were difficult to do.

The Relational Model

The SQL alleviate those problems by grouping data into tables which are simply tuple (rows) collections with a smart query optimiser which makes decision on what part of the query to execute in what order. If we were to compare the pros and cons between document and relation models we would point out that document models offer schema flexibility, locality, and close(r) data structure to application code, while relation model has better support for joins, many-to-(one or many) relationships. The choice on which one to use is made by developers by analyzing their business needs/goals, so for example in case of data that resembles document like structure we would go with the document model, on the contrary in the scenario where we would need joins a relational model would be a preference.

Document Models Schema flexibility

Document model database are famous for their schema flexibility in other words the possibility of adding any key/value pair during any moment during the write to database, which also leads to clients having no guarantees if certain key/value is presented in the database. Document model databases are often called schemaless, but a better term would be schema-on-read (in contrast to schema-on-write) since there is a implicit “type structure” determined by the client. Often a parallel is draw between schema-on-ready/schema-on-write and dynamic (runtime)/static (compile-time) type checking. Schema-on-read approach comes in handy when we have heterogeneous (different structure) records in the same collection.

As of today it seems like document models are taking lessons learned from the relational models and vice versa by using the complementary approaches and they are starting to looking similar. So the future might be filled with hybrid based models taking the best from both of the data models.

Query languages for data

First we have to know the difference between declarative and imperative languages/approaches. In declarative query language we specify the data we want, but not how to achieve the goal (it’s up to database system to decide how to best optimize the query and increase performance), while in the imperative we create the list and exact order of command execution. There are many advantages using declarative way, one of them being parallelization over multiple cores where imperative approach mostly fails because of the total order of command execution. A good example to show the difference between declarative and imperative approach is actually the DOM manipulation/CSS styling on the web:

<ul>
  <li class="selected">
    <p>Fruits</p>
    <ul>
      <li>Apple</li>
      <li>Ananas</li>
      <li>Orange</li>
    </ul>
  </li>
  <li>
    <p>Vegetables</p>
    <ul>
      <li>Potato</li>
      <li>Carrot</li>
      <li>Tomato</li>
    </ul>
  </li>
</ul>

If we want to make the title of selected group red we would do it with css:

li.selected > p {
  background-color: red;
}

This is the declerative approach since we’re only declaring we want to do (the title of selected group to be certain color) without specifying the underlying algorithm on how to achieve that, which is exactly shown in the imperative approach using DOM manipulation:

const elements = document.getElementsByTagName("li");
for (let i = 0; i < elements.length; ++i) {
  if (elements[i].className === "selected") {
    const children = elements[i].childNodes;
    for (let j = 0; j < children.length; ++j) {
      const child = children[j];
      if (child.nodeType === Node.ELEMENT_NODE && child.tagName === "P") {
        child.setAttribute("style", "background-color: red");
      }
    }
  }
}

I don’t know about you but I prefer the declarative approach (centering div is hard using either approach).

MapReduce querying

Another good example where declarative approach comes in handy is the MapReduce programming model made for batch processing of large amount of data.

db.observations.mapReduce(
  function map() {
    var year = this.observationTimestamp.getFullYear();
    var month = this.observationTimestamp.getMonth() + 1;
    emit(year + "-" + month, this.numFruits);
  },
  function reduce(key, values) {
    return Array.sum(values);
  },
  {
    query: { family: "Fruits" },
    out: "monthlyFruitReport",
  }
);

Let’s break the snippet above:

query is used as filter to enable us to only search for fruits category
the result will be written to collection monthlyFruitReport
Map and Reduce function have to be pure function (pure in functional programming language context, taking input and giving output without any side effects)
Map emits key/value pairs, in this case year-month and the number of fruits
Reduce aggregates (in this case sums) the values over same keys

Graph-like data models

When your application starts to use many many-to-many (how many manies?), the natural choice for data model is graph. A graph consists of two kinds of objects: vertices (also known as nodes or entities) and edges (also known as relationships or arcs). Let’s give some example on what kind of application can be modeled by graphs:

Social graphs Vertices are people, and edges indicate which people know each other.
The web graph Vertices are web pages, and edges indicate HTML links to other pages.
Road or rail networks Vertices are junctions, and edges represent the roads or railway lines between them.

There are several ways of structuring and querying the data:

The property graph model (Neo4j, Titan, and Infinite Graph)
Triple-store model (Datomic, AllegroGraph)

Property graphs

Each vertex consists of:

Unique identifier
Outgoing edges
Incoming edges
Collection of properties (key-value pairs)

Each edge consists of:

Unique identifier
Vertex at which the edge starts (tail vertex)
Vertex at which the edge ends (head vertex)
Label to describe the kind of relationship between the two vertices
A collection of properties (key-value pairs)

Data Models and Query Language