Data-intensive system design: data models and query languages (relational, document and graph storage data models)

insert image description here
A complex application may have more intermediate layers, such as API-based APIs, but the basic idea remains the same: each layer hides the complexity in the lower layers by providing an explicit data model. These abstractions allow diverse groups of people to collaborate effectively.

There are many kinds of data models, each with an assumption of how it will be used. Some usages are easy and some are not; some operations run quickly and some perform poorly; some data transformations are very natural and some are cumbersome.

Mastering a data model takes a lot of effort. It's very difficult to build software even with just one data model, without worrying about its inner workings. However, because the data model has a profound impact on the functionality of the upper layer software (what can and cannot be done), it is very important to choose a suitable data model.

1. Relational model and document model

The relational model was once a theoretical proposal, and many people at the time were skeptical that it could be effectively implemented. Yet by the mid-1980s, relational database management systems (RDBMSes) and SQL had become the tools of choice for most people to store and query data of some conventional structure. Relational databases have continued to dominate for about 25-30 years—an extremely long time in the history of computing.

In the 2010s, NoSQL began the latest round of attempts to overthrow the dominance of the relational model. The name "NoSQL" is a shame because it doesn't actually refer to any specific technology. It was initially used as a prominent Twitter hashtag at a 2009 open source meetup on distributed, non-relational databases. Regardless, the term struck a nerve and quickly spread in and out of the web startup community. A number of interesting database systems are now associated with the #NoSQL# hashtag, and NoSQL has been retroactively reinterpreted as Not Only SQL.

There are several drivers behind the adoption of NoSQL databases, including:

  • Need better scalability than relational databases, including very large datasets or very high write throughput
  • Free and open source software is preferred over commercial database products.
  • The relational model cannot well support some special query operations
  • Frustrated by the limitations of the relational model, longing for a more dynamic and expressive data model

Different applications have different needs, and the best technology choice for one use case may not be the best technology choice for another. It seems likely that relational databases will continue to be used with various non-relational databases - an idea sometimes called polyglot persistence.

1.1. How objects are stored

Most application development today is done using object-oriented programming languages, which leads to a common criticism of the SQL data model: if the data is stored in relational tables, then an unwieldy translation layer is required, objects in the application code And between the database model of tables, rows, and columns.

ORM object-relational mapping frameworks like ActiveRecord and Hibernate can reduce the amount of boilerplate code required for this translation layer, but they cannot completely hide the differences between the two models.
insert image description here

The diagram above shows how a resume (a LinkedIn profile) is represented in a relational schema. The entire profile can be identified by a unique identifier user_id. Fields like first_name and last_name appear only once per user, so they can be modeled as columns on the User table. However, most people have more than one job during their careers, people may have different levels of education and any number of contact information. There is a one-to-many relationship from users to these items, which can be represented in a number of ways:

  • In the traditional SQL model, the most common normalized representation is to put the position, education and contact information in separate tables, and provide foreign key references to the User table.
  • Subsequent SQL standards added support for structured data types and XML data; this allows multi-valued data to be stored within a single row, and supports querying and indexing within these documents. These features are supported to varying degrees in Oracle, IBM DB2, MS SQL Server, and PostgreSQL. The JSON data type is also supported by several databases, including IBM DB2, MySQL, and PostgreSQL.
  • A third option is to encode occupation, education, and contact information as a JSON or XML document, store it in a text column in a database, and let the application parse its structure and content. In this configuration, it is generally not possible to use the database to query the value in the encoded column.

For a data structure of a self-contained document like a resume, JSON representation is very suitable, and a document-oriented database such as MongoDB can well support this data model:

{
    
    
  "user_id": 251,
  "first_name": "Bill",
  "last_name": "Gates",
  "summary": "Co-chair of the Bill & Melinda Gates... Active blogger.",
  "region_id": "us:91",
  "industry_id": 131,
  "photo_url": "/p/7/000/253/05b/308dd6e.jpg",
  "positions": [
    {
    
    
      "job_title": "Co-chair",
      "organization": "Bill & Melinda Gates Foundation"
    },
    {
    
    
      "job_title": "Co-founder, Chairman",
      "organization": "Microsoft"
    }
  ],
  "education": [
    {
    
    
      "school_name": "Harvard University",
      "start": 1973,
      "end": 1975
    },
    {
    
    
      "school_name": "Lakeside School, Seattle",
      "start": null,
      "end": null
    }
  ],
  "contact_info": {
    
    
    "blog": "http://thegatesnotes.com",
    "twitter": "http://twitter.com/BillGates"
  }
}

From user profile to user title, education history and contact information, this one-to-many relationship implies a tree structure in the data, and the JSON representation makes this tree structure explicit:
insert image description here

1.2. Read Time Model & Write Time Mode

The essence of a document database is schema-on-read, where the structure of the data is implicit and only interpreted when the data is read). The corresponding relational database is schema-on-write. In the traditional relational database approach, the schema is clear, and the database ensures that all data conforms to its schema when it is written.
Read-time patterns are analogous to dynamic (run-time) type checking in programming languages, while write-time patterns are analogous to static (compile-time) type checking.

1.3. Document query data locality

Documents are typically stored as a single contiguous string, encoded as JSON, XML, or their binary variants (such as MongoDB's BSON). If the application frequently needs to access the entire document (for example, to render it to a web page), then storage locality can bring performance benefits, because there are not many table joins to be done.
If the data is split across multiple tables, multiple index lookups are required to retrieve it all, which may require more disk seeks and take more time.

1.4. Document and relational database fusion

Most relational database systems (except MySQL) already support XML. This includes the ability to make local modifications to XML documents, as well as the ability to index and query within XML documents. This allows applications to use a data model that is very similar to what a document database should use.
PostgreSQL since version 9.3, MySQL since version 5.7, and IBM DB2 since version 10.5 also provide a similar level of support for JSON documents. Given the popularity of JSON used in Web APIs, it is likely that other relational databases will follow in their footsteps and add JSON support.
Relational and document databases seem to be becoming more and more similar over time, and that's a good thing: the data models complement each other, and if a database can handle document-like data and be able to perform relational queries on it, then application The program can then use the combination of features that best suits its needs.

2. Data query language

2.1. Imperative and Declarative Query Languages

When the relational model was introduced, the relational model included a new way of querying data: SQL is a declarative query language, while IMS and CODASYL use imperative code to query databases.

  • Many commonly used programming languages ​​are imperative. For example, given a list of animal species, returning the sharks in the list can be written as
function getSharks() {
    
    
    var sharks = [];
    for (var i = 0; i < animals.length; i++) {
    
    
        if (animals[i].family === "Sharks") {
    
    
            sharks.push(animals[i]);
        }
    }
    return sharks;
}
  • SQL, declarative query, shielding the details of the query engine
    SELECT * FROM animals WHERE family ='Sharks';

Imperative languages ​​tell computers to perform certain operations in a specific order. Imagine walking through the code line by line, evaluating conditions, updating variables, and deciding whether to loop again.
In a declarative query language (like SQL or relational algebra) you simply specify the schema of the required data - what conditions the results must meet, and how the data should be transformed (for example, sorting, grouping, and collection) - but not how a goal. The database system's query optimizer decides which indexes and which join methods to use, and in what order to execute the various parts of the query.

2.2. MapReduce query

insert image description here

MapReduce is a programming model popularized by Google for batch processing of large-scale data sets on multiple machines. Some NoSQL data stores (including MongoDB and CouchDB) support a limited form of MapReduce as a mechanism for performing read-only queries across multiple documents.
MapReduce is neither a declarative query language nor a fully imperative query API, but somewhere in between: the query logic is represented by code snippets, which are repeatedly called by the processing framework. It is based on the map (also known as collect) and reduce (also known as fold or inject) functions, two functions present in many functional programming languages.
Say you're a marine biologist, and every time you see an animal in the ocean, you add an observation to your database. Now you want to generate a report of how many sharks you saw each month.
This query first filters the observations to show only species of the shark family, then groups the observations by the calendar month in which they occur, and finally sums the number of animals seen across all observations for that month.
How to write a MapReduce query using MongoDB:

db.observations.mapReduce(function map() {
        var year = this.observationTimestamp.getFullYear();
        var month = this.observationTimestamp.getMonth() + 1;
        emit(year + "-" + month, this.numAnimals);
    },
    function reduce(key, values) {
        return Array.sum(values);
    },
    {
        query: {
          family: "Sharks"
        },
        out: "monthlySharkReport"
    });
  • Filters that only consider shark species can be specified declaratively (this is a MongoDB-specific extension to MapReduce).
  • The JavaScript function map is called once for each document that matches the query, setting this to the document object.
  • The map function emits a key (a string including the year and month, such as "2013-12" or "2014-1") and a value (the number of animals in that observation).
  • The key-value pairs emitted by the map are grouped by key. The reduce function is called once for all key-value pairs with the same key (i.e., the same month and year).
  • The reduce function sums the number of animals in all observation records for a particular month.
  • Write the final output to the monthlySharkReport collection.
    Suppose the observations collection contains these two documents:
{
    
    
  observationTimestamp: Date.parse(  "Mon, 25 Dec 1995 12:34:56 GMT"),
  family: "Sharks",
  species: "Carcharodon carcharias",
  numAnimals: 3
{
    
    
}
  observationTimestamp: Date.parse("Tue, 12 Dec 1995 16:17:18 GMT"),
  family: "Sharks",
  species:    "Carcharias taurus",
  numAnimals: 4
}
  • The map function will be called once for each document, and the result will be emit("1995-12",3) and emit("1995-12",4)
  • reduce("1995-12",[3,4]) calls the reduce function and will return 7.

2.3. Graph data query

insert image description here

If most of your application's relationships are one-to-many (tree-structured data), or most records don't have relationships, then using a document model is appropriate.
What if many-to-many relationships are common in your data: relational models can handle the simple case of many-to-many relationships, but as the connections between your data become more complex, it becomes more natural to model your data as a graph.
A graph consists of two kinds of objects: vertices (also called nodes or entities), and edges (also called relationships). Various data can be modeled as a graph. Typical examples include:

  • The vertices of the social graph
    are people, and the edges represent the relationship between people (friends, relatives, superiors and subordinates, etc.)
  • Network map
    Vertices are web pages, and edges represent the reference and referenced relationship between web pages
  • Road or rail network
    Vertices are intersections, and edges represent road or rail lines between them.

2.3.1. Property map

In the property graph model, each vertex (vertex) includes:

  • unique identifier
  • A set of outgoing edges
  • A set of ingoing edges
  • A set of properties (key-value pairs)
    for each edge includes:
  • unique identifier
  • Edge start/tail vertex (tail vertex)
  • End point of edge/head vertex (head vertex)
  • A label describing the type of relationship between two vertices
  • A set of attributes (key-value pairs) A
    graph store can be thought of as consisting of two relational tables: one storing vertices and the other storing edges. If you want incoming or outgoing edges for a set of vertices, you can query the edges table via head_vertex or tail_vertex respectively
// 顶点表记录顶点信息
CREATE TABLE vertices (
  vertex_id  INTEGER PRIMARY KEY,
  properties JSON
);
// 边表记录顶点及边的关系
CREATE TABLE edges (
  edge_id     INTEGER PRIMARY KEY,
  tail_vertex INTEGER REFERENCES vertices (vertex_id),
  head_vertex INTEGER REFERENCES vertices (vertex_id),
  label       TEXT,
  properties  JSON
);

2.3.2. Cypher query language

Cypher is a declarative query language for property graphs, invented for the Neo4j graph database.
1. Vertex and edge declaration:

  • Vertex declaration: (vertex name: vertex type {key, value...})
  • Edge declaration: (vertex name) -[:edge relationship] -> (vertex name)...
    For example, Xiao Ming was born in Guangzhou, and Guangzhou is located in Guangdong Province, which is located in China, then it can be abstracted:
  • Two types of vertices: person (Xiao Ming) and location (Guangzhou, Guangdong Province, China)
  • Four vertices: Xiaoming, Guangzhou, Guangdong Province, China
  • Three sides: the relationship between Xiaoming and Guangzhou, the relationship between Guangzhou and Guangdong Province, the relationship between Guangdong Province and China
CREATE
(xiaoMing:person {
    
    'name':'小明'})
(guangZhou:location {
    
    'name':'广州'})
(guangDong:location {
    
    'name':'广东省'})
(china:location {
    
    'name':'中国'})
(xiaoMing) -[:bornIn] -> (guangZhou)
(guangZhou) -[:in] -> (guangDong) -[:in] -> (guangDong) -[:in] -> (china)

2. Query all Xiaoming's friends living in Guangdong Province

MATCH
(person) -[:friend*0..] -> (person{
    
    'name':'小明'}), -- 首先查询小明的朋友顶点
(person) -[:in*0..] -> (location{
    
    'name':'广东省'}). -- 在查询位于广东省的朋友

3. Summary

Historically, data was originally represented as a big tree (hierarchical data model), but this is not conducive to representing many-to-many relationships, so the relational model was invented to solve this problem. Recently, developers have discovered that some applications are not suitable for the relational model. The new non-relational "NoSQL" data stores diverge in two main directions:

  1. The application scenario of document database is: the data is usually self-contained, and the relationship between documents is very sparse.
  2. Graph databases are used in the opposite scenario: anything can be related to anything.

All three models (document, relational, and graph) are widely used today and work well in their respective domains. One model can be modeled with another—for example, graph data can be represented in a relational database—but the results are often poor. That's why we have different systems for different purposes, rather than a single, one-size-fits-all solution.

What document databases and graph databases have in common is that they generally do not enforce a schema for the data they store, which can make it easier for applications to adapt to changing needs. But the application will most likely still assume that the data has a certain structure; it's just a matter of whether the schema is explicit (enforced on write) or implicit (handled on read).

Guess you like

Origin blog.csdn.net/pbrlovejava/article/details/124918535