Getting Started with NoSQL Databases

1. Overview of NoSQL databases

        NoSQL is a database management system design method that is different from relational databases. It is a general term for non-relational databases. The data model it uses is not the relational model of traditional relational databases, but similar to key/value, column family, document, etc. Non-relational model. NoSQL databases do not have a fixed table structure, there are usually no join operations, and they do not strictly adhere to ACID constraints. Therefore, compared to relational databases, NoSQL has flexible horizontal scalability and can support massive data storage. In addition, NoSQL database supports MapReduce style programming and can be well applied to various data management in the big data era. The emergence of NoSQL databases, on the one hand, makes up for various shortcomings of relational databases in current commercial applications, and on the other hand, it also shakes the traditional monopoly of relational databases.

2. NoSQL database characteristics

        When the application requires simple data model, flexible IT system, high database performance and low database consistency, NoSQL database is a good choice. Generally, NoSQL databases have the following three characteristics.

1. Flexible scalability

        Due to its own design mechanism, traditional relational databases are usually difficult to achieve "horizontal expansion". When facing a large-scale increase in database load, it is often necessary to upgrade hardware to achieve "vertical expansion". However, the current computer hardware manufacturing process has reached a limit, and the speed of performance improvement has begun to slow down, which is far behind the increase in database system load. Moreover, high-end high-performance servers are expensive, so we hope to use "vertical It has become increasingly unrealistic to "extend" to meet actual business needs. On the contrary, "horizontal expansion" only requires very common and cheap standardized blade servers, which not only have a high cost performance, but also provide theoretically almost unlimited expansion space. NoSQL databases were originally designed to meet the needs of "horizontal expansion", so they are inherently capable of good horizontal expansion.

2. Flexible data model

        The relational model is the cornerstone of the relational database. It is based on the complete relational algebra theory, has standardized definitions, and abides by various strict constraints. Although this approach ensures the data consistency requirements of the business system, the overly rigid data model also means that it cannot meet various emerging business needs. On the contrary, NoSQL databases are inherently designed to get rid of the various constraints of relational databases, abandoning the relational data model that has been popular for many years, and instead adopting non-relational models such as key/value and column families, allowing different types of data to be stored in one data element. data.

3. Close integration with cloud computing

        Cloud computing has good horizontal expansion capabilities and can be freely scaled according to resource usage. Various resources can be dynamically added or exited. NoSQL databases can fully utilize the cloud computing infrastructure by virtue of their good horizontal expansion capabilities and make good use of the cloud computing infrastructure. Integrate into the cloud computing environment and build NoSQL-based cloud database services.

3. Comparison between NoSQL and relational databases

        The following table gives a simple comparison between NoSQL and relational database (Relational DataBase Management System, RDBMS). The comparison indicators include database principle, data scale, database mode, query efficiency, consistency, data integrity, scalability, availability, standardization, Technical support and maintainability, etc. As can be seen from the table, the outstanding advantages of relational databases are that they are based on complete relational algebra theory, have strict standards, support the four properties of transaction ACID, and can achieve efficient queries with the help of indexing mechanisms. The technology is mature and there are professional companies. Technical support; its disadvantages are that it has poor scalability and cannot properly support massive data storage, the data model is too rigid and cannot properly support Web 2.0 applications, and the transaction mechanism affects the overall performance of the system. The obvious advantage of NoSQL database is that it can support ultra-large-scale data storage, the flexible data model can well support Web 2.0 applications, and has strong horizontal expansion capabilities. Its disadvantages are that it lacks a mathematical theoretical foundation and has low complex query performance. Generally, strong transaction consistency cannot be achieved, data integrity is difficult to achieve, the technology is not yet mature, there is a lack of technical support from a professional team, and maintenance is difficult.

        Through the above series of comparisons between NoSQL databases and relational databases, we can see that both have their own advantages and disadvantages at different levels. Therefore, in practical applications, both can have their own target user groups and market spaces, and there is no problem of one completely replacing the other. For relational databases, their status and role cannot be replaced in some specific application fields. Business systems in banks, supermarkets and other fields still need to rely heavily on relational databases to ensure data consistency. In addition, for some complex query analysis applications, data warehouse products based on relational databases can still achieve better performance than NoSQL databases.

        In practical applications, some companies will also use a hybrid approach to build database applications. For example, Amazon uses different types of databases to support its e-commerce applications. For temporary data such as "shopping basket", it is more efficient to use key-value storage, while current product and order information is suitable for storage in a relational database, and a large amount of historical order information is suitable for storage in a document database like MongoDB.

4. Four major types of NoSQL

        In recent years, NoSQL databases have developed rapidly. In just four or five years, the NoSQL field exploded with 50-150 new databases (http://nosql-database.org/). According to an online survey, the top ten developer skills most in demand in the industry are HTML5, MongoDB, iOS, Android, Mobile Apps, Puppet, Hadoop, jQuery, PaaS and Social Media. Among them, the popularity of MongoDB (a document database belonging to NoSQL) is even ahead of iOS, which is enough to see the popularity of NoSQL. Interested readers can refer to the book "Seven Databases in Seven Days" to learn how to use NoSQL databases such as Riak, Apache HBase, MongoDB, Apache CouchDB, Neo4j and Redis.

        Although there are many NoSQL databases, when boiled down, typical NoSQL databases usually include key-value databases,
column family databases, document databases, and graph databases, as shown in the figure.

4.1. Key-value database

        Key-Value Database uses a hash table, which has a specific Key and a pointer pointing to a specific Value. Key can be used to locate Value, that is, to store and retrieve specific Value. Value is transparent and invisible to the database. Value cannot be indexed or queried, and can only be queried through Key. Value can be used to store any type of data, including integers, characters, arrays, objects, etc. Key-value databases can achieve significantly better performance than relational databases in the presence of heavy write operations. Because relational databases need to build indexes to speed up queries, when there are a large number of write operations, the index will be updated frequently, which will result in high index maintenance costs. Relational databases are usually difficult to expand horizontally, but key-value databases are inherently scalable and can theoretically achieve almost unlimited expansion of data volume. Key-value databases can be further divided into memory key-value databases and persistent (Persistent) key-value databases. In-memory key-value databases store data in memory, such as Memcached and Redis; persistent key-value databases store data on disk, such as BerkeleyDB, Voldmort, and Riak.

        Of course, the key-value database also has its own limitations, and conditional query is the weakness of the key-value database. Therefore, if only some values ​​are queried or updated, it will be less efficient. When using a key-value database, you should try to avoid multi-table associated queries. You can use two-way redundant storage relationships to replace table associations and decompose operations into single-table operations. Additionally, key-value databases do not support rollback operations in the event of a failure, and thus cannot support transactions. The relevant products, data models, typical applications, advantages and disadvantages, and users of key-value databases are shown in the following table:

4.2. Column family database

        Column family databases generally use column family data models. The database is composed of multiple rows. Each row of data contains multiple column families. Different rows can have different numbers of column families. Data belonging to the same column family will be stored together. Each row of data is located by a row key, and this row key corresponds to a column family. From this perspective, the column family database can also be regarded as a key-value database. Column families can be configured to support different types of access modes, and a column family can also be set to be placed in memory, consuming memory in exchange for better responsiveness. The related products, data models, typical applications, advantages and disadvantages, and users of the column family database are shown in the following table:

4.3. Document database

        In a document database, a document is the smallest unit of the database. Although every document database deployment is different, most assume that documents are encapsulated in some standardized format and the data is encrypted and decoded in multiple formats, including XML, YAML, JSON, and BSON, or alternatively Use binary formats (such as PDF, Microsoft Office documents, etc.). The document database locates a document through keys, so it can be regarded as a derivative of the key-value database, and the former has higher query efficiency than the latter. Document databases are ideal for applications that can represent input data as documents. A document can contain very complex data structures, such as nested objects, and without requiring a specific data schema, each document may have a completely different structure. Document databases can build indexes based on keys or document content. In particular, the ability to index and query based on document content is what makes a document database different from a key-value database, because in a key-value database, the value (Value) is transparent and invisible to the database, and indexes cannot be built based on the value. Document databases are mainly used to store and retrieve document data. When many relationships and standardized constraints need to be considered and transaction support is required, traditional relational databases are a better choice. The related products, data models, typical applications, advantages and disadvantages and users of document databases are shown in the following table:

4.4. Graph database

        Graph databases are based on graph theory. A graph is a mathematical concept used to represent a collection of objects, including vertices and edges connecting the vertices. Graph databases use graphs as a data model to store data, which is completely different from key-value, column family, and document data models, and can efficiently store the relationship between different vertices. Graph databases are specially designed to process data with highly interrelated relationships, and can efficiently process relationships between entities. They are more suitable for problems such as social networks, pattern recognition, dependency analysis, recommendation systems, and path finding. Some graph databases (such as Neo4J) are fully ACID compliant. However, in addition to having good performance in application areas such as processing graphs and relations, in other areas, the performance of graph databases is not as good as other NoSQL databases. The related products, data models, typical applications, advantages and disadvantages, and users of graph databases are shown in the following table: 

5. Three cornerstones of NoSQL

The three cornerstones of NoSQL include CAP, BASE, and eventual consistency.

5.1、CAP

        In 2000, the famous American scientist Professor Eric Brewer of the University of Berkeley pointed out the famous CAP theory. Later, two scientists from the Massachusetts Institute of Technology (MIT), Seth Gilbert and Nancy Lynch, proved the correctness of the CAP theory. The so-called CAP refers to:

  • C (Consistency): Consistency. It means that any read operation can always read the result of the previously completed write operation, that is, in a distributed environment, data at multiple points is consistent.
  • A (Availability): Availability. It refers to the rapid acquisition of data, which can return the result of the operation within a certain period of time.
  • P (Tolerance of Network Partition): partition tolerance. It means that when a network partition occurs (that is, some nodes in the system cannot communicate with other nodes), the separated system can still operate normally.

        The CAP theory (see the figure below) tells us that it is impossible for a distributed system to meet the three requirements of consistency, availability, and partition tolerance at the same time, and at most two of them can be satisfied at the same time. ". If you pursue consistency, you have to sacrifice availability, and you need to deal with the failure of write operations caused by system unavailability; if you pursue availability, you must predict that data inconsistencies may occur, for example, system read operations The latest value written by a write operation may not be accurately read.

        An example of sacrificing consistency for availability is given below. Assume that there are two nodes M1 and M2 in a distributed environment. Two copies of a data V, V1 and V2, are stored on M1 and M2 respectively. The values ​​of the two copies are val0. Now assume that there are two processes P1 and P2 respectively. To operate on two copies, process P1 writes a new value val1 to copy V1 in node M1, and process P2 reads the value of copy V2 of V from node M2.

        When the whole process is fully executed, it will follow the following process (see figure below).

        (1) Process P1 writes new value val1 to copy V1 of node M1.

        (2) Node M1 sends message MSG to node M2 ​​to update the copy V2 value, and update the copy V2 value to val1.
        (3) Process P2 reads the new value val1 of copy V2 in node M2.
        However, when the network fails, the message MSG in node M1 may not be sent to node M2. At this time, the value of copy V2 read by process P2 in node M2 ​​is still the old value val0. This creates the problem of inconsistency.
        It can be seen from this example that when we want both processes P1 and P2 to achieve high availability, that is, to be able to quickly access the required data, data consistency will be sacrificed.

        When dealing with CAP problems, there are several obvious choices (see Figure 5-4).

        (1)CA. That is, emphasizing consistency (C) and availability (A), giving up partition tolerance (P), the simplest way is to put all transaction-related content on the same machine. Obviously, this approach will seriously affect the scalability of the system. Traditional relational databases (MySQL, SQL Server, and PostgreSQL) all adopt this design principle, so the scalability is relatively poor.

        (2) CP. That is to say, we emphasize consistency (C) and partition tolerance (P), and give up availability (A). When a network partition occurs, the affected services need to wait for the data to be consistent, so they cannot provide external services during the waiting period. NoSQL databases such as Neo4J, BigTable, and HBase all use CP design principles.

        (3)AP. That is, emphasizing availability (A) and partition tolerance (P), abandoning consistency (C), allowing the system to return inconsistent data. This is feasible for many Web 2.0 websites. The first concern of users of these websites is whether the website service is available. When users need to publish a Weibo, they must be able to publish it immediately, otherwise, users will give up using it, but this Weibo When the blog can be read by other users after it is published is not a very important issue and will not affect the user experience. Therefore, for Web 2.0 websites, availability and partition tolerance have higher priority than data consistency, and websites will generally try to be designed in the direction of AP. Of course, when adopting AP design, you can not give up consistency completely and switch to eventual consistency. NoSQL databases such as Dynamo, Riak, CouchDB, and Cassandra adopt AP design principles.

5.2、BASE

        Speaking of BASE (Basically Available, Soft-state, Eventual consistency), I have to talk about ACID. A database transaction has four ACID properties.

  • A (Atomicity): Atomicity. It means that a transaction must be an atomic unit of work, and either all of its data modifications are executed or none of them are executed.
  • C (Consistency): Consistency. It means that when the transaction is completed, all data must be kept in a consistent state.
  • I (Isolation): Isolation. It means that modifications made by concurrent transactions must be isolated from modifications made by any other concurrent transactions.
  • D (Durability): Durability. It means that after the transaction is completed, its impact on the system is permanent, and the modification will always be maintained even if a fatal system failure occurs.

        A complex transaction management mechanism is designed in the relational database system to ensure that the transaction strictly meets the four requirements of ACID during execution. The transaction mechanism of relational database satisfies the requirements of data consistency in banking and other fields, so it has been widely used in business. However, NoSQL databases are usually used in scenarios such as Web 2.0 websites. The requirements for data consistency are not very high, but the high availability of the system is emphasized. Therefore, in order to obtain high availability of the system, you can consider appropriately sacrificing consistency or partition tolerance. sex. The basic idea of ​​BASE is developed on this basis. It is completely different from the ACID model. BASE sacrifices high consistency to obtain availability or reliability. Cassandra system is a good example. Interestingly, just from the name, we can see that the two are somewhat "incompatible". The English meaning of BASE is alkali, while the English meaning of ACID is acid.

        The basic meaning of BASE is basically available (Basically Available), soft state (Soft-state) and eventual consistency (Eventual consistency).

5.2.1.Basically available

        Basic availability means that when a part of a distributed system has a problem and becomes unavailable, other parts can still be used normally, which means that partition failure is allowed. For example, a distributed data storage system consists of 10 nodes. When one node is damaged and becomes unavailable, the other 9 nodes can still provide data access normally. Then, only 10% of the data is unavailable, and the remaining 90% All of the data is available, then the distributed data storage system can be considered "basically available".

5.2.2.soft state

        "Soft-state" is a term corresponding to "hard-state". When the data stored in the database is in a "hard state", data consistency can be guaranteed, that is, the data is always correct. "Soft state" means that the state can be out of sync for a period of time, with a certain hysteresis. Assume that a user A in a bank transfers funds to another user B. Assume that this operation is decoupled through the message queue, that is, user A puts funds into the sending queue, and after the funds arrive in the receiving queue, user B is notified to withdraw the funds. . Due to the delay in message transmission, there may be a short-term inconsistency in this process, that is, user A has put funds in the queue, but the funds have not yet arrived in the receiving queue, and user B has not yet received the funds, which will cause data Inconsistent state, that is, user A's money has been reduced, but user B's money has not increased accordingly. That is to say, there is a lag time between the start and end of the transfer. During this lag time, the two users' money has not increased accordingly. Funds appear to have disappeared, with brief inconsistencies. Although there is a lag for users, this lag is tolerated by users and may not even be perceived by users, because users on both sides do not actually know when the funds will arrive. When the funds arrive in the receiving queue after a short delay, user B can be notified to withdraw the funds, and the status is finally consistent.

5.2.3.eventual consistency

        The types of consistency include strong consistency and weak consistency. The main difference between the two is whether subsequent operations can obtain the latest data under highly concurrent data access operations. For strong consistency, after an update operation is performed, other subsequent read operations can guarantee that the latest data after the update is read; conversely, if it cannot be guaranteed that subsequent accesses read the latest data after the update, then It's weak consistency. Final consistency is just a special case of weak consistency, which allows subsequent access operations to temporarily not be able to read the updated data, but after a period of time, the updated data must eventually be read. Eventual consistency is also the ultimate goal of ACID, as long as the final data is consistent, rather than maintaining real-time consistency every moment.

5.3. Final consistency

        When discussing consistency, it needs to be considered from both the client and server perspectives. From the perspective of the server, consistency refers to how updates are replicated and distributed throughout the system to ensure that the data is eventually consistent. From the perspective of the client, consistency mainly refers to whether subsequent operations can obtain the latest data under high-concurrency data access operations. Relational databases usually implement strong consistency, that is, once an update is completed, subsequent access operations can immediately read the updated data. For weak consistency, there is no guarantee that subsequent accesses will be able to read the updated data.

        Eventual consistency has lower requirements, as long as the updated data can be accessed after a period of time. That is to say, if an operation OP writes a value to the distributed storage system, the system that follows the eventual consistency can guarantee that if there is no other write operation to update the value before the subsequent access occurs, then eventually all subsequent Access can read the latest value written by the OP. The time interval between the completion of the OP operation and the fact that subsequent accesses can finally read the latest value written by the OP is called the "inconsistency window". If no system failure occurs, the size of this window depends on the interaction delay, system Factors such as load and number of replicas.

        The final consistency can be distinguished as follows according to the time and method of each process accessing the data after the data is updated.

  • Causal consistency. If process A notifies process B that it has updated a data item, subsequent accesses by process B will obtain the latest value written by process A. The access of process C, which has no causal relationship with process A, still obeys the general eventual consistency rules.
  • “Read what you have written” consistency. can be regarded as a special case of causal consistency. When process A performs an update operation by itself, it can always access the updated value itself and never see the old value.
  • Session consistency. It puts the process of accessing the storage system in the context of the session. As long as the session still exists, the system guarantees the consistency of "reading what you have written". If the session is terminated due to some failure condition, a new session must be established, and the system guarantees that it will not be continued into the new session.
  • Monotonic read consistency. If the process has already seen a certain value of the data object, then any subsequent access will not return the value before that value.
  • Monotone write consistency. The system guarantees that write operations from the same process execute sequentially. The system must guarantee this degree of consistency, otherwise it is very difficult to program.

6. From NoSQL to NewSQL database

        NoSQL databases can provide good scalability and flexibility, well make up for the shortcomings of traditional relational databases, and better meet the needs of Web 2.0 applications. However, NoSQL databases also have their own inherent shortcomings. Because it adopts a non-relational data model, it does not have features such as highly structured queries. The query efficiency, especially complex queries, is not as good as that of relational databases, and it does not support the four transaction ACID properties.

        Against this background, NewSQL databases have begun to gradually heat up in recent years. NewSQL is the abbreviation for various new scalable, high-performance databases. This type of database not only has the storage and management capabilities of NoSQL for massive data, but also maintains the characteristics of traditional databases such as ACID and SQL. The internal structures of different NewSQL databases vary greatly, but they have two significant features in common: they all support the relational data model; they all use SQL as their main interface.

        Currently, representative NewSQL databases mainly include Spanner, Clustrix, GenieDB, ScalArc, Schooner, VoltDB, RethinkDB, ScaleDB, Akiban, CodeFutures, ScaleBase, Translattice, NimbusDB, Drizzle, Tokutek, JustOne DB, etc. In addition, there are some NewSQL databases (also known as cloud databases) provided in the cloud, including Amazon RDS, Microsoft SQL Azure, Database.com, Xeround and FathomDB, etc. Among the many NewSQL databases, Spanner has attracted much attention. It is a scalable, multi-version, globally distributed database that supports synchronous replication. It is Google's first database that can be globally expanded and supports external consistency. Spanner can do this without a time API implemented with GPS and atomic clocks. This API enables time synchronization between data centers to within 10 ms. Therefore, Spanner has several good features: lock-free read transactions, atomic mode modification, and non-blocking reading of historical data.

        Some NewSQL databases have significant performance advantages over traditional relational databases. For example, the VoltDB system uses NewSQL's innovative architecture to release the buffer pool that consumes system resources in the database running in main memory, and can execute transactions 45 times faster than traditional relational databases. VoltDB is scalable to 39 servers and can process 1.6 million transactions per second (300 CPU cores), while Hadoop with the same processing power requires more servers.

        Taken together, the arrival of the big data era has triggered changes in data processing architecture. In the past, the direction pursued by the industry and academia was an architecture that supports multiple types of applications (One Size Fits All), as shown in Figure 5-5, including transactional applications (OLTP systems), analytical applications (OLAP, data warehouses) and Internet applications (Web 2.0). However, practice has proved that this ideal vision is impossible to realize. The data management requirements of different application scenarios are completely different, and one database architecture cannot satisfy all scenarios. Therefore, in the era of big data, the database architecture began to develop in a diversified direction, and formed three camps: traditional relational database (OldSQL), NoSQL database and NewSQL database, each of which has its own application scenarios and development space.

        In particular, traditional relational databases have not been completely replaced by the other two. While the basic architecture remains unchanged, many relational database products have begun to introduce in-memory computing and all-in-one computer technology to improve processing performance. In the future, the situation of coexistence and co-prosperity of the three camps will continue, but one thing is certain, that is, the glorious period of traditional relational databases has passed.

        In order to have a clearer understanding of the related products of traditional relational databases, NoSQL and NewSQL databases, Figure 5-6 shows the classification of the three database-related products.

Supongo que te gusta

Origin blog.csdn.net/java_faep/article/details/132753743
Recomendado
Clasificación