Go Must-Know Series: Data Storage and NoSQL

Author: Zen and the Art of Computer Programming

1. Background introduction

Introduction to data storage

In the era of the development of modern Internet and mobile Internet, traditional relational databases can no longer meet the needs. With the rise of cloud computing and big data, it is increasingly difficult for relational databases to cope with the massive data storage needs of enterprise applications. Therefore, large-scale distributed database systems, such as HBase, MongoDB, etc., have begun to be widely used. However, for emerging NoSQL databases, the situation is completely different. The emergence of NoSQL makes it possible not only to store data in non-relational databases, but also to act as a relational database to handle some simple query scenarios. This article will introduce the core concepts and features related to NoSQL data storage.

NoSQL overview

NoSQL (Not Only SQL) means "not just SQL". It refers to the collective name of non-relational databases and is mainly used to quickly store and access large amounts of data in ultra-large-scale, high-concurrency scenarios. NoSQL usually stores data based on key-value pairs, documents or graphs. Characteristics of NoSQL databases include:

  1. Distributed storage: multiple nodes are deployed in a distributed manner, and each node is connected through the network;
  2. Automatic horizontal expansion: When the amount of data increases, the number of nodes can be dynamically increased;
  3. Horizontal elastic expansion: no downtime is required during expansion;
  4. There is no fixed model: the storage structure can be flexibly adjusted according to needs;
  5. Supports multiple data models: supports key-value, column family, document, graph and other data models;
  6. Flexible query language: supports rich query languages, such as SQL, CQL and GraphQL;
  7. High availability: Guaranteed 99.99% availability, extremely high reliability and availability;
  8. Easy to use: Provides mature SDK and toolkit;
  9. Open source and free: It follows the Apache 2.0 license and is free to use.

NoSQL database classification

At present, NoSQL databases can be roughly divided into the following three categories:

  1. Key-value storage database: Redis, Riak, Memcached, etc. These databases are simple, fast, and low-latency, and are suitable for scenarios such as caching, counters, and rankings.
  2. Document database: MongoDB, Couchbase, Firebase Realtime Database, etc. These databases use BSON format to store data, have flexible query functions, and provide interfaces with rich query syntax, supporting cross-platform and multi-language.
  3. Column store database: Cassandra, HBase, etc. These databases store data in column clusters and columns, making query speeds fast and saving disk space.

NoSQL advantages

Due to the various characteristics of NoSQL databases, they can reduce costs, improve performance, and implement more flexible data models, thereby gaining advantages in many scenarios. Here are some typical scenarios for using NoSQL:

  1. Large-scale caching: Many NoSQL databases can also be used as caching layers. Because of their high performance, simplicity and ease of use, they are especially suitable for scenarios with very high read-write ratios. For example, Redis is a good in-memory cache database.
  2. Massive data analysis: NoSQL database’s flexible query language and support for SQL interfaces make it a powerful tool for processing massive data. For example, MongoDB supports parallel operations based on MapReduce, which can easily implement data aggregation, summary, filtering and other operations.
  3. Real-time data update: The asynchronous writing feature of NoSQL database makes it suitable for real-time data update scenarios. For example, Cassandra has high-throughput writing capabilities and can achieve real-time data writing in seconds or even milliseconds.
  4. Complex related queries: NoSQL databases generally have corresponding query syntax, support multiple query methods, and can flexibly implement complex related queries. For example, Couchbase's N1QL can implement flexible conditional query and sorting, and can meet complex related query scenarios.
  5. Time series data storage: NoSQL databases can also be used to store time series data. For example, time series databases InfluxDB and Druid support efficient storage, retrieval, and analysis of time series data.

2. Core concepts and connections

Consistency and availability

In a distributed system, consistency and availability are two basic goals. Consistency requires that the data of all replicas be consistent, and availability requires that the service is always available. In order to ensure high availability of data, NoSQL databases often adopt redundancy mechanisms. That is, more than one copy stores the same data to achieve a balance between data reliability and availability. In addition, for some specific business scenarios, NoSQL databases can also provide stronger consistency guarantees. For example, NoSQL databases can provide perfect solutions for the support of distributed transactions and the biggest problems of the CAP theorem - consistency, availability, and partition fault tolerance.

CAP theorem

The CAP theorem (Consistency, Availability, Partition Tolerance), also known as Brewer's theorem, points out that for a distributed computing system, consistency (Consistency), availability (Availability), and partition tolerance (Partition tolerance) cannot be achieved at the same time. These three attributes respectively correspond to data consistency, the time interval for normal operation of the system, and the maximum number of node failures allowed by the system. When network communication encounters limitations, partition fault tolerance cannot be guaranteed.

NoSQL databases also have to bear the issue of data consistency. One of its most famous ones is the BASE theory. The BASE theory of NoSQL is Basically Available (basically available), Soft State (soft state/flexible transactions) and Eventually Consistency (eventual consistency). Specifically, BASE theory believes that for any application, NoSQL database can basically guarantee availability, but does not guarantee consistency. Soft state means that there is a certain delay in the data in the system. The system may change at any time, but it will definitely evolve in the right direction. Eventual consistency means that all data copies will not cause data inconsistency due to network partitions and other reasons within a certain period of time. However, the data tends to become more consistent over time. For example, distributed locks can be used to achieve strong consistency.

BASE theory

BASE theory was proposed by eBay architect He Shiguang in 2008. Its concept is "neither too much nor too little can be adjusted", that is, on the premise of ensuring ACID transactions, by sacrificing isolation, atomicity, and durability. A trade-off between consistency and availability. BASE theory believes that even if strong consistency cannot be achieved, it is guaranteed that user requests can receive responses under any circumstances. In order to implement this theory, NoSQL databases must ensure that:

  1. Basically Available: The system is in normal working condition and requests can be responded to;
  2. Soft state: There is a certain delay in the data in the system. The system may change at any time, but it will definitely evolve in the right direction;
  3. Eventual consistency: Data will reach consistency after a period of time.

B: Basic availability

Basically available means that the distributed system is available most of the time unless a component fails or the entire system fails. Therefore, there are no hard failures like single points of failure, and service availability is guaranteed as long as individual components continue to operate. In addition, multiple copies of each component can be deployed based on its own business characteristics to maintain high availability during deployment.

A: High availability

High Availability (HA) is an antonym, which means ensuring that a system continues to run without the need to restart or damage it. A reliable distributed system should be able to achieve high availability through redundancy and distributed clusters, ensuring that the system can serve normally at all times without any obvious flaws. Therefore, for highly dependent business scenarios, NoSQL databases should also support high availability. Of course, there are many other ways to achieve high availability, such as using master-slave architecture, sharding architecture, etc.

S: soft state

Soft State means that there is a certain delay in the data in the system. The system may change at any time, but it will definitely tend to the right direction. Soft state is a relatively easy-to-implement requirement for distributed systems, because data synchronization is a difficult problem when various nodes in a distributed system communicate through the network. Moreover, the data is not strongly consistent, there is a certain delay, and the system may change at any time.

E: eventual consistency

Eventual Consistency means that the data in the system reaches a consistent state within a period of time without obvious abnormalities. Final consistency is also a relatively complex requirement because it involves data replication, communication, etc. of multiple nodes. Strategies need to be formulated based on various factors such as the actual environment, business characteristics, and data volume. However, eventual consistency can make up for the shortcomings of weak consistency and reduce unnecessary synchronization overhead.

Distributed transactions

Distributed Transaction means that transaction participants, managers and resource servers are all deployed on different distributed systems, involving data exchange and coordination between multiple nodes, and is a complex technology. NoSQL databases generally support distributed transactions, but the specific implementation mechanism and applicable scenarios still need to be specifically evaluated.

The choice between CAP theory and BASE theory

Although CAP theory and BASE theory provide a complete set of solutions, to truly determine whether a NoSQL database is capable of complex distributed systems in practice, CAP and BASE theory still need to be considered comprehensively. Generally speaking, when selecting a NoSQL database, priority is given to choosing a solution that can ensure eventual consistency, so as to avoid performance degradation of the entire system. If you really pursue strong consistency, you need to take appropriate measures to prevent data inconsistency, such as using distributed locks. In addition, choosing a highly available NoSQL database can effectively avoid service interruptions caused by hardware failures.

Guess you like

Origin blog.csdn.net/universsky2015/article/details/133594776