What is a database? What are the basic classifications and main features of databases?

A database is a collection of data stored in some organized manner. Starting from the basic concepts of databases, this article explains in detail the main categories and basic characteristics of databases, and conducts an in-depth analysis of vector databases, a type of database that has attracted much attention in the era of large models, so that everyone can start to understand the basic concepts in the database field. To some reference.

1. What is a database?

Let's first look at Wikipedia's definition of a database: a collection of data that is stored together in a certain way, can be shared by multiple users, has as little redundancy as possible, and is independent of the application. A database consists of multiple table spaces.

Baidu Encyclopedia defines a database as follows: A database is a "warehouse that organizes, stores, and manages data according to a data structure." It is a collection of large amounts of data that is stored in a computer for a long time, is organized, shareable, and unified management.

Database (DB in English) is a collection of data stored in some organized manner. It can be understood as a warehouse that stores computer data. This warehouse organizes and stores data according to a certain data structure (that is, the organizational form of data or the relationship between data). We can manage it through a variety of methods provided by the database. the data in it.

Insert image description here

2. Classification of databases

According to the data model used by organizations to store data, data can be divided into relational databases (Relational Database Management System, referred to as RDBMS) and non-relational databases (NoSQL).

Relational database: stores data in tabular form, and uses SQL as the operating language, such as MySQL, Oracle, SQL Server, etc. The characteristic is that the data can be highly organized, classified, filtered, and equivalent values ​​can be searched. It is suitable for application scenarios with strong consistency in data maintenance, large transaction volume, and few structural changes.

Non-relational database (NoSQL): does not use tables for data storage, but uses different data models such as documents, column families, key values, etc. to store data. The advantage is that it can handle massive data, has high concurrent reading and writing, and is highly scalable. However, it does not support JOIN operations and is not suitable for application scenarios that require multi-table operations and have high transaction requirements.

Insert image description here

Tips: The birth history of non-relational databases

· Non-relational databases are also called NoSQL databases. The original meaning of NoSQL is "Not Only SQL", which refers to non-relational databases, not "No SQL" (no SQL). The emergence of NoSQL database was not to completely deny and end relational databases, but to serve as an effective supplement to traditional relational databases.

· With the rise of Web 2.0 and Web 3.0 websites on the Internet, traditional relational databases have encountered many problems and challenges in dealing with the ever-expanding massive data, ultra-large-scale and highly concurrent dynamic websites. For example, traditional relational databases have encountered many problems and challenges. It is difficult to effectively break through the I/O bottlenecks and performance bottlenecks of the database.

· As a result, a large number of functionally specialized database products have emerged for specific scenarios, aiming at high performance, high concurrency and ease of use. Non-relational databases were born in such a scenario and developed very rapidly. In these specific scenarios, NoSQL databases can achieve unimaginable efficiency and performance. NoSQL is a broad definition of non-relational database, which breaks the long-standing unification of relational database and ACID theory.

· The data storage of NoSQL database does not require a fixed table structure, and there is usually no connection operation. It has performance advantages that relational databases cannot match in big data access, and meets the needs of enterprise applications to store data horizontally and scalably. requirements for stronger functionality.

At present, there is still no unified standard for NoSQL databases. There are now four major classifications: key-value databases, column storage databases, document databases, and graph databases.

· Key-Value storage database

The key-value database is similar to the hash table used in traditional languages. Data can be added, queried or deleted through key. Because it is accessed using the key primary key, it will achieve high performance and scalability.

Key-value databases primarily use a hash table with a specific key and a pointer (pointing to specific data). For IT systems, the advantages of the key/value model are simplicity, ease of deployment, and high concurrency.

· Column store database

This part of the database is usually used to deal with massive data stored in a distributed manner. The keys still exist, but they have the characteristic of pointing to multiple columns. The columns are arranged by column families. Such as: Cassandra, HBase, Riak.

· Document database

The document database was inspired by the Lotus Notes office software and is similar to the first key-value store. This type of data model is a versioned document, a semi-structured document stored in a specific format, such as JSON. Document databases can be seen as an upgraded version of key-value databases, allowing key values ​​to be nested. When processing complex data such as web pages, document databases have higher query efficiency than traditional key-value databases. Such as: CouchDB, MongoDb.

· Graph database

As the name suggests, the graph database is a database that stores graph relationships, and the graph model is an important concept in the graph database. The graph model consists of two elements: nodes and edges. Each node represents an entity (person, place, thing, etc.), and each edge represents the connection between two nodes. This general structure can be used to build various scenarios. models, such as social networks and anything else defined by relationships. The characteristic is that it can handle high-density connections and sparse data, and supports fast complex queries. GraphDB and Neo4j are commonly used graph databases.

In addition, with the explosive growth of large model technology, as a new type of database, the "hippocampus of large models" - Vector Database has become the hottest technology focus at present.

Vector database:

A vector database is a database that stores and manages vector data in a vector-embedded manner. In a vector database, each vector has a unique identifier and can be stored in a continuous vector space.

Unstructured data such as images, text, audio and video can be converted into vector data through some kind of transformation or embedding learning and stored in a vector database, thereby enabling similarity search and retrieval of images, text, audio and video. This means that a vector database can be used to find the most similar or relevant data based on semantic or contextual meaning, rather than using traditional methods of querying a database based on exact matches or predefined criteria.

Vector database has the following characteristics in processing vector data:

· Efficient storage and query: The vector database adopts a specific storage structure and index algorithm to efficiently store and query vector data, reduce data redundancy, and improve query efficiency.

· Multi-dimensional query: The vector database supports multi-dimensional query, which can be queried based on multiple attributes of the vector, such as similarity query, range query, etc.

· Vector similarity calculation: Vector database can perform similarity calculation on vectors to find the most similar vector data. It is often used in recommendation systems, image search and other applications.

· High concurrency processing: The vector database has strong concurrent processing capabilities and can handle a large number of vector data query requests at the same time.

· Support vector index: The vector database supports various vector index technologies, such as inverted index, KD-Tree, LSH, etc., to accelerate the query of vector data.

· Distributed storage: Some vector databases support distributed storage and computing, can be horizontally expanded, and are suitable for processing large-scale vector data.

Vector databases can achieve efficient storage and retrieval, using indexing technology and vector retrieval algorithms to achieve rapid response to high-dimensional big data. Vector database is also a type of non-relational database. In addition to managing vector data, it also supports the management of traditional structured data. In actual use, there are many scenarios where vector fields and structured fields are filtered and retrieved at the same time, which is also a challenge for vector databases.

3. Characteristics of database

· Enable data sharing

Data sharing includes that all users can access the data in the database at the same time, and also includes that users can use the database in various ways through interfaces and provide data sharing.

· Reduce data redundancy

Compared with the file system, because the database realizes data sharing, it avoids users from creating application files individually. It reduces a large amount of duplicate data, reduces data redundancy, and maintains data consistency.

· Data independence

The independence of data includes logical independence (the logical structure of the database and the application program in the database are independent of each other) and physical independence (changes in the physical structure of the data do not affect the logical structure of the data).

· Centralized control of data

In the file management method, the data is in a scattered state, and there is no relationship between the files of different users or the same user in different processes. The database can be used to centrally control and manage data, and the data model can be used to represent the organization of various data and the connections between data.

· Data consistency and maintainability to ensure data security and reliability

· Security controls: to prevent data loss, incorrect updates and unauthorized use;

· Integrity control: ensure the correctness, validity and compatibility of data;

· Concurrency control: allows multiple accesses to data within the same time period and prevents abnormal interactions between users.

· Fault recovery: The database management system provides a set of methods to detect and repair faults in time to prevent data from being damaged.

Guess you like

Origin blog.csdn.net/weixin_46880696/article/details/134264178
Recommended