Artificial intelligence large-scale model accelerates the development of database storage model Breaking the situation under mixed storage of ranks and columns

insert image description here

data storage model

​Column content :

Personal homepage : My homepage
Motto: Tian Xingjian, a gentleman strives for self-improvement; Geography Kun, a gentleman carries virtue.

overview

In the development process of database, relational database is a milestone stage, and now relational data still occupies an important position. In relational data, each table is a relationship, and each row of data is a record of the relationship. When storing, each row of data is stored in a continuous position, and the rows are also stored
consecutively;
Records.

Handling business type

With the rise of the Internet, the improvement of storage capacity and the leap of computing power, more and more smart devices have been added to our lives, generating endless information.
The scale of such information has exceeded the capability limit of a single entity, and they are continuously classified. For database processing models, they are often divided into:

  • Online transaction processing model (OLTP), mainly based on transaction consistency and relational data;
  • The online analytical processing model (OLAP) is mainly based on analysis and statistics, and more is to extract data of several dimensions from a large amount of data;

However, such a division is far from meeting the needs brought about by the information explosion. It is not a black and white classification with clear boundaries. There are still a large number of data and services that have the characteristics of both OLTP and OLAP. At this time, a Hybrid database storage model.

Data storage model principle

what is

The data inserted through SQL is actually stored on the disk in the database. At this time, we also need to consider the efficiency of our writing, the efficiency of reading, how to generate fewer IO times, and in what format to organize these data. How can we achieve such a goal?

The file system we use reads and writes physical storage devices in units of blocks. Commonly used block sizes are 2k, 4k, etc.; then, in order to improve performance, the database also chooses to organize data in units of blocks, each time by block Read and write data files.
Each data block is further divided into: block header information domain, the start offset of the data domain, and the data domain, which are stored continuously according to the rows of the logic table in the data domain.

Of course, row data is divided into two different organization methods: fixed length or variable length; fixed length, that is, each data type has a fixed length, so the length of a row of data is also determined; variable length types, such as characters, text, etc. The length is variable, so the length needs to be recorded when storing.
The biggest difference between them is that when updating, the fixed length can be directly overwritten and updated, while the variable length requires an additional update.

Why is the storage model so important

Because our data stored in the database is persisted to the disk, when we query, and then read from the disk, although
our database and operating system levels have been cached, when the amount of data is large, it will still generate A lot of disk IO, and the database is mostly random access, the cache does not guarantee all hits.

Compared with the memory speed, the disk speed is extremely low, but the memory is often limited, so the storage model is very important. By converting random writes into sequential writes, less IO can accurately find data and reduce traversal. All of these can reduce the number of IOs and improve performance.

Data Storage Model Type

NSM model

As the name implies, it is in the form of an array arranged by row data. The physical structure of the data is the same as their logical structure, which is what we often call the row storage model, which is also the storage model adopted by most relational databases.

physical storage structure

The disk is composed of data blocks one by one, so continuous data is also divided into continuous data blocks.
Each data block is divided into block header information, which records the starting offset of the data in the block, and each row of data is divided into row data offset items, which are stored continuously from the back of the block header, and the real row data, which starts from the end of the block It is stored continuously toward the head, which is to facilitate the management of free space.

Table data corresponds to the physical storage structure as shown in the following figure:

physical storage structure

Application Scenario

  • Its advantage is that the query of associated data is very fast. For example, a series of information such as name and address can be read out at one time based on the ID card number.
    On this basis, it is very advantageous for complex nested joins, because its columns of data are all together.

not suitable for the scene

  • For businesses that only look for part of the column attribute data, the cost of IO will be increased, which requires the readout of the entire row of data. For the design according to 3NF, or a large wide table, the reduction of cache efficiency cannot be avoided.

DSM model

Decomposed storage model, that is, store each field in a row in different data units. When a column of data is needed, only part of the data is loaded from the disk. If the entire row of data is needed, the full amount of data is loaded, and then the row Assemble.

Each column can be stored separately, or it can be divided irregularly according to business needs. For example, if there are three columns that are often queried at the same time, these three columns can be stored together, and the remaining columns can be stored separately.

physical storage structure

Common formats are:

  • PAX
  • RCFile(record columnar file)
  • Apache ORC
  • Parquet (An Open Columnar Storage for Hadoop)

Among them, more are inclined to analytical columnar storage, which can handle a large amount of time series and streaming data, and some are inclined to hybrid type of row and column, and each format has mature product applications.

Application Scenario

Their scenarios are more analytical, such as hdoop series, using ORC, Parquet.

Hybrid Data Storage Model

In order to combine the advantages of the above NSM and DSM and complement each other, some databases have adopted some mixed storage models.

Common Mixed Model Practices

  • data redundancy

When storing data, simply store in two formats at the same time, one is stored by row, and the other is stored separately by column, which avoids the complexity of conversion bands and trades space for performance; in the optimization engine, you can choose a more suitable format path of;

  • data conversion type

Because row storage must bring IO amplification, columnar storage is also used for actual storage, and it is assembled into logical row data when used; the difficulty of this model lies in how to accurately find each field in the logical row, mostly using The way of grouping mentioned in PAX.

difficulty

In big data processing, it is no longer limited to relational data, but more non-relational data, such as text and json data. How to convert them into column data, which can be quickly searched, will be a problem faced by the hybrid storage model challenge.

The amount of vector data that has emerged recently corresponds to the dimension of the large model, and the underlying database storage needs to store each type of data separately.

end

Thank you very much for your support. Don’t forget to leave your valuable comments while browsing. If you think it is worthy of encouragement, please like and bookmark, I will work harder!

Author email: [email protected]
If there are any mistakes or omissions, please point them out and learn from each other.

Note: Do not reprint without consent!

Guess you like

Origin blog.csdn.net/senllang/article/details/132394482