Primary key index and secondary index

First of all, what is the index.
I don't know if it was invented by foreigners. Baidu took a look and said it was a database term.
As for the database table, if it is very large, if you want to check a value, you have to scan the entire table. This is very slow.
But with an index, it's much faster, why?
This has to look at what the index looks like first.

Baidu says that index is like a directory, but index is obviously not the same as content.
Index refers to the important words, concepts, and materials of a book or document. Any document can be indexed.
The content directory is generally required for documents with more than 10 pages, and is part of a book or document. So TOC stands for table of contents. It comes in the order of the chapters in the book, and the page numbers are marked. Generally, the table of contents should not exceed 2 pages.
The index is an important concept and can have many pages. There are article references, concept interpretations, etc. It is generally sorted alphabetically, but not in the order in which it appears in the book. Usually at the end, including some specific words, page numbers, and quoting important concepts of the article. Important points are generally indexed.
Insert picture description here
From the perspective of the English interpretation of the index, the actual index is a pointer to important concepts. For example, if you have a very thick book, you only want to find one of the keywords. There is no post-book index. You have to go through the whole book. With the post-book index, you can directly look for this keyword in the index, and then use the page number instructions to find the book Corresponding content.

But the index page may have 5 pages. In other words, this index page is additional.

Back to the database, the index table is independent of the database table.
That is, you need to build another index on the database table.
For insert and update table operations, you not only operate on the table, but also rewrite and sort the index table. If you have two index tables, you have to operate on three tables. That is to say, there are primary key indexes and non-primary key indexes (secondary indexes).
If the table and index are in one place, it will take more time.

Why do you need an index

Data is stored in blocks on the disk (assuming on the disk), then all these databases are accessed like a linked list, that is, one block is stored by itself, and then there is a pointer to the next block (almost this means), since there is a continuous Pointer. Does not require continuous storage.
For a table, we can only sort one column. If we look for unsorted fields, assuming that this table has N blocks, then linear search needs to access N/2 blocks on average. The maximum is N blocks. Because you need to constantly compare the found columns. This is still for the primary key (unique key), if you want to find the unique key, then you have to find N blocks.
If the fields are sorted, then the binary search is performed. You only need to access the log2N block.
If you sort non-primary key fields, comparing a high value is actually equivalent to comparing a lot of high values, so the performance is greatly improved.

What is the database index

As can be seen from the above, the database index is to sort the record values ​​of multiple fields. When you create an index for a certain field (or multiple fields) of a table, you create a structure with field values ​​and pointers to the record positions in the table. This pointer structure will be sorted. Allow binary search.
The disadvantage is that the index also takes up memory.
And if the table is changed, the index must be updated.

Let's look at an example of a specific table:
Suppose the table fields are like this:
field name | data type | disk space

-------- | ----- | -----
ID(主键) | INT | 4 bytes
Name | char(50) | 50 bytes
Sex | char(50) | 50 bytes
Address | char(100) | 100 bytes

Assume that our table has 10,000 fixed-size records.
The length of each record is 204 bytes, and the default disk block size is B=1024bytes, which means that one disk block can hold 1024/204=5 records. To save this table record requires 10000/5=2000 disk blocks.

Performing a linear search on the ID requires an average of 2000/2=1000 visits. But it is the primary key, so binary search can be performed, and only log2 (2000) = 10.9 visits are required. Just 11 times.
If you have access to the non-primary key Name, it will need to traverse 2000 times.

Build an index

So let's build an index on Name.
Field Name | Data Type | Occupy Disk Space

-------- | ----- | -----
Name | char(50) | 50 bytes
pointer | special | 4 bytes

So the index is 54 bytes. The default disk block is 1024 bytes. So each disk block can have 18 records. The index will occupy 10000/18 = 555.555 disk blocks.
Then, first searching for Name on this index requires a binary search, which requires log2 (555.5) = 9 times to find 9 disk blocks. Then, to find Address on the basis of Name, you only need to find another block, which is 10 blocks.

Compared with the original search for 10,000, it is much faster.

When to build

In this way, the index is really good. But don't forget, the index occupies 555 disk blocks, and the 2000 disk blocks recorded in the comparison table directly use 25% of the space.
Therefore, the index takes up a lot of space. Even if it can help us greatly speed up the search for matching fields in entries, when you perform insert or delete operations, the drawbacks come.
Therefore, it is necessary to consider clearly which field to use to build. Because binary search requires high data uniqueness. In addition, the data base must be considered here. That is, the data in the columns of the table, especially how many unique values ​​are there. At this point, I went back to check the database, because I don't remember what base number is, what is mesh number. On Wikipedia, cardinality refers to the uniqueness of data values ​​contained in specific columns (attributes) in a database table.
Then think of the set, the set <a, b, c> contains 3 elements, the base of the set is the number of elements in the set, that is to say, the base of the set is 3.
Let me extend it a bit, which means that the base is a set of rows in the table.
The number of items refers to the attribute columns of the table.
And for this base number, we must first consider the uniqueness of the data. If you have a lot of repeated row values ​​in this column, such as 1, 2, 1, 2, 1, 2, then your base is 2. If this The row values ​​of the columns are all unique, so your base is the number of rows.

Going back to the topic index, if the value base of your column is 2, then you are equivalent to splitting the entire data table in two when looking for data. If your base is 1000, then find 1000 rows.
This means that if your column cardinality is very low, there is actually no need to build an index. Indexes are required for columns with high cardinality. Even when the column cardinality is high, the index occupies space, but it is very helpful for search. If the column cardinality is low, although the index takes up less space, it is unnecessary. The column cardinality is less than 30% of the number of rows, which means that there are actually only 3 different values ​​in 10 rows, so there is no need to use an index.
Generally, in our database tables, columns with high repetition probability, such as year, month, and value, are not used for indexing. Only use ID, number and the like.

Other issues to consider

As mentioned before, the table you update has an index, so there are actually two write operations, and the index must be reordered. If the tables and indexes are on the same hard disk, it will take more time. So, if you put the tables and indexes on different hard drives, you can save time.
Another problem is that this data is continuously inserted, resulting in index storage fragmentation. You need to reorganize at this time. I really don't know how to do this.

The index takes up space and write operations. It depends on whether you want to write frequently or read this form frequently. Measure whether you need to build an index.

Back to topic: primary key index and secondary index

The primary key is an index based on the primary key column and there are no duplicate values.
Secondary indexes are non-primary key indexes and can have duplicate values.

There are two parts in the index, one is the search key and the other is the data reference.
The search key is the primary key value or other column value, sorted. The data reference contains a pointer that stores the address of the search key on the disk block.

Primary key index

The primary key index does not mean that it contains only one column of the primary key, but that it may be composed of many column values, but the combined value is also unique.
And based on the index key is sorted. There can only be one primary key index.

Secondary index

Non-primary key indexes can have duplicate values.
It has little effect on how rows are organized in data blocks. There can be many.

Well, I'm going to read a book. . .

Guess you like

Origin blog.csdn.net/weixin_45689053/article/details/113345904