Database design specifications (MySQL three paradigms and denormalization)

Database Design Specifications

foreword

I am Hu Yi , a blogger who loves to share technology. The so-called non-rule does not make a circle, and the design of the database is the same; a good design can bring us performance improvements. This article mainly uses some examples to explain the three major paradigms of MySQL and the emergence of denormalization and why denormalization is needed.
Well, without further ado, the content is as follows:

1. Why do we need database design

Possible problems with poor database design:

  • Data redundancy, information duplication, waste of storage space
  • Data update, insert, delete exceptions
  • Inability to represent information correctly
  • missing valid information
  • Poor program performance

Good database design has the following advantages:

  • Save data storage space
  • Ability to guarantee data integrity
  • Facilitate the development of database application systems

2 The concept of keys and related attributes

The definition of the normal form will use the primary key and the candidate key, and the key (key) in the database consists of one or more attributes. Several keys and attributes commonly used in data tables are defined as follows:

  • Superkey : The set of attributes that can uniquely identify a tuple is called a superkey; tuple: a row of data

  • Candidate key : If the super key does not include redundant attributes, then the super key is a candidate key;

  • Primary key : The user can choose one of the candidate keys as the primary key

  • Foreign key : If an attribute set in data table R1 is not the primary key of R1, but the primary key of another data table R2, then this attribute set is the foreign key of data table R1.

  • Primary Attributes : The attributes contained in any of the candidate keys are called primary attributes.

  • Non-primary attributes : As opposed to primary attributes, they refer to attributes that are not included in any candidate key.

Usually, we also refer to the candidate key as " ", and the primary key as " 主码". Because the key may be composed of multiple attributes , for a single attribute, we can also use the primary attribute and non-primary attribute to distinguish.

3. MySQL three paradigms

3.1 Introduction to Paradigms

In relational databases, the basic principles and rules of data table design are called paradigms. It can be understood that the design structure of a data table needs to meet the level of a certain design standard. To design a well-structured relational database, certain paradigms (rules) must be met.

The English name of the paradigm is Normal Form, or NF for short. It was summed up after the British EFcodd (Edgar Frank Codd) proposed the relational database model in the 1970s. Paradigm is the basis of relational database theory, and it is also the rules and guiding methods we should follow in the process of designing database structure .

In 1981, Corder won the Turing Award for his work on relational databases . He is also known as: "Father of Relational Database"

3.2 First Normal Form (1st NF)

The first normal form is mainly to ensure that the value of each field in the data table must have 原子性, that is to say, the value of each field in the data table 不可再次拆分is the smallest data unit**.

Example 1:

Suppose a company wants to store the names and contact details of employees. It creates a table like this:

insert image description here

The table does not conform to 1NF because the rule says "every attribute of the table must have an atomic (single) value", and the values ​​of lisi and zhaoliuemployee emp_mobileviolate this rule (there are two calls). In order to make the table conform to 1NF, we should have the following table data:

insert image description here

Example 2:

The atomicity of properties is subjective , you can choose it according to your business scenario, and the answer depends on the application. If the application needs to process the province, city, and address in the address separately, it is necessary to separate them. Otherwise, no need.

Table 1: Address fields are not separated

Name age address
Zhang San 20 No. 78, Sanyuanli, Guangzhou City, Guangdong Province
Li Si 24 Longhua New District, Shenzhen City, Guangdong Province

Table 2: Address fields separated

Name age Province city address
Zhang San 20 Guangdong Guangzhou No. 78 Sanyuanli
Li Si 24 Guangdong Shenzhen Longhua new district

Therefore, paradigm design sometimes requires a little subjectivity. It does not have to be strictly followed. The design of the database should be related to the specific business. The table structure that suits your business the better.

3.3 Second Normal Form (2nd NF)

The second normal form requires that, on the basis of meeting the first normal form, each data record in the data table must also be uniquely identifiable . Moreover, all non-primary key fields must completely depend on the primary key , not only part of the primary key . Any value of any attribute of any tuple (row) can be retrieved if the values ​​of all attributes of the primary key are known. (The primary key in the request can actually be extended and replaced with a candidate key).

In short, all attributes can be obtained through the primary key (and only through the primary key)

Example 1:

In the grade table (student number, course number, grade) relationship, (student number, course number) can determine the grade, but the student number cannot determine the grade, and the course number cannot determine the grade, so "(student number, course number) → grade " is a complete dependency relationship , which conforms to the second normal form, because the student number + grade constitutes the primary key, and all columns depend on the primary key.

Example 2:

It exists 比赛表player_game, which contains attributes such as player number, name, age, game number, game time, and game venue. Here, both the candidate key and the primary key are (player number, game number), and we can use the candidate key (or primary key) to determine the following relation:

(球员编号, 比赛编号) → (姓名, 年龄, 比赛时间, 比赛场地,得分)

But this data table does not satisfy the second normal form, because there is still the following correspondence between the fields in the data table:

# 姓名和年龄"部分依赖"球员编号。
(球员编号) → (姓名,年龄)
# 比赛时间, 比赛场地"部分依赖"比赛编号。(通过比赛的编号,我们就可以得到比赛的时间和比赛场地。这些信息不通过球员编号一样可以得到)
(比赛编号) → (比赛时间, 比赛场地)

For non-primary attributes, it is not completely dependent on candidate keys. What kind of problems will this cause?

  1. Data redundancy : If a player can play m games, then the player's name and age are repeated m - 1 times. There may also be n players participating in a game, and the time and place of the game are repeated n - 1 times.
  2. Insertion exception : If we want to add a new game, but we have not yet determined who the participating players are, then we cannot insert it (the game information and player information are redundant).
  3. Deletion exception : If I want to delete a certain player number, if the game table is not saved separately, the game information will be deleted at the same time.
  4. Update exception : If we adjust the time of a certain game, then all the times of this game in the data table need to be adjusted, otherwise there will be a situation where the time of a game is different.

In order to avoid the above problems, we can design the player competition table as the following three tables.

Table Name attribute (field)
player list player Attributes such as player number, name, and age
game table game Attributes such as game number, game time, and game venue
Player game relationship table player_game Attributes such as player number, game number, and score

1NF tells us that field attributes need to be atomic, while 2NF tells us that a table is an independent object, and a table only expresses one meaning.

The second normal form (2NF) requires that the attribute of the entity is completely dependent on the primary keyword. If there is an incomplete dependency, then this part of the attribute and the primary keyword should be separated ; a new entity is formed, and the new entity and the meta-entity are one to-many relationship.

3.4 Third Normal Form (3rd NF)

The third normal form is based on the second normal form, ensuring that each non-primary key field in the data table is directly related to the primary key field, that is, all non-primary key fields in the data table cannot depend on other non-primary key fields . (That is, there cannot be a situation where non-primary attribute A depends on non-primary attribute B, and non-primary attribute B depends on primary key C, that is, there is a decision relationship of "A→B-C") In layman's terms, the meaning of this rule There 非主键属性can be no dependencies between all of them, they must be independent of each other .

The primary key here can be extended to a candidate key.

Example 1:

部门信息表: Each department has information such as department number, department name, and department profile.

员工信息表: Each employee has employee number, name, department number. After the department number is listed, it is no longer possible to add department-related information such as department name and department profile to the employee information table (to eliminate partial dependencies, which is a requirement of 2NF).

The above example satisfies the third normal form. Because there is no problem of partial dependency and transitive dependency. All non-primary key fields in the employee information table depend on the primary key field, and the non-primary key fields are independent of each other, and there is no dependency relationship.

Example 2:

Field Name Field Type Is it the primary key illustrate
id INT yes Commodity primary key id (primary key)
category_id INT no 商品类别编号
category_name VARCHAR(30) no 商品类别名称
goods_name VARCHAR(30) no product name
price DECIMAL(10,2) no commodity price

商品类别名称Depends on 商品类别编号, does not conform to third normal form, nor does it conform to second normal form.

Revise:

Table 1: The design 商品类别表 goods_categoryof

Field Name Field Type Is it the primary key illustrate
id INT yes commodity category primary key id
category_name VARCHAR(30) no product category name

Table 2: Designs 商品表goodsin

Field Name Field Type Is it the primary key illustrate
id INT yes Product primary key id
category_id VARCHAR(30) no commodity category id
goods_name VARCHAR(30) no product name
price DECIMAL(10,2) no commodity price

商品表商品类别表Goods are associated with goods_category through the commodity category id field (category_id) .

The data model after conforming to 3NF In layman's terms, 2NF and 3NF are usually summarized in this sentence: "every non-key attribute depends on the key, depends on the entire key, and there is nothing else but the key".

The key here is the primary key, or it can be a candidate key, anyway, they are all keys.

3.5 Summary

Regarding the design of data tables, there are three paradigms to follow.

1) First Normal Form (1NF), which ensures that each column remains atomic . Each column of the database is an indivisible atomic data item, the smallest data unit that cannot be further divided, and cannot be a non-atomic data item such as a collection, an array, or a record.

2) The second normal form (2NF), ensuring that each column is completely dependent on the primary key . Especially in the case of composite primary keys , non-primary key parts should not depend on parts of the primary key .

3) The third normal form (3NF), ensuring that each column is directly related to the primary key column, not indirectly.

Advantages of Paradigm

Normalization of data helps to eliminate data loss in databases 数据冗余, and third normal form (3NF) is generally considered to achieve the best balance in terms of performance, scalability, and data integrity.

Disadvantages of Paradigm

The use of normal forms may reduce the efficiency of queries . Because the higher the paradigm level is, the more and more refined the data tables are designed, the lower the redundancy of the data, and it may be necessary to associate multiple tables when performing data queries, which is not only expensive, but also may cause some indexes to Policy is invalid .

Paradigm only proposes design standards. In fact, when designing data tables, these standards may not necessarily be met. During development, we will violate the principle of normalization for performance and read efficiency , improve the read performance of the database by adding a small amount of redundant or repeated data , reduce the number of associated query join tables, and achieve the purpose of exchanging space for time. Therefore, in the actual design process, it is necessary to combine theory with practice and use it flexibly.

Paradigm itself is not good or bad, only the applicable scenarios are different. There is no perfect design, only a suitable design . In the design of the data table, we also need to mix the paradigm and anti-paradigm according to the requirements.

4. Denormalization

4.1 Overview

Sometimes it is not possible to simply design the data table according to the specifications, because some data may seem redundant, but it is actually very important to the business. At this time, we must follow the principle of business priority , first meet business needs, and then minimize redundancy .

If the amount of data in the database is relatively large, and the access frequency of UV and PV of the system is relatively high, the data table should be designed completely according to the three paradigms of MySQL, and a large number of related queries will be generated when reading data, which will affect the reading performance of the database to a certain extent. . If we want to optimize query efficiency, anti-paradigm optimization is also an optimization idea. At this point, you can improve the read performance of the database by adding redundant fields in the data table .

Normalization vs Performance

  1. To meet certain business goals, database performance is more important than normalizing the database
  2. While normalizing the data, the performance of the database should be considered comprehensively
  3. To drastically reduce the time required to search for information from a given table by adding additional fields to it
  4. By inserting computed columns in a given table to facilitate queries

Example 1:
Employee information is stored in the employees table, and department information is stored in the departments table. Establish an association relationship with the departments table through the department_id field in the employees table. To query the name of an employee's department:

select employee_id,department_name
from employees e join departments d
on e.department_id = d.department_id;

If this operation is often required, the connection query will waste a lot of time. You can add a redundant field department_name to the employees table, so that you don't need to connect every time.

Example 2:

We have 2 tables, commodity flow table (atguigu.trans) and commodity information table (atguigu.goodsinfo). There are 4 million records in the commodity flow table, and 2000 commodity records in the commodity information table.

Commodity Flow Meter:
insert image description here

Commodity information sheet:

insert image description here

​ The two tables meet the requirements of the third normal form. However, during the implementation of our project, the query frequency of running water is very high, and in order to obtain the product name, the connection query with the product information table is basically used.

In order to reduce the connection between tables , we can directly add the product name field to the flow table . In this way, we can directly obtain the product name field from the flow table. Although redundant fields are added, associated queries are avoided and query efficiency is improved.

The new commodity flow table is as follows:

insert image description here

4.2 New Problems of Denormalization

The anti-paradigm can exchange space for time to improve query efficiency, but the anti-paradigm will also bring some new problems:

  • more storage space
  • If the fields in one table are modified, the redundant fields in the other table also need to be modified synchronously, otherwise the data will be inconsistent
  • If the stored procedure is used to support additional operations such as data update and deletion, if the update is frequent, it will consume a lot of system resources
  • In the case of a small amount of data , the anti-paradigm cannot reflect the advantages of performance, and may make the design of the database more complicated

4.3 Applicable scenarios of denormalization

When redundant information is valuable or can greatly improve query efficiency , we will adopt anti-paradigm optimization.

1. Suggestions for adding redundant fields

Adding redundant fields must meet the following two conditions. Only when these two conditions are met, adding redundant fields can be considered.

1) This redundant field does not need to be modified frequently.

2) This redundant field is indispensable when querying.

2. The need for historical snapshots and historical data

In real life, we often need some redundant information, such as the consignee information in the order, including name, phone number and address. Each occurrence of order receipt information is a historical snapshot and needs to be saved, but users can modify their own information at any time, so it is very necessary to save these redundant information.

Anti-paradigm optimization is also commonly used in the design of data warehouses, because data warehouses usually store historical data , and the real-time requirements for additions, deletions and modifications are not strong, but the analysis requirements for historical data are strong . At this time, the redundancy of data is properly allowed, which makes data analysis more convenient.

A brief summary of the differences between data warehouses and databases in use:

  1. The purpose of database design is to capture data , while the purpose of data warehouse design is to analyze data ;
  2. The database has strong real-time requirements for adding, deleting, and modifying data, and needs to store online user data, while the data warehouse generally stores historical data ;
  3. The database design needs to avoid redundancy as much as possible , but a certain degree of redundancy is also allowed in order to improve query efficiency , and the design of the data warehouse is more inclined to adopt anti-paradigm design.

5. BCNF (Bass Normal Form)

People improved on the basis of 3NF and proposed Bath Normal Form (BCNF) , also known as Bath-Codd Normal Form (Boyce-Codd NormalFom) . BCNF is considered to have no new design specifications , but has stronger requirements for design specifications in the third normal form , making the database redundancy smaller .

Therefore, it is called the modified third normal form, or the extended third normal form, and BCNF is not called the fourth normal form.

If a relation reaches the third normal form, and it has only one candidate key, or each of its candidate keys is a single attribute, then the relation naturally reaches BC normal form. Generally speaking, a database design conforms to 3NF or BCNF.

Prefer anti-paradigm design.

5. BCNF (Bass Normal Form)

People improved on the basis of 3NF and proposed Bath Normal Form (BCNF) , also known as Bath-Codd Normal Form (Boyce-Codd NormalFom) . BCNF is considered to have no new design specifications , but has stronger requirements for design specifications in the third normal form , making the database redundancy smaller .

Therefore, it is called the modified third normal form, or the extended third normal form, and BCNF is not called the fourth normal form.

If a relation reaches the third normal form, and it has only one candidate key, or each of its candidate keys is a single attribute, then the relation naturally reaches BC normal form. Generally speaking, a database design conforms to 3NF or BCNF.


The above is all the content of this article, if it is helpful to you; I hope to like and support a wave! !
I'm Hu Yi, looking forward to meeting you next time~

Guess you like

Origin blog.csdn.net/qq_56880706/article/details/128489225