MySQL database optimization summary

For a data-centric application, the quality of the database directly affects the performance of the program, so database performance is crucial. Generally speaking, in order to ensure the efficiency of the database, the following four aspects should be done well: database design, SQL statement optimization, database parameter configuration, appropriate hardware resources and operating system. This order also shows the performance of these four tasks. the size of the impact. Let's clarify one by one:
       
       1. Database design

  Moderate anti-paradigm, attention is moderation

  We all know the three-paradigm form, and the model based on the three-paradigm form is the most efficient way to save data and the easiest way to expand. When we develop applications, the designed database must comply with the three-paradigm form to the greatest extent, especially for OLTP-type systems, the three-paradigm form is a rule that must be followed. Of course, the biggest problem with the three-paradigm form is that it usually needs to join many tables during query, resulting in very low query efficiency. Therefore, sometimes based on performance considerations, we need to intentionally violate the three paradigms and do moderate redundancy to achieve the purpose of improving query efficiency. Note that the anti-paradigm here is modest and a good reason must be provided for this practice. Here's a bad example:  

   Here, in order to improve the retrieval efficiency of student activity records, the unit name is redundant into the student activity record table. The unit information has 500 records, and the student activity records have about 2 million data in one year. If the student activity record table does not redundant this unit name field, it only contains three int fields and one timestamp field, which only occupies 16 bytes, which is a very small table. And after a redundant varchar(32) field, it is three times the original, and the retrieval is correspondingly more I/O. And the number of records is very different, 500 VS 2,000,000, which leads to updating 4,000 redundant records when updating a unit name. Thus, this redundancy is simply counterproductive.

  The following redundancy is good  

   It can be seen that the [total score of the student's test] is redundant, and this score can be obtained by summarizing the [score situation]. In [Student Examination Total Score], there is only one record for a student at a time, while in [Score Situation], a student asks one record for a small question in the test paper, and the ratio is roughly 1:100. . Moreover, the score of the judgment paper will not change easily, and the frequency of updating is not high, so this redundancy is better.

    properly indexed

  When it comes to improving database performance, indexes are the best and cheapest thing. No need to add memory , no need to change programs, no need to adjust sql, as long as you execute a correct 'create index', the query speed may increase hundreds of times, which is really tempting. However, there is no free lunch in the world. The increase in query speed comes at the expense of the speed of inserting, updating, and deleting. These write operations increase a lot of I/O. Since the storage structure of the index is different from the storage of the table, it often happens that the space of the index of a table is larger than the space of the data. This means that we do a lot of extra work when writing to the database, and this work is just to improve the efficiency of reading. Therefore, we build an index, we must ensure that this index will not "loss". Generally, the following rules need to be followed:

  The indexed field must be a field that is often used as a query condition;

  If indexing multiple fields, the first field should always be used as the query condition. If only the second field is used as the query condition, this index will not work;

  The indexed fields must have sufficient discrimination;

  Mysql supports prefix indexes for long fields;

  Divide the table horizontally

   If the number of records in a table is too many, such as tens of millions, and it needs to be retrieved frequently, then it is necessary for us to divide it into zero. If I split into 100 tables, then each table only has 100,000 records. Of course, this requires the data to be logically divided. A good division basis is conducive to the simple realization of the program, and can also make full use of the advantages of horizontal sub-tables. For example, the system interface only provides the function of querying by month, then the table is divided into 12 by month, and only one table is queried for each query. If it has to be divided by region, even if the table is split to a small size, the query still needs to combine all the tables to check, it is better not to split. So a good split basis is the most important.

  Here's a good example        

   The questions each student has done are recorded in this table, including correct and incorrect questions. Each question will correspond to one or more knowledge points, and we need to analyze which knowledge point students lack according to the wrong questions. This watch can easily reach tens of millions, and it urgently needs to be dismantled. So what is the basis for dismantling? From the perspective of demand, whether it is a teacher or a student, the focus will eventually be on a student. Students will care about themselves, and teachers will care about the students in their class. And the knowledge points of each subject are different. So we can easily think of splitting this table by combining the two fields of subject and knowledge point. In this way, each table has about 20,000 pieces of data, and the retrieval efficiency is very high.

     Divide the table vertically

  Some tables do not have many records, maybe 20,000 or 30,000, but the fields are very long and the table takes up a lot of space. When retrieving the table, it needs to perform a lot of I/O, which seriously reduces the performance. At this time, the large field needs to be split into another table, and the table has a one-to-one relationship with the original table.        

   The two tables [Test Question Content] and [Answer Information] were originally added as several fields to [Test Question Information]. You can see that the two fields of Test Question Content and Answer are very long. When there are 30,000 records in the table , the table has taken up 1G of space, and it is very slow in the list of test questions. After analysis, it is found that the system often displays the detailed content of the test questions in pagination according to the query conditions such as [Book], [Unit], type, category, and difficulty level. And each retrieval is done by joining these tables, and it is very depressing to scan a 1G table every time. We can completely split the content and answers into another table, and only read this large table when the detailed content is displayed, thus resulting in two tables of [question content] and [answer information].


       Choose the appropriate field type, especially the primary key

  The general principle of selecting fields is to keep the fields small and not large. If you can use a field that occupies a small byte, you don't need a large field. For example, for the primary key, we strongly recommend using the auto-increment type instead of guid. Why? Save space? What is space? Space is efficiency! Positioning a record by 4 bytes and 32 bytes is too obvious. . When several tables are involved to join, the effect is more obvious. It is worth mentioning that datetime and timestamp, datetime occupies 8 bytes, while timestamp occupies 4 bytes, which is only half, and the range represented by timestamp is 1970-2037. For most applications, especially what exams are recorded Information such as time and login time is more than enough.

  Large files such as files and pictures are stored in the file system instead of a database

  Needless to say, the iron law!!! The database only stores paths.

  The foreign key is clearly represented, which is convenient for indexing

  We all know that when a relationship is established between two entities in powerdesigner, the foreign key will be automatically indexed when the physical model is generated. So let's not be afraid to build a relationship and mess up the line, just build a ShortCut.

  Master the timing of writing to the table

  With the same library schema, how the database is used also plays an important role in performance. It is also written to a table, and writing first and writing later will have a great impact on subsequent operations. For example, in the example in Moderate Redundancy mentioned above,        

   Our original purpose is to record the total score of the candidates to achieve the purpose of improving retrieval efficiency, that is, to write this table when entering the scores. There is such a requirement in the requirements: list all the student's scores in this test, and display the student's name if no score is entered, but the total score is displayed as empty. This query needs to use [student information] left outer join [student test total score information]. Everyone knows that the efficiency of outer join is lower than that of join. In order to avoid this problem, we write this table when assigning the test. , insert all students into it, the scores are all null, so we can use join to achieve this effect. And there is such an advantage: in a certain exam, all students in a class are scheduled to take an exam, and all students have entered their grades. Now that there is a new student in the class, if you check the student's grades at this time, the new student will be listed, and the result is that no grades have been entered, which is obviously wrong. If it is written at the time of arrangement, the actual candidates in the test can be recorded. The function of this table is not known to be redundant.

    Rather focus on batch operations and avoid frequent reads and writes

  The system includes points. Students and teachers can get points by doing operations through the system. Moreover, the rules of points are very complicated, and the points are limited for each type of operation. There is an upper limit for each type of points per person per day. For example, when you log in, you can get 1 point for one login, but no matter how many times you log in, you can only accumulate one login point per day. This is still simple. Some points are very abnormal. For example, one of the teacher points is based on the teacher's judgment of homework. The rule is: the teacher judges the homework and finds that the student is wrong. If the student has corrected it, the teacher will judge again. If this time If the students are right, they will give the teacher extra points. If the students are still wrong, then they will change it. Only when the students are correct and the teachers have finished the judgment can the teachers be given extra points. If it is handled by program, it is very likely that each function will write a bunch of extra code to deal with this tasteless integral. Not only did the programming colleagues fail to find the key points in their work, but they also brought a lot of pressure to the database. After discussions with the demanders, it was determined that the points do not need to be accumulated in real time, so we adopted the method of batch processing of background scripts. In the dead of night, let the machine play by itself.

  This perverted integration rule is read out in batch processing like this:  

      
1 select  person_id, @semester_id, 301003, 0, @one_marks, assign_date, @one_marks
2          from  hom_assignmentinfo   ha, hom_assign_class hac
3          where  ha.assignment_id = hac.assignment_id
4               and  ha.assign_date between  @time_begin and  @time_end
5               and  ha.assignment_id not  in
6                    (
7                         select  haa.assignment_id from  hom_assignment_appraise haa, hom_check_assignment hca
8                          where  haa.appraise_id = hca.appraise_id and  haa.if_submit=1
9                               and  (
10                                      (hca.recheck_state = 3004001 and  hca.check_result in  (3003002, 3003003) )
11                                       or
12                                      (hca.recheck_state = 3004002 and  hca.recheck_result in  (3003002, 3003003))
13                                    )
14                    )
15               and  ha.assignment_id not  in
16                    (
17                         select  assignment_id from  hom_assignment_appraise where  if_submit=0 and  result_type = 0
18                    )
19               and  ha.assignment_id in     
20                    (
21                         select  haa.assignment_id from  hom_assignment_appraise haa, hom_check_assignment hca
22                          where  haa.appraise_id = hca.appraise_id and  haa.if_submit=1
23                               and  hca.check_result in  (3003002, 3003003)
24                    );

 

  This is just an intermediate process. If the program is processed in real time, even if the programmers do not strike, the database will stop.

  Choose the right engine

   Mysql provides many kinds of engines, the three types we use the most are myisam, innodb, and memory. The official manual says that the reading speed of myisqm is faster than that of innodb, about 3 times. But the book can't be trusted. The book "OreIlly.High.Performance.Mysql" mentions the comparison between myisam and innodb. In the test, myisam's performance is not as good as innodb. As for memory, haha, it's still relatively easy to use. It is a good choice to make temporary tables in batches (if the memory is large enough). In one of my batches, the speed ratio is almost 1:10.

 

    Second, SQL statement optimization

 Sql statement optimization tool

  ·Slow log

  If you find that the system is slow, and you can't tell where the slowness is, then it's time to use this tool. You only need to configure parameters for mysql, and mysql will record slow SQL statements by itself. The configuration is very simple, the configuration in the parameter file:

  slow_query_log=d:/slow.txt

  long_query_time = 2

  You can find the statement whose execution time exceeds 2 seconds in d:/slow.txt, and locate the problem according to this file.

  ·mysqldumpslow.pl

Slow log files can be large, and it can be hard for people to look at them. At this time, we can use the tools that come with mysql to analyze. This tool can format slow log files. Statements with different parameters will be classified and merged. For example, there are two statements select * from a where id=1 and select * from a where id=2. Only select * from a where id=N is left, which is much more comfortable to read. And this tool can achieve simple sorting, allowing us to target.

       Explain

  Now we know which statement is slow, so why is it slow? Let's see how mysql is executed. Use explain to see the mysql execution plan. The following usage comes from the manual

  EXPLAIN syntax (get SELECT related information)

  EXPLAIN [EXTENDED] SELECT select_options

  The EXPLAIN statement can be used as a synonym for DESCRIBE, or to obtain information about how MySQL executes a SELECT statement:

  · EXPLAIN tbl_name is a synonym for DESCRIBE tbl_name or SHOW COLUMNS FROM tbl_name.

  · If you put the keyword EXPLAIN before a SELECT statement, MySQL will explain how it handles the SELECT, providing information about how the tables are joined and the order in which they are joined.

  This section explains the second usage of EXPLAIN.

  With EXPLAIN, you know when you must index a table to get a faster SELECT that uses the index to find records.

  If a problem occurs due to the use of an incorrect index, ANALYZE TABLE should be run to update the table's statistics (such as the key set's potential), which affects the choices made by the optimizer.

  You can also know whether the optimizer joins the tables in an optimal order. To force the optimizer to follow the join order of a SELECT statement in table naming order, the statement should begin with STRAIGHT_JOIN instead of just SELECT.

   EXPLAIN returns one row of information for each table used in the SELECT statement. Tables are listed in the order they will be read by MySQL during query processing. MySQL resolves all joins in a single-sweep multi-join fashion. This means that MySQL reads a row from the first table, then finds a matching row in the second table, then in the third table and so on. When all tables are processed, it outputs the selected columns and returns the table list until it finds a table with more matching rows. Read in the next row from this table and continue processing the next table.

  When using the EXTENDED keyword, EXPLAIN produces additional information that can be viewed with SHOW WARNINGS. This information shows the optimizer qualifies the table and column names in the SELECT statement, what the SELECT statement looks like after the optimization rules are rewritten and executed, and may also include other annotations for the optimization process.

    If that doesn't do anything, try a full index scan

  If a statement really cannot be optimized, there is another method you can try: index coverage.

  If a statement can obtain all data from the index, there is no need to read the table through the index, saving a lot of I/O. such a table        

  If I want to count the score of each question of each student, we not only need to index the primary key and foreign key of each table, but also index the actual score field of [Score], so that the entire query can be obtained from the index data.

 

 3. Database parameter configuration

      The most important parameter is memory . We mainly use the innodb engine, so the following two parameters are adjusted very large.

  # Additional memory pool that is used by InnoDB to store metadata

  # information. If InnoDB requires more memory for this purpose it will

  # start to allocate it from the OS. As this is fast enough on most

  # recent operating systems, you normally do not need to change this

  # value. SHOW INNODB STATUS will display the current amount used.

  innodb_additional_mem_pool_size = 64M

  # InnoDB, unlike MyISAM, uses a buffer pool to cache both indexes and

  # row data. The bigger you set this the less disk I/O is needed to

  # access data in tables. On a dedicated database server you may set this

  # parameter up to 80% of the machine physical memory size. Do not set it

  # too large, though, because competition of the physical memory may

  # cause paging in the operating system. Note that on 32bit systems you

  # might be limited to 2-3.5G of user level memory per process, so do not

  # set it too high.

  innodb_buffer_pool_size = 5G

  For myisam, key_buffer_size needs to be adjusted

  Of course, adjusting the parameters depends on the status. You can use the show status statement to see the current status to decide which parameters to adjust.

  Cretated_tmp_disk_tables 增加tmp_table_size

  Handler_read_key high means the index is correct, Handler_read_rnd high means the index is incorrect

  Key_reads/Key_read_requests should be less than 0.01 Calculate the cache loss rate and increase Key_buffer_size

  Opentables/Open_tables increase table_cache

  select_full_join The number of joins without a useful index. If not 0, the index should be checked.

  select_range_check If not 0, the check table index.

  sort_merge_passes The number of merges the sorting algorithm has performed. If the value is larger, sort_buffer_size should be increased

  table_locks_waited The number of locks for the table that cannot be obtained immediately, if the value is high, the query should be optimized

  Threads_created The number of threads created to handle connections. If Threads_created is large, increase the thread_cache_size value.

  Cache access rate calculation method Threads_created/Connections.

  

      4. Reasonable hardware resources and operating system

  If your machine has more than 4G of memory , you should undoubtedly use a 64-bit operating system and 64-bit mysql

  read-write separation

  If the database is under a lot of pressure and cannot be supported by one machine, you can use mysql replication to synchronize multiple machines to spread the pressure on the database.  

  Master

  Slave1

  Slave2

  Slave3

   The main database master is used for writing, and slave1-slave3 are used for select, so the pressure shared by each database is much less.

   To implement this method, special program design is required, and all writes operate on the master, and all reads operate on the slave, which brings extra burden to program development. Of course, there is already middleware to implement this proxy, which is transparent to the program to read and write which databases. There is an official mysql-proxy, but it is still an alpha version. Sina has amobe for mysql, which can also achieve this purpose. The structure is as follows  

  How to use can see amobe's manual.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324840720&siteId=291194637
Recommended