Examples of large amounts of data SQL optimization discussions

 Today, in an article itput point of view, is optimized to discuss a statement:

     Original Post Address: http://www.itpub.net/viewthread.php?tid=1015964&extra=&page=1

  First, identify problems

   Optimization of the statement:     

Copy the code
How Does the following statement optimization:
the CREATE TABLE  aa_001    (ip  VARCHAR2 ( 28 ),          name  VARCHAR2 ( 10 ),          password  VARCHAR2 ( 30 )) the SELECT * from  aa_001  the WHERE  ip  in  ( 1 , 2 , 3 the Order by  name  desc ; - currently table records there are about ten million strips, but also in the number of values is uncertain.  




     
Copy the code

  These are the need to optimize the optimization of statements and situations.

 

   Many people in the back thread: Some say no way to optimize, and some say this is the IN EXISTS, some said index composite index (ip, name) and so on ip.

  Second, ask questions

     That such a situation, to optimize it, how to optimize? Today to discuss this issue.

  Third, the analysis of the problem

        1, 10 million more than the amount of data.

        2, in the number of values ​​is uncertain

     3.1 Analysis of data distribution

      Here the author did not mention the distribution of data columns ip, ip currently distributed data columns may have the following:

         1, ip column (unique probability data, or data duplication is small)

         2, ip columns (data not uniform, some of the data may be repeated multiple, repeated some less)

         3, ip columns (data is more evenly distributed, large amounts of data duplication, mainly some of the same data (may be different from thousands of ip level data, etc.)

 

     Solve the problem:

         1, the data distribution for the first case, based on an index to column ip. At this time no matter how many rows the table, in case the number is uncertain, very quickly.

         2, corresponding to the second distribution of data, the column index in the ip, ineffective. Because the uneven distribution of data, there may be some fast, some slow

         3, the third data corresponding to the distribution, the index ip column definitely slow speed.

        Note : order by name desc here is to retrieve data and then sort of. Instead of taking data before sorting 

 

     For the two cases 2 and 3, it is possible because the need to remove large amounts of data, the optimizer uses table scan (table scan), rather than index lookup (index seek), very slow, because then the efficiency is excellent scan table Find the index , especially under high concurrency, low efficiency.

 

    2 and 3 that corresponds to the situation, how to deal with. It is in change exists. In fact, the optimizer in sql server 2005 and oracle where the data came from behind in, efficiency is the same . In this case the use of low efficiency of the general index. Then if ip build clustered index on the column, it would be more efficient. We do a test in SQL server 2005.

 

   Table: . [The dbo] [[zping.com]]] of about 200 million data. Column contains the Userid , the above mentioned id, RuleId and other columns. According to the above case inquiries about similar statement: 

select    *   from   [ dbo ] . [ [zping.com ] ]]  where  
userid 
in  ( ' 402881410ca47925010cb329c7670ffb ' , ' 402881ba0d5dc94e010d5dced05a0008 '
,
' 4028814111a735e90111a77fa8e30384 ' order   by  Ruleid  desc

 

   Userid we look at the distribution of data, execute the following statement:

select  userid, count ( * from   [ dbo ] . [ [zping.com ] ]]  group   by  userid  order   by   2

   Then we look at the data distribution: A total of 379 data, two are from 1-150000, data distribution is significantly tilted . It is part of the FIG.

 

 

   Then if the establishment of a non-clustered index on ip, inefficient, and is forced to scan the index , efficiency is very low, you will find IO times higher than table scan . At this time we can only build a clustered index on ip. Then look at the results.

  Then found, a search using the (clustered index seek) gather search scan.

  Take a look at the results returned by the query: 

Copy the code
( 156,603  rows affected)
table 
' [zping.com] ' . Scan count  8 , read logic  5877  , physical reads  0  , read-ahead  0  times, lob logical reads  0  times, lob physical reads  0  times, lob read-ahead  0  times.
Table 
' Worktable ' . Scan count  0 , logical reads  0  , physical reads  0  , read-ahead  0  times, lob logical reads  0  times, lob physical reads  0  times, lob read-ahead  0  times.
Copy the code

    Return 150,000 rows, only less than 6,000 times IO. High efficiency, because the 15 million lines to sort, query cost in ordering accounted for 51% . Of course, you can build ( userid, RuleId ) composite clustered index to improve performance, but this DML higher maintenance costs. Not recommended.

 

   As can be seen from the test above example, the optimization solution:

     Data distribution of 1: Create an index to ip

     Data distribution 2,3: ip build clustered index on the column.

Reproduced in: https: //www.cnblogs.com/flysun0311/archive/2012/08/28/2659721.html

Guess you like

Origin blog.csdn.net/weixin_33816821/article/details/93694191