Points to note when dealing with millions of database data

quote

   The recent project needs to use the management scale of realizing nodes to reach a scale of one million, and it is necessary to use a database to store intermediate data and final results, and the storage scale can reach tens of millions. The storage of 100,000 node data has been initially realized, but the access speed is too slow. After consulting relevant information, we found the reasons for the very slow node insertion time:
      1. Problems connecting to the database: too many times to establish and close connections, resulting in The number of IO accesses is too frequent.
       2. The method of batch insertion and batch modification should be used instead of inserting a piece of data, which will cause the actual access to the database to be very slow.
       3. When building the database, it is necessary to establish appropriate indexes: such as primary key, foreign key, unique, etc., to optimize the query efficiency.
       For specific discussions, see the link here: http://www.oschina.net/question/1859_62586?sort=default&p=3#answers
       Some of the discussions in this link can provide ideas~
       Another reprinted content is as follows:
Participated in a recent period of time If the project needs to operate millions of data, the efficiency of ordinary SQL query plummets, and if there are many query conditions in where, the query speed is simply unbearable. In the past, when the amount of data was small, the quality of the query statement would not have any significant impact on the execution time, so many detailed issues were ignored.
    After testing, a conditional query was performed on a table containing more than 4 million records, and the query time was as high as 40 seconds. I believe that such a high query delay will drive any user crazy. Therefore, how to improve the query efficiency of sql statement is very important. The following are several query optimization methods that are widely circulated on the Internet:
    First of all, when the amount of data is large, full table scan should be avoided as much as possible, and indexing should be considered on the columns involved in where and order by, which can greatly speed up data retrieval. However, in some cases the index will not work:
1. Try to avoid using the != or <> operator in the where clause, otherwise the engine will give up the use of the index and perform a full table scan.
2. Try to avoid the null value judgment of the field in the where clause, otherwise the engine will give up the use of the index and perform a full table scan, such as:
     select id from t where num is null
     You can set the default value of 0 on num to ensure that There is no null value in the num column of the table, and then query like this:
     select id from t where num=0
3. Try to avoid using or in the where clause to connect conditions, otherwise the engine will give up the use of the index and perform a full table scan, such as:
     select id from t where num=10 or num=20
     can be queried like this:
     select id from t where num=10
     union all
     select id from t where num=20
4. The following query will also result in a full table scan:
    select id from t where name like '%abc%'
    To improve efficiency, consider full-text search.
5. In and not in should also be used with caution, otherwise it will lead to a full table scan, such as:
     select id from t where num in(1,2,3)
     For continuous values, if you can use between, do not use in:
     select id from t where num between 1 and 3
6. If a parameter is used in the where clause, it will also cause a full table scan. Because SQL resolves local variables only at runtime, the optimizer cannot defer the choice of an access plan to runtime; it must choose it at compile time. However, if the access plan is built at compile time, the value of the variable is unknown and cannot be used as an input for index selection. For example, the following statement will perform a full table scan:
     select id from t where num=@num
     can be changed to force the query to use the index:
     select id from t with(index(index name)) where num=@num
7. Try to avoid where The expression operation on the field in the clause will cause the engine to give up the use of the index and perform a full table scan. For example:
     select id from t where num/2=100
     should be changed to:
     select id from t where num=100*2
8. You should try to avoid performing functional operations on fields in the where clause, which will cause the engine to abandon the use of indexes instead of Do a full table scan. Such as:
     select id from t where substring(name,1,3)='abc'–name starts with abc id
     select id from t where datediff(day,createdate,'2005-11-30′)=0–'2005-11 The id generated by -30'
     should be changed to:
     select id from t where name like 'abc%'
     select id from t where createdate>='2005-11-30' and createdate<'2005-12-1
' Functions, arithmetic operations, or other expression operations are performed on the left side of the "=" in the where clause, otherwise the system may not use the index correctly.
10. When using an index field as a condition, if the index is a composite index, the first field in the index must be used as a condition to ensure that the system uses the index, otherwise the index will not be used and should be used. As much as possible, make the field order consistent with the index order.
11. Don't write some meaningless queries. If you need to generate an empty table structure:
     select col1,col2 into #t from t where 1=0
     This kind of code will not return any result set, but it will consume system resources and should be changed In this way:
     create table #t(…)
12. It is a good choice to replace in with exists in many cases:
     select num from a where num in(select num from b)
     Replace with the following statement:
     select num from a where exists(select 1 from b where num=a.num) Points

    to note when building an index:
1. Not all indexes are valid for queries. SQL is based on the data in the table. For query optimization, when a large amount of data is repeated in the index column, the SQL query may not use the index. For example, if there are fields sex in a table, male and female are almost half and half, then even if an index is built on sex, it will not improve the query efficiency. effect.
2. The more indexes the better, the index can certainly improve the efficiency of the corresponding select, but it also reduces the efficiency of insert and update, because the index may be rebuilt during insert or update, so how to build an index needs to be carefully considered. As the case may be. The number of indexes in a table should not exceed 6. If there are too many indexes, you should consider whether it is necessary to build indexes on some infrequently used columns.
3. Avoid updating the clustered index data columns as much as possible, because the order of the clustered index data columns is the physical storage order of the table records. Once the value of this column changes, the order of the entire table records will be adjusted, which will consume considerable resources. If the application system needs to update the clustered index data column frequently, it needs to consider whether the index should be built as a clustered index.

    Other points to pay attention to:
1. Use numeric fields as much as possible. If the fields only contain numeric information, try not to design them as character types, which will reduce the performance of query and connection and increase the storage overhead. This is because the engine compares each character of the string one by one when processing queries and joins, whereas only one comparison is required for numbers.
2. Do not use select * from t anywhere, replace "*" with a list of specific fields, and do not return any unused fields.
3. Try to use table variables instead of temporary tables. If the table variable contains a lot of data, be aware that the indexes are very limited (only the primary key index).
4. Avoid frequent creation and deletion of temporary tables to reduce the consumption of system table resources.
5. Temporary tables are not unusable, and their proper use can make certain routines more efficient, for example, when a large table or a dataset in a frequently used table needs to be repeatedly referenced. However, for one-time events, it is better to use an export table.
6. When creating a new temporary table, if a large amount of data is inserted at one time, you can use select into instead of create table to avoid causing a large number of logs to improve the speed; if the amount of data is not large, in order to ease the resources of the system table, you should first create table, then insert.
7. If temporary tables are used, all temporary tables must be explicitly deleted at the end of the stored procedure, first truncate table, and then drop table, which can avoid long-term locking of system tables.
8. Try to avoid using the cursor, because the efficiency of the cursor is poor, if the data operated by the cursor exceeds 10,000 rows, then you should consider rewriting.
9. Before using the cursor-based method or the temporary table method, you should first find a set-based solution to solve the problem. The set-based method is usually more efficient.
10. Like temporary tables, cursors are not unavailable. Using FAST_FORWARD cursors on small datasets is often preferable to other row-by-row processing methods, especially when several tables must be referenced to obtain the required data. Routines that include "totals" in the result set are usually faster than using cursors. If development time allows, try both the cursor-based approach and the set-based approach to see which one works better.
11. Set SET NOCOUNT ON at the beginning of all stored procedures and triggers, and set SET NOCOUNT OFF at the end. There is no need to send a DONE_IN_PROC message to the client after each statement of stored procedures and triggers is executed.
12. Try to avoid returning a large amount of data to the client. If the amount of data is too large, you should consider whether the corresponding demand is reasonable.
13. Try to avoid large transaction operations and improve system concurrency capabilities.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326066365&siteId=291194637