Processing 430 million records per day in SQL Server

Processing 430 million records per day in SQL Server

First of all, I am just a programmer, not a professional DBA. The following article is written from the process of solving a problem, rather than giving you a correct result from the beginning. If there is something wrong in the article, please Database Daniel gives corrections so that I can better handle this business.

Background of the project

This is a project for a certain data center. The difficulty of the project is horrific. This project really made me feel that the shopping mall is like a battlefield, and I am just a soldier in it. There are too many tactics and too many high-level managers. There is too much insider in the contest between the two. For the specific situation of this project, I will write a related blog post when I have time.

This project requires environmental monitoring. For the time being, we call the monitored equipment as acquisition equipment, and the attributes of the acquisition equipment as monitoring indicators. Project requirements: The system supports no less than 10w monitoring indicators, the data update of each monitoring indicator is no more than 20 seconds, and the storage delay is no more than 120 seconds. Then, we can get the ideal state through simple calculation - the data to be stored is: 30w per minute, 1800w per hour, that is, 432 million per day. In reality, the amount of data will be about 5% larger than this. (In fact, most of them are information garbage, which can be processed by data compression, but others just want to mess with you, what can you do)

The above are the indicators required by the project. I think many students who have a lot of experience in big data processing will sneer at it, just like that? Well, I have also read a lot of big data processing stuff, but I haven't dealt with it before. It seems that it is really easy to solve based on how others are well-informed, what distribution, and what separation of read and write. However, the problem is not so simple. As I said above, this is a very bad project, a typical project of vicious competition in the industry.

  1. There are no more servers, but this server is not only equipped with a database, a centralized collector (that is, a program for data analysis, alarm, and storage), but also supports a northbound interface (SNMP) of 30W points. Before the program is not optimized, the CPU occupies 80 %above. Because the project requires the use of dual-machine hot backup, in order to save trouble and reduce unnecessary trouble, we put related services together so that we can make full use of the features of HA (externally purchased HA system)
  2. The correctness of system data is extremely abnormal. It requires that from the bottom acquisition system to the top monitoring system, no single piece of data is bad
    . Our system architecture is as follows. It can be seen that the database pressure is very large, especially in the LevelA node:
    How to process 430 million records per day in SQL Server (database big data processing) - Close your eyes and miss you - away.
  3. The hardware configuration is as follows:
    CPU: Intel® Xeon® processor E5-2609 (4 cores, 2.40GHz, 10MB, 6.4 GT/s)
    Memory: 4GB (2x2GB) DDR3 RDIMM Memory, 1333MHz, ECC
    HDD: 500GB 7200 RPM 3.5' ' SATA3 hard drive, Raid5.
  4. The database version
    uses SQLServer2012 Standard Edition, the genuine software provided by HP, which lacks many NB functions of the Enterprise Edition.

write bottleneck

The first stumbling block we encountered was that we found that under the existing program, SQL Server could not handle such a large amount of data at all. What is the specific situation?

Our storage structure

Generally, in order to store a large amount of historical data, we will carry out a physical sub-table, otherwise there are millions of records per day, and hundreds of millions of records in a year. So it turns out that our table structure looks like this:

CREATE TABLE [dbo].[His20140822](
	[No] [bigint] IDENTITY(1,1) NOT NULL,
	[Dtime] [datetime] NOT NULL,
	[MgrObjId] [varchar](36) NOT NULL,
	[Id] [varchar](50) NOT NULL,
	[Value] [varchar](50) NOT NULL,
 CONSTRAINT [PK_His20140822] PRIMARY KEY CLUSTERED 
(
	[No] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

No is used as a unique identifier, collection device Id (Guid), monitoring indicator Id (varchar(50)), recording time, and recording value. The collection device ID and monitoring indicator ID are used as indexes for quick search.

batch write

At that time, BulKCopy was used for writing. Yes, it is it. It claims to write millions of records in seconds.

    public static int BatchInert(string connectionString, string desTable, DataTable dt, int batchSize = 500)
    {
        using (var sbc = new SqlBulkCopy(connectionString, SqlBulkCopyOptions.UseInternalTransaction)
        {
            BulkCopyTimeout = 300,
            NotifyAfter = dt.Rows.Count,
            BatchSize = batchSize,
            DestinationTableName = desTable
        })
        {
            foreach (DataColumn column in dt.Columns)
                sbc.ColumnMappings.Add(column.ColumnName, column.ColumnName);
            sbc.WriteToServer(dt);
        }

        return dt.Rows.Count;
    }

What's the problem?

The above architecture is OK with 40 million data per day. However, when the configuration is adjusted to the above background, the centralized monitoring program overflows the memory. The analysis shows that too much data received is stored in the memory, but there is no time to write it into the database, which eventually leads to the generated data. The data is larger than the consumption, resulting in memory overflow and the program cannot work.

Where is the bottleneck?

Is it because of a RAID disk issue? Is it a data structure problem? Is it a hardware problem? Is the SQLServer version the problem? Is there a problem with no partition table? Or is it a program problem?

At that time, there was only one week, and if we didn’t do well in one week, the project supervision would ask us to get out of the way. Therefore, we had the feat of working 48 hours continuously, and we had the chickens who called everywhere to ask for help…

However, what is needed at this time is calm, and then calm... SQLServer version? hardware? It is currently impossible to replace. RAID disk array, should not be. So what is it, I really can't calm down.

You may not understand the tense atmosphere at the scene. In fact, after so long, it is difficult for me to return to that situation. But it can be said that maybe we now have various methods, or we are outsiders and we have more thinking, but when a project oppresses you and is about to give up, your thoughts and considerations at that time are restricted by the on-site environmental factors. Significant deviations may occur. It may make you think fast, or it may stagnate. In this high-pressure environment, some colleagues even made more low-level mistakes, their thinking was completely messed up, and their efficiency was lower... 36 hours without sleeping, or only on the construction site (mud everywhere in rainy days, dry If it’s done, it’ll be stucco by then) squat for two or three hours, and then continue to work for a week! Or keep going!

A lot of people have given a lot of ideas, but they seem to work, and they don't seem to work. Wait, why is "it seems to work, but it doesn't seem to work"? I vaguely seem to have grasped a hint of direction, what is it? By the way, verification, we are running in the live environment now. There was no problem before, but it does not mean that there is no problem under the current pressure. To analyze such a small function in a large system, the impact is too great, we should decompose it. Yes, it is "unit testing", which is the test of a single method. We need to verify each function. Where does each independent step take time?

Step-by-step testing to verify system bottlenecks

Modify the parameters of BulkCopy
First of all, what I thought of is that the parameters of Xiuga BulkCopy, BulkCopyTimeout, BatchSize, are constantly tested and adjusted, and the results always fluctuate within a certain range, which has no actual effect. It may affect some CPU counts, but it is far from what I expected. The writing speed still fluctuates from 1w to 2w in 5 seconds, which is far from the requirement to write 20w records in 20 seconds.

Storage by acquisition device
Yes, the above structure is one record per index per value, isn't it too much waste? So is it feasible to use the collection device + collection time as a record? The question is, how to solve the problem that different collection devices have different properties? At this time, a colleague shows his talent, and the monitoring indicators + monitoring values ​​can be stored in XML format. Wow, how can this be? The query can be in the form of for XML.

So there is this structure:No、MgrObjId、Dtime、XMLData

The results are verified, slightly better than the above, but not too obvious.

Data table partitioning???
At that time, I hadn't learned this skill yet. After reading the articles on the Internet, it seemed to be quite complicated. I didn't have much time to try it.

Stop other programs
I know this is definitely not possible, because the architecture of software and hardware cannot be modified for the time being. But I want to verify whether these factors affect it. It was found that the prompt was indeed obvious, but it still did not meet the requirements.

Is it the bottleneck of SQLServer?
No way, is this the bottleneck of SQLServer? I checked the relevant information on the Internet, it may be the bottleneck of IO, Nima, what else can I do, do I need to upgrade the server or replace the database, but will the project party give it?

Wait, there seems to be another thing, index, right index! The existence of the index will affect the insert, update

remove index

Yes, the query is definitely slower after removing the index, but I have to verify first if removing the index will speed up the write. If you decisively remove the indexes of the MgrObjId and Id fields.

Running, a miracle occurs, each time 10w records are written, it can be written completely within 7~9 seconds, thus meeting the requirements of the system.

How to solve the query?

A table needs more than 400 million records a day, which is impossible to query without an index. How to do! ? I thought of our old method again, physical sub-tables. Yes, we used to divide the table by days, so we now divide the table by hours. Then 24 tables, each table only needs to store about 1800w records.

Then query, a property in an hour or a few hours of history. The result: slow! slow! ! slow! ! ! It is simply unimaginable to query more than 10 million records without the index. What else can I do?

Continue to divide the table, I thought, we can also continue to divide the table according to the underlying collector, because the collection devices are different in different collectors, then when we query the historical curve, we only check the historical curve of a single indicator, then this is It can be distributed in different tables.

As a result, by collecting 10 embedded records and dividing the tables by 24 hours, 240 tables are generated every day (historical table names are similar to this: His_001_2014112615), and finally more than 400 million records are written in one day and support simple Query this problem has been solved! ! !

Query optimization

After the above problems are solved, half of the difficulties of this project have been solved, and the project supervisor is embarrassed to come and find fault. I don’t know what kind of tactical arrangement it is.

After a long period of time, it is almost the end of the year, and the problem comes again, that is, it will drag you to death and prevent you from accepting other projects at the end of the year.

The requirements this time are as follows: because the above is a simulation of 10w monitoring indicators, but now it is actually online, but there are only about 5w devices. Then this obviously cannot meet the requirements of the bid and cannot be accepted. So what to do? These smart people thought, since the monitoring indicators are halved, then we will also halve the time. Isn't it enough: that is to say, according to the current 5w equipment, you need to store it in the warehouse within 10s. Let me go, according to your logic, if we only have 500 monitoring indicators, wouldn't it be stored within 0.1 seconds? Don't you think about the feelings of those monitored devices?

But someone wants to play with you, what can you do? Take the call. As a result, after the time was reduced to 10 seconds, the problem came. If you carefully analyze the above logic, you can know that the sub-tables are divided according to the collector. Now the number of collectors is reduced, but the number has increased. What happened? Write can be supported. However, the record of each table is close to 400w, and some collection equipment has many monitoring indicators, which is close to 600w. How to break it?

So technical stakeholders met to discuss related initiatives.

How to optimize the query without adding an index?

A colleague suggested that the order of the where clause will affect the results of the query, because you can select a part of the data first, and then continue to filter the next condition. Sounds like a good idea, but doesn't SQL Server Query Analyzer automatically optimize it? Forgive me for being a novice, I just feel it, it should be the same as the VS compiler, it should be automatically optimized.

Specifically, let's talk about the facts:

As a result, after the colleagues modified the client, the test feedback showed a great improvement. I looked at the code:
How to process 430 million records per day in SQL Server (database big data processing) - Close your eyes and miss you - away.

难道真的有这么大的影响?等等,是不是忘记清空缓存,造成了假象?
于是让同事执行下述语句以便得出更多的信息:

--优化之前
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS

SET STATISTICS IO ON
select Dtime,Value from dbo.his20140825 WHERE  Dtime>='' AND Dtime<='' AND MgrObjId='' AND Id=''
SET STATISTICS IO OFF

--优化之后
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS

SET STATISTICS IO ON
select Dtime,Value from dbo.his20140825 WHERE MgrObjId='' AND Id='' AND Dtime>='' AND Dtime<=''
SET STATISTICS IO OFF

结果如下:
How to process 430 million records per day in SQL Server (database big data processing) - Close your eyes and miss you - away.

优化之前反而更好了?

仔细查看IO数据,发现,预读是一样的,就是说我们要查询的数据记录都是一致的,物理读、表扫描也是一直的。而逻辑读取稍有区别,应该是缓存命中数导致的。也就是说,在不建立索引的情况下,where子句的条件顺序,对查询结果优化作用不明显。

那么,就只能通过索引的办法了。

建立索引的尝试

建立索引不是简单的事情,是需要了解一些基本的知识的,在这个过程中,我走了不少弯路,最终才把索引建立起来。

下面的实验基于以下记录总数做的验证:
How to process 430 million records per day in SQL Server (database big data processing) - Close your eyes and miss you - away.

按单个字段建立索引
这个想法,主要是受我建立数据结构影响的,我内存中的数据结构为:Dictionary<MgrObjId,Dictionary<Id,Property>>。我以为先建立MgrObjId的索引,再建立Id的索引,SQLServer查询时,就会更快。
How to process 430 million records per day in SQL Server (database big data processing) - Close your eyes and miss you - away.

先按MgrObjId建立索引,索引大小为550M,耗时5分25秒。结果,如上图的预估计划一样,根本没有起作用,反而更慢了。

按多个条件建立索引
OK,既然上面的不行,那么我们按多个条件建立索引又如何?CREATE NONCLUSTERED INDEX Idx_His20141008 ON dbo.his20141008(MgrObjId,Id,Dtime)

结果,查询速度确实提高了一倍:
How to process 430 million records per day in SQL Server (database big data processing) - Close your eyes and miss you - away.

等等,难道这就是索引的好处?花费7分25秒,用1.1G的空间换取来的就是这些?肯定是有什么地方不对了,于是开始翻查资料,查看一些相关书籍,最终,有了加大的进展。

正确的建立索引

首先,我们需要明白几个索引的要点:

  • 索引之后,按索引字段重复最少的来排序,会达到最优的效果。以我们的表来说,如果建立了No的聚集索引,把No放在where子句的第一位是最佳的,其次是Id,然后是MgrObjId,最后是时间,时间索引如果表是一个小时的,最好不要用
  • The order of where clauses determines whether the query analyzer uses the index for queries. For example, if an index of MgrObjId and Id is established, where MgrObjId='' and Id='' and Dtime=''index search will be used, but index search where Dtime='' and MgrObjId='' and Id=''may not be used.
  • Put the result column of the non-indexed column in the containing column. Because our conditions are MgrObjId and Id and Dtime, the returned result only needs to include Dtime and Value, so put Dtime and Value in the included column, and the returned index result will have this value, no need to look up the physical table, you can achieve optimum speed.

With the above principles, we build the following indexes:CREATE NONCLUSTERED INDEX Idx_His20141008 ON dbo.his20141008(MgrObjId,Id) INCLUDE(Value,Dtime)

The time consumption is: more than 6 minutes, and the index size is 903M.

Let's take a look at the estimated plan:
How to process 430 million records per day in SQL Server (database big data processing) - Close your eyes and miss you - away.

It can be seen that the index is completely used here, and there is no additional consumption. The actual execution result was less than 1 second, and the result was screened out in the 1100w record in less than a second! ! Awesome! !

How are indexes applied?

Now that the writing is completed and the reading is completed, how to combine it? We can index the data of an hour ago, and the data of the current hour will not be indexed. That is, do not create indexes when creating tables! !

how to optimize

You can try to separate read-write and write two libraries, one is a real-time library and the other is a read-only library. The data within an hour is queried in the real-time database, and the data one hour ago is queried in the read-only database; the read-only database is regularly stored, and then indexed; the data of more than one week is analyzed and processed before storage. In this way, no matter what time period the data is queried, it can be processed correctly—the real-time database is queried within an hour, the read-only database is queried within an hour to a week, and the report database is queried a week ago.

If physical sub-tables are not required, the index can be rebuilt periodically in the read-only library.

Summarize

How to process billions of data (historical data) in SQL Server can be done in the following ways:

    • remove all indexes from the table
    • Insert with SqlBulkCopy
    • Divide tables or partitions to reduce the total amount of data in each table
    • Create an index after a table is completely written
    • Correctly specify the index field
    • Put the fields you need to use in the containing index (the returned index contains everything)
    • When querying, only the required fields are returned

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325992710&siteId=291194637