Influence of the order of fields SQLite on performance and PAGE_SIZE

    1. background

SQLite database has a table that contains several fields, including a BLOB field type, and the last BLOB field is not a field. Structure similar to the following table (BLOB field is col3):

    T (col1 INTEGER,col2 TEXT,col3 BLOB,col4 REAL,col5 TEXT)

    Business system to traverse the contents of this table, but the query does not include BLOB fields, namely SQL query like the following:

    Select col1,col2,col4,col5 from T;

    2. Problem Description

Above usage patterns, in the case of smaller operating table T yet good, but when a large table T (in our system .DB file reaches the 100G, and BLOB accounted for the main memory), a table needs to traverse a long time, up to several hours. So for such usage scenarios, you should be how to optimize it?

    3. Optimization ideas

    Irrespective physical IO optimization and optimization of operating systems, optimization only considers DB, in general, the following optimization nothing less than several common ways:

  1. index. But for the tables need to traverse accessed through the index is clearly no room for optimization, efficiency will be even lower. Thus optimizing the index is first to give up the idea.
  2. SQL optimization. But this SQL, belong to the most basic query SQL, so there is no room for optimization.
  3. Parallel query. Before a process will have access to all records, instead of multiple processes are accessing different recording intervals. This way you can try.
  4. Parameter optimization. By setting the parameters PAGE_SIZE e.g., adjust the size of the smallest storage unit PAGE.
  5. Other optimization. It is optimized based on DB database file formats and operating principles.

According to the above description, I will hereinafter parallel query, parameter optimization, the other three aspects optimization optimization experiments.

    4. Optimization experiments

4.1. Parallel Query

    I will list all records T, according to the ROWID every 5000 as a unit, and then open multiple query processes are different units. When implemented in this way does parallel, but the IO single process will increase with the number of processes is reduced, so that improve the overall performance of No, the table is on a different number of parallel processes, each process the obtained IO throughput:

The number of parallel processes

IO throughput of each process

All finished consuming

1

18M/s

About 2 hours

4

4.4m / s

About 2 hours

6

2.9M / s

About 2 hours

8

2.2M/s

About 2 hours

    Although it is possible to achieve multiple processes simultaneously, but for SQLite, it does not extend parallel IO throughput. Therefore parallel query optimization effect can not play.

4.2. Parameter optimization

    SQLite may be set by different operating parameters PRAGMA macro. By analyzing all the parameters that can be set, I think it may be more PAGE_SIZE parameters significantly affect optimization results. Based on the following analysis:

PAGE SQLite is the smallest memory cell, and it is a table of expansion and contraction of the basic unit, records are stored in the table PAGE (similar to the ORACLE block). Before PAGE_SIZE be used to specify the size of the PAGE, different versions have different default values ​​(before v3.12 is 1024 Byte, after v3.12 is 4096 Byte), change the default value can only be created in the first .DB a table ( or change the default value immediately after execution VACUUM).

Our business systems database version used is less than v3.12, so the default PAGE_SIZE is 1KB, and by analyzing the data and found that almost all of the records in the table, its size was larger than 1KB, even up to a few hundred KB. When the storage table records of PAGE (in SQLite also called payload), will first use the current PAGE remaining storage space, when the remaining space is not enough, will produce a overflow page (overflow page), then continue in overflow pages the remaining content stored payload, space is still not enough time, we will continue to overflow page, in this manner until the full expression of the payload. Which diagram is as follows:

    Suppose a record is 3K, when PAGE_SIZE is 1K, that a complete record need to query the address 3 (corresponding lookup PAGE); and when PAGE_SIZE is 4K, which a complete record queries addressing only once.

    Produced a test table, the average size of 76KB recorded, a total of 12,000 records, the table size of about 960M, different PAGE_SIZE not provided, which compare the efficiency of the query is as follows:

PAGE_SIZE

The use of time-consuming queries scene

1K

4.93s

2K

2.74s

4K

1.67s

8K

1.08s

16K

0.79

    根据上表可知,当表记录较大时(超过PAGE_SIZE的大小),随着PAGE_SIZE的增大,本使用场景的查询耗时越小。查询效率提升的倍数大致与PAGE_SIZE的倍数一致。

4.3.其它优化

    通过分析SQLite的文件格式可知,表记录的所有字段的内容是连续排列的,这与ORACLE等数据库是不同的(ORACLE对于LOB对象,仅在字段内容中记录LOB的地址,而非实际LOB内容)。差别如下图:

    对于SQLite,如果要查询col4和col5,需要将col3 value完全"走过",当col3 value由于过大而分散存储在多个溢出页时,还需要"走过"所有这些溢出页,虽然这些"走过"完全是无意义的,但仍然会发生IO。

    由于SQLite的文件格式有上述特征,因此只需将BLOB字段顺序由第3位调整为最末位,即可避免对BLOB字段无效的IO"走过"。

    仍然使用4.2中的数据,设PAGE_SIZE为1K,将BLOB字段分别设为中间位置和最末位创建数据库,比较性能如下:

PAGE_SIZE

BLOB字段的位置

该使用场景查询耗时

1K

中间位置

4.93s

1K

最末位

0.28s

    将BLOB字段由中间位置调整为最未位之后,优化效果明显,查询效率约为调整前的17.6倍。

5.结论

    关于第一章背景中提到的优化场景,有如下两种优化手段:

  1. 将BLOB字段由中间位置调整为最末位。此种优化手段优化效果非常明显。
  2. 根据表记录的大小,设置合适的PAGE_SIZE,以尽量减少溢出页,进而减少IO次数。此种优化方式优化效果尚可,但没有第一种优化手段效果明显。

Guess you like

Origin www.cnblogs.com/6yuhang/p/11444351.html