Python standard library Difflib pit? - comparison of two methods of data quickly improve efficiency

I. Introduction

Recent data used in the matching program developed in the Python standard library Difflib, it would have to work quite as expected, but when it encounters a file that, if met nemesis, that file only 100 rows of data * 77, and by comparing it actually consuming 61s. This is unacceptable, because the subsequent extraction flow line comparison, this is by no means the order. How to break?

 

Two Flashbacks

Here is Difflib than to that file, the amount of data is 100 * 77 rows, time-consuming 61s, as follows:

1.png

Well, it would reduce the amount of data to 5 lines * 77, to see the effect, consuming only 0.05s, as follows:

2.png


The results from the time-consuming, difficult to find, Difflib worse than the performance of the file, time-consuming and does not increase linearly with the amount of data, this is the most horrible place, and if it continues to increase the amount of data, will become time-consuming unbearable.

 

Three Optimizations ideas

Difflib as standard library function is to match its data, and all manner of generation format. When faced with such a serious time-consuming problem, we should first start from their own data, optimization I have two ideas:

First,    filter out the same data line, reducing the ratio of the amount of data;

Second,    the data slice;

For the first idea to the line dividing the file stored in the list, then the list of the same data at the same position removed, leaving only the different rows of data, it is clear that the benefits of doing so, one can reduce the amount of data comparison , improve efficiency, Further, the resulting output is also cleaner, the same line data is not output no longer necessary;

For the second and ideas, will be compared to the data divided into a relatively small data blocks, fast match, the feasibility of this method, can be derived from the time-consuming than the amount of data.

 

Four specific implementation

Filter optimization strategy in the same row, codes are as follows:

# 过滤相同行
source_length = len(source)  # source为原始数据按行分割的列表
target_length = len(target)  # target为目标数据按行分割的列表
min_length = source_length if source_length < target_length else target_length 
pos_list = []  # 标记相同行的行号,保留列头
for index in range(1, min_length):
    # 注意保序
    if operator.eq(source[index], target[index]):
        pos_list.append(index)
# 删除相同行数据, 注意索引漂移
source = [source[index] for index in range(source_length) if index not in pos_list]
target = [target[index] for index in range(target_length) if index not in pos_list]


数据分片优化策略,实现代码如下:

# Fragment 
max_length = source_length if source_length> target_length else target_length # for slice 
# fragment, missing lines can be taken to ensure 
start_pos = 0 
STEP = # 10 fragment size, i.e., single level attained on the number of rows, the default line 10 
end_pos = STEP + start_pos 
the diff = difflib.HtmlDiff () # create an object instance htmldiff 
the while end_pos <MAX_LENGTH + STEP: 
    detail_info = diff.make_file (Source [start_pos: end_pos], target [start_pos: end_pos]) 
    # processing logic

 

Five optimization results

In the case of using only the optimization strategy of data pieces, more than 100 papers on the line * 77, the results showed that compared to consuming only 1.8s. And as above, the file before optimization is time-consuming than 61s, more importantly, because the data piece, each piece of time-consuming than the substantially stable, even if the amount of data continues to increase, consuming only linear increase, rather similar exponential increase. In addition, if the overlay first strategy to filter data, we believe that with the decline in the amount of data, time-consuming data will have better performance. However, in order to optimize the effect before and after the same amount of data than for more intuitive, therefore, except that here the policy data slice.

After consuming optimization results are as follows:

3.png

 

The other six

In this paper, the Python standard library Difflib time-consuming than the serious problems encountered when documents propose two optimization strategies, and effectively tested to verify that only the data partitioning strategy, the time-consuming than the same file from the original 61s down to 1.8s, and time-consuming but increases linearly. If there is a better way, welcome message exchange.


Guess you like

Origin blog.51cto.com/2681882/2412560