用python的difflib模块比较文本序列

需求背景

我们需要定时采集一些配置文件的内容，每次采集之后和上一次采集的内容进行比较，将按行为单位的变更记录持久化到数据库中。

这样做的好处是可以随时来查看这些变更记录，我们可以知道在什么时候进行了哪些变更，可以比较方便的分析出哪些变更影响到了服务的正常运行。

下面就开始使用difflib模块来实现这个需求。

difflib介绍

官方文档地址：https://docs.python.org/3/library/difflib.html

中文版：https://docs.python.org/zh-cn/3/library/difflib.html

difflib是python的标准库模块，它提供的类和方法用来比较两个序列之间的差异，生成差异结果文本或者html格式的差异化页面。

使用Differ类

先使用Differ类来比较两个文本序列。

代码示例

text1 = '''  1. Beautiful is better than ugly.
  2. Explicit is better than implicit.
  3. Simple is better than complex.
  4. Complex is better than complicated.
'''.splitlines(keepends=True)
text2 = '''  1. Beautiful is better than ugly.
  3. Simple is better than complex.
  4. Complicated is better than complicated.
  5. Flat is better than nested.
'''.splitlines(keepends=True)

differ = Differ()
for i in differ.compare(text1, text2):
    print(i, end='')

执行结果

    1. Beautiful is better than ugly.
-   2. Explicit is better than implicit.
    3. Simple is better than complex.
-   4. Complex is better than complicated.
?            ^
+   4. Complicated is better than complicated.
?           ++++ ^
+   5. Flat is better than nested.

该方法生成的结果包括了行间和行内的差异，其实我们对行内的差异并不在意，而且结果的格式很难做解析。

使用SequenceMatcher类

SequenceMatcher类的get_opcodes方法返回描述如何将a转换为b的元组列表。

代码示例

matcher = SequenceMatcher(None, text1, text2)
for tag, alo, ahi, blo, bhi in matcher.get_opcodes():
    if tag == 'replace':
        print('replace\n{}\n{}'.format(text1[alo:ahi], text2[blo:bhi]))
    elif tag == 'delete':
        print('delete\n{}'.format(text1[alo:ahi]))
    elif tag == 'insert':
        print('insert\n{}'.format(text2[blo:bhi]))
    elif tag == 'equal':
        print('equal\n{}\n{}'.format(text1[alo:ahi], text2[blo:bhi]))

执行结果

equal
['  1. Beautiful is better than ugly.\n']
['  1. Beautiful is better than ugly.\n']
delete
['  2. Explicit is better than implicit.\n']
equal
['  3. Simple is better than complex.\n']
['  3. Simple is better than complex.\n']
replace
['  4. Complex is better than complicated.\n']
['  4. Complicated is better than complicated.\n', '  5. Flat is better than nested.\n']

将变更内容拆分成单一的变更

使用SequenceMatcher类得到的结果其实已经符合想要的结果，如果将变更内容拆成单一的变更就更好了。

下面尝试写处理函数去实现。

代码示例

def diff(text1, text2):
    change_list = []
    matcher = SequenceMatcher(None, text1, text2)
    for tag, i1, i2, j1, j2 in matcher.get_opcodes():
        if tag == 'replace':
            l1, l2 = text1[i1:i2], text2[j1:j2]
            change_list.extend(map(lambda x, y: (tag, x, y), l1, l2))
            if len(l1) == len(l2):
                continue
            if len(l1) > len(l2):
                change_list.extend(('delete', line) for line in l1[len(l2):])
            else:
                change_list.extend(('insert', line) for line in l2[len(l1):])
        elif tag == 'delete':
            change_list.extend([(tag, line) for line in text1[i1:i2]])
        elif tag == 'insert':
            change_list.extend([(tag, line) for line in text2[j1:j2]])
        elif tag == 'equal':
            pass
    return change_list

for change in diff(text1, text2):
    print(change)

执行结果

('delete', '  2. Explicit is better than implicit.\n')
('replace', '  4. Complex is better than complicated.\n', '  4. Complicated is better than complicated.\n')
('insert', '  5. Flat is better than nested.\n')

这个结果已经可以进行解析和持久化了。

只是对于一些特殊情况并不能有正确的比较结果。

发现问题

经测试后发现，如果将测试数据更改为以下内容，会出现内容错位的情况。

代码示例

text1 = '''  1. Beautiful is better than ugly.
  2. Explicit is better than implicit.
  3. Simple is better than complex.
  4. Complex is better than complicated.
'''.splitlines(keepends=True)
text2 = '''  1. Beautiful is better than ugly.
  3. Simple is better than complexed.
  4. Complicated is better than complicated.
  5. Flat is better than nested.
'''.splitlines(keepends=True)

for change in diff(text1, text2):
    print(change)

执行结果

('replace', '  2. Explicit is better than implicit.\n', '  3. Simple is better than complexed.\n')
('replace', '  3. Simple is better than complex.\n', '  4. Complicated is better than complicated.\n')
('replace', '  4. Complex is better than complicated.\n', '  5. Flat is better than nested.\n')

自定义CustomDiffer类

看来以上方式也并不可靠，我决定还是从Differ类下手。

Differ类内部其实使用了SequenceMatcher类，它采用了查找最佳匹配对的方式对replace的部分进行了分解，可以很好的解决我们刚才碰到的问题。

接下来我自定义了CustomDiffer类去继承Differ类，并重写了父类的格式化方法，主要的目的还是将Diifer方式的结果格式变得统一。

代码

class CustomDiffer(Differ):
    def _dump(self, tag, x, lo, hi):
        if tag == '+':
            type = 'insert'
        elif tag == '-':
            type = 'delete'
        else:
            return
        for i in range(lo, hi):
            yield type, x[i]

    def _qformat(self, aline, bline, atags, btags):
        yield 'replace', aline, bline

执行结果

('delete', '  2. Explicit is better than implicit.\n')
('replace', '  3. Simple is better than complex.\n', '  3. Simple is better than complexed.\n')
('replace', '  4. Complex is better than complicated.\n', '  4. Complicated is better than complicated.\n')
('insert', '  5. Flat is better than nested.\n')

可以看到单个变更内容变为了元组形式，这样就比较容易解析和处理了。