Python Module's difflib-sequence comparison

Module Purpose : Compare sequences, especially multi-line text.

difflibThe module contains many tools for calculating and comparing the differences between sequences. This is very useful when comparing text.

The sample data in this section will all use difflib_data.pythe common test text from:

# difflib_data.py

text1 = """Lorem ipsum dolor sit amet, consectetuer adipiscing
elit. Integer eu lacus accumsan arcu fermentum euismod. Donec
pulvinar porttitor tellus. Aliquam venenatis. Donec facilisis
pharetra tortor.  In nec mauris eget magna consequat
convalis. Nam sed sem vitae odio pellentesque interdum. Sed
consequat viverra nisl. Suspendisse arcu metus, blandit quis,
rhoncus ac, pharetra eget, velit. Mauris urna. Morbi nonummy
molestie orci. Praesent nisi elit, fringilla ac, suscipit non,
tristique vel, mauris. Curabitur vel lorem id nisl porta
adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate
tristique enim. Donec quis lectus a justo imperdiet tempus."""

text1_lines = text1.splitlines()

text2 = """Lorem ipsum dolor sit amet, consectetuer adipiscing
elit. Integer eu lacus accumsan arcu fermentum euismod. Donec
pulvinar, porttitor tellus. Aliquam venenatis. Donec facilisis
pharetra tortor. In nec mauris eget magna consequat
convalis. Nam cras vitae mi vitae odio pellentesque interdum. Sed
consequat viverra nisl. Suspendisse arcu metus, blandit quis,
rhoncus ac, pharetra eget, velit. Mauris urna. Morbi nonummy
molestie orci. Praesent nisi elit, fringilla ac, suscipit non,
tristique vel, mauris. Curabitur vel lorem id nisl porta
adipiscing. Duis vulputate tristique enim. Donec quis lectus a
justo imperdiet tempus.  Suspendisse eu lectus. In nunc."""

text2_lines = text2.splitlines()

Compare text bodies

DifferClasses are used to process multiple lines of text and produce human-readable comparison differences or indications of changes, as well as differences specific to each line of text. DifferThe default output of Unix is ​​similar to that of the Unix command-line tools diff, including the values ​​in the original input list, common values, and markers to mark changes.

  • Lines of text prefixed with a minus sign -exist in the first sequence but not in the second;
  • Lines of text prefixed with a plus sign +exist in the second sequence but not in the first;
  • If a line of text differs, an additional line ?of text beginning with a question mark is used to identify the difference.
  • If the text has not changed, the line of text will be prefixed with a space so that it aligns with those lines of text that have changed.

Splitting our text compare()into separate lines before passing it into the method produces a comparison that is easier to understand than passing a long string directly.

# difflib_differ.py

import difflib
from difflib_data import *

d = difflib.Differ()
diff = d.compare(text1_lines, text2_lines)
print('\n'.join(diff))

The text at the beginning of the sample data is exactly the same, so the first two lines are output directly without additional markers.

  Lorem ipsum dolor sit amet, consectetuer adipiscing
  elit. Integer eu lacus accumsan arcu fermentum euismod. Donec

A comma is added to the modified version of the third line, lines ,of text for both versions are output, and an extra line of line 5 is added to mark where the text was modified, where the comma was added.

- pulvinar porttitor tellus. Aliquam venenatis. Donec facilisis
+ pulvinar, porttitor tellus. Aliquam venenatis. Donec facilisis
?         +

The next few lines of output mark an extra space to be removed.

- pharetra tortor.  In nec mauris eget magna consequat
?                 -

+ pharetra tortor. In nec mauris eget magna consequat

Next, a more complex change is marked: a few words are replaced.

- convalis. Nam sed sem vitae odio pellentesque interdum. Sed
?                 - --

+ convalis. Nam cras vitae mi vitae odio pellentesque interdum. Sed
?               +++ +++++   +

The last paragraph was almost completely revised, so the old version was simply removed and a new version was added.

  consequat viverra nisl. Suspendisse arcu metus, blandit quis,
  rhoncus ac, pharetra eget, velit. Mauris urna. Morbi nonummy
  molestie orci. Praesent nisi elit, fringilla ac, suscipit non,
  tristique vel, mauris. Curabitur vel lorem id nisl porta
- adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate
- tristique enim. Donec quis lectus a justo imperdiet tempus.
+ adipiscing. Duis vulputate tristique enim. Donec quis lectus a
+ justo imperdiet tempus.  Suspendisse eu lectus. In nunc.

ndiff()method produces almost the same output. Its processing is specially tailored for processing text and removes "noise" from the input.

Other output formats

Differ()The class will display all input lines, while the unified_diff()method will only mark the modified line and some context.

# difflib_unified.py

import difflib
from difflib_data import *

diff = difflib.unified_diff(text1_lines, text2_lines, lineterm='')

print('\n'.join(diff))

linetermThe parameter tells unified_diff()the method that newlines should not be added to the control line because they are not included on the input line. Add newlines to all lines when printing output. This output should be very familiar to users of some popular version control tools.

$ python3 difflib_unified.py

---
+++
@@ -1,11 +1,11 @@
 Lorem ipsum dolor sit amet, consectetuer adipiscing
 elit. Integer eu lacus accumsan arcu fermentum euismod. Donec
-pulvinar porttitor tellus. Aliquam venenatis. Donec facilisis
-pharetra tortor.  In nec mauris eget magna consequat
-convalis. Nam sed sem vitae odio pellentesque interdum. Sed
+pulvinar, porttitor tellus. Aliquam venenatis. Donec facilisis
+pharetra tortor. In nec mauris eget magna consequat
+convalis. Nam cras vitae mi vitae odio pellentesque interdum. Sed
 consequat viverra nisl. Suspendisse arcu metus, blandit quis,
 rhoncus ac, pharetra eget, velit. Mauris urna. Morbi nonummy
 molestie orci. Praesent nisi elit, fringilla ac, suscipit non,
 tristique vel, mauris. Curabitur vel lorem id nisl porta
-adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate
-tristique enim. Donec quis lectus a justo imperdiet tempus.
+adipiscing. Duis vulputate tristique enim. Donec quis lectus a
+justo imperdiet tempus.  Suspendisse eu lectus. In nunc.

Using context_diff()also produces similar output.


useless data

All diff methods can accept parameters to control which lines of text should be ignored, and which characters in the lines of text should be ignored. These parameters can be used to skip differences caused by tokens or spaces. For example the following example:

# difflib_junk.py

from difflib import SequenceMatcher

def show_results(match):
    print(' a    = {}'.format(match.a))
    print(' b    = {}'.format(match.b))
    print(' size = {}'.format(match.size))
    i, j, k = match
    print(' A[a:a+size] = {!r}'.format(A[i:i + k]))
    print(' B[b:b+size] = {!r}'.format(B[j:j + k]))


A = ' abcd'
B = 'abcd abcd'

print('A = {!r}'.format(A))
print('B = {!r}'.format(B))

print('\nWithout junk detection:')
s1 = SequenceMatcher(None, A, B)
match1 = s1.find_longest_match(0, len(A), 0, len(B))
show_results(match1)

print('\nTreat sapces as junk:')
s2 = SequenceMatcher(lambda x: x == ' ', A, B)
match2 = s2.find_longest_match(0, len(A), 0, len(B))
show_results(match2)

The default Differclass does not explicitly ignore any text lines and characters, but relies on SequenceMatcherthe functions to detect noise. ndiff()The method ignores spaces and tabs by default.

$ python3 difflib_junk.py

A = ' abcd'
B = 'abcd abcd'

Without junk detection:
  a    = 0
  b    = 4
  size = 5
  A[a:a+size] = ' abcd'
  B[b:b+size] = ' abcd'

Treat spaces as junk:
  a    = 1
  b    = 0
  size = 4
  A[a:a+size] = 'abcd'
  B[b:b+size] = 'abcd'

Compare any type

SequenceMatcherClasses can be used to compare two data of any type, as long as they are hashable. It uses an algorithm to compute the longest consecutive subsequence of a sequence and ignores "garbage data" that doesn't make sense.

get_opcodes()The method returns a list of commands that adjust the first sequence to match the second sequence. These commands are tuples with 5 elements, including an instruction string (opcode, see table below) and two pairs of tables representing the starting position of the sequence, i1, i2, j1and j2.

Instruction (opcode) definition
‘replace’ will be a[i1:i2]replaced withb[j1:j2]
‘delete’ removea[i1:i2]
‘insert’ a[i1:i1]insert atb[j1:j2]
‘equal’ The two sequences are already equal
# difflib_seq.py

import difflib

s1 = [1, 2, 3, 5, 6, 4]
s2 = [2, 3, 5, 4, 6, 1]

print('Initial data:')
print('s1 =', s1)
print('s2 =', s2)
print('s1 == s2:', s1 == s2)
print()

matcher = difflib.SequenceMatcher(None, s1, s2)
for tag, i1, i2, j1, j2 in reversed(matcher.get_opcodes()):

    if tag == 'delete':
        print('Remove {} from positions [{}:{}]'.format(
            s1[i1:i2], i1, i2))
        print('  before =', s1)
        del s1[i1:i2]

    elif tag == 'equal':
        print('s1[{}:{}] and s2[{}:{}] are the same'.format(
            i1, i2, j1, j2))

    elif tag == 'insert':
        print('Insert {} from s2[{}:{}] into s1 at {}'.format(
            s2[j1:j2], j1, j2, i1))
        print('  before =', s1)
        s1[i1:i2] = s2[j1:j2]

    elif tag == 'replace':
        print(('Replace {} from s1[{}:{}] '
               'with {} from s2[{}:{}]').format(
                   s1[i1:i2], i1, i2, s2[j1:j2], j1, j2))
        print('  before =', s1)
        s1[i1:i2] = s2[j1:j2]

    print('   after =', s1, '\n')

print('s1 == s2:', s1 == s2)

This example compares two lists of integers and uses get_opcodes()methods to get a set of commands that adjust the original sequence to the new sequence. The commands are in reverse order so that the subscripts remain the same after adding or removing items.

$ python3 difflib_seq.py

Initial data:
s1 = [1, 2, 3, 5, 6, 4]
s2 = [2, 3, 5, 4, 6, 1]
s1 == s2: False

Replace [4] from s1[5:6] with [1] from s2[5:6]
  before = [1, 2, 3, 5, 6, 4]
   after = [1, 2, 3, 5, 6, 1]

s1[4:5] and s2[4:5] are the same
   after = [1, 2, 3, 5, 6, 1]

Insert [4] from s2[3:4] into s1 at 4
  before = [1, 2, 3, 5, 6, 1]
   after = [1, 2, 3, 5, 4, 6, 1]

s1[1:4] and s2[0:3] are the same
   after = [1, 2, 3, 5, 4, 6, 1]

Remove [1] from positions [0:1]
  before = [1, 2, 3, 5, 4, 6, 1]
   after = [2, 3, 5, 4, 6, 1]

s1 == s2: True

SequenceMatcherClasses can handle custom classes as well as built-in classes, as long as they are hashable.

original click here

refer to:

1. Official documentation of the difflib module

2. "Pattern Matching: The Gestalt Approach" - Discussion of a Similar Algorithm by John W. Ratcliff and DE Metzener

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325655444&siteId=291194637