Capture ALL strings within a Python script with regex

iperetta :

This question was inspired by my failed attempts after trying to adapt this answer: RegEx: Grabbing values between quotation marks

Consider the following Python script (t.py):

print("This is also an NL test")
variable = "!\n"
print('And this has an escaped quote "don\'t"  in it ', variable,
      "This has a single quote ' but doesn\'t end the quote as it" + \
      " started with double quotes")
if "Foo Bar" != '''Another Value''':
    """
    This is just nonsense
    """
    aux = '?'
    print("Did I \"failed\"?", f"{aux}")

I want to capture all strings in it, as:

  • This is also an NL test
  • !\n
  • And this has an escaped quote "don\'t" in it
  • This has a single quote ' but doesn\'t end the quote as it
  • started with double quotes
  • Foo Bar
  • Another Value
  • This is just nonsense
  • ?
  • Did I \"failed\"?
  • {aux}

I wrote another Python script using re module and, from my attempts into regex, the one which finds most of them is:

import re
pattern = re.compile(r"""(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)""")
with open('t.py', 'r') as f:
    msg = f.read()
x = pattern.finditer(msg, re.DOTALL)
for i, s in enumerate(x):
    print(f'[{i}]',s.group(0))

with the following result:

  • [0] And this has an escaped quote "don\'t" in it
  • [1] This has a single quote ' but doesn\'t end the quote as it started with double quotes
  • [2] Foo Bar
  • [3] Another Value
  • [4] Did I \"failed\"?

To improve my failures, I couldn't also fully replicate what I can found with regex101.com:

enter image description here

I'm using Python 3.6.9, by the way, and I'm asking for more insights into regex to crack this one.

CertainPerformance :

Because you want to match ''' or """ or ' or " as the delimiter, put all of that into the first group:

('''|"""|["'])

Don't put \b after it, because then it won't match strings when those strings start with something other than a word character.

Because you want to make sure that the final delimiter isn't treated as a starting delimiter when the engine starts the next iteration, you'll need to fully match it (not just lookahead for it).

The middle part to match anything but the delimiter can be:

((?:\\.|.)*?)

Put it all together:

('''|"""|["'])((?:\\.|.)*?)\1

and the result you want will be in the second capture group:

pattern = re.compile(r"""(?s)('''|\"""|["'])((?:\\.|.)*?)\1""")
with open('t.py', 'r') as f:
    msg = f.read()
x = pattern.finditer(msg)
for i, s in enumerate(x):
    print(f'[{i}]',s.group(2))

https://regex101.com/r/dvw0Bc/1

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=23671&siteId=1