Extract questions and answers with regex

Filip :

I want to extract some questions and answers from some files I'm reading but my regex isn't working for me:

from re import findall,DOTALL

text='''
category 1
1. question
a) answer
b) answer
2. question
a) answer
b) answer

category 2
3. question
a) answer
b) answer
'''

The format in the files is basically a numbered list with a variable number of indexed answers like a) b) or a. b. ... with the answers spanning several lines in places. I've tried this:

mo=findall(r"^\d\.(.+)(\w\)|\.(.+))+$",text,DOTALL)
print(mo)

I tried putting in capture groups to separate the questions from the answers, removing "^" gives the closest result but it's still junk and I don't understand why this happens:

[(' question\na) answer\nb) answer\n2. question\na) answer\nb) answer\ncategory 2\n3', '. question\na) answer\nb) answer\n', ' question\na) answer\nb) answer\n')]

I'm considering looking for a space between the answers in order to not pick up the "category" junk as a part of the answer or controlling my input more to support a format with no space as well.

I'm trying to get an output like(doesn't need to be a tuple, that's just what findall groups return):

[('question', 'answer', 'answer'), 
 ('question', 'answer', 'answer'), 
 ('question', 'answer', 'answer')]
ggorlen :

Instead of writing a monster regex for a multiline requirement, I'd use normal iteration and accumulation, more or less. You can split on /\n(?=[a-z\d][).] )/gm to extract the Q&A content only. Iterating over these chunks, if any are questions, start a new Q&A block, otherwise append to the existing one to accumulate the result.

import re

text = '''
category 1
1. q1
  q1 foobar
a) a1.a
b) a1.b
  some extra a1.b
2. q2
a) a2.a
b) a2.b
  some extra a2.b
c) a2.c
 blah a2.c

category 2
3. q3
a) a3.a
b) a3.b
extra a3.b
'''

qa = []
block = []

for chunk in re.split(r"\n(?=[a-z\d][).] )", text):
    if m:= re.match(r"\d+\. (.+)", chunk, re.S):
        qa.append(tuple(block))
        block = [m.group(1)]
    elif m := re.match(r"[a-z]+\) (.+?)(?=\n\n|$|[a-z]+\) )", chunk, re.S):
        block.append(m.group(1))

qa = qa[1:] + [tuple(block)]

for line in qa: 
    print(line)

Gives:

('q1\n  q1 foobar', 'a1.a', 'a1.b\n  some extra a1.b')
('q2', 'a2.a', 'a2.b\n  some extra a2.b', 'a2.c\n blah a2.c')
('q3', 'a3.a', 'a3.b\nextra a3.b')

Regex explanations:

  • /\n(?=[a-z\d][).] )/gs does the splitting on newlines that lookahead to either of the two a) or 1. patterns. This enables us to preserve the multiline chunks.
  • /\d+\. (.+)/gs lets us identify a 1. question chunk and capture the question body.
  • /[a-z]+\) (.+?)(?=\n\n|$|[a-z]+\) )/gs matches the a) answer chunk. It's pretty much the same as the 1. question chunk above, but it has to be a bit careful to trim the next content header, which wasn't handled by regex (1) above. This is what the (?=\n\n|$|[a-z]+\) ) lookahead does: if the following is a double newline, end of string or a), then don't include it in this answer.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=350465&siteId=1