I want to extract some questions and answers from some files I'm reading but my regex isn't working for me:
from re import findall,DOTALL
text='''
category 1
1. question
a) answer
b) answer
2. question
a) answer
b) answer
category 2
3. question
a) answer
b) answer
'''
The format in the files is basically a numbered list with a variable number of indexed answers like a)
b)
or a.
b.
... with the answers spanning several lines in places. I've tried this:
mo=findall(r"^\d\.(.+)(\w\)|\.(.+))+$",text,DOTALL)
print(mo)
I tried putting in capture groups to separate the questions from the answers,
removing "^"
gives the closest result but it's still junk and I don't understand why this happens:
[(' question\na) answer\nb) answer\n2. question\na) answer\nb) answer\ncategory 2\n3', '. question\na) answer\nb) answer\n', ' question\na) answer\nb) answer\n')]
I'm considering looking for a space between the answers in order to not pick up the "category" junk as a part of the answer or controlling my input more to support a format with no space as well.
I'm trying to get an output like(doesn't need to be a tuple, that's just what findall groups return):
[('question', 'answer', 'answer'),
('question', 'answer', 'answer'),
('question', 'answer', 'answer')]
Instead of writing a monster regex for a multiline requirement, I'd use normal iteration and accumulation, more or less. You can split on /\n(?=[a-z\d][).] )/gm
to extract the Q&A content only. Iterating over these chunks, if any are questions, start a new Q&A block, otherwise append to the existing one to accumulate the result.
import re
text = '''
category 1
1. q1
q1 foobar
a) a1.a
b) a1.b
some extra a1.b
2. q2
a) a2.a
b) a2.b
some extra a2.b
c) a2.c
blah a2.c
category 2
3. q3
a) a3.a
b) a3.b
extra a3.b
'''
qa = []
block = []
for chunk in re.split(r"\n(?=[a-z\d][).] )", text):
if m:= re.match(r"\d+\. (.+)", chunk, re.S):
qa.append(tuple(block))
block = [m.group(1)]
elif m := re.match(r"[a-z]+\) (.+?)(?=\n\n|$|[a-z]+\) )", chunk, re.S):
block.append(m.group(1))
qa = qa[1:] + [tuple(block)]
for line in qa:
print(line)
Gives:
('q1\n q1 foobar', 'a1.a', 'a1.b\n some extra a1.b')
('q2', 'a2.a', 'a2.b\n some extra a2.b', 'a2.c\n blah a2.c')
('q3', 'a3.a', 'a3.b\nextra a3.b')
Regex explanations:
/\n(?=[a-z\d][).] )/gs
does the splitting on newlines that lookahead to either of the twoa)
or1.
patterns. This enables us to preserve the multiline chunks./\d+\. (.+)/gs
lets us identify a1. question
chunk and capture the question body./[a-z]+\) (.+?)(?=\n\n|$|[a-z]+\) )/gs
matches thea) answer
chunk. It's pretty much the same as the1. question
chunk above, but it has to be a bit careful to trim the next content header, which wasn't handled by regex (1) above. This is what the(?=\n\n|$|[a-z]+\) )
lookahead does: if the following is a double newline, end of string ora)
, then don't include it in this answer.