In a pandas dataframe I need to extract text between square brackets and output that text as a new column. I need to do this at the "StudyID" level and create new rows for each bit of text extracted.
Here is a simplified example dataframe
data = {
"studyid":['101',
'101',
'102',
'103'],
"Question":["Q1",
"Q2",
"Q1",
"Q3"],
"text":['I love [Bananas] and also [oranges], and [figs]',
'Yesterday I ate [Apples]',
'[Grapes] are my favorite fruit',
'[Mandarins] taste like [oranges] to me'],
}
df2 = pd.DataFrame(data)
I worked out a solution (see code below, if you run this it shows the wanted output), however it is very long with many steps. I am wanting to know if there is a much shorter way of doing this.
You will see that I used str.findall() for the regex, but I originally tried str.extractall() which outputs the extracted text to a dataframe, but I didn't know how to output the extracted text with the "studyid" and "question" columns included in the dataframe generated by extractall(). So I resorted to using the str.findall().
Here is my code ('I know it is clunky') - how can I reduce the number of steps? Thanks in advance for your help!
# Step 1: Use Regex to pull put the text between the square brackets
df3 = pd.DataFrame(df2['text'].str.findall(r"(?<=\[)([^]]+)(?=\])").tolist())
# Step 2: Merge the extracted text back with the original data
df3 = df2.merge(df3, left_index=True, right_index=True)
# Step 3: Transpose the wide file to a long file (e.g. panel)
df4 = pd.melt(df3, id_vars=['studyid', 'Question'], value_vars=[0, 1, 2])
# Step 4: Delete rows with None in the value column
indexNames = df4[df4['value'].isnull()].index
df4.drop(indexNames , inplace=True)
# Step 5: Sort the data by the StudyID and Question
df4.sort_values(by=['studyid', 'Question'], inplace=True)
# Step 6: Drop unwanted columns
df4.drop(['variable'], axis=1, inplace=True)
# Step 7: Reset the index and drop the old index
df4.reset_index(drop=True, inplace=True)
df4
If assign back output of Series.str.findall
to column is possible use DataFrame.explode
, last for unique index is used DataFrame.reset_index
with drop=True
:
df2['text'] = df2['text'].str.findall(r"(?<=\[)([^]]+)(?=\])")
df4 = df2.explode('text').reset_index(drop=True)
Solution with Series.str.extractall
, removed second level of MultiIndex
and last use DataFrame.join
for append to original:
s = (df2.pop('text').str.extractall(r"(?<=\[)([^]]+)(?=\])")[0]
.reset_index(level=1, drop=True)
.rename('text'))
df4 = df2.join(s).reset_index(drop=True)
print (df4)
studyid Question text
0 101 Q1 Bananas
1 101 Q1 oranges
2 101 Q1 figs
3 101 Q2 Apples
4 102 Q1 Grapes
5 103 Q3 Mandarins
6 103 Q3 oranges