prp :
I have the following df.
data = [
['DWWWWD'],
['DWDW'],
['WDWWWWWWWWD'],
['DDW'],
['WWD'],
]
df = pd.DataFrame(data, columns=['letter_sequence'])
I want to subset the rows that contain the pattern 'D' + '[whichever number of W's]' + 'D'
. Examples of rows I want in my output df: DWD
, DWWWWWWWWWWWD
, WWWWWDWDW
...
I came up with the following, but it does not really work for 'whichever number of W's'.
df[df['letter_sequence'].str.contains(
'DWD|DWWD|DWWWD|DWWWWD|DWWWWWD|DWWWWWWD|DWWWWWWWD|DWWWWWWWWD', regex=True
)]
Desired output new_df
:
letter_sequence
0 DWWWWD
1 DWDW
2 WDWWWWWWWWD
Any alternatives?
jezrael :
Use [W]{1,}
for one or more W
, regex=True
is by default, so should be omit:
df = df[df['letter_sequence'].str.contains('D[W]{1,}D')]
print (df)
letter_sequence
0 DWWWWD
1 DWDW
2 WDWWWWWWWWD