Subset string rows that contain a 'flexible' pattern

prp :

I have the following df.

data = [
    ['DWWWWD'],
    ['DWDW'],
    ['WDWWWWWWWWD'],
    ['DDW'],
    ['WWD'],
]

df = pd.DataFrame(data, columns=['letter_sequence'])

I want to subset the rows that contain the pattern 'D' + '[whichever number of W's]' + 'D'. Examples of rows I want in my output df: DWD, DWWWWWWWWWWWD, WWWWWDWDW...

I came up with the following, but it does not really work for 'whichever number of W's'.

df[df['letter_sequence'].str.contains(
    'DWD|DWWD|DWWWD|DWWWWD|DWWWWWD|DWWWWWWD|DWWWWWWWD|DWWWWWWWWD', regex=True
)]

Desired output new_df:

    letter_sequence
0   DWWWWD
1   DWDW
2   WDWWWWWWWWD

Any alternatives?

jezrael :

Use [W]{1,} for one or more W, regex=True is by default, so should be omit:

df = df[df['letter_sequence'].str.contains('D[W]{1,}D')]
print (df)
  letter_sequence
0          DWWWWD
1            DWDW
2     WDWWWWWWWWD

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=19435&siteId=1