take numeric value without label in line, regex

sygneto :

Input:

df=pd.DataFrame({'text':['value 123* 333','122* 666','722 888*']})
print(df)
             text
0  value 123* 333
1        122* 666
2        722 888*

I need to extract from df['text'] only numeric values, but withou *label my code:

df.text.str.extract(r'([0-9]+|[0-9]+\.[0-9]+)')

But with this code, values with the * char on the right are returned.

Expected output:

text
333
666
722
Wiktor Stribiżew :

You may use

df['text'].str.extract(r'(?=([0-9]+(?:\.[0-9]+)?))\1(?!\*)')

See the regex demo. Or, you may also require a word boundary on the left with r'\b(?=([0-9]+(?:\.[0-9]+)?))\1(?!\*)'. See this regex demo.

Regex details

  • (?=([0-9]+(?:\.[0-9]+)?)) - a positive lookahead that requires and captures into Group 1 the following sequence of patterns immediately on the right:
    • [0-9]+ - 1+ digits
    • (?:\.[0-9]+)? - an optional sequence of . and 1+ digits.
  • \1 - the value of Group 1
  • (?!\*) - a negative lookahead that fails the match if, immediately to the right, there is a * char.

See the Python test:

>>> import pandas as pd
>>> df=pd.DataFrame({'text':['value 123* 333','122* 666','722 888*']})
>>> df['text'].str.extract(r'(?=([0-9]+(?:\.[0-9]+)?))\1(?!\*)')
0    333
1    666
2    722
Name: text, dtype: object
>>> 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=401678&siteId=1