Regex: remove strings after slash just when they are more than one word

Juan Perez :

How to remove string after slash just when there are more than one word in the string? In specific, consider the following string:

    0      1     2        0       1      2   3   
 CENTRAL CARE HOSPITAL/HOPITAL CENTRALE DE SOINS

All the characters after slash should be removed because there are 4 words (HOPITAL, CENTRALE, DE, SOINS) and the limit is just one. Then the result is: CENTRAL CARE HOSPITAL

On the other hand, we have the following string:

   0     1     2    3  0
HAPPY SPRING BREAK 20/20

20 this time has to be kept because it is just one word (\b[A-Za-z0-9]\b). Then, the / slash should be replaced by empty space. The result should look like the following: HAPPY SPRING BREAK 20 20

Suppose the following test set:

CENTRAL CARE HOSPITAL/HOPITAL CENTRALE DE SOINS
ELEMENTARY/INSTITUTION
FOUNDATION INSTITUTION/FUNDATION DEL INSTITUTO
HAPPY SPRING BREAK 20/20

The result should be the following:

CENTRAL CARE HOSPITAL
ELEMENTARY INSTITUTION
FOUNDATION INSTITUTION
HAPPY SPRING BREAK 20 20

Overall, just keep the strings after slash just when it is one word and add an space where the slash was located. Otherwise, remove the strings after slash

I have tried this regex so far, but not working: (?:[\/])([A-Z0-9]*\b)(?!\b[A-Z]*)|[^\/]*$

Thanks

Wiktor Stribiżew :

You may use

import re
rx = r'/(\w+(?:\W+\w+)+\W*$)?'
strs = ['CENTRAL CARE HOSPITAL/HOPITAL CENTRALE DE SOINS','ELEMENTARY/INSTITUTION','FOUNDATION INSTITUTION/FUNDATION DEL INSTITUTO','HAPPY SPRING BREAK 20/20']
for s in strs:
    print( re.sub(rx, lambda x: "" if x.group(1) else " ", s) )

See the Python demo online. Output:

CENTRAL CARE HOSPITAL
ELEMENTARY INSTITUTION
FOUNDATION INSTITUTION
HAPPY SPRING BREAK 20 20

The regex is /(\w+(?:\W+\w+)+\W*$)?, see its online demo. It matches:

  • / - a slash
  • (\w+(?:\W+\w+)+\W*$)? - an optional capturing group #1 that matches
    • \w+ - 1+ word chars
    • (?:\W+\w+)+ - 1+ sequences of 1+ non-word chars followed with 1+ word chars
    • \W* - zero or more non-word chars
    • $ - end of string.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=279210&siteId=1