How to remove Duplicate SubStrings

Codious-JR :

I have a portion of a string which duplicates itself, and I want to remove all duplicate "substrings" without losing the order of the words in the string.

For example: "12 PL DE LA HALLE BP 425 BRIVE-LA-GAILLARDE BP 425 BRIVE-LA-GAILLARDE BP 425 BRIVE-LA-GAILLARDE BP 425 BRIVE-LA-GAILLARDE"

Here "BP 425 BRIVE-LA-GAILLARDE" repeats itself 4 times.

I would like the string to finally be "12 PL DE LA HALLE BP 425 BRIVE-LA-GAILLARDE" where the duplicates have been removed.

This problem is occurring when one of my generic scraper modules is collecting all text elements from a certain HTML Element. In the HTML element the same information is repeated multiple times but is hidden using CSS. This is why I am looking for a generic way of de-duplicating substrings.

More examples of duplicated substrings:

 "TOUR SOCIETE SUISSE 1 BD VIVIER MERLE 1 BD VIVIER MERLE"
      => "TOUR SOCIETE SUISSE  1 BD VIVIER MERLE"

 "2 PARC DES ERABLES 66 RTE DE SARTROUVILLE 66 RTE DE SARTROUVILLE"
      => "2 PARC DES ERABLES  66 RTE DE SARTROUVILLE"

 "CASERNE AUDEOUD 111 AV DE LA CORSE 111 AV DE LA CORSE"
      => "CASERNE AUDEOUD  111 AV DE LA CORSE"

The simple approach to not repeat the same word twice does not work here because in the case when a words is repeated but isn't duplicates for example: "12 PL DE LA HALLE BP 425 BRIVE LA GAILLARDE BP 425 BRIVE LA GAILLARDE", Here "LA" between BRIVE and GAILLARDE would be removed.

and the output would be: "12 PL DE LA HALLE BP 425 BRIVE GAILLARDE" whereas the actual desired output is: "12 PL DE LA HALLE BP 425 BRIVE LA GAILLARDE"

My hunch is one would need to compare sequence of words. But not sure exactly how.

Any help appreciated.

Tim Biegeleisen :

Here is a potentially viable regex based solution:

inp = "12 PL DE LA HALLE  BP 425 BRIVE-LA-GAILLARDE  BP 425 BRIVE-LA-GAILLARDE  BP 425 BRIVE-LA-GAILLARDE  BP 425 BRIVE-LA-GAILLARDE"
while True:
    out = re.sub(r'(?<!\S)(\S+(?:\s\S+)*)\s+\1(?!\S)', '\\1', inp)
    if out == inp:
        break
    inp = out
print(out)

This prints:

12 PL DE LA HALLE  BP 425 BRIVE-LA-GAILLARDE

The idea here is to match any phrase which is followed by the same phrase, and then to replace with just the first captured phrase.

We use a recursive re.sub here, because once PHRASE PHRASE has been processed and replaced with just a single PHRASE, that remaining phrase won't be used again.

Here is an explanation of the regex pattern:

(?<!\S)         assert what precedes is either whitespace or the start of the string
(               match AND capture the following in \1
    \S+         match one or more non whitespace characters (i.e. a "word")
    (?:\s\S+)*  then match a space followed by another word, zero or more times
)
\s+             match one or more whitespace characters
\1              then match the same phrase we just saw         
(?!\S)          assert that whitespace or the end of the string follows

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=21353&siteId=1