I have a portion of a string which duplicates itself, and I want to remove all duplicate "substrings" without losing the order of the words in the string.
For example: "12 PL DE LA HALLE BP 425 BRIVE-LA-GAILLARDE BP 425 BRIVE-LA-GAILLARDE BP 425 BRIVE-LA-GAILLARDE BP 425 BRIVE-LA-GAILLARDE"
Here "BP 425 BRIVE-LA-GAILLARDE"
repeats itself 4 times.
I would like the string to finally be "12 PL DE LA HALLE BP 425 BRIVE-LA-GAILLARDE"
where the duplicates have been removed.
This problem is occurring when one of my generic scraper modules is collecting all text elements from a certain HTML Element. In the HTML element the same information is repeated multiple times but is hidden using CSS. This is why I am looking for a generic way of de-duplicating substrings.
More examples of duplicated substrings:
"TOUR SOCIETE SUISSE 1 BD VIVIER MERLE 1 BD VIVIER MERLE"
=> "TOUR SOCIETE SUISSE 1 BD VIVIER MERLE"
"2 PARC DES ERABLES 66 RTE DE SARTROUVILLE 66 RTE DE SARTROUVILLE"
=> "2 PARC DES ERABLES 66 RTE DE SARTROUVILLE"
"CASERNE AUDEOUD 111 AV DE LA CORSE 111 AV DE LA CORSE"
=> "CASERNE AUDEOUD 111 AV DE LA CORSE"
The simple approach to not repeat the same word twice does not work here because in the case when a words is repeated but isn't duplicates for example: "12 PL DE LA HALLE BP 425 BRIVE LA GAILLARDE BP 425 BRIVE LA GAILLARDE"
, Here "LA" between BRIVE and GAILLARDE would be removed.
and the output would be: "12 PL DE LA HALLE BP 425 BRIVE GAILLARDE"
whereas the actual desired output is: "12 PL DE LA HALLE BP 425 BRIVE LA GAILLARDE"
My hunch is one would need to compare sequence of words. But not sure exactly how.
Any help appreciated.
Here is a potentially viable regex based solution:
inp = "12 PL DE LA HALLE BP 425 BRIVE-LA-GAILLARDE BP 425 BRIVE-LA-GAILLARDE BP 425 BRIVE-LA-GAILLARDE BP 425 BRIVE-LA-GAILLARDE"
while True:
out = re.sub(r'(?<!\S)(\S+(?:\s\S+)*)\s+\1(?!\S)', '\\1', inp)
if out == inp:
break
inp = out
print(out)
This prints:
12 PL DE LA HALLE BP 425 BRIVE-LA-GAILLARDE
The idea here is to match any phrase which is followed by the same phrase, and then to replace with just the first captured phrase.
We use a recursive re.sub
here, because once PHRASE PHRASE
has been processed and replaced with just a single PHRASE
, that remaining phrase won't be used again.
Here is an explanation of the regex pattern:
(?<!\S) assert what precedes is either whitespace or the start of the string
( match AND capture the following in \1
\S+ match one or more non whitespace characters (i.e. a "word")
(?:\s\S+)* then match a space followed by another word, zero or more times
)
\s+ match one or more whitespace characters
\1 then match the same phrase we just saw
(?!\S) assert that whitespace or the end of the string follows