Pandas replace not replacing the whole string

Jacques Thibodeau :

So I'm going through a text and I need to replace a bunch of CIDs (characters that were not readable when I scraped them). I need to replace every "cid:###" with the correct character. The issue that I'm currently running into is that some CIDs are wrapped around in <s></s> and there is no space between <s>(cid:131)</s> and the next word.

So, when I use replace, it doesn't work when I try to replace <s>(cid:131)</s> to ▪. When I try to replace cid:131 with ▪, I get <s>▪</s>. I'm trying to get rid of the <s></s> for this specific case (<s></s> is found in other places in the document and I don't want to replace those).

Doesn't change anything:

csv_of_table = csv_of_table.replace('<s>(cid:131)</s>', '▪', regex=True)

Only changes the part with cid:131:

csv_of_table = csv_of_table.replace('cid:131', '▪', regex=True)
Ben Pap :

You can use the ? quantifier to signify that a group can appear 0 or multiple times.

csv_of_table = csv_of_table.replace("(<s>\()?cid:\d+(\)<\/s>)?", "▪", regex = True)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=17945&siteId=1