There is a list of sentencens sentences = ['Ask the swordsmith', 'He knows everything']
. The goal is to remove those sentences that a word from a wordlist lexicon = ['word', 'every', 'thing']
. This can be achieved using the following list comprehension:
newlist = [sentence for sentence in sentences if not any(word in sentence.split(' ') for word in lexicon)]
Note that if not word in sentence
is not a sufficient condition as it would also remove sentences that contain words in which a word from the lexicon is embedded, e.g. word
is embedded in swordsmith
, and every
and thing
are embedded in everything
.
However, my list of sentences consists of 1.000.000 sentences and my lexicon of 200.000 words. Applying the list comprehension mentioned takes hours! Because of that, I'm looking for a faster method to remove strings from a list that contain words from another list. Any suggestions? Maybe using regex?
Do your lookup in a set
. This makes it fast, and alleviates the containment issue because you only look for whole words in the lexicon.
lexicon = set(lexicon)
newlist = [s for s in sentences if not any(w in lexicon for w in s.split())]
This is pretty efficient because w in lexicon
is an O(1)
operation, and any
short-circuits. The main issue is splitting your sentence into words properly. A regular expression is inevitably going to be slower than a customized solution, but may be the best choice, depending on how robust you want to be against punctuation and the like. For example:
lexicon = set(lexicon)
pattern = re.compile(r'\w+')
newlist = [s for s in sentences if not any(m.group() in lexicon for m in pattern.finditer(s))]