Is there a more efficient way to assess containment of strings?

MauroT :

I have to execute this line of cose several million times, I wonder if there is a way to optimize it (maybe precomputing something?).

a.contains(b) || b.contains(a)

Thank you

edit: the code executed by the contains method already checks for a.length < b.length.

public static int indexOf(byte[] value, int valueCount, byte[] str, int strCount, int fromIndex) {
    byte first = str[0];
    int max = (valueCount - strCount);
    for (int i = fromIndex; i <= max; i++) {
        [...]
    }
    return -1;
}
tobias_k :

As I understand the task, you have to check whether a contains b or vice versa for each pair of a and b from a set of about 35 million words. That's a lot of pairs to check.

You should be able to narrow the search down considerable by precomputing which n-grams a word contains: If a contains some n-gram, then b has to contain the same n-gram if b contains a. You could e.g. precompute all the trigrams that each word in the list contains, and at the same time all the words that contain a given trigram, then you can just look up the words in those dictionaries and with some set operations get a small set of candidates to check properly.

In pseudo-code:

  • select a size for the n-grams (see below)
  • initialize a Map<String, Set<String>> ngram_to_word
  • first iteration: for each word a in your data set
    • iterate all the n-grams (e.g. using some sort of sliding window) of a
    • for each, add a to the sets of words containing those n-grams in ngrams_to_words
  • second iteration: for each word a in your data set
    • again get all the n-grams a contains
    • for each of those, get the set of words that contain that n-gram from ngrams_to_words
    • get the intersection of those sets of words
    • for each word b in that intersection that contains all the n-grams that a contains (but maybe in a different order or quantity), properly check whether b contains a

Depending on the number of letters in those n-grams (e.g. bigrams, trigrams, ...), they will be more expensive to pre-compute, in both time and space, but the effect will also be greater. In the simplest case, you could even just precompute which words contain a given letter (i.e. "1-grams"); that should be fast and already considerable narrow down the words to check. Of course, the n-grams should not be shorter than the shortest of the words in the data set, but you could even use two length of n-grams, e.g. use two maps letter_to_words and trigrams_to_words.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=121037&siteId=1