Implementing a Big Data Search Engine with Python

Search is a common requirement in the field of big data. Splunk and ELK are the leaders in the field in the closed-source and open-source areas, respectively. This article uses very little Python code to implement a basic data search function, trying to let everyone understand the basic principles of big data search.

Bloom Filter

The first step is to implement a bloom filter .

Bloom filter is a common algorithm in the field of big data, its purpose is to filter out those elements that are not the target. That is, if a search term does not exist in my data, then it can return the target does not exist very quickly.

Let's look at the code for the following bloom filter:

class Bloomfilter(object):
    """
    A Bloom filter is a probabilistic data-structure that trades space for accuracy
    when determining if a value is in a set.  It can tell you if a value was possibly
    added, or if it was definitely not added, but it can't tell you for certain that
    it was added.
    """
    def __init__(self, size):
        """Setup the BF with the appropriate size"""
        self.values = [False] * size
        self.size = size

    def hash_value(self, value):
        """Hash the value provided and scale it to fit the BF size"""
        return hash(value) % self.size

    def add_value(self, value):
        """Add a value to the BF"""
        h = self.hash_value(value)
        self.values[h] = True

    def might_contain(self, value):
        """Check if the value might be in the BF"""
        h = self.hash_value(value)
        return self.values[h]

    def print_contents(self):
        """Dump the contents of the BF for debugging purposes"""
        print self.values
  • The basic data structure is an array (actually a bitmap, using 1/0 to record whether the data exists), the initialization is nothing, so all are set to False. In actual use, the length of the array is very large to ensure efficiency.
  • Use the hash algorithm to determine where the data should exist, that is, the index of the array
  • When a piece of data is added to the Bloom filter, calculate its hash and set the corresponding position to True
  • When checking whether a data already exists or has been indexed, just check the True/Fasle of the bit corresponding to the hash value

Seeing this, you should be able to see that if the Bloom filter returns False, then the data must have not been indexed, but if it returns True, it cannot be said that the data must have been indexed. Using bloom filters in the search process can make many searches that do not hit return early to improve efficiency.

Let's see how this code works:

bf = Bloomfilter(10)
bf.add_value('dog')
bf.add_value('fish')
bf.add_value('cat')
bf.print_contents()
bf.add_value('bird')
bf.print_contents()
# Note: contents are unchanged after adding bird - it collides
for term in ['dog', 'fish', 'cat', 'bird', 'duck', 'emu']:
    print '{}: {} {}'.format(term, bf.hash_value(term), bf.might_contain(term))

result:

[False, False, False, False, True, True, False, False, False, True]
[False, False, False, False, True, True, False, False, False, True]
dog: 5 True
fish: 4 True
cat: 9 True
bird: 9 True
duck: 5 True
emu: 8 False

First create a Bloom filter with a capacity of 10

Then add three objects 'dog', 'fish', and 'cat' respectively. At this time, the content of the Bloom filter is as follows:

Then adding the 'bird' object, the content of the bloom filter does not change because 'bird' and 'fish' happen to have the same hash.

Finally we check if a bunch of objects ('dog', 'fish', 'cat', 'bird', 'duck', 'emu') are already indexed. It turns out that 'duck' returns True, 2 and 'emu' returns False. Because the hash of 'duck' happens to be the same as 'dog'.

Participle 

In the next step we want to implement word segmentation. The purpose of word segmentation is to divide our text data into the smallest searchable units, that is, words. Here we mainly focus on English, because Chinese word segmentation involves natural language processing, which is more complicated, and English basically only needs to use punctuation marks.

Let's look at the code for word segmentation:

def major_segments(s):
    """
    Perform major segmenting on a string.  Split the string by all of the major
    breaks, and return the set of everything found.  The breaks in this implementation
    are single characters, but in Splunk proper they can be multiple characters.
    A set is used because ordering doesn't matter, and duplicates are bad.
    """
    major_breaks = ' '
    last = -1
    results = set()

    # enumerate() will give us (0, s[0]), (1, s[1]), ...
    for idx, ch in enumerate(s):
        if ch in major_breaks:
            segment = s[last+1:idx]
            results.add(segment)

            last = idx

    # The last character may not be a break so always capture
    # the last segment (which may end up being "", but yolo)    
    segment = s[last+1:]
    results.add(segment)

    return results

main division

The main segmentation uses spaces to segment words. In the actual word segmentation logic, there will be other delimiters. For example, Splunk's default delimiters include the following, and users can also define their own delimiters.

] < > ( ) { } | ! ; , ' " * \n \r \s \t & ? + %21 %26 %2526 %3B %7C %20 %2B %3D -- %2520 %5D %5B %3A %0A %2C %28 %29

def minor_segments(s):
    """
    Perform minor segmenting on a string.  This is like major
    segmenting, except it also captures from the start of the
    input to each break.
    """
    minor_breaks = '_.'
    last = -1
    results = set()

    for idx, ch in enumerate(s):
        if ch in minor_breaks:
            segment = s[last+1:idx]
            results.add(segment)

            segment = s[:idx]
            results.add(segment)

            last = idx

    segment = s[last+1:]
    results.add(segment)
    results.add(s)

    return results

secondary division

The logic of the secondary split is similar to the primary split, except that the results from the beginning part to the current split are added. For example "1.2.3.4" would have a secondary split of 1, 2, 3, 4, 1.2, 1.2.3

def segments(event):
    """Simple wrapper around major_segments / minor_segments"""
    results = set()
    for major in major_segments(event):
        for minor in minor_segments(major):
            results.add(minor)
    return results

The logic of word segmentation is to perform primary segmentation on the text first, and then perform secondary segmentation for each primary segmentation. Then return all the separated words.

Let's see how this code works:

for term in segments('src_ip = 1.2.3.4'):
        print term
src
1.2
1.2.3.4
src_ip
3
1
1.2.3
ip
2
=
4

search

Well, with the support of word segmentation and Bloom filter, we can implement the search function.

Above code:

class Splunk(object):
    def __init__(self):
        self.bf = Bloomfilter(64)
        self.terms = {}  # Dictionary of term to set of events
        self.events = []
    
    def add_event(self, event):
        """Adds an event to this object"""

        # Generate a unique ID for the event, and save it
        event_id = len(self.events)
        self.events.append(event)

        # Add each term to the bloomfilter, and track the event by each term
        for term in segments(event):
            self.bf.add_value(term)

            if term not in self.terms:
                self.terms[term] = set()
            self.terms[term].add(event_id)

    def search(self, term):
        """Search for a single term, and yield all the events that contain it"""
        
        # In Splunk this runs in O(1), and is likely to be in filesystem cache (memory)
        if not self.bf.might_contain(term):
            return

        # In Splunk this probably runs in O(log N) where N is the number of terms in the tsidx
        if term not in self.terms:
            return

        for event_id in sorted(self.terms[term]):
            yield self.events[event_id]
  • Splunk represents an indexed collection with search capabilities
  • Each collection contains a bloom filter, an inverted word list (dictionary), and an array that stores all events
  • When an event is added to the index, the following logic will be done
    • Generate an unqie id for each event, here is the serial number
    • Segment the event, and add each word to the inverted word list, that is, the mapping structure of the id of the event corresponding to each word. Note that a word may correspond to multiple events, so the value of the inverted list is a Set . The inverted list is the core function of most search engines.
  • When a word is searched, it will do the following logic
    • Check bloom filter, if false, return directly
    • Check the vocabulary, if the searched word is not in the vocabulary, return directly
    • Find all the corresponding event ids in the inverted list, and then return the content of the event

Let's run it and see:

s = Splunk()
s.add_event('src_ip = 1.2.3.4')
s.add_event('src_ip = 5.6.7.8')
s.add_event('dst_ip = 1.2.3.4')

for event in s.search('1.2.3.4'):
    print event
print '-'
for event in s.search('src_ip'):
    print event
print '-'
for event in s.search('ip'):
    print event
src_ip = 1.2.3.4
dst_ip = 1.2.3.4
-
src_ip = 1.2.3.4
src_ip = 5.6.7.8
-
src_ip = 1.2.3.4
src_ip = 5.6.7.8
dst_ip = 1.2.3.4

Isn't it great!

more complex search

Further, during the search process, we want to use And and Or to implement more complex search logic.

Above code:

class SplunkM(object):
    def __init__(self):
        self.bf = Bloomfilter(64)
        self.terms = {}  # Dictionary of term to set of events
        self.events = []
    
    def add_event(self, event):
        """Adds an event to this object"""

        # Generate a unique ID for the event, and save it
        event_id = len(self.events)
        self.events.append(event)

        # Add each term to the bloomfilter, and track the event by each term
        for term in segments(event):
            self.bf.add_value(term)
            if term not in self.terms:
                self.terms[term] = set()
            
            self.terms[term].add(event_id)

    def search_all(self, terms):
        """Search for an AND of all terms"""

        # Start with the universe of all events...
        results = set(range(len(self.events)))

        for term in terms:
            # If a term isn't present at all then we can stop looking
            if not self.bf.might_contain(term):
                return
            if term not in self.terms:
                return

            # Drop events that don't match from our results
            results = results.intersection(self.terms[term])

        for event_id in sorted(results):
            yield self.events[event_id]


    def search_any(self, terms):
        """Search for an OR of all terms"""
        results = set()

        for term in terms:
            # If a term isn't present, we skip it, but don't stop
            if not self.bf.might_contain(term):
                continue
            if term not in self.terms:
                continue

            # Add these events to our results
            results = results.union(self.terms[term])

        for event_id in sorted(results):
            yield self.events[event_id]

Using the intersection and union operations of Python collections, it is very convenient to support the operations of And (for intersection) and Or (for collection).

The results are as follows:

s = SplunkM()
s.add_event('src_ip = 1.2.3.4')
s.add_event('src_ip = 5.6.7.8')
s.add_event('dst_ip = 1.2.3.4')

for event in s.search_all(['src_ip', '5.6']):
    print event
print '-'
for event in s.search_any(['src_ip', 'dst_ip']):
    print event
src_ip = 5.6.7.8
-
src_ip = 1.2.3.4
src_ip = 5.6.7.8
dst_ip = 1.2.3.4

 

Summarize

The above code is just to illustrate the basic principles of big data search, including bloom filter, word segmentation and inverted list. If you really want to use this code to implement a real search function, it's still far away. All content from Splunk Conf2017. If you are interested, you can watch the video online.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324535215&siteId=291194637