Java implements a simple search engine

Write the specific implementation code first, and write the specific implementation ideas and logic after the code.

Bean used for sorting when searching

/**  
 *@Description:   
 */  
package cn.lulei.search.engine.model;  
  
public class SortBean {
  private String id;
  private int times;
   
  public String getId() {
    return id;
  }
  public void setId(String id) {
    this.id = id;
  }
  public int getTimes() {
    return times;
  }
  public void setTimes(int times) {
    this.times = times;
  }
}

Constructed search data structure and simple search algorithm

/**  
 *@Description:   
 */  
package cn.lulei.search.engine;  
 
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
 
import cn.lulei.search.engine.model.SortBean;
  
public class SerachBase {
  //details stores the details of the search object, where the key is used as the unique identifier to distinguish the object
  private HashMap<String, Object> details = new HashMap<String, Object>();
  //For the keywords involved in the search, the sparse array storage used here can also be stored in HashMap, and the definition format is as follows
  //private static HashMap<Integer, HashSet<String>> keySearch = new HashMap<Integer, HashSet<String>>();
  //The key value in HashMap is equivalent to the subscript in the sparse array, and the value is equivalent to the value of the sparse array at this position
  private final static int maxLength = Character.MAX_VALUE;
  @SuppressWarnings("unchecked")
  private HashSet<String>[] keySearch = new HashSet[maxLength];
   
  /**
   *@Description: Implement the singleton mode and load it with Initialization on Demand Holder
   *@Version:1.1.0
   */
  private static class lazyLoadSerachBase {
    private static final SerachBase serachBase = new SerachBase();
  }
   
  /**
   * Here the constructor is set to private for the singleton mode
   */
  private SerachBase() {
     
  }
   
  /**
   * @return  
   * @Description: get the singleton
   */
  public static SerachBase getSerachBase() {
    return lazyLoadSerachBase.serachBase;
  }
   
  /**
   * @param id
   * @return  
   * @Description: Get details based on id
   */
  public Object getObject(String id) {
    return details.get(id);
  }
   
  /**
   * @param ids
   * @return  
   * @Description: Get details based on ids, separated by ","
   */
  public List<Object> getObjects(String ids) {
    if (ids == null || "".equals(ids)) {
      return null;
    }
    List<Object> objs = new ArrayList<Object>();
    String[] idArray = ids.split(",");
    for (String id : idArray) {
      objs.add(getObject(id));
    }
    return objs;
  }
   
  /**
   * @param key
   * @return  
   * @Description: Find the corresponding id according to the search term, and separate the ids with ","
   */
  public String getIds(String key) {
    if (key == null || "".equals(key)) {
      return null;
    }
    //find
    //idTimes stores whether each character of the search term appears in the id
    HashMap<String, Integer> idTimes = new HashMap<String, Integer>();
    //ids store the id of the character that appears in the search term
    HashSet<String> ids = new HashSet<String>();
     
    // find from the search library
    for (int i = 0; i < key.length(); i++) {
      int at = key.charAt(i);
      //There is no corresponding character in the search thesaurus, then the next character is matched
      if (keySearch[at] == null) {
        continue;
      }
      for (Object obj : keySearch[at].toArray()) {
        String id = (String) obj;
        int times = 1;
        if (ids.contains(id)) {
          times += idTimes.get(id);
          idTimes.put(id, times);
        } else {
          ids.add(id);
          idTimes.put(id, times);
        }
      }
    }
     
    // sort by array
    List<SortBean> sortBeans = new ArrayList<SortBean>();
    for (String id : ids) {
      SortBean sortBean = new SortBean();
      sortBeans.add(sortBean);
      sortBean.setId(id);
      sortBean.setTimes(idTimes.get(id));
    }
    Collections.sort(sortBeans, new Comparator<SortBean>(){
      public int compare(SortBean o1, SortBean o2){
        return o2.getTimes() - o1.getTimes();
      }
    });
     
    //build the return string
    StringBuffer sb = new StringBuffer();
    for (SortBean sortBean : sortBeans) {
      sb.append(sortBean.getId());
      sb.append(",");
    }
     
    // release resources
    idTimes.clear();
    idTimes = null;
    ids.clear();
    ids = null;
    sortBeans.clear();
    sortBeans = null;
     
    //return
    return sb.toString();
  }
   
  /**
   * @param id
   * @param searchKey
   * @param obj
   * @Description: Add search history
   */
  public void add(String id, String searchKey, Object obj) {
    //The parameters are partially empty, do not load
    if (id == null || searchKey == null || obj == null) {
      return;
    }
    // save the object
    details.put(id, obj);
    // save the search term
    addSearchKey(id, searchKey);
  }
   
  /**
   * @param id
   * @param searchKey
   * @Description: Add the search term to the search field
   */
  private void addSearchKey(String id, String searchKey) {
    //The parameters are partially empty, do not load
    //This is a private method, you can make the following judgments, but for the design specification, add
    if (id == null || searchKey == null) {
      return;
    }
    //The character segmentation is used below, and other mature word segmentations can also be used here.
    for (int i = 0; i < searchKey.length(); i++) {
      //at value is equivalent to the subscript of the array, and the HashSet composed of id is equivalent to the value of the array
      int at = searchKey.charAt(i);
      if (keySearch[at] == null) {
        HashSet<String> value = new HashSet<String>();
        keySearch[at] = value;
      }
      keySearch[at].add(id);
    }
  }
   
   
 
}

Test case:

/**  
 *@Description:   
 */  
package cn.lulei.search.engine.test;  
 
import java.util.List;
 
import cn.lulei.search.engine.SerachBase;
  
public class Test {
  public static void main(String[] args) {
    // TODO Auto-generated method stub  
    SerachBase serachBase = SerachBase.getSerachBase ();
    serachBase.add("1", "Hello!", "Hello!");
    serachBase.add("2", "Hello! I'm Zhang San.", "Hello! I'm Zhang San.");
    serachBase.add("3", "The weather is fine today.", "The weather is fine today.");
    serachBase.add("4", "Who are you?", "Who are you?");
    serachBase.add("5", "Advanced mathematics is difficult", "Advanced mathematics is really difficult.");
    serachBase.add("6", "Test", "The above is just a test");
    String ids = serachBase.getIds("Your high number");
    System.out.println(ids);
    List<Object> objs = serachBase.getObjects(ids);
    if (objs != null) {
      for (Object obj : objs) {
        System.out.println((String) obj);
      }
    }
  }
 
}

The test output is as follows:

5,3,2,1,4,
High numbers are really hard.
The weather is fine today.
Hi! I am Zhang San.
Hi!
Who are you?

Such a simple search engine is complete.

Question 1: The word segmentation here is character segmentation, which is quite good for Chinese, but weak for English.

Improvement method: Use the mature word segmentation methods, such as IKAnalyzer, StandardAnalyzer, etc. In this way, the data structure of keySearch needs to be modified, which can be modified to private HashMap<String, String>[] keySearch = new HashMap[maxLength]; The key stores the token of the score, and the value stores the unique identifier id.

Question 2: The search engine implemented in this article does not set weights on the word elements like lucene, but simply judges whether the word elements appear in the object.

Improvement method: None at the moment. Adding weight processing makes the data structure more complex, so it is not processed for the time being, and weight processing will be implemented in future articles.

The following is a brief introduction to the implementation of search engines .

Set the details and keySearch properties in the SerachBase class. Details are used to store the detailed information of the Object, and keySearch is used to index the search domain. The data format of details is HashMap, and the data format of keySearch is sparse array (it can also be HashMap, the key value in HashMap is equivalent to the subscript in the sparse array, and the value is equivalent to the value of the sparse array at this position).

I won't go into too much detail about details.

The calculation method of the array subscript in keySearch (for example, using HashMap is the key) is to obtain the int value of the first character of the word element (because the word segmentation in this article uses character word segmentation, so a character is a word element), the int value is The subscript of the array, the corresponding array value is the unique identifier of the Object. In this way, the data structure of keySearch is as follows

Therefore, when you want to add a new record, you only need to call the add method.

The implementation logic for search is similar to the above keySearch. For id search, you can directly use the get method of HashMap. For a search for a search term, the overall process is also to use word segmentation first, query second, and sorting last. Of course, the word segmentation here should be consistent with the word segmentation used for creation (that is, character segmentation is used when creating, and character segmentation is also used when searching).

In the getIds method, HashMap<String, Integer> idTimes = new HashMap<String, Integer>(); The idTimes variable is used to store how many words in the search word appear in keySearch, and the key value is the unique identifier id, value is the number of words that appear. HashSet<String> ids = new HashSet<String>(); The ids variable is used to store the ids of the present tokens. The complexity of such a search is the number of lemmas of the search term, n. Obtain the ids containing the tokens, construct a SortBean array, and sort it. The sorting rule is the descending order of the number of tokens. Finally, return the ids string, each id is separated by ",". To get detailed information
, use the getObjects method.

The above is just a simple search engine, and there are not too many calculation methods designed. I hope it will inspire everyone's learning.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325293916&siteId=291194637