[Java] Spend dozens of hours to take you through the implementation process of a Java document search engine

Java Documentation Search Engine

Project operation effect

insert image description here

1. Briefly describe the concept of search engine

Let's first look at what a search engine is?

The Baidu search engine we often use is one such. It looks like the page is very simple, but the code inside is very complicated.
insert image description here
When we searched, we found that the core function of the search engine is to find a group of words or sentences entered by the user.

insert image description here

Like the word cake, we call it 查询词, and the content we search must also be related to the query word.
insert image description here
Generally, the searched content is almost the same, of course, some displayed results will have a little more content.
insert image description here
When we click in, we will jump to the detailed page (landing page)
insert image description here


Second, search engine implementation ideas

For a search engine, it first needs to obtain a lot of web pages, and then search in these web pages according to the query words entered by the user.
But we have the following problem:

  1. How do search engines get pages?
    Answer: This mainly involves programs such as "crawlers". In fact, crawlers are also Http client programs. Each website collects and processes the data.
  2. After the user enters the query word, how to match the query word with the current web pages?
    Answer: Assuming we use brute force search, if we now have 100 million pieces of data, then we need to search for the query words in 100 million, then our efficiency is very low, and our search engine must hope that one Hit Enter to get the result immediately. This is definitely not possible, so we need a data structure, using a data structure such as an inverted index, which is very important in search engines.

2.1 Introduction to Inverted Index

Let's get to know the professional terms first:

  1. Document (document): refers to each web page to be searched

  2. Positive index: refers 文档idto 文档内容, give you a document id to quickly find the corresponding content

    • Document id: When we crawl a lot of information, we need to add an id to distinguish each information, like an ID number, which does not repeat each other
  3. Inverted index: refers to 文档id列表the mapping relationship to. The inverted index is just the opposite of the forward index. I will give you a random word and ask you in which document, so there must be many words that have appeared in the content, so the given is a list .

    • Words: The content of the document is not completely isolated. The content includes many paragraphs, sentences, and many words in the sentences.

Let's take a simple example:

Now we have 2 documents:

  1. forward index
Document ID article
1 Lei Jun released the Xiaomi phone
2 Lei Jun bought two catties of millet

According to the document ID=1, we can quickly find the first content, according to the document ID=2, we can quickly find the second content , such a structure is正排索引


  1. Inverted index
word The ID of the document where the word appeared
Lei Jun 1,2
release 1
purchase 2
up 1, 2
Millet 1,2
cell phone 1

In the above, according to which document the word appears in, find out its ID, and this process is formed倒排索引 .

Of course, this is the agreement above, but it’s just that everyone’s effect is in this form, and you can come anyway . QQ


In fact, when we usually play games, we often encounter professional words such as inverted index.

Take the glory of the king as an example: there is a hero named Daji in it, and she has three skills

1 skill: group damage
2 skills: stun skill
3 skills: group damage

This is similar to the positive index, according 英雄名字to 英雄技能

Now let's ask, which hero's 2 skill has a stun effect?

There are these heroes: 1. Daji, 2. Little Luban (close to the enemy)...

so according 英雄的技能to英雄的名字


2.2 Project goals

Implement a search engine for Java documentation

Search engines like Baidu and Bing belong to the "full-site search" and search all websites on the entire Internet.

There is also a type of search engine called "site search" that only searches for the content inside a certain website, such as Zhihu and Baidu Tieba, which are our current goals.


Let's take a look at the java document website Java document address

insert image description here

But we found it inconvenient that this website does not have a search box ! ! ! So let's make one! ! !

2.3 Obtain java documentation

As we said just now, if you want to search for content, you must have a webpage, and then you can create an inverted index and search it out.

We have two ways to get it:

  1. Fetching Documents via a Crawler
  2. Download the compressed package directly from the official website

We use the second one, just download it directly, and do not need to use crawlers to achieve it.

Website address: click to jump to download

insert image description here
After downloading, open
insert image description here
it and open an HTML inside to compare with the official document and find that they are the same.

insert image description here

Official document:
insert image description here
Local document:
insert image description here
In fact, the key point is that we compare the relationship between their paths: right-click to open the link in a new window
insert image description here
and we find that there is still a certain relationship. It is the same from the back of the docs, but the front is different. The front is ours Path created by yourself.

For such a relationship,我们可以在本地基于离线文档来制做索引,当用户在搜索结果页点击具体的搜索结果的时候,就自动跳转到在线文档的页面。


2.4 Module Division

  1. index module

1) Scan the downloaded document, analyze the content of the document, build a forward index + inverted index, and save the index content in the file
2) load the created index, and provide some APIs to realize query forward and reverse row such a function

2. Search module

Call the index module to realize a complete search process.
Input: user query
output: complete search results (contains many records, each record has title, description, display url, and can jump)

3. web module

Need to implement a web module program that can interact with users in the form of web pages (including front-end and back-end)


2.5 Create a project

Just create a Maven project directly
insert image description here


2.6 Understanding word segmentation

In the search engine, the query word entered by the user may not necessarily be a word, but may also be a sentence.

Word segmentation is to divide a complete sentence into multiple words

I want to buy cabbage into words I/want/buy/cabbage/

For humans, word segmentation is very simple, but for machines, we need to use codes to segment words, which is much more difficult.

Typical example:

I hold on to the handlebars and I want to live the life I want to live

Hahaha isn't it difficult?

But in contrast, the word segmentation in English is very simple, because there are spaces in the middle.

In this case, we use a third-party library to achieve.

Of course, for Baidu and others, others have teams to make their own word segmentation databases, which are much more accurate than our open source ones.


2.7 The principle of word segmentation

  1. based on thesaurus

    Try to enumerate all the "words" exhaustively, put the results of these exhaustive enumerations into the dictionary file,
    and then you can sequentially fetch the contents of the sentence, check every other word in the dictionary, and check every two words one time

    Of course, there are still some words that are inaccurate, and those popular words on the Internet will not work.

  2. Based on statistics

    Collecting a lot of "corpus" is equivalent to artificial standards, and you know that the probability of those words being together is relatively high.

The realization of word segmentation is a typical application scenario of "artificial intelligence", training models.


2.8 Use a third-party word segmentation library

There are quite a lot of java third-party word segmentation libraries, we use ansj this .

<!-- https://mvnrepository.com/artifact/org.ansj/ansj_seg -->
<dependency>
    <groupId>org.ansj</groupId>
    <artifactId>ansj_seg</artifactId>
    <version>5.1.6</version>
</dependency>

If it is red, click refresh, if it is not red, click:
insert image description here

Let's write a code:

import org.ansj.domain.Term;
import org.ansj.splitWord.analysis.ToAnalysis;

import java.util.List;

/**
 * Created by Lin
 * Description:
 * User: Administrator
 * Date: 2022-12-14
 * Time: 19:29
 */
public class TestAnsj {
    
    
    public static void main(String[] args) {
    
    
        //准备一个比较长的话 用来分词
        String str ="小明毕业于清华大学,后来又去蓝翔技校和新东方去深照,擅长使用计算机控制挖掘机来炒菜。";
        ToAnalysis analysis = new ToAnalysis();
        //Term就表示一个分词结果
         List<Term> terms = analysis.parse(str).getTerms();
        for (Term term :terms){
    
    
            System.out.println(term.getName());
        }
    }
}

We use ToAnalysis If the idea becomes popular, manually import the package by yourself.

We use the parse() method, but this method returns the Result type of a third-party thesaurus. We want to get a collection similar to List, so we use the getTerms() method, which returns the List type .
insert image description here

Then the result of the operation is:
insert image description here
the word has been divided, but we see what the red thing is**, in fact, when the word is segmented, some dictionary files will be loaded, through which the word segmentation speed can be accelerated and the accuracy rate can be improved , but without these dictionary files, ansj can also quickly and accurately separate words**,

Note that English capitalization will become lowercase.
insert image description here


3. Implement the index module-parser class

Next we implement the index module

We expect this class to read the documents we downloaded before and complete the indexing.

We first create a Parser class to implement the index data structure

Here are a few things that this class does specifically:

    1.根据指定的路径去枚举出该路径中所有的文件(所有子目录的html文件),这个过程需要把全部子目录的文件全部获取到
    2.针对上面罗列出的路径,打开文件,读取文件内容,并进行解析.并构建索引
    3.把内存中构造好的索引数据结构,保存到指定的文件中

Let's take a look at what the first point means:
because the official documents are placed in the api folder, so we want all the contents of that folder:
insert image description here

The meaning of the second point is: that folder insert image description here
The third point means: Put the finished index in a file, and let the program read the index later.

Current stage code block:

public class Parser {
    
    
    //先指定一个加载文档的路径 ,由于是固定路径 我们使用 static 类属性 不需要变final
    private static final String INPUT_PATH  ="D:\\gitee\\doc_searcher_index\\docs\\api";     // 只需要api文件夹下的文件

    public  void run(){
    
    
        //整个Parser类的入口
        //1.根据指定的路径去枚举出该路径中所有的文件(所有子目录的html文件),这个过程需要把全部子目录的文件全部获取到
        //2.针对上面罗列出的路径,打开文件,读取文件内容,并进行解析.并构建索引
        //3.把内存中构造好的索引数据结构,保存到指定的文件中


    }

    public static void main(String[] args) {
    
    
        //通过main方法来实现整个制作索引的过程
        Parser parser = new Parser();
        parser.run();
    }

}


3.1 Implement index module - recursively enumerate files

Enumerate all files and put them in the collection, first add the code:

import java.io.File;
import java.util.ArrayList;


/**
 * Created by Lin
 * Description:
 * User: Administrator
 * Date: 2022-12-15
 * Time: 19:15
 */
public class Parser {
    
    
    //先指定一个加载文档的路径 ,由于是固定路径 我们使用 static 类属性 不需要变final
    private static final String INPUT_PATH  ="D:\\gitee\\doc_searcher_index\\docs\\api";     // 只需要api文件夹下的文件

    public  void run(){
    
    
        //整个Parser类的入口
        //1.根据指定的路径去枚举出该路径中所有的文件(所有子目录的html文件),这个过程需要把全部子目录的文件全部获取到
        ArrayList<File> fileList = new ArrayList<>();
        enumFile(INPUT_PATH,fileList);
        //2.针对上面罗列出的路径,打开文件,读取文件内容,并进行解析.并构建索引
        //3.把内存中构造好的索引数据结构,保存到指定的文件中
        System.out.println(fileList);
        //看看文件个数
        System.out.println(fileList.size());

    }

    //第一个参数表示从那个目录开始进行遍历,第二个目录表示递归得到的结果
    private void enumFile(String inputPath, ArrayList<File> fileList) {
    
    
        //我们需要把String类型的路径变成文件类 好操作点
        File rootPath = new File(inputPath);
        //listFiles()类似于Linux的ls把当前目录中包含的文件名获取到
        //使用listFiles只可以看见一级目录,想看到子目录需要递归操作
        File[] files = rootPath.listFiles();
        for (File file : files) {
    
    
            //根据当前的file的类型,觉得是否递归
            //如果file是普通文件就把file加入到listFile里面
            //如果file是一个目录 就递归调用enumFile这个方法,来进一步获取子目录的内容
            if (file.isDirectory()){
    
    
                //根路径要变
                enumFile(file.getAbsolutePath(),fileList);
            }else {
    
    
                //普通文件
                fileList.add(file);
            }
        }
    }

    public static void main(String[] args) {
    
    
        //通过main方法来实现整个制作索引的过程
        Parser parser = new Parser();
        parser.run();
    }

}

Here we have created an enumFile() method, using the listFile() function to get the current directory under the target path

The idea of ​​getting all the files is: judge whether it is a directory or a file, if it is a file, add the file to ArrayList fileList = new ArrayList<>(); if it is a directory, continue to recurse in the function, see the code comment for details.

Finally look at the results:
insert image description here
it is found that there are not only HTML files but also other files, we should remove it, only HTML files


3.2 Exclude non-HTML files

The idea of ​​excluding non-HTML files is actually very simple. You only need to determine what the suffix of the file is, and you can use the function endWith() to identify it.

import java.io.File;
import java.util.ArrayList;


/**
 * Created by Lin
 * Description:
 * User: Administrator
 * Date: 2022-12-15
 * Time: 19:15
 */
public class Parser {
    
    
    //先指定一个加载文档的路径 ,由于是固定路径 我们使用 static 类属性 不需要变final
    private static final String INPUT_PATH  ="D:\\gitee\\doc_searcher_index\\docs\\api";     // 只需要api文件夹下的文件

    public  void run(){
    
    
        //整个Parser类的入口
        //1.根据指定的路径去枚举出该路径中所有的文件(所有子目录的html文件),这个过程需要把全部子目录的文件全部获取到
        ArrayList<File> fileList = new ArrayList<>();
        enumFile(INPUT_PATH,fileList);
        //2.针对上面罗列出的路径,打开文件,读取文件内容,并进行解析.并构建索引
        //3.把内存中构造好的索引数据结构,保存到指定的文件中
        System.out.println(fileList);
        System.out.println(fileList.size());

    }

    //第一个参数表示从那个目录开始进行遍历,第二个目录表示递归得到的结果
    private void enumFile(String inputPath, ArrayList<File> fileList) {
    
    
        //我们需要把String类型的路径变成文件类 好操作点
        File rootPath = new File(inputPath);
        //listFiles()类似于Linux的ls把当前目录中包含的文件名获取到
        //使用listFiles只可以看见一级目录,想看到子目录需要递归操作
        File[] files = rootPath.listFiles();
        for (File file : files) {
    
    
            //根据当前的file的类型,觉得是否递归
            //如果file是普通文件就把file加入到listFile里面
            //如果file是一个目录 就递归调用enumFile这个方法,来进一步获取子目录的内容
            if (file.isDirectory()){
    
    
                //根路径要变
                enumFile(file.getAbsolutePath(),fileList);
            }else {
    
    
                //只针对HTML文件
                if(file.getAbsolutePath().endsWith(".html")){
    
    
                    //普通HTML文件
                    fileList.add(file);
                }

            }
        }
    }

    public static void main(String[] args) {
    
    
        //通过main方法来实现整个制作索引的过程
        Parser parser = new Parser();
        parser.run();
    }

}

insert image description here
Just add this judgment.


3.3 Implement the index module - parse HTML

The meaning of parsing HTML is: one of our search results contains title, description, and display URL. These information come from the HTML to be parsedinsert image description here

Therefore, the current operation of parsing HTML is to obtain the title, description, and URL of the entire HTML file. In fact, our key point is to understand what this 描述is?

Description: We can regard it as a summary
of the text, so if we want to get the description, we must first get the whole text, so let’s ignore the description and find a way to get the text first.

So our current task is to:

  1. Parse out the HTML title
  2. Parsing out the article corresponding to HTML
  3. Parsing out the text corresponding to the HTML (there is a subsequent description only if there is a text)
import java.io.File;
import java.util.ArrayList;


/**
 * Created by Lin
 * Description:
 * User: Administrator
 * Date: 2022-12-15
 * Time: 19:15
 */
public class Parser {
    
    
    //先指定一个加载文档的路径 ,由于是固定路径 我们使用 static 类属性 不需要变final
    private static final String INPUT_PATH  ="D:\\gitee\\doc_searcher_index\\docs\\api";     // 只需要api文件夹下的文件

    public  void run(){
    
    
        //整个Parser类的入口
        //1.根据指定的路径去枚举出该路径中所有的文件(所有子目录的html文件),这个过程需要把全部子目录的文件全部获取到
        ArrayList<File> fileList = new ArrayList<>();
        enumFile(INPUT_PATH,fileList);
        //2.针对上面罗列出的路径,打开文件,读取文件内容,并进行解析.并构建索引
        for (File f :fileList){
    
    
            //通过这个方法解析单个HTML文件
            System.out.println("开始解析:" + f.getAbsolutePath());
            parseHTML(f);
        }
        //3.把内存中构造好的索引数据结构,保存到指定的文件中
//        System.out.println(fileList);
//        System.out.println(fileList.size());

    }

    //通过这个方法解析单个HTML文件
    private void parseHTML(File f) {
    
    
//        1. 解析出HTML标题
        String title  = parseTitle(f);
//        2. 解析出HTML对应的文章
        String url = parseUrl(f);
//        3. 解析出HTML对应的正文(有正文才有后续的描述)
        String content = parseContent(f);
    }

    private String parseContent(File f) {
    
    
    }

    private String parseUrl(File f) {
    
    
    }

    private String parseTitle(File f) {
    
    
    }

    //第一个参数表示从那个目录开始进行遍历,第二个目录表示递归得到的结果
    private void enumFile(String inputPath, ArrayList<File> fileList) {
    
    
        //我们需要把String类型的路径变成文件类 好操作点
        File rootPath = new File(inputPath);
        //listFiles()类似于Linux的ls把当前目录中包含的文件名获取到
        //使用listFiles只可以看见一级目录,想看到子目录需要递归操作
        File[] files = rootPath.listFiles();
        for (File file : files) {
    
    
            //根据当前的file的类型,觉得是否递归
            //如果file是普通文件就把file加入到listFile里面
            //如果file是一个目录 就递归调用enumFile这个方法,来进一步获取子目录的内容
            if (file.isDirectory()){
    
    
                //根路径要变
                enumFile(file.getAbsolutePath(),fileList);
            }else {
    
    
                //只针对HTML文件
                if(file.getAbsolutePath().endsWith(".html")){
    
    
                    //普通HTML文件
                    fileList.add(file);
                }

            }
        }
    }

    public static void main(String[] args) {
    
    
        //通过main方法来实现整个制作索引的过程
        Parser parser = new Parser();
        parser.run();
    }

}

insert image description here
In this section, we have added the parsing HTML class, and parsing the content, title, and URL classes we need to use.


3.4 Implement the index module - parsing the title

Next we start parsing the HTML title

There are two ideas:

  1. Find the content in the title tag of the HTML file and the title
    insert image description here
  2. Get the file name Get the specific HTML title, each file name seems to be similar to the HTML title.
    insert image description here
    So we directly choose to get the file name to get the title.

Here we use the getName() function:
insert image description here
the output result will get the HTML file name
insert image description here
. The search result we need is the title, which does not need to be suffixed, so it needs to be removed.

Remove the suffix implementation idea: use substring () to achieve

There is a small problem here, how do we .htmljust remove the , this substring () has a version that is closed before opening and then opened, we only need 找出总长度减去.html的长度to get the front part of .html:

  public static void main(String[] args) {
    
    
        File f = new File("D:\\gitee\\doc_searcher_index\\docs\\api\\java\\util\\ArrayList.html");
        System.out.println(f.getAbsolutePath());
        System.out.println(f.getName().substring(0,f.getName().length()-".html".length()));

    }

.html is also a string and you can use length();

Implementation of parseTitle():

 private String parseTitle(File f) {
    
    
        //获取文件名
        String name =  f.getName();

        return name.substring(0,name.length()-".html".length());
    }

3.5 Implement the index module - the idea of ​​parsing url

In fact, the display url of a real search engine is different from the redirect url: some have to go through the search engine server first and then go to the page
insert image description here
, but we don’t need to think so much here, just use one url directly. Both display and jump.


Some people are wondering why they should jump to the server of the search engine, because there are the following reasons:

  1. If it is an advertisement result, it needs to be charged according to the click
  2. For natural search results, user experience needs to be optimized based on clicks

Our idea of ​​​​realizing url:

Because our final desired effect is: the user can jump to the corresponding online document page by clicking on the search result

  1. Then we found that there are two Java API documents, online documents and offline documents,
    and their paths have the same point

https://docs.oracle.com/javase/8/docs/api/java/util/ArrayList.html
D:/gitee/doc_searcher_index/docs/api/java/util/ArrayList.html
can be found in front of the doc directory Not the same, everything else is the same,

Our final jump destination is official: "https://docs.oracle.com/javase/8/docs/api/java/util/ArrayList.html"

So we can use the splicing method, first save the second half of the local path in advance, and splice it with the fixed path in front of the official website , so as to realize their relationship.


3.6 Realize the index module - parsing url code implementation

Test the Test class:

   private static final String INPUT_PATH  ="D:\\gitee\\doc_searcher_index\\docs\\api";     // 只需要api文件夹下的文件

    public static void main(String[] args) {
    
    
        File f = new File("D:\\gitee\\doc_searcher_index\\docs\\api\\java\\util\\ArrayList.html");
        //固定的前缀
        String path = "https://docs.oracle.com/javase/8/docs/api/";
        //只放一个参数的意思是:前面一段都不需要,取后面的一段
        String path2=   f.getAbsolutePath().substring(INPUT_PATH.length());
        String result = path + path2;
        System.out.println(result);
    }

In fact, it is the splicing using substring: INPUT_PATHours is the path that is different from the official path in the front, we just need to remove it and take the back, and then splice.

Does the result look a bit twisted? We can also directly use replaceall to replace the slash, or ignore it, because after copying the code to the browser, the browser will analyze it by itself and turn the entire URL into a normal one. Major browsers have little problem with these.
insert image description here

  private String parseUrl(File f) {
    
    
        //固定的前缀
        String path = "https://docs.oracle.com/javase/8/docs/api/";
        //只放一个参数的意思是:前面一段都不需要,取后面的一段
        String path2=   f.getAbsolutePath().substring(INPUT_PATH.length());
        return path + path2;
    }

3.7 Realize the index module - the idea of ​​parsing the text

There are many ways to remove tags, such as regular expressions, or simple and crude ways:

We use simple and rough,

HTML tags are very characteristic. We read each character in this HTML in turn, and then judge each character that is extracted.

See if the result is < (left angle bracket), then do not put these characters into the result from this position until > (right angle bracket) is encountered, in other words, if it is not an angle bracket, put it directly Copy the current character into a result (StringBuilder)

Demo:

<div>This is the content</div>

When we read the first <, we will not copy the following content. If we read >, we will start copying the following content, so we can set a flag bit. If it is <, it will be false to close the copy. If it is > is true to open for copying.

Some people may wonder what if there are < or > characters in the content? In fact, htm requires that <> in the content be replaced by &l t; or &g t;.


3.8 Implement the index module - code implementation of parsing the text

public String parseContent(File f) {
    
    

        //先按照一个字符一个字符来读取,以< 和 > 来控制拷贝数据的开关
        try(FileReader fileReader = new FileReader(f)) {
    
    
            //加上一个开关
            boolean isCopy = true;
            //还准备一个保存结果的StringBuilder
            StringBuilder content  = new StringBuilder();
            while (true){
    
    
                //read int类型 读到最后返回-1
                int ret = fileReader.read();
                if (ret == -1){
    
    
                    //表示文件读完了
                    break;
                }
                //不是-1就是合法字符
                char c = (char) ret;
                if (isCopy){
    
    
                    //打开的状态可以拷贝
                    if (c == '<'){
    
    
                        isCopy =false;
                        continue;
                    }
                    //判断是否是换行
                    if (c == '\n' || c == '\r'){
    
    
//                        是换行就变成空格
                        c = ' ';
                    }
                    //其他字符进行拷贝到StringBuilder中
                    content.append(c);
                }else{
    
    
                    //
                    if (c=='>'){
    
    
                        isCopy= true;
                    }
                }
            }
            return content.toString();
        } catch (IOException e) {
    
    
            e.printStackTrace();
        }


        return "";

    }

What this code needs to pay attention to is to close the character stream StringBuilder, otherwise it will cause resource leakage, and then it is a logical problem, what to do if it is closed, and what to do if it is open.


3.9 Parser class summary

Now we parse the title, content, and url of the HTML page, which will be used in the future.
insert image description here
Next, we need to create an index class and put the parsed information into the index, and put the index built in memory into the specified file.

Parser class code:

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;


/**
 * Created by Lin
 * Description:
 * User: Administrator
 * Date: 2022-12-15
 * Time: 19:15
 */
public class Parser {
    
    
    //先指定一个加载文档的路径 ,由于是固定路径 我们使用 static 类属性 不需要变final
    private static final String INPUT_PATH  ="D:\\gitee\\doc_searcher_index\\docs\\api";     // 只需要api文件夹下的文件

    public  void run(){
    
    
        //整个Parser类的入口
        //1.根据指定的路径去枚举出该路径中所有的文件(所有子目录的html文件),这个过程需要把全部子目录的文件全部获取到
        ArrayList<File> fileList = new ArrayList<>();
        enumFile(INPUT_PATH,fileList);
        //2.针对上面罗列出的路径,打开文件,读取文件内容,并进行解析.并构建索引
        for (File f :fileList){
    
    
            //通过这个方法解析单个HTML文件
            System.out.println("开始解析:" + f.getAbsolutePath());
            parseHTML(f);
        }
        //3. TODO 把内存中构造好的索引数据结构,保存到指定的文件中


    }

    //通过这个方法解析单个HTML文件
    private void parseHTML(File f) {
    
    
//        1. 解析出HTML标题
        String title  = parseTitle(f);
//        2. 解析出HTML对应的文章
        String url = parseUrl(f);
//        3. 解析出HTML对应的正文(有正文才有后续的描述)
        String content = parseContent(f);
       // 4. TODO 解析的信息加入到索引当中

    }

    public String parseContent(File f) {
    
    

        //先按照一个字符一个字符来读取,以< 和 > 来控制拷贝数据的开关
        try(FileReader fileReader = new FileReader(f)) {
    
    
            //加上一个开关
            boolean isCopy = true;
            //还准备一个保存结果的StringBuilder
            StringBuilder content  = new StringBuilder();
            while (true){
    
    
                //read int类型 读到最后返回-1
                int ret = fileReader.read();
                if (ret == -1){
    
    
                    //表示文件读完了
                    break;
                }
                //不是-1就是合法字符
                char c = (char) ret;
                if (isCopy){
    
    
                    //打开的状态可以拷贝
                    if (c == '<'){
    
    
                        isCopy =false;
                        continue;
                    }
                    //判断是否是换行
                    if (c == '\n' || c == '\r'){
    
    
//                        是换行就变成空格
                        c = ' ';
                    }
                    //其他字符进行拷贝到StringBuilder中
                    content.append(c);
                }else{
    
    
                    //
                    if (c=='>'){
    
    
                        isCopy= true;
                    }
                }
            }
            return content.toString();
        } catch (IOException e) {
    
    
            e.printStackTrace();
        }


        return "";

    }

    private String parseUrl(File f) {
    
    
        //固定的前缀
        String path = "https://docs.oracle.com/javase/8/docs/api/";
        //只放一个参数的意思是:前面一段都不需要,取后面的一段
        String path2=   f.getAbsolutePath().substring(INPUT_PATH.length());
        return path + path2;
    }

    private String parseTitle(File f) {
    
    
        //获取文件名
        String name =  f.getName();

        return name.substring(0,name.length()-".html".length());
    }

    //第一个参数表示从那个目录开始进行遍历,第二个目录表示递归得到的结果
    private void enumFile(String inputPath, ArrayList<File> fileList) {
    
    
        //我们需要把String类型的路径变成文件类 好操作点
        File rootPath = new File(inputPath);
        //listFiles()类似于Linux的ls把当前目录中包含的文件名获取到
        //使用listFiles只可以看见一级目录,想看到子目录需要递归操作
        File[] files = rootPath.listFiles();
        for (File file : files) {
    
    
            //根据当前的file的类型,觉得是否递归
            //如果file是普通文件就把file加入到listFile里面
            //如果file是一个目录 就递归调用enumFile这个方法,来进一步获取子目录的内容
            if (file.isDirectory()){
    
    
                //根路径要变
                enumFile(file.getAbsolutePath(),fileList);
            }else {
    
    
                //只针对HTML文件
                if(file.getAbsolutePath().endsWith(".html")){
    
    
                    //普通HTML文件
                    fileList.add(file);
                }

            }
        }
    }

    public static void main(String[] args) {
    
    
        //通过main方法来实现整个制作索引的过程
        Parser parser = new Parser();
        parser.run();
    }

}

Fourth, implement the index module - Index class

We need to create an Index class, the main implementation of this class is to construct the index structure in memory

//通过这个类在内存中来构造出索引结构
public class Index {
    
    
    //这个类需要提供的方法
    //1.给定一个docId ,在正排索引中,查询文档的详细信息
    public DocInfo getDocInfo(int docId){
    
    
        //TODO
        return null;
    }
    //2.给定一词,在倒排索引中,查哪些文档和这个文档词关联
    public List<Weight> getInverted(String term){
    
    
        //TODO
        return null;
    }
    //3.往索引中新增一个文档
    public void addDoc(String title,String url,String content){
    
    
        //TODO
    }
    //4.把内存中的索引结构保存到磁盘中
    public void save(){
    
    
        //TODO
    }
    //5.把磁盘中的索引数据加载到内存中
    public void load(){
    
    
        //TODO

    }

}

Next, let's explain what each function means:

  1. public DocInfo getDocInfo(int docId)   // 给定一个docId ,在正排索引中,查询文档的详细信息
    

    This class is actually a positive index, and the document information content is obtained for a document ID, so we want the return value to represent the specific return content, so we create a new class
    insert image description here

public class DocInfo {
    
    
    private int docId;
    private String title;
    private String url;
    private String content;

    public String getUrl() {
    
    
        return url;
    }

    public String getTitle() {
    
    
        return title;
    }

    public void setTitle(String title) {
    
    
        this.title = title;
    }

    public void setUrl(String url) {
    
    
        this.url = url;
    }

    public String getContent() {
    
    
        return content;
    }

    public void setContent(String content) {
    
    
        this.content = content;
    }

    public int getDocId() {
    
    
        return docId;
    }

    public void setDocId(int docId) {
    
    
        this.docId = docId;
    }
}

These attributes are what we analyzed earlier, and they must have a certain relationship.

  1. public List<Weight> getInverted(String term)  //给定一个词,在倒排索引中,查哪些文档和这个文档词关联
    

    Because document words may be related to many documents, we use the collection of List to receive them. The parameter is the result of word segmentation. Because the user may search for a word, we do not search for a word, but after word segmentation, go to check. List<Weight>This Weight is the weight of the article, that is, the query word is more relevant to some documents, and some are less relevant.

//文档ID和文档的相关性 权重进行包裹
public class Weight {
    
    

    private int docId;


    public int getDocId() {
    
    
        return docId;
    }

    public void setDocId(int docId) {
    
    
        this.docId = docId;
    }

    public int getWeight() {
    
    
        return weight;
    }

    public void setWeight(int weight) {
    
    
        this.weight = weight;
    }

    //这个weight就表示文档和词的相关性
    //这个值越大,就认为相关性越强
    private int weight;
}

This weight indicates the correlation between the document and the word. The larger the value, the stronger the correlation


  1. public void addDoc(String title,String url,String content)
    

    Add a document to the index.


  1. public void save()
    

    Save the in-memory index structure to disk


  1. public void load()
    

    Load index data from disk into memory


4.1 Implement the index module - implement the index structure

The specific structure to realize the forward index:

Use ArrayList to indicate the forward index. If the DocID is 0, then the item is placed at the 0 position, and if it is 100, it is placed at the 100 position. In this way, the get method of ArrayList is used to find the corresponding element according to the subscript. This is the forward index express.
insert image description here

The specific structure to realize the inverted index:

We use a hash table to indicate that the inverted index key is the search word, and the value is a group of articles related to the word, so we use ArrayList, why use Weight for generics, because this contains both DocId and relevance (weight )
insert image description here

code:

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;

/**
 * Created by Lin
 * Description:
 * User: Administrator
 * Date: 2022-12-17
 * Time: 13:01
 */

//通过这个类在内存中来构造出索引结构
public class Index {
    
    

    //使用数组下标表示 DocId
    private ArrayList<DocInfo> forwardIndex = new ArrayList<>();


    //使用哈希表 来表示倒排索引 key就是词 value就是一组和词关联的文章
    private HashMap<String,ArrayList<Weight>> invertedIndex = new HashMap<>();



    //这个类需要提供的方法
    //1.给定一个docId ,在正排索引中,查询文档的详细信息
    public DocInfo getDocInfo(int docId){
    
    
        return forwardIndex.get(docId);
    }
    //2.给定一词,在倒排索引中,查哪些文档和这个文档词关联
    public List<Weight> getInverted(String term){
    
    
        return invertedIndex.get(term);
    }
    //3.往索引中新增一个文档
    public void addDoc(String title,String url,String content){
    
    
        //TODO
    }
    //4.把内存中的索引结构保存到磁盘中
    public void save(){
    
    
        //TODO

    }
    //5.把磁盘中的索引数据加载到内存中
    public void load(){
    
    
        //TODO

    }


}

4.2 Implement the index module - implement the forward index

We have analyzed the things we need before:
insert image description here
now we need to add new documents to the index, both in the forward index and in the inverted index. Let’s take a look at this simple front row:

//3.往索引中新增一个文档
    public void addDoc(String title,String url,String content){
    
    
        //新增文档操作,需要同时给正排索引和倒排索引新增信息
        //构建正排索引
        DocInfo docInfo =  buildForward(title,url,content);
        //构建倒排索引
        buildInverted(docInfo);
        //TODO
    }

    private DocInfo buildForward(String title, String url, String content) {
    
    
        DocInfo docInfo =new DocInfo();
        docInfo.setDocId(forwardIndex.size());
        docInfo.setTitle(title);
        docInfo.setUrl(url);
        docInfo.setContent(content);
        forwardIndex.add(docInfo);
        return docInfo;
    }

Let's mainly look at these codes:正排索引

 private DocInfo buildForward(String title, String url, String content) {
    
    
        DocInfo docInfo =new DocInfo();
        docInfo.setDocId(forwardIndex.size());
        docInfo.setTitle(title);
        docInfo.setUrl(url);
        docInfo.setContent(content);
        forwardIndex.add(docInfo);
        return docInfo;
    }

This code is because the front row structure is actually to insert the parsed three parameters into DocInfo, the focus is on DocId, how do we make it achieve self-increment from 0, 1, 2, 3, and later we actually found out that if the ArrayList is in When there is no add to add a value, the size() is 0, and it will become 1 when we perform forwardIndex.add(docInfo); Then it becomes 2 in add, and so on.

insert image description here


4.3 Realize the index module - realize the construction of inverted index

Let's take a look at 倒排索引the ideas:

The inverted index is the mapping between words and document ids. We need to know which words are in the current document.

Therefore, we need to segment the current document. The word segmentation should target the title and text , and then combine the results of the word segmentation to know that the current document id should be added to the key of the inverted index.

We must remember that the inverted index is a key-value pair structure (Hash Map), key is the word segmentation result (term) , and value is a set of document lists equivalent to the word segmentation result

Therefore, you can first perform word segmentation for the current document, and then go to the inverted index to find the corresponding value according to the word segmentation result, and then add the current document id to the corresponding value list.

There is another problem. Our inverted index is not just a hash table. It also has parameters, ArrayList< Weight >, which describes the correlation between words and documents .

Our correlation is expressed as: the higher the number of times a word appears in an article, the higher the correlation. This is our own simple and crude method. In a real search engine, correlation is something that an algorithm team does. .

So what we're going to do next:

  1. Word segmentation for document titles
  2. Traversing word segmentation results, counting the number of occurrences of each word
  3. Word segmentation for the text
  4. Traversing word segmentation results, counting the number of occurrences of each word
  5. Summarize the above results into a HasMap

The weight of the final document, set to the number of occurrences of the title * 10 + the number of occurrences in the body


4.4 How to improve the weight formula

Let's introduce how the real working environment is improved. If we want to improve the implementation company, we must first have a way to evaluate the quality of this formula.

Real search engines often use the concept of "click-through rate" to measure: click-through rate = clicks / (divided by) impressions

For example, when searching, do nothing and then close the browser, one page is one display.
insert image description here

If we search for a page and click to jump to a page, then the click-through rate of this search result is 100% , so it is generally not 100%, it may be a few thousandths of the real. This is also normal. After all, there are thousands of content in a search result, and it is impossible to go to every one of them.

We can use a strategy such as click-through rate, so we can also use multiple other strategies.

For real search engine projects with relatively large traffic, assuming that there are 100 million visits per day, then we can split the 100 million visits into several parts, 30%, 30%, 40%, we can make the first one hundred million Use formula A for 30 percent, formula B for the second 30 percent, and formula C for 40 percent, and then count the click-through rates separately. After a series of iterations, the formula will become more and more complicated . Simultaneously clicking works better and better. Finally, choose the best formula. This kind of work is called a small flow experiment , and the formula line can amplify the effect a little bit.


4.5 Realize the index module - realize word frequency statistics

We are now going to implement a word frequency statistics function:

 private void buildInverted(DocInfo docInfo) {
    
    
        //搞一个内部类避免出现2个哈希表
        class WordCnt{
    
    
            //表示这个词在标题中 出现的次数
            public int titleCount ;
            // 表示这个词在正文出现的次数
            public int contentCount;

        }
        //统计词频的数据结构
        HashMap<String,WordCnt> wordCntHashMap =new HashMap<>();



        //1,针对文档标题进行分词 为什么可以直接docInfo取值,是因为上次的正派索引里面已经有内容了
        List<Term> terms =  ToAnalysis.parse( docInfo.getTitle()).getTerms();
        //2. 遍历分词结果,统计每个词出现的比例
        for (Term term : terms){
    
    
            //先判定一个term这个词是否存在,如果不存在,就创建一个新的键值对,插入进去,titleCount 设为1
            //gameName()的分词的具体的词
            String word = term.getName();
            //哈希表的get如果不存在默认返回的是null
            WordCnt wordCnt =  wordCntHashMap.get(word);
            if (wordCnt == null){
    
    
                //词不存在
                WordCnt newWordCnt = new WordCnt();
                newWordCnt.titleCount =1;
                newWordCnt.contentCount = 0;
                wordCntHashMap.put(word,newWordCnt);
            }else{
    
    
                //存在就找到之前的值,然后加1
                wordCnt.titleCount +=1;
            }
            //如果存在,就找到之前的值,然后把对应的titleCount +1

        }
        //3. 针对正文进行分词
        terms = ToAnalysis.parse(docInfo.getContent()).getTerms();
        for (Term term : terms) {
    
    
            String word = term.getName();
            WordCnt wordCnt = wordCntHashMap.get(word);
            if (wordCnt == null){
    
    
                WordCnt newWordCnt = new WordCnt();
                newWordCnt.titleCount = 0;
                newWordCnt.contentCount = 1;
                wordCntHashMap.put(word,newWordCnt);
            }else {
    
    
                wordCnt.contentCount +=1;

            }
        }
        //4. 遍历分词结果,统计每个词出现的次数
        //5. 把上面的结果汇总到一个HasMap里面
        //  最终的文档的权重,设置为标题的出现次数 * 10 + 正文中出现的次数
        //遍历当前的HashMap,依次来更新倒排索引中的结构。

    }

Let's take a look at this picture first:
insert image description here
In fact, if there is no word, first check whether the word exists in the inverted index. If it does not exist, insert it. If it exists, continue to store the corresponding code in the existing place:

 WordCnt wordCnt =  wordCntHashMap.get(word);
            if (wordCnt == null){
    
    
                //词不存在
                WordCnt newWordCnt = new WordCnt();
                newWordCnt.titleCount =1;
                newWordCnt.contentCount = 0;
                wordCntHashMap.put(word,newWordCnt);
            }else{
    
    
                //存在就找到之前的值,然后加1
                wordCnt.titleCount +=1;
            }

Let’s explain Weight again:
insert image description here
this is actually the document id, and the frequency of words that appear:

The document id number is 2: Lei Jun bought Xiaomi's Xiaomi mobile phone

This will get a picture of such an inverted index part,
insert image description here
which means that the word Xiaomi has appeared twice in the No. 2 article, so that the word frequency statistics are realized.


4.6 Realize the index module - construct the inverted index code implementation

private void buildInverted(DocInfo docInfo) {
    
    
        //搞一个内部类避免出现2个哈希表
        class WordCnt{
    
    
            //表示这个词在标题中 出现的次数
            public int titleCount ;
            // 表示这个词在正文出现的次数
            public int contentCount;

        }
        //统计词频的数据结构
        HashMap<String,WordCnt> wordCntHashMap =new HashMap<>();
        
        //1,针对文档标题进行分词
        List<Term> terms =  ToAnalysis.parse( docInfo.getTitle()).getTerms();
        //2. 遍历分词结果,统计每个词出现的比例
        for (Term term : terms){
    
    
            //先判定一个term这个词是否存在,如果不存在,就创建一个新的键值对,插入进去,titleCount 设为1
            //gameName()的分词的具体的词
            String word = term.getName();
            //哈希表的get如果不存在默认返回的是null
            WordCnt wordCnt =  wordCntHashMap.get(word);
            if (wordCnt == null){
    
    
                //词不存在
                WordCnt newWordCnt = new WordCnt();
                newWordCnt.titleCount =1;
                newWordCnt.contentCount = 0;
                wordCntHashMap.put(word,newWordCnt);
            }else{
    
    
                //存在就找到之前的值,然后加1
                wordCnt.titleCount +=1;
            }
            //如果存在,就找到之前的值,然后把对应的titleCount +1

        }
        //3. 针对正文进行分词
        terms = ToAnalysis.parse(docInfo.getContent()).getTerms();
        //4. 遍历分词结果,统计每个词出现的次数
       
        for (Term term : terms) {
    
    
            String word = term.getName();
            WordCnt wordCnt = wordCntHashMap.get(word);
            if (wordCnt == null){
    
    
                WordCnt newWordCnt = new WordCnt();
                newWordCnt.titleCount = 0;
                newWordCnt.contentCount = 1;
                wordCntHashMap.put(word,newWordCnt);
            }else {
    
    
                wordCnt.contentCount +=1;

            }
        }
        //5. 把上面的结果汇总到一个HasMap里面
        //  最终的文档的权重,设置为标题的出现次数 * 10 + 正文中出现的次数
        //6.遍历当前的HashMap,依次来更新倒排索引中的结构。
        for(Map.Entry<String,WordCnt> entry:wordCntHashMap.entrySet()){
    
    
            //先根据这里的词去倒排索引中查一查词
            //倒排拉链
            List<Weight> invertedList  =  invertedIndex.get(entry.getKey());
            if (invertedList == null){
    
    
                //如果为空,插入一个新的键值对
                ArrayList<Weight> newInvertedList =new ArrayList<>();
                Weight weight = new Weight();
                weight.setDocId(docInfo.getDocId());
                //权重计算公式:标题中出现的次数* 10 +正文出现的次数
                weight.setWeight(entry.getValue().titleCount * 10 + entry.getValue().contentCount);
                newInvertedList.add(weight);
                invertedIndex.put(entry.getKey(),newInvertedList);
            }else{
    
    
                //如果非空 ,就把当前的文档,构造出一个Weight 对象,插入到倒排拉链的后面
                Weight weight = new Weight();
                weight.setDocId(docInfo.getDocId());
                //权重计算公式:标题中出现的次数* 10 +正文出现的次数
                weight.setWeight(entry.getValue().titleCount * 10 + entry.getValue().contentCount);
                invertedList.add(weight);
            }
        }

    }

insert image description here

This part writes the code here: first figure out why you need a Map.Entrt to what this is:

  1. Not all codes can be looped, only this object is "iterable", and it can implement the Iterable interface
  2. But Map has not been implemented. The meaning of Map is to find the value according to the key. But fortunately, Set implements Iterable, so you can convert Map into Set
  3. Originally, the Map exists in key-value pairs, and the value can be quickly found according to the key.
  4. Set here is a class that packs key-value pairs together called Entry (entry)
  5. After converting to Set, the ability to quickly find the value based on the key is lost, but in exchange it can be traversed

Each one is a key-value pair and becomes an Entry class:
insert image description here

We know that our inverted index has such a structure: the

key is a word, and the value is a set of documents.
insert image description here
We check whether there is an article in the index according to the word, and the return value is a set of articles

If this is empty, we create a new key-value pair:

What an inverted index requires is a word and a set of articles:

We found that there is no container for articles, so we need to create a container first,
insert image description here
and then we need to create a Weight object, and put the id and weight of the article into this container.
insert image description here
Here we need to pay attention to the setting of the weight. The key of our Entry is the word of the String, and the word is the class of the above word frequency statistics: so we can directly obtain the value of the Entry which is the word frequency, and then set the weight according to the weight formula.
insert image description here

Finally, put the constructed new key-value pair into the inverted index.

insert image description here

If the word is not empty, then we set the docId and weight. insert image description here
In fact, the most important construction of the inverted index is this code:
insert image description here


4.7 Implement the index module - save and load the background of the index

We have built the index now, but there is still a problem. Our index is currently in the memory, but we need to save it in the hard disk, because the process of building the index is time-consuming:
insert image description here
every time we adjust it It is built once, but we have thousands of documents, and the real search engine documents can be hundreds of millions, or even billions. It will be very slow if you do this.

Therefore we should not build the index at server startup (starting the server may be slowed down a lot).

So we split up these time-consuming operations and complete them separately. After the separate execution, we let the online server directly load the constructed index


How to save it to the file? A file is nothing more than binary data or text data. To put it bluntly, text data is a "string". We convert the index structure in memory into a string, and then write the file. It can also be called serialization. Reverse parsing the resulting string into some structured data (classes, objects, basic data structures) can be called deserialization.

There are many ready-made methods for serialization and deserialization. Here we directly use the Json format for serialization/deserialization- using the jackson library , serialization and deserialization are very simple.


4.8 Realize the index module - realize saving the index file

Jackson's maven address: JacksonMaven address

<!-- https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-databind -->
<dependency>
    <groupId>com.fasterxml.jackson.core</groupId>
    <artifactId>jackson-databind</artifactId>
    <version>2.14.1</version>
</dependency>

refresh:
insert image description here

First create an instance of the library:
insert image description here
the path to save the index file: (2 backslashes are required, and at the end)
insert image description here

//4.把内存中的索引结构保存到磁盘中
    public void save(){
    
    
        //使用2个文件。分别保存正排和倒排
        System.out.println("保存索引开始");
        //1.先判断一下索引对应的目录是否存在
        File indexPathFile =new File(INDEX_PATH);
        if (!indexPathFile.exists()){
    
    
            //如果路径不存在
            //mkdirs()可以创建多级目录
            indexPathFile.mkdirs();
        }
        //创建正排索引文件
        File forwardIndexFile = new File(INDEX_PATH+"forward.txt");
        //创建倒排索引文件
        File invertedIndexFile = new File(INDEX_PATH+"inverted.txt");
        try {
    
    
            //writeValue的有个参数可以把对象写到文件里
            objectMapper.writeValue(forwardIndexFile,forwardIndex);
            objectMapper.writeValue(invertedIndexFile,invertedIndex);
        }catch (IOException e){
    
    
            e.printStackTrace();
        }
        System.out.println("保存索引完成");


    }

4.9 Implement index module - implement loading index

Saving is to write the data in the memory to a file, and loading is to write the data in the file back to the memory. It needs to be saved in the index making stage, and it needs to be loaded when using the server program:

 //5.把磁盘中的索引数据加载到内存中
    public void load(){
    
    
        System.out.println("加载索引开始");
        //1.设置加载索引路径
        //正排索引
        File forwardIndexFile = new File(INDEX_PATH+"forward.txt");
        //倒排索引
        File invertedIndexFile = new File(INDEX_PATH+"inverted.txt");
        try{
    
    
            //readValue()2个参数,从那个文件读,解析是什么数据
            forwardIndex = objectMapper.readValue(forwardIndexFile, new TypeReference<ArrayList<DocInfo>>() {
    
    });
            invertedIndex = objectMapper.readValue(invertedIndexFile, new TypeReference<HashMap<String, ArrayList<Weight>>>() {
    
    });
        }catch (IOException e){
    
    
            e.printStackTrace();
        }
        System.out.println("加载索引结束");
    }

In fact, we just want to see how we write the content of the file back to the memory. Here we still use the library function readValue() of Jackson. This has 2 parameters, one is where to read from, and the second parameter It is what kind of format is parsed, here this library function gives a new TypeReference<>, and the format you want to parse can be filled in the square brackets.


4.10 Implement the index module - add time to load and save operations

    //4.把内存中的索引结构保存到磁盘中
    public void save(){
    
    
        long beg = System.currentTimeMillis();
        //使用2个文件。分别保存正排和倒排
        System.out.println("保存索引开始");
        //1.先判断一下索引对应的目录是否存在
        File indexPathFile =new File(INDEX_PATH);
        if (!indexPathFile.exists()){
    
    
            //如果路径不存在
            //mkdirs()可以创建多级目录
            indexPathFile.mkdirs();
        }
        //正排索引文件
        File forwardIndexFile = new File(INDEX_PATH+"forward.txt");
        //倒排索引文件
        File invertedIndexFile = new File(INDEX_PATH+"inverted.txt");
        try {
    
    
            //writeValue的有个参数可以把对象写到文件里
            objectMapper.writeValue(forwardIndexFile,forwardIndex);
            objectMapper.writeValue(invertedIndexFile,invertedIndex);
        }catch (IOException e){
    
    
            e.printStackTrace();
        }
        long end = System.currentTimeMillis();
        System.out.println("保存索结束 !消耗时间"+(end - beg)+"ms");
    }
    //5.把磁盘中的索引数据加载到内存中
    public void load(){
    
    
        long beg = System.currentTimeMillis();
        System.out.println("加载索引开始");
        //1.设置加载索引路径
        //正排索引
        File forwardIndexFile = new File(INDEX_PATH+"forward.txt");
        //倒排索引
        File invertedIndexFile = new File(INDEX_PATH+"inverted.txt");
        try{
    
    
            //readValue()2个参数,从那个文件读,解析是什么数据
            forwardIndex = objectMapper.readValue(forwardIndexFile, new TypeReference<ArrayList<DocInfo>>() {
    
    });
            invertedIndex = objectMapper.readValue(invertedIndexFile, new TypeReference<HashMap<String, ArrayList<Weight>>>() {
    
    });
        }catch (IOException e){
    
    
            e.printStackTrace();
        }
        long end = System.currentTimeMillis();
        System.out.println("加载索引结束 ! 消耗时间"+(end -beg)+"ms");
    }

We can add time to the time-consuming part, which can make a more intuitive comparison.


4.11 Implement the index module - call index in parser

We have almost written the core code of index, we need to associate the index class with the parser class

The relationship between them is: the
Parser class is equivalent to making an index entry corresponding to an executable program.
The index is equivalent to implementing the data structure of the index, and provides some APIs for use by the superior, so the index is called by the parser.

insert image description here
Then in the run method, save the constructed index to the specified file, and add the parsed single HTML file to the index.
insert image description here


4.12 Realize the index module - verify index production

Let's see if the code can make an index file now: release the complete code of the two classes

Parser class:

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;


/**
 * Created by Lin
 * Description:
 * User: Administrator
 * Date: 2022-12-15
 * Time: 19:15
 */
public class Parser {
    
    
    //先指定一个加载文档的路径 ,由于是固定路径 我们使用 static 类属性 不需要变final
    private static final String INPUT_PATH  ="D:\\gitee\\doc_searcher_index\\docs\\api";     // 只需要api文件夹下的文件

    //创建一个Index实例
    private Index index =new Index();

    public  void run(){
    
    
        //整个Parser类的入口
        //1.根据指定的路径去枚举出该路径中所有的文件(所有子目录的html文件),这个过程需要把全部子目录的文件全部获取到
        ArrayList<File> fileList = new ArrayList<>();
        enumFile(INPUT_PATH,fileList);
        //2.针对上面罗列出的路径,打开文件,读取文件内容,并进行解析.并构建索引
        for (File f :fileList){
    
    
            //通过这个方法解析单个HTML文件
            System.out.println("开始解析:" + f.getAbsolutePath());
            parseHTML(f);
        }
        //3. TODO 把内存中构造好的索引数据结构,保存到指定的文件中
        index.save();

    }

    //通过这个方法解析单个HTML文件
    private void parseHTML(File f) {
    
    
//        1. 解析出HTML标题
        String title  = parseTitle(f);
//        2. 解析出HTML对应的文章
        String url = parseUrl(f);
//        3. 解析出HTML对应的正文(有正文才有后续的描述)
        String content = parseContent(f);
       // 4.  解析的信息加入到索引当中
        index.addDoc(title,url,content);
    }

    public String parseContent(File f) {
    
    

        //先按照一个字符一个字符来读取,以< 和 > 来控制拷贝数据的开关
        try(FileReader fileReader = new FileReader(f)) {
    
    
            //加上一个开关
            boolean isCopy = true;
            //还准备一个保存结果的StringBuilder
            StringBuilder content  = new StringBuilder();
            while (true){
    
    
                //read int类型 读到最后返回-1
                int ret = fileReader.read();
                if (ret == -1){
    
    
                    //表示文件读完了
                    break;
                }
                //不是-1就是合法字符
                char c = (char) ret;
                if (isCopy){
    
    
                    //打开的状态可以拷贝
                    if (c == '<'){
    
    
                        isCopy =false;
                        continue;
                    }
                    //判断是否是换行
                    if (c == '\n' || c == '\r'){
    
    
//                        是换行就变成空格
                        c = ' ';
                    }
                    //其他字符进行拷贝到StringBuilder中
                    content.append(c);
                }else{
    
    
                    //
                    if (c=='>'){
    
    
                        isCopy= true;
                    }
                }
            }
            return content.toString();
        } catch (IOException e) {
    
    
            e.printStackTrace();
        }


        return "";

    }

    private String parseUrl(File f) {
    
    
        //固定的前缀
        String path = "https://docs.oracle.com/javase/8/docs/api/";
        //只放一个参数的意思是:前面一段都不需要,取后面的一段
        String path2=   f.getAbsolutePath().substring(INPUT_PATH.length());
        return path + path2;
    }

    private String parseTitle(File f) {
    
    
        //获取文件名
        String name =  f.getName();

        return name.substring(0,name.length()-".html".length());
    }

    //第一个参数表示从那个目录开始进行遍历,第二个目录表示递归得到的结果
    private void enumFile(String inputPath, ArrayList<File> fileList) {
    
    
        //我们需要把String类型的路径变成文件类 好操作点
        File rootPath = new File(inputPath);
        //listFiles()类似于Linux的ls把当前目录中包含的文件名获取到
        //使用listFiles只可以看见一级目录,想看到子目录需要递归操作
        File[] files = rootPath.listFiles();
        for (File file : files) {
    
    
            //根据当前的file的类型,觉得是否递归
            //如果file是普通文件就把file加入到listFile里面
            //如果file是一个目录 就递归调用enumFile这个方法,来进一步获取子目录的内容
            if (file.isDirectory()){
    
    
                //根路径要变
                enumFile(file.getAbsolutePath(),fileList);
            }else {
    
    
                //只针对HTML文件
                if(file.getAbsolutePath().endsWith(".html")){
    
    
                    //普通HTML文件
                    fileList.add(file);
                }

            }
        }
    }

    public static void main(String[] args) {
    
    
        //通过main方法来实现整个制作索引的过程
        Parser parser = new Parser();
        parser.run();
    }

}

Index class:

import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.ansj.domain.Term;
import org.ansj.splitWord.analysis.ToAnalysis;

import java.io.File;
import java.io.IOException;
import java.lang.reflect.Field;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

/**
 * Created by Lin
 * Description:
 * User: Administrator
 * Date: 2022-12-17
 * Time: 13:01
 */

//通过这个类在内存中来构造出索引结构
public class Index {
    
    
    //保存索引文件的路径
    private static final String INDEX_PATH ="D:\\gitee\\doc_searcher_index\\";

    private ObjectMapper objectMapper = new ObjectMapper();

    //使用数组下标表示 DocId
    private ArrayList<DocInfo> forwardIndex = new ArrayList<>();


    //使用哈希表 来表示倒排索引 key就是词 value就是一组和词关联的文章
    private HashMap<String,ArrayList<Weight>> invertedIndex = new HashMap<>();



    //这个类需要提供的方法
    //1.给定一个docId ,在正排索引中,查询文档的详细信息
    public DocInfo getDocInfo(int docId){
    
    
        return forwardIndex.get(docId);
    }
    //2.给定一词,在倒排索引中,查哪些文档和这个文档词关联
    public List<Weight> getInverted(String term){
    
    
        return invertedIndex.get(term);
    }
    //3.往索引中新增一个文档
    public void addDoc(String title,String url,String content){
    
    
        //新增文档操作,需要同时给正排索引和倒排索引新增信息
        //构建正排索引
        DocInfo docInfo =  buildForward(title,url,content);
        //构建倒排索引
        buildInverted(docInfo);

    }


    private void buildInverted(DocInfo docInfo) {
    
    
        //搞一个内部类避免出现2个哈希表
        class WordCnt{
    
    
            //表示这个词在标题中 出现的次数
            public int titleCount ;
            // 表示这个词在正文出现的次数
            public int contentCount;

        }
        //统计词频的数据结构
        HashMap<String,WordCnt> wordCntHashMap =new HashMap<>();

        //1,针对文档标题进行分词
        List<Term> terms =  ToAnalysis.parse( docInfo.getTitle()).getTerms();
        //2. 遍历分词结果,统计每个词出现的比例
        for (Term term : terms){
    
    
            //先判定一个term这个词是否存在,如果不存在,就创建一个新的键值对,插入进去,titleCount 设为1
            //gameName()的分词的具体的词
            String word = term.getName();
            //哈希表的get如果不存在默认返回的是null
            WordCnt wordCnt =  wordCntHashMap.get(word);
            if (wordCnt == null){
    
    
                //词不存在
                WordCnt newWordCnt = new WordCnt();
                newWordCnt.titleCount =1;
                newWordCnt.contentCount = 0;
                wordCntHashMap.put(word,newWordCnt);
            }else{
    
    
                //存在就找到之前的值,然后加1
                wordCnt.titleCount +=1;
            }
            //如果存在,就找到之前的值,然后把对应的titleCount +1

        }
        //3. 针对正文进行分词
        terms = ToAnalysis.parse(docInfo.getContent()).getTerms();
        //4. 遍历分词结果,统计每个词出现的次数

        for (Term term : terms) {
    
    
            String word = term.getName();
            WordCnt wordCnt = wordCntHashMap.get(word);
            if (wordCnt == null){
    
    
                WordCnt newWordCnt = new WordCnt();
                newWordCnt.titleCount = 0;
                newWordCnt.contentCount = 1;
                wordCntHashMap.put(word,newWordCnt);
            }else {
    
    
                wordCnt.contentCount +=1;

            }
        }
        //5. 把上面的结果汇总到一个HasMap里面
        //  最终的文档的权重,设置为标题的出现次数 * 10 + 正文中出现的次数
        //6.遍历当前的HashMap,依次来更新倒排索引中的结构。
        //并不是全部代码都是可以for循环的,只有这个对象是”可迭代的“,实现Iterable 接口才可以
        // 但是Map并没有实现,Map存在意义,是根据key查找value,但是好在Set实现了实现Iterable,就可以把Map转换为Set
        //本来Map存在的是戒键值对,可以根据key快速找到value,
        //Set这里存的是一个把 键值对 打包在一起的类 称为Entry(条目)
        //转成Set之后,失去了快速根据key快速查找value的只这样的能力,但是换来了可以遍历
        for(Map.Entry<String,WordCnt> entry:wordCntHashMap.entrySet()){
    
    
            //先根据这里的词去倒排索引中查一查词
            //倒排拉链
            List<Weight> invertedList  =  invertedIndex.get(entry.getKey());
            if (invertedList == null){
    
    
                //如果为空,插入一个新的键值对
                ArrayList<Weight> newInvertedList =new ArrayList<>();
                Weight weight = new Weight();
                weight.setDocId(docInfo.getDocId());
                //权重计算公式:标题中出现的次数* 10 +正文出现的次数
                weight.setWeight(entry.getValue().titleCount * 10 + entry.getValue().contentCount);
                newInvertedList.add(weight);
                invertedIndex.put(entry.getKey(),newInvertedList);
            }else{
    
    
                //如果非空 ,就把当前的文档,构造出一个Weight 对象,插入到倒排拉链的后面
                Weight weight = new Weight();
                weight.setDocId(docInfo.getDocId());
                //权重计算公式:标题中出现的次数* 10 +正文出现的次数
                weight.setWeight(entry.getValue().titleCount * 10 + entry.getValue().contentCount);
                invertedList.add(weight);
            }
        }

    }
    private DocInfo buildForward(String title, String url, String content) {
    
    
        DocInfo docInfo =new DocInfo();
        docInfo.setDocId(forwardIndex.size());
        docInfo.setTitle(title);
        docInfo.setUrl(url);
        docInfo.setContent(content);
        forwardIndex.add(docInfo);
        return docInfo;
    }

    //4.把内存中的索引结构保存到磁盘中
    public void save(){
    
    
        long beg = System.currentTimeMillis();
        //使用2个文件。分别保存正排和倒排
        System.out.println("保存索引开始");
        //1.先判断一下索引对应的目录是否存在
        File indexPathFile =new File(INDEX_PATH);
        if (!indexPathFile.exists()){
    
    
            //如果路径不存在
            //mkdirs()可以创建多级目录
            indexPathFile.mkdirs();
        }
        //正排索引文件
        File forwardIndexFile = new File(INDEX_PATH+"forward.txt");
        //倒排索引文件
        File invertedIndexFile = new File(INDEX_PATH+"inverted.txt");
        try {
    
    
            //writeValue的有个参数可以把对象写到文件里
            objectMapper.writeValue(forwardIndexFile,forwardIndex);
            objectMapper.writeValue(invertedIndexFile,invertedIndex);
        }catch (IOException e){
    
    
            e.printStackTrace();
        }
        long end = System.currentTimeMillis();
        System.out.println("保存索结束 !消耗时间"+(end - beg)+"ms");
    }
    //5.把磁盘中的索引数据加载到内存中
    public void load(){
    
    
        long beg = System.currentTimeMillis();
        System.out.println("加载索引开始");
        //1.设置加载索引路径
        //正排索引
        File forwardIndexFile = new File(INDEX_PATH+"forward.txt");
        //倒排索引
        File invertedIndexFile = new File(INDEX_PATH+"inverted.txt");
        try{
    
    
            //readValue()2个参数,从那个文件读,解析是什么数据
            forwardIndex = objectMapper.readValue(forwardIndexFile, new TypeReference<ArrayList<DocInfo>>() {
    
    });
            invertedIndex = objectMapper.readValue(invertedIndexFile, new TypeReference<HashMap<String, ArrayList<Weight>>>() {
    
    });
        }catch (IOException e){
    
    
            e.printStackTrace();
        }
        long end = System.currentTimeMillis();
        System.out.println("加载索引结束 ! 消耗时间"+(end -beg)+"ms");
    }

    public static void main(String[] args) {
    
    
        Index index = new Index();
        index.load();
        System.out.println("索引加载完成");
    }

}

Provide the main method of Parser to run:
insert image description here

result:
insert image description here

Then go to see the index file made: the file came out successfully.
insert image description here
Let's use vscode to open this file to see what it is: (notepad is slow to open)

insert image description here

In fact, we found that there are still the files we want in our text, and we have everything we need. The inverted file also has docid and weight,
insert image description here
so we can think that our index file is still complete.


5. Implement the index module - optimize the index module

5.1 Realize the index module - about the index making speed

It still took a certain amount of time when we just created the index. We can go to the run method of the parser class to make a time difference so that we can see clearly how long it took: and then see how much the speed has been improved after optimization.

It can be found that it took almost 30 seconds for the index to be created:
insert image description here

So let's think about where we are wasting time?

Are you enumerating paths? Actually it's very fast here

insert image description here
Save index? In fact, it is only 0.8 seconds.
insert image description here
The real big head is still in the loop:
insert image description here
if we want to improve the performance, we must first find the reason: if we want to optimize the performance of the program, we need to find the "performance bottleneck" through testing. Like going to the hospital, first take a film to locate the problem, and then solve the problem. How to test? The easiest way is to add time to each environment to see who consumes more:

insert image description here
Result: the cycle time takes the most time. insert image description here
The optimization idea for this loop is also very simple: through the test just now, it was found that the main performance bottleneck is in looping through files, and each loop needs to analyze a file: read file + word segmentation + parse content (mainly stuck in cpu computing), in the case of a single thread, these tasks are serial (parsing the first file before parsing the second file),使用多线程提神速度,这样通过多线程制作索引,到达提升速度的目的


5.2 Implement the index module - implement multi-threaded indexing

We implement multi-threaded indexing through a new method:

 //通过这个方法实现多线程制作索引
    public void runByThread() throws InterruptedException {
    
    
        long beg =System.currentTimeMillis();
        System.out.println("索引制作开始!");

        //1.,枚举全部文件
        ArrayList<File> files = new ArrayList<>();
        enumFile(INPUT_PATH,files);
        //2.循环遍历文件 此处为了通过多线程制作索引,就直接引入线程池
        CountDownLatch latch = new CountDownLatch(files.size());
        ExecutorService executorService = Executors.newFixedThreadPool(10);
        for(File f:files){
    
    
            //添加任务submit到线程池
            executorService.submit(new Runnable() {
    
    
                @Override
                public void run() {
    
    
                    System.out.println("解析"+f.getAbsolutePath());
                    parseHTML(f);
                    //保证所有的索引制作完再保存索引
                    latch.countDown();
                }
            });
        }
        //latch.await()等待全部countDown完成,才阻塞结束。
        latch.await();
        //3.保存索引 ,可能存在还没有执行完的情况
        index.save();
        long end =System.currentTimeMillis();
        System.out.println("索引制作结束!时间"+(end - beg)+"ms");

    }

The basic idea here is still the same as before, but there are a few changes that need everyone to understand:

  1. Added thread pool, ExecutorService uses newFixedThreadPool (10), uses this method to create 10 threads, and then the focus comes

  2. If we have finished indexing, we should call the index.save(); method, but we are now multi-threaded, and there may be a situation where the index is still being made and not completed. At this time, because it is a multi-threaded concurrent operation, it may Just execute index.save(); so we will get an incomplete index file.

  3. Our solution to this situation is: use the countDown() method of the CountDownLatch class, this method records how many tasks have not been completed, when all are completed, we use await() to wake up, and then operate the next step .


5.3 Realize the index module - lock the index code made

Can we run directly after adding multi-threading? No, this is not the case, because thread safety is involved, we need to be clear about thread safety, if it involves the operation of public objects, it may happen . In other words (multiple threads try to modify the same object)


We call the parseHTML method here.
insert image description here
The parserHTML method also has other operations and the addDoc operation.
insert image description here
We found that parsing the title, url, and text does not involve the operation of public objects, but there is a problem in the addDoc method:

Here we find that addDoc finds that there are constructed indexes, forward and reverse.

The forward index can be found that there are 2 places where common classes are being operated:
insert image description here

Building an inverted index also has the behavior of manipulating public objects:
insert image description here
Let’s draw a picture to see the approximate execution process: there are 4 threads modifying forwardIndex and invertedIndex at the same time, so there is a thread safety problem, so we use thread safety to lock.
insert image description here
How to add it? We want to make the places that can be concurrent be as concurrent as possible, and we can't just serialize the places, and the granularity of locking should not be too large .

  1. Lock addDoc directly? Obviously not. If the lock is added here, the index can only be built serially, and the multithreading we just added is meaningless.
    insert image description here
  2. Adjust the order and lock the 2 lines of code for the positive row index
    insert image description here
  3. Adjust the order of the inverted index loop code and lock it

insert image description here
There is another question: what is the object we need to lock? The parameters in synchronize

If this is given to the lock object, it means the current Index class, but if this is the case, it will appear that the forward index object must be executed before the inverted index object code can be executed

We found that we are actually operating two different objects (forward index and inverted index) , so they should not cause lock competition, just like girl A and girl B, both have suitors, and then a thread male A puts Girl A is about to leave, so other thread men have to wait for girl A? They can also go to girl B. Girls A and B are 2 different objects.

So we can lock the index object,

insert image description here
You can also create two new lock objects
insert image description here
and change the code:
insert image description here

Modified code:

import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.ansj.domain.Term;
import org.ansj.splitWord.analysis.ToAnalysis;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

/**
 * Created by Lin
 * Description:
 * User: Administrator
 * Date: 2022-12-17
 * Time: 13:01
 */

//通过这个类在内存中来构造出索引结构
public class Index {
    
    
    //保存索引文件的路径
    private static final String INDEX_PATH ="D:\\gitee\\doc_searcher_index\\";

    private ObjectMapper objectMapper = new ObjectMapper();

    //使用数组下标表示 DocId
    private ArrayList<DocInfo> forwardIndex = new ArrayList<>();


    //使用哈希表 来表示倒排索引 key就是词 value就是一组和词关联的文章
    private HashMap<String,ArrayList<Weight>> invertedIndex = new HashMap<>();

    //新创建2个锁对象
    private Object locker1 = new Object();
    private Object locker2 = new Object();

    //这个类需要提供的方法
    //1.给定一个docId ,在正排索引中,查询文档的详细信息
    public DocInfo getDocInfo(int docId){
    
    
        return forwardIndex.get(docId);
    }
    //2.给定一词,在倒排索引中,查哪些文档和这个文档词关联
    public List<Weight> getInverted(String term){
    
    
        return invertedIndex.get(term);
    }
    //3.往索引中新增一个文档
    public  void addDoc(String title,String url,String content){
    
    
        //新增文档操作,需要同时给正排索引和倒排索引新增信息
        //构建正排索引
        DocInfo docInfo =  buildForward(title,url,content);
        //构建倒排索引
        buildInverted(docInfo);

    }


    private void buildInverted(DocInfo docInfo) {
    
    
        //搞一个内部类避免出现2个哈希表
        class WordCnt{
    
    
            //表示这个词在标题中 出现的次数
            public int titleCount ;
            // 表示这个词在正文出现的次数
            public int contentCount;

        }
        //统计词频的数据结构
        HashMap<String,WordCnt> wordCntHashMap =new HashMap<>();

        //1,针对文档标题进行分词
        List<Term> terms =  ToAnalysis.parse( docInfo.getTitle()).getTerms();
        //2. 遍历分词结果,统计每个词出现的比例
        for (Term term : terms){
    
    
            //先判定一个term这个词是否存在,如果不存在,就创建一个新的键值对,插入进去,titleCount 设为1
            //gameName()的分词的具体的词
            String word = term.getName();
            //哈希表的get如果不存在默认返回的是null
            WordCnt wordCnt =  wordCntHashMap.get(word);
            if (wordCnt == null){
    
    
                //词不存在
                WordCnt newWordCnt = new WordCnt();
                newWordCnt.titleCount =1;
                newWordCnt.contentCount = 0;
                wordCntHashMap.put(word,newWordCnt);
            }else{
    
    
                //存在就找到之前的值,然后加1
                wordCnt.titleCount +=1;
            }
            //如果存在,就找到之前的值,然后把对应的titleCount +1

        }
        //3. 针对正文进行分词
        terms = ToAnalysis.parse(docInfo.getContent()).getTerms();
        //4. 遍历分词结果,统计每个词出现的次数

        for (Term term : terms) {
    
    
            String word = term.getName();
            WordCnt wordCnt = wordCntHashMap.get(word);
            if (wordCnt == null){
    
    
                WordCnt newWordCnt = new WordCnt();
                newWordCnt.titleCount = 0;
                newWordCnt.contentCount = 1;
                wordCntHashMap.put(word,newWordCnt);
            }else {
    
    
                wordCnt.contentCount +=1;

            }
        }
        //5. 把上面的结果汇总到一个HasMap里面
        //  最终的文档的权重,设置为标题的出现次数 * 10 + 正文中出现的次数
        //6.遍历当前的HashMap,依次来更新倒排索引中的结构。
        //并不是全部代码都是可以for循环的,只有这个对象是”可迭代的“,实现Iterable 接口才可以
        // 但是Map并没有实现,Map存在意义,是根据key查找value,但是好在Set实现了实现Iterable,就可以把Map转换为Set
        //本来Map存在的是戒键值对,可以根据key快速找到value,
        //Set这里存的是一个把 键值对 打包在一起的类 称为Entry(条目)
        //转成Set之后,失去了快速根据key快速查找value的只这样的能力,但是换来了可以遍历
       synchronized (locker2){
    
    
           for(Map.Entry<String,WordCnt> entry:wordCntHashMap.entrySet()){
    
    
               //先根据这里的词去倒排索引中查一查词
               //倒排拉链
               List<Weight> invertedList  =  invertedIndex.get(entry.getKey());
               if (invertedList == null){
    
    
                   //如果为空,插入一个新的键值对
                   ArrayList<Weight> newInvertedList =new ArrayList<>();
                   Weight weight = new Weight();
                   weight.setDocId(docInfo.getDocId());
                   //权重计算公式:标题中出现的次数* 10 +正文出现的次数
                   weight.setWeight(entry.getValue().titleCount * 10 + entry.getValue().contentCount);
                   newInvertedList.add(weight);
                   invertedIndex.put(entry.getKey(),newInvertedList);
               }else{
    
    
                   //如果非空 ,就把当前的文档,构造出一个Weight 对象,插入到倒排拉链的后面
                   Weight weight = new Weight();
                   weight.setDocId(docInfo.getDocId());
                   //权重计算公式:标题中出现的次数* 10 +正文出现的次数
                   weight.setWeight(entry.getValue().titleCount * 10 + entry.getValue().contentCount);
                   invertedList.add(weight);
               }
           }
       }

    }
    private DocInfo buildForward(String title, String url, String content) {
    
    
        DocInfo docInfo =new DocInfo();
        docInfo.setTitle(title);
        docInfo.setUrl(url);
        docInfo.setContent(content);
        synchronized (locker1){
    
    
            docInfo.setDocId(forwardIndex.size());
            forwardIndex.add(docInfo);
        }

        return docInfo;
    }

    //4.把内存中的索引结构保存到磁盘中
    public void save(){
    
    
        long beg = System.currentTimeMillis();
        //使用2个文件。分别保存正排和倒排
        System.out.println("保存索引开始");
        //1.先判断一下索引对应的目录是否存在
        File indexPathFile =new File(INDEX_PATH);
        if (!indexPathFile.exists()){
    
    
            //如果路径不存在
            //mkdirs()可以创建多级目录
            indexPathFile.mkdirs();
        }
        //正排索引文件
        File forwardIndexFile = new File(INDEX_PATH+"forward.txt");
        //倒排索引文件
        File invertedIndexFile = new File(INDEX_PATH+"inverted.txt");
        try {
    
    
            //writeValue的有个参数可以把对象写到文件里
            objectMapper.writeValue(forwardIndexFile,forwardIndex);
            objectMapper.writeValue(invertedIndexFile,invertedIndex);
        }catch (IOException e){
    
    
            e.printStackTrace();
        }
        long end = System.currentTimeMillis();
        System.out.println("保存索结束 !消耗时间"+(end - beg)+"ms");
    }
    //5.把磁盘中的索引数据加载到内存中
    public void load(){
    
    
        long beg = System.currentTimeMillis();
        System.out.println("加载索引开始");
        //1.设置加载索引路径
        //正排索引
        File forwardIndexFile = new File(INDEX_PATH+"forward.txt");
        //倒排索引
        File invertedIndexFile = new File(INDEX_PATH+"inverted.txt");
        try{
    
    
            //readValue()2个参数,从那个文件读,解析是什么数据
            forwardIndex = objectMapper.readValue(forwardIndexFile, new TypeReference<ArrayList<DocInfo>>() {
    
    });
            invertedIndex = objectMapper.readValue(invertedIndexFile, new TypeReference<HashMap<String, ArrayList<Weight>>>() {
    
    });
        }catch (IOException e){
    
    
            e.printStackTrace();
        }
        long end = System.currentTimeMillis();
        System.out.println("加载索引结束 ! 消耗时间"+(end -beg)+"ms");
    }

    public static void main(String[] args) {
    
    
        Index index = new Index();
        index.load();
        System.out.println("索引加载完成");
    }

}


5.4 Realize the index module - verify the effect of multi-threading

Let's see how much faster the multithreaded code is:

Code speed before multithreading:
insert image description here

Code speed after multi-threading:
insert image description here
We found that it is still improved. Of course, the thread here does not mean that it will increase several times after using a few threads. How appropriate is the thread setting here? It still has to be judged through experiments. Because the multi-threaded code is not completely concurrent, it may involve lock competition and file reading io operations in the middle. In fact, the concurrency will not be significantly improved, and it will be stuck on the bottleneck of io. In short, it is not that the more threads, the better.

5.4 Realize the index module - solve the problem that the process does not exit

We found that the experimental multi-threaded process is not over:
insert image description here
here we need to mention our daemon thread:

If a thread is a daemon thread (background thread), the running status of this thread at this time will not affect the process result

If a thread is not a daemon thread, the running status of this thread will affect the end of the process.

The threads we created above are not daemon threads. When the main method is executed, these threads are still working, (waiting for the arrival of new tasks)

insert image description here

We can manually kill the thread manually.
insert image description here


5.4 Realize the index module - the problem of slow index making for the first time

We can find that restarting the machine makes the first production slower, and then continues to make the production faster. Why is this?

We found that the parsing content in addDoc is designed to read files: computer reading files is a relatively expensive operation.
insert image description here
We think that it may be slowed down here. Guess that the speed of reading files is particularly slow when running for the first time?

We try to time the parsing content and add it to the index as a specific operation, because the parsing content uses multi-threading, so we use AtomicLong to avoid thread insecurity.
insert image description here
Calculate the time separately:
insert image description here
then print the time at the end of the thread:
insert image description here
t1 is the time of parseContent:
t2 is the time of addDoc

The computer restarts for the first time: t1 is 47 seconds and t2 is 87 seconds
insert image description here

The second run is over:
insert image description here
the core operation of parsing content in 30s and 58s is to read files and access them from the dictionary, and the operating system will cache "frequently read files".

When running for the first time, the current Java documents are not cached in memory, so when reading, they can only be read directly from the hard disk (it will be slower)

When running again later, because these documents have been read earlier, the document has a cache in the memory in the operating system, and the second read does not need to read the hard disk directly, but reads the cache, so can we directly What about using code to write documents to the cache? The answer is yes.


5.5 Realize the index module - the problem of slow index making for the first time

We can find that reading one by one here is to read to disk
insert image description here


We can use BufferedReader with FileReader to use

BufferedReader has a built-in buffer, which can automatically pre-read some content in FileReader into memory, thereby reducing the number of direct disk accesses.
insert image description here
We can see that this buffer is only 8k. Each of our html files is more than 10 k, we can use its construction parameters to make it bigger here.
insert image description here
Since the operating system has already cached the read file, we manually set the cache to improve the performance, and the buffer area of ​​the operating system may not be greatly improved. When something goes wrong, a little help is better than nothing.
insert image description here


5.5 Implement the index module - verify the index loading logic

Let's take a look at the loading logic of this load and see if it is useful.
insert image description here
Let's create a main method in the index class and use it

insert image description here
Result:
insert image description here
Let's debug to see what the index we loaded looks like.

This is the forward index:
insert image description here
this is the inverted index:
insert image description here


5.6 Implement the index module - index module summary

What work have we all done? Let's summarize

  1. Implemented the Parser class
    • Enumerate all HTML files recursively
    • Perform parsing operations for each HTML here
      • Title: directly use the HTML file title
      • URL: simple splicing based on the path of the file (the relationship between the offline document and the online document path)
      • Text: To remove the tags, the simple and rude way realizes the use of < > to judge whether to copy
    • Put the parsed result into the index class (addDoc)

The most important thing through the Parser class is to assist the Index class to complete the index production

We started with single-threaded production, and then converted to multi-threaded, and multi-threaded must be handled well. Only when all documents are processed can we save the index

The other is the speed of reading files. The first time we made an index through the experiment was too slow, because it directly accesses the disk and does not reach the cache. Later we changed to a BufferedReader



  1. Implemented the core properties of the Parser class :
    1. Forward index: ArrayList Each DocInfo represents a document, which contains id, title, url, content
    2. Inverted index: HashMap<String,ArrayList> Each key-value pair indicates which documents the word has appeared in
      • Weight not only contains the document id, but also contains the weight
      • The weight is currently calculated by the frequency that the word can only appear in the document (the number of times the title appears * 10 + the number of times the text appears)

Core method:

  1. Check the positive row, just take the elements in the ArrayList directly according to the subscript
  2. Check the reverse row, just get the value of HashMap<String, ArrayList> directly according to the key
  3. Add a document, and the Parser class calls this method when building an index
    • Build the front row: Construct a docInfo object and add it to the end of the front row index
    • Build an inverted row: perform word segmentation first, count word frequency, traverse word segmentation results, and update the corresponding inverted zipper of the inverted index

Our public objects should take thread safety into consideration.

  1. Save the index, save the index data to the specified file based on the json format
  2. Load the index, analyze the data based on the json format, and read the content in the file and parse it into the memory

Six, realize the search module - search module articles

Call the index module to complete the core process of the search

  1. Word segmentation, for the query word input by the user, (the query word input by the user may not be a word, but may be a sentence)
  2. Trigger, get the word segmentation result, search in the inverted index, and find relevant documents~ (call the index class to check the inverted method)
  3. Sorting, sorting the results triggered above (according to relevance, descending order)
  4. Packaging results, according to the sorted results, check the positive row in turn, obtain the detailed information of each document, and return the data packaged into a certain structure

Of course, we are just a simple search engine here, which is incomparable with a real search engine.


6.1 Implementing the search module - creating the DocSearcher class

Complete the entire search process through this class:

import java.util.List;

/**
 * Created by Lin
 * Description:
 * User: Administrator
 * Date: 2022-12-21
 * Time: 13:22
 */
public class DocSearcher {
    
    

    private Index index = new Index();

    public DocSearcher(){
    
    
    	//一开始要加载
        index.load();
    }
    //完成整个搜索过程的方法
    //参数(输入部分)就是用户给出的查询词
    //返回值(输出部分)就是搜索结果的集合
    public List<Result>  search(String query){
    
    
        //1.[分词]针对query这个查询词进行分词
        //2.[触发]针对分词结果来查倒排
        //3.[排序]针对触发的结果按照权重降序排序
        //4.[包装结果]针对排序的结果,去查正排,构造出要返回的数据
        return null;
    }
}

The return value we need is a lot of search results:
insert image description here
we also need to complete our index loading: use it directly in the construction method.


We are creating a class to represent search results:

/**
 * Created by Lin
 * Description:
 * User: Administrator
 * Date: 2022-12-21
 * Time: 13:24
 */

//表示搜索结果
public class Result {
    
    
    private String title;
    private String url;
    //此处是描述 正文的摘要
    private String desc;

    public String getTitle() {
    
    
        return title;
    }

    public void setTitle(String title) {
    
    
        this.title = title;
    }

    public String getUrl() {
    
    
        return url;
    }

    public void setUrl(String url) {
    
    
        this.url = url;
    }

    public String getDesc() {
    
    
        return desc;
    }

    public void setDesc(String desc) {
    
    
        this.desc = desc;
    }
}

Note that the Result classes are title, url, and summary, since we are displaying part of the body.


6.2 Implement the search module - implement the search method (1)

What we want to implement in the first part are the following operations:

  1. Word segmentation : word segmentation for the query word query

  2. Trigger : check the inverted ranking for word segmentation results

  3. Sorting : Sort the triggered results in descending order of weight

  4. Packaging results : For the sorted results, check the positive row and construct the data to be returned

 public List<Result> search(String query) {
    
    
        //1.[分词]针对query这个查询词进行分词
        List<Term> terms = ToAnalysis.parse(query).getTerms();

        //2.[触发]针对分词结果来查倒排
        List<Weight> allTermResult = new ArrayList<>();
        for (Term term : terms) {
    
    
            String word = term.getName();
            List<Weight> invertedList = index.getInverted(word);
            if (invertedList == null) {
    
    
                //说明词不存在
                continue;
            }
            allTermResult.addAll(invertedList);
        }
        // 3.[排序]针对触发的结果按照权重降序排序
        allTermResult.sort(new Comparator<Weight>() {
    
    
            @Override
            public int compare(Weight o1, Weight o2) {
    
    
                //降序排序 return o2.getWeight-01.getWeight  升序反之
                return o2.getWeight() - o1.getWeight();
            }
        });
        //4.[包装结果]针对排序的结果,去查正排,构造出要返回的数据
        List<Result> results = new ArrayList<>();
        for (Weight weight : allTermResult) {
    
    
            DocInfo docInfo = index.getDocInfo(weight.getDocId());
            Result result = new Result();
            result.setTitle(docInfo.getTitle());
            result.setUrl(docInfo.getUrl());
            result.setDesc(GenDesc(docInfo.getContent(),terms));
            results.add(result);
        }
        return results;
    }

Here we should pay attention to our fourth step: because we only get some weights and document IDs in the reverse ranking, we don’t know our detailed content, and the detailed content still has to go to the forward index.

There is also our abstract. What we have found is all the articles. What we need to construct the final result is a description. This description comes from the main text and also contains part of the query words.

insert image description here

So next we're going to generate a description:


6.2 Implement the search module - implement the search method (2)

Speaking of the generation of description above, our idea is first written out:

  1. You can get all the word segmentation results and traverse the word segmentation results
  2. See which result appears in the text, the current document may not contain all word segmentation results
  3. For this included result, go to the text to find the corresponding position of the current query word
  4. To generate a description, 60 characters ahead of the current position can be used as the beginning of the description, and 160 characters can be intercepted after the beginning as the entire description

insert image description here

private String GenDesc(String content, List<Term> terms) {
    
    
        //先遍历结果,看看哪个结果是在content中存在
        int firstPos = -1;
        for (Term term : terms) {
    
    
            String word = term.getName();

            //因为分词结果是会把正文转成小写,所以我们要把查询词也转成小写

            //为了搜索结果独立成词 所以加" "
            firstPos =content.toLowerCase().indexOf(" " + word + " ");
            if (firstPos >= 0){
    
    
                break;
            }

            if(firstPos ==-1){
    
    
                //所有的分词结果都不在正文中存在 极端情况
                return content.substring(0,160)+"...";
            }
        }
        //从firstPos 作为基准,往前找60个字符,作为描述的起始位置
        String desc ="";
        //如果当前位置少于60个字符开始位置就是第一个 否则开始位置 在查询词前60个
        int descBeg = firstPos < 60 ?  0 : firstPos -60;
        if (descBeg+160 > content.length()){
    
    
            //判断是否超过正文长度
            //从开始位置到最后
            desc = content.substring(descBeg);
        }else {
    
    
            desc  =content.substring(descBeg,descBeg + 160)+"...";
        }
        return desc;
    }

The core idea of ​​the code that generates the description in this paragraph is:
find the position of the article where the document word is located, and intercept the document description according to the first 60 characters of the query word and the 160 characters after the query word;


6.3 Implementing the Search Module - Simple Validation

Let's first add a toString to the Reslut result to see
insert image description here

All codes for the current search class

import org.ansj.domain.Term;
import org.ansj.splitWord.analysis.ToAnalysis;

import java.util.*;

/**
 * Created by Lin
 * Description:
 * User: Administrator
 * Date: 2022-12-21
 * Time: 13:22
 */
public class DocSearcher {
    
    

    private Index index = new Index();

    public DocSearcher() {
    
    
        //一开始要加载
        index.load();
    }

    //完成整个搜索过程的方法
    //参数(输入部分)就是用户给出的查询词
    //返回值(输出部分)就是搜索结果的集合
    public List<Result> search(String query) {
    
    
        //1.[分词]针对query这个查询词进行分词
        List<Term> terms = ToAnalysis.parse(query).getTerms();

        //2.[触发]针对分词结果来查倒排
        List<Weight> allTermResult = new ArrayList<>();
        for (Term term : terms) {
    
    
            String word = term.getName();
            List<Weight> invertedList = index.getInverted(word);
            if (invertedList == null) {
    
    
                //说明词不存在
                continue;
            }
            allTermResult.addAll(invertedList);
        }
        // 3.[排序]针对触发的结果按照权重降序排序
        allTermResult.sort(new Comparator<Weight>() {
    
    
            @Override
            public int compare(Weight o1, Weight o2) {
    
    
                //降序排序 return o2.getWeight-01.getWeight  升序反之
                return o2.getWeight() - o1.getWeight();
            }
        });
        //4.[包装结果]针对排序的结果,去查正排,构造出要返回的数据
        List<Result> results = new ArrayList<>();
        for (Weight weight : allTermResult) {
    
    
            DocInfo docInfo = index.getDocInfo(weight.getDocId());
            Result result = new Result();
            result.setTitle(docInfo.getTitle());
            result.setUrl(docInfo.getUrl());
            result.setDesc(GenDesc(docInfo.getContent(),terms));
            results.add(result);
        }
        return results;
    }

    private String GenDesc(String content, List<Term> terms) {
    
    
        //先遍历结果,看看哪个结果是在content中存在
        int firstPos = -1;
        for (Term term : terms) {
    
    
            String word = term.getName();

            //因为分词结果是会把正文转成小写,所以我们要把查询词也转成小写

            //为了搜索结果独立成词 所以加" "
            firstPos =content.toLowerCase().indexOf(" " + word + " ");
            if (firstPos >= 0){
    
    
                break;
            }

            if(firstPos ==-1){
    
    
                //所有的分词结果都不在正文中存在 极端情况
                return content.substring(0,160)+"...";
            }
        }
        //从firstPos 作为基准,往前找60个字符,作为描述的起始位置
        String desc ="";
        //如果当前位置少于60个字符开始位置就是第一个 否则开始位置 在查询词前60个
        int descBeg = firstPos < 60 ?  0 : firstPos -60;
        if (descBeg+160 > content.length()){
    
    
            //判断是否超过正文长度
            //从开始位置到最后
            desc = content.substring(descBeg);
        }else {
    
    
            desc  =content.substring(descBeg,descBeg + 160)+"...";
        }
        return desc;
    }

    public static void main(String[] args) {
    
    
        DocSearcher docSearcher = new DocSearcher();
        Scanner scanner = new Scanner(System.in);
        while (true) {
    
    
            System.out.print("->");
            String query = scanner.next();
            List<Result> results = docSearcher.search(query);
            for (Result result : results) {
    
    
                System.out.println("======================================");
                System.out.println(result);
            }
        }
    }
}

Result: We searched for an ArrayList, and found that the sorting and what we needed were all there, but there was something marked in a yellow box. In fact, it was found carefully that it was js code

insert image description here

Why does it appear? Because we just remove the tags from the HTML page, but some HTML contains the <script> tag, which leads to js code even when the tag is removed. We will solve this problem in turn later.


6.4 Implementing the search module - using regular expressions

For the script tag above, we have to remove both the tag and the content, here we use regular expressions

There are many methods in Java's String that support regular indexOf, replace, replaceAll, split...

Let's briefly introduce our regular part symbol rules:

. means match a character that is not a newline (not \n or \r)

* Indicates that the previous character can appear several times

.* matches several occurrences of a non-newline character

? Indicates non-greedy matching, matching the shortest result that meets the conditions (greedy: match as long as possible, matching the longest result that meets the conditions)

Suppose there is a <.*> rule now, our document is <div>aaa</div> <div>bbb</div>, if it is a greedy match, it will match all the blue places.

insert image description here

We can also go to the test site to have a look:

insert image description here

If you add non-greedy matching <.*?>:

the effect of replacing it with spaces:insert image description here


If you want to remove the script tag and content regularization, you can write it like this

< script. >(. )</script> Treat attributes and content as character matches

Remove normal tags without removing content

<.* > can match both the start tag and the end tag


6.5 Implement the search module - replace script tags and their content

Let's replace:

 private String readFile(File f){
    
    
        try(BufferedReader bufferedReader = new BufferedReader(new FileReader(f))){
    
    
            StringBuilder content = new StringBuilder();
            while(true){
    
    
                int ret = bufferedReader.read();
                if (ret==-1){
    
    
                    break;
                }
                char c = (char) ret;
                if (c=='\n' || c == '\r'){
    
    
                    c= ' ';
                }
                content.append(c);
            }
            return content.toString();
        }catch (IOException e){
    
    
            e.printStackTrace();
        }return  "";
    }

      public String parseContentRegex(File f){
    
    
        //1.先把整个文件都读取到String里面
        String content = readFile(f);

        //2.替换script标签
        content = content.replaceAll("<script.*?>(.*?)</script>"," ");

        //3.替换普通的HTML标签
        content = content.replaceAll("<.*?>"," ");
        //4.使用正则把多个空格,合并成一个空格
        content = content.replaceAll("\\s+"," ");
        return content;
    }

What should be noted here is the order of replacement. The script tag must be in the html tag first, otherwise there will be problems. Finally, let's test the comparison effect:
insert image description here

It can be found that the js code is gone. There are too many spaces in this space, which has also been adjusted by us. /s+ means to query at least once.

Then we replace it and re-index.
insert image description here

All code of Parser class:

import com.sun.org.apache.regexp.internal.RE;

import java.io.*;
import java.util.ArrayList;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.atomic.AtomicLong;


/**
 * Created by Lin
 * Description:
 * User: Administrator
 * Date: 2022-12-15
 * Time: 19:15
 */
public class Parser {
    
    
    //先指定一个加载文档的路径 ,由于是固定路径 我们使用 static 类属性 不需要变final
    private static final String INPUT_PATH  ="D:\\gitee\\doc_searcher_index\\docs\\api";     // 只需要api文件夹下的文件

    //创建一个Index实例
    private Index index =new Index();

    //为了避免线程不安全
    private AtomicLong t1 = new AtomicLong(0);
    private AtomicLong t2 = new AtomicLong(0);

    //通过这个方法实现单线程制作索引
    public  void run(){
    
    
        long beg = System.currentTimeMillis();
        System.out.println("索引制作开始");
        //整个Parser类的入口
        //1.根据指定的路径去枚举出该路径中所有的文件(所有子目录的html文件),这个过程需要把全部子目录的文件全部获取到
        ArrayList<File> fileList = new ArrayList<>();
        enumFile(INPUT_PATH,fileList);
        //测试枚举时间
        long endEnumFile = System.currentTimeMillis();
        System.out.println("枚举文件完毕 时间"+(endEnumFile - beg));

        //2.针对上面罗列出的路径,打开文件,读取文件内容,并进行解析.并构建索引
        for (File f :fileList){
    
    
            //通过这个方法解析单个HTML文件
            System.out.println("开始解析:" + f.getAbsolutePath());
            parseHTML(f);
        }
        long endFor = System.currentTimeMillis();
        System.out.println("循环遍历文件完毕 时间"+(endFor - endEnumFile)+"ms");
        //3.  把内存中构造好的索引数据结构,保存到指定的文件中
        index.save();
        long end = System.currentTimeMillis();
        System.out.println("索引制作完毕,消耗时间:"+(end - beg) + "ms");
    }

    //通过这个方法实现多线程制作索引
    public void runByThread() throws InterruptedException {
    
    
        long beg =System.currentTimeMillis();
        System.out.println("索引制作开始!");

        //1.,枚举全部文件
        ArrayList<File> files = new ArrayList<>();
        enumFile(INPUT_PATH,files);
        //2.循环遍历文件 此处为了通过多线程制作索引,就直接引入线程池
        CountDownLatch latch = new CountDownLatch(files.size());
        ExecutorService executorService = Executors.newFixedThreadPool(8);
        for(File f:files){
    
    
            //添加任务submit到线程池
            executorService.submit(new Runnable() {
    
    
                @Override
                public void run() {
    
    
                    System.out.println("解析"+f.getAbsolutePath());
                    parseHTML(f);
                    //保证所有的索引制作完再保存索引
                    latch.countDown();
                }
            });
        }
        //latch.await()等待全部countDown完成,才阻塞结束。
        latch.await();
        executorService.shutdown();
        //3.保存索引 ,可能存在还没有执行完的情况
        index.save();
        long end =System.currentTimeMillis();
        System.out.println("索引制作结束!时间"+(end - beg)+"ms");
        System.out.println("t1:" +t1 +"t2:"+t2);
    }
    //通过这个方法解析单个HTML文件
    private void parseHTML(File f) {
    
    
//        1. 解析出HTML标题
        String title  = parseTitle(f);
//        2. 解析出HTML对应的文章
        String url = parseUrl(f);
//        3. 解析出HTML对应的正文(有正文才有后续的描述)
        //纳秒级别时间
        long beg = System.nanoTime();
        String content = parseContentRegex(f);
        long mid = System.nanoTime();
       // 4.  解析的信息加入到索引当中
        index.addDoc(title,url,content);
        long end = System.nanoTime();
        t1.addAndGet(mid -beg);
        t2.addAndGet(end - mid);

    }
    private String readFile(File f){
    
    
        try(BufferedReader bufferedReader = new BufferedReader(new FileReader(f))){
    
    
            StringBuilder content = new StringBuilder();
            while(true){
    
    
                int ret = bufferedReader.read();
                if (ret==-1){
    
    
                    break;
                }
                char c = (char) ret;
                if (c=='\n' || c == '\r'){
    
    
                    c= ' ';
                }
                content.append(c);
            }
            return content.toString();
        }catch (IOException e){
    
    
            e.printStackTrace();
        }return  "";
    }

    public String parseContentRegex(File f){
    
    
        //1.先把整个文件都读取到String里面
        String content = readFile(f);

        //2.替换script标签
        content = content.replaceAll("<script.*?>(.*?)</script>"," ");

        //3.替换普通的HTML标签
        content = content.replaceAll("<.*?>"," ");
        //4.使用正则把多个空格,合并成一个空格
        content = content.replaceAll("\\s+"," ");
        return content;
    }

    public String parseContent(File f) {
    
    
        //先按照一个字符一个字符来读取,以< 和 > 来控制拷贝数据的开关
        try(BufferedReader bufferedReader = new BufferedReader(new FileReader(f),1024 *1024)) {
    
    
            //加上一个开关
            boolean isCopy = true;
            //还准备一个保存结果的StringBuilder
            StringBuilder content  = new StringBuilder();
            while (true){
    
    
                //read int类型 读到最后返回-1
                int ret = bufferedReader.read();
                if (ret == -1){
    
    
                    //表示文件读完了
                    break;
                }
                //不是-1就是合法字符
                char c = (char) ret;
                if (isCopy){
    
    
                    //打开的状态可以拷贝
                    if (c == '<'){
    
    
                        isCopy =false;
                        continue;
                    }
                    //判断是否是换行
                    if (c == '\n' || c == '\r'){
    
    
//                        是换行就变成空格
                        c = ' ';
                    }
                    //其他字符进行拷贝到StringBuilder中
                    content.append(c);
                }else{
    
    
                    //
                    if (c=='>'){
    
    
                        isCopy= true;
                    }
                }
            }
            return content.toString();
        } catch (IOException e) {
    
    
            e.printStackTrace();
        }
        return "";
    }

    private String parseUrl(File f) {
    
    
        //固定的前缀
        String path = "https://docs.oracle.com/javase/8/docs/api/";
        //只放一个参数的意思是:前面一段都不需要,取后面的一段
        String path2=   f.getAbsolutePath().substring(INPUT_PATH.length());
        return path + path2;
    }

    private String parseTitle(File f) {
    
    
        //获取文件名
        String name =  f.getName();
        return name.substring(0,name.length()-".html".length());
    }

    //第一个参数表示从那个目录开始进行遍历,第二个目录表示递归得到的结果
    private void enumFile(String inputPath, ArrayList<File> fileList) {
    
    
        //我们需要把String类型的路径变成文件类 好操作点
        File rootPath = new File(inputPath);
        //listFiles()类似于Linux的ls把当前目录中包含的文件名获取到
        //使用listFiles只可以看见一级目录,想看到子目录需要递归操作
        File[] files = rootPath.listFiles();
        for (File file : files) {
    
    
            //根据当前的file的类型,觉得是否递归
            //如果file是普通文件就把file加入到listFile里面
            //如果file是一个目录 就递归调用enumFile这个方法,来进一步获取子目录的内容
            if (file.isDirectory()){
    
    
                //根路径要变
                enumFile(file.getAbsolutePath(),fileList);
            }else {
    
    
                //只针对HTML文件
                if(file.getAbsolutePath().endsWith(".html")){
    
    
                    //普通HTML文件
                    fileList.add(file);
                }

            }
        }
    }

    public static void main(String[] args) throws InterruptedException {
    
    
        //通过main方法来实现整个制作索引的过程
        Parser parser = new Parser();
        parser.runByThread();
    }

}

Go back to the search category to see:
insert image description here
perfect solution.


6.6 Implementing the Search Module - Summary of the Search Module

Our search module is mainly to implement the search method of the Searcher class:
1. Segmentation 2. Trigger 3. Sorting 4. Packaging results We are stringing together the prepared work that has been implemented before.

How simple we are now is because there is no business logic yet:
insert image description here
others here also display pictures, regions, etc., and various user experiences are full.

In the future, when we actually develop technology, we still have to serve business. Next we start implementing the web module.


7. Implement the Web module - agree on the front-end and back-end interaction interfaces

Next, we need to implement the web module, provide a web interface, and present the program to the user.

Front-end (HTML+css+js) + back-end (java, servlet / Spring):

Here we only need to implement an interface, just search for the interface

insert image description here


7.1 Implementing the Web module - implementing the backend based on Servlet

We first use servlet to implement, and then change Spring boot

My Tomcat is 8.5.x so using servlet version 3.1:

insert image description here

package api;

import com.fasterxml.jackson.databind.ObjectMapper;

import searcher.DocSearcher;
import searcher.Result;

import javax.servlet.ServletException;
import javax.servlet.annotation.WebServlet;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import java.io.IOException;
import java.util.List;


/**
 * Created by Lin
 * Description:
 * User: Administrator
 * Date: 2022-12-23
 * Time: 15:15
 */
@WebServlet("/search")
public class DocSearcherServlet extends HttpServlet {
    
    
    //此处的DocSearcher 对象应该是全局唯一的,所以给static修饰
    private static DocSearcher docSearcher = new DocSearcher();
    private ObjectMapper objectMapper = new ObjectMapper();
    @Override
    protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException {
    
    
        //1.先解析请求,拿到用户提交的查询词
        String query =  req.getParameter("query");
        if(query == null || query.equals("")){
    
    
            String msg = "你的参数非法!没有获取到query的值";
            resp.sendError(404,msg);
            return;
        }
        //2. 打印记录一下query的值
        System.out.println("query="+query);
        //3.调用搜索模块,来进行搜索
        List<Result> results =  docSearcher.search(query);
        //4.把当前的搜索结果进行打包
        resp.setContentType("application/json;charset=utf-8");
        objectMapper.writeValue(resp.getWriter(),results);
    }
}


Our directory structure has also changed a bit:
insert image description here


7.2 Implement the Web module-authentication backend interface

Here we are the community version and then use the smartTomcat configuration to run it
insert image description here
. Enter this URL:
insert image description here
our server can return data at the current preliminary stage


7.3 Realize the Web module - realize the page structure

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Java文档搜索</title>
</head>
<body>
    <!-- 1.搜索框 和搜索按钮 -->
    <!-- 2.显示搜索结果 -->

    <!-- 通过.container来表示整个页面的元素的容器 -->
    <div class="container">
        <!-- 搜索框加搜索按钮 -->
        <div class="header">
            <input type="text">
            <button id="search-btn">搜索</button>
        </div> 
        <!-- 显示搜索结果 -->
        <div class="result">
                <!-- 包含了很多记录 -->
                <!-- 通过访问服务器的方式获取搜索结果 -->
                <div class="item">
                    <a href="#">我是标题</a>
                    <div class="class">我是一段描述: Lorem ipsum dolor sit, amet consectetur adipisicing elit. Cumque sunt maxime eveniet ducimus error nihil quidem assumenda eius soluta esse, officiis, dolores tenetur sit temporibus. Ea aliquam culpa beatae vitae.</div>
                    <div class="url">http://www.baidu.com</div>
                </div>

                <div class="item">
                    <a href="#">我是标题</a>
                    <div class="class">我是一段描述: Lorem ipsum dolor sit, amet consectetur adipisicing elit. Cumque sunt maxime eveniet ducimus error nihil quidem assumenda eius soluta esse, officiis, dolores tenetur sit temporibus. Ea aliquam culpa beatae vitae.</div>
                    <div class="url">http://www.baidu.com</div>
                </div>

                
                <div class="item">
                    <a href="#">我是标题</a>
                    <div class="desc">我是一段描述: Lorem ipsum dolor sit, amet consectetur adipisicing elit. Cumque sunt maxime eveniet ducimus error nihil quidem assumenda eius soluta esse, officiis, dolores tenetur sit temporibus. Ea aliquam culpa beatae vitae.</div>
                    <div class="url">http://www.baidu.com</div>
                </div>


                <div class="item">
                    <a href="#">我是标题</a>
                    <div class="desc">我是一段描述: Lorem ipsum dolor sit, amet consectetur adipisicing elit. Cumque sunt maxime eveniet ducimus error nihil quidem assumenda eius soluta esse, officiis, dolores tenetur sit temporibus. Ea aliquam culpa beatae vitae.</div>
                    <div class="url">http://www.baidu.com</div>
                </div>

                <div class="item">
                    <a href="#">我是标题</a>
                    <div class="desc">我是一段描述: Lorem ipsum dolor sit, amet consectetur adipisicing elit. Cumque sunt maxime eveniet ducimus error nihil quidem assumenda eius soluta esse, officiis, dolores tenetur sit temporibus. Ea aliquam culpa beatae vitae.</div>
                    <div class="url">http://www.baidu.com</div>
                </div>

        </div>
    </div>
</body>
</html>

Our most basic page has:

insert image description here


7.4 Implementing Web Modules - Implementing Page Styles (CSS)

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Java文档搜索</title>
</head>
<body>
    <!-- 1.搜索框 和搜索按钮 -->
    <!-- 2.显示搜索结果 -->

    <!-- 通过.container来表示整个页面的元素的容器 -->
    <div class="container">
        <!-- 搜索框加搜索按钮 -->
        <div class="header">
            <input type="text">
            <button id="search-btn">搜索</button>
        </div> 
        <!-- 显示搜索结果 -->
        <div class="result">
                <!-- 包含了很多记录 -->
                <!-- 通过访问服务器的方式获取搜索结果 -->
                <div class="item">
                    <a href="#">我是标题</a>
                    <div class="class">我是一段描述: Lorem ipsum dolor sit, amet consectetur adipisicing elit. Cumque sunt maxime eveniet ducimus error nihil quidem assumenda eius soluta esse, officiis, dolores tenetur sit temporibus. Ea aliquam culpa beatae vitae.</div>
                    <div class="url">http://www.baidu.com</div>
                </div>

                <div class="item">
                    <a href="#">我是标题</a>
                    <div class="class">我是一段描述: Lorem ipsum dolor sit, amet consectetur adipisicing elit. Cumque sunt maxime eveniet ducimus error nihil quidem assumenda eius soluta esse, officiis, dolores tenetur sit temporibus. Ea aliquam culpa beatae vitae.</div>
                    <div class="url">http://www.baidu.com</div>
                </div>

                
                <div class="item">
                    <a href="#">我是标题</a>
                    <div class="desc">我是一段描述: Lorem ipsum dolor sit, amet consectetur adipisicing elit. Cumque sunt maxime eveniet ducimus error nihil quidem assumenda eius soluta esse, officiis, dolores tenetur sit temporibus. Ea aliquam culpa beatae vitae.</div>
                    <div class="url">http://www.baidu.com</div>
                </div>


                <div class="item">
                    <a href="#">我是标题</a>
                    <div class="desc">我是一段描述: Lorem ipsum dolor sit, amet consectetur adipisicing elit. Cumque sunt maxime eveniet ducimus error nihil quidem assumenda eius soluta esse, officiis, dolores tenetur sit temporibus. Ea aliquam culpa beatae vitae.</div>
                    <div class="url">http://www.baidu.com</div>
                </div>

                <div class="item">
                    <a href="#">我是标题</a>
                    <div class="desc">我是一段描述: Lorem ipsum dolor sit, amet consectetur adipisicing elit. Cumque sunt maxime eveniet ducimus error nihil quidem assumenda eius soluta esse, officiis, dolores tenetur sit temporibus. Ea aliquam culpa beatae vitae.</div>
                    <div class="url">http://www.baidu.com</div>
                </div>

        </div>
    </div>

    <style >
        /* 这部分代码来写样式 */
        /* 先去掉浏览器默认的样式 */
        *{
      
      
            margin: 0;
            padding: 0;
            box-sizing: border-box;
        }

        /* 给整体的页面指定一个高度(和浏览器窗口一样高) */
        html,body{
      
      
            height: 100%;
            /* 设置背景图 */
            background-image: url(image/bjt.jpg);
            /* 设置背景图不平铺 */
            background-repeat: no-repeat;
            /* 设置背景图的大小 */
            background-size: cover;
            /* 设置背景图的位置 */
            background-position: center center;
        }

        /* 针对.container 也设置样式,实现版心效果 */
        .container{
      
      
            width: 1135px;
            height: 100%;                                         
            /* 设置水平居中 */
            margin: 0 auto;
            /* 设置背景色,让版心和背景图能够区分开 */
            background-color:rgba(255, 255, 255, 0.8);
            /* 设置圆角矩形 */
            border-radius: 10px;
            /* 设置内边距 避免文章内容紧填边界 */
            padding: 15px;

            /* 超出元素的部分,自动生成一个滚动条 */
            overflow: auto;
        }
        .header{
      
      
            width: 100%;
            height: 50px;
            display: flex;
            justify-content: space-between;
            align-items: center;
        }
        .header> input{
      
      
            height: 30px;
            width: 1000px;
            font-size: 22px;
            line-height: 50px;
            padding-left: 10px;
            border-radius: 10px;
        }

        .header>button{
      
      
            height: 30px;
            width: 100px;
            background-color: antiquewhite;
            color: black;
            border-radius: 10px;

        }
        .result .count{
      
      
            color: darkblue;
            margin-top: 10px;

        }
        .header>button:active{
      
      
            background: gray;
        }
        .item{
      
      
            width:100%;
            margin-top: 20px;
        }

        .item a{
      
      

            display: block;
            height: 40px;
            font-size: 22px;
            line-height: 40px;
            font-weight: 700;

            color: rgb(42, 107, 205);
        }

        .item .desc{
      
      
            font-size: 18px;
        }

        .item .url{
      
      
            font-size: 18px;
            color: rgb(0, 130, 0);
        }

        .item>.desc>i {
      
      
            color: red;
            /* 去掉斜体 */
            font-style: normal;
        }
    </style>
</body>
</html>


7.5 Implement the Web module - get search results through ajax

The data we saw above is actually dead, and the real data still has to be returned through the server. We use ajax to happen:

When the user clicks the search button, the browser will obtain the content of the search box, construct a request based on ajax, and then send it to the search server. When the browser obtains the search result, it will generate the page according to the json data of the result. We use jquery.

<script src="http://libs.baidu.com/jquery/2.0.0/jquery.min.js"></script>

Copy this and it will work.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Java文档搜索</title>
</head>
<body>
    <!-- 1.搜索框 和搜索按钮 -->
    <!-- 2.显示搜索结果 -->

    <!-- 通过.container来表示整个页面的元素的容器 -->
    <div class="container">
        <!-- 搜索框加搜索按钮 -->
        <div class="header">
            <input type="text">
            <button id="search-btn">搜索</button>
        </div> 
        <!-- 显示搜索结果 -->
        <div class="result">
                <!-- 包含了很多记录 -->
                <!-- 通过访问服务器的方式获取搜索结果 -->
<!--                <div class="item">-->
<!--                    <a href="#">我是标题</a>-->
<!--                    <div class="class">我是一段描述: Lorem ipsum dolor sit, amet consectetur adipisicing elit. Cumque sunt maxime eveniet ducimus error nihil quidem assumenda eius soluta esse, officiis, dolores tenetur sit temporibus. Ea aliquam culpa beatae vitae.</div>-->
<!--                    <div class="url">http://www.baidu.com</div>-->
<!--                </div>-->

<!--                <div class="item">-->
<!--                    <a href="#">我是标题</a>-->
<!--                    <div class="class">我是一段描述: Lorem ipsum dolor sit, amet consectetur adipisicing elit. Cumque sunt maxime eveniet ducimus error nihil quidem assumenda eius soluta esse, officiis, dolores tenetur sit temporibus. Ea aliquam culpa beatae vitae.</div>-->
<!--                    <div class="url">http://www.baidu.com</div>-->
<!--                </div>-->

<!--                -->
<!--                <div class="item">-->
<!--                    <a href="#">我是标题</a>-->
<!--                    <div class="desc">我是一段描述: Lorem ipsum dolor sit, amet consectetur adipisicing elit. Cumque sunt maxime eveniet ducimus error nihil quidem assumenda eius soluta esse, officiis, dolores tenetur sit temporibus. Ea aliquam culpa beatae vitae.</div>-->
<!--                    <div class="url">http://www.baidu.com</div>-->
<!--                </div>-->


<!--                <div class="item">-->
<!--                    <a href="#">我是标题</a>-->
<!--                    <div class="desc">我是一段描述: Lorem ipsum dolor sit, amet consectetur adipisicing elit. Cumque sunt maxime eveniet ducimus error nihil quidem assumenda eius soluta esse, officiis, dolores tenetur sit temporibus. Ea aliquam culpa beatae vitae.</div>-->
<!--                    <div class="url">http://www.baidu.com</div>-->
<!--                </div>-->

<!--                <div class="item">-->
<!--                    <a href="#">我是标题</a>-->
<!--                    <div class="desc">我是一段描述: Lorem ipsum dolor sit, amet consectetur adipisicing elit. Cumque sunt maxime eveniet ducimus error nihil quidem assumenda eius soluta esse, officiis, dolores tenetur sit temporibus. Ea aliquam culpa beatae vitae.</div>-->
<!--                    <div class="url">http://www.baidu.com</div>-->
<!--                </div>-->

        </div>
    </div>

    <style >
        /* 这部分代码来写样式 */
        /* 先去掉浏览器默认的样式 */
        *{
      
      
            margin: 0;
            padding: 0;
            box-sizing: border-box;
        }

        /* 给整体的页面指定一个高度(和浏览器窗口一样高) */
        html,body{
      
      
            height: 100%;
            /* 设置背景图 */
            background-image: url(image/bjt.jpg);
            /* 设置背景图不平铺 */
            background-repeat: no-repeat;
            /* 设置背景图的大小 */
            background-size: cover;
            /* 设置背景图的位置 */
            background-position: center center;
        }

        /* 针对.container 也设置样式,实现版心效果 */
        .container{
      
      
            width: 1135px;
            height: 100%;
            /* 设置水平居中 */
            margin: 0 auto;
            /* 设置背景色,让版心和背景图能够区分开 */
            background-color:rgba(255, 255, 255, 0.8);
            /* 设置圆角矩形 */
            border-radius: 10px;
            /* 设置内边距 避免文章内容紧填边界 */
            padding: 15px;

            /* 超出元素的部分,自动生成一个滚动条 */
            overflow: auto;
        }
        .header{
      
      
            width: 100%;
            height: 50px;
            display: flex;
            justify-content: space-between;
            align-items: center;
        }
        .header> input{
      
      
            height: 30px;
            width: 1000px;
            font-size: 22px;
            line-height: 50px;
            padding-left: 10px;
            border-radius: 10px;
        }

        .header>button{
      
      
            height: 30px;
            width: 100px;
            background-color: antiquewhite;
            color: black;
            border-radius: 10px;

        }
        .result .count{
      
      
            color: darkblue;
            margin-top: 10px;

        }
        .header>button:active{
      
      
            background: gray;
        }
        .item{
      
      
            width:100%;
            margin-top: 20px;
        }

        .item a{
      
      

            display: block;
            height: 40px;
            font-size: 22px;
            line-height: 40px;
            font-weight: 700;

            color: rgb(42, 107, 205);
        }

        .item .desc{
      
      
            font-size: 18px;
        }

        .item .url{
      
      
            font-size: 18px;
            color: rgb(0, 130, 0);
        }

        .item>.desc>i {
      
      
            color: red;
            /* 去掉斜体 */
            font-style: normal;
        }
    </style>

<script src="http://libs.baidu.com/jquery/2.0.0/jquery.min.js"></script>
<script>
    //放置用户自己写的js代码
    let button = document.querySelector("#search-btn");
    button.onclick =function(){
      
      
        // 先获取输入框的内容
        let input =document.querySelector(".header input");
        let query = input.value;
        //然后构造ajax的请求
        $.ajax({
      
      
            type:"GET",
            url:"search?query="+query,
            success:function(data,status){
      
      
                // success会在请求成功后调用
                //data参数就是表示拿到的结果数据
                //status参数就表示的HTTP状态码
                //根据收到的数据结果,构造出页面内容
                //console.log(data);
                buildResult(data);
            }
        })
    }

    function buildResult(data) {
      
      
        //通过这个函数,来把响应数据给构造成页面内容
        //遍历data中的元素,针对每个元素都创建一个div.item,然后把标题,url,描述构造好
        //再把这个div.item 给加入到div.result中
        //这些操作都是基于DOM API来展开

        //获取到.result这个标签
        let result = document.querySelector('.result');

        //清空上次结果
        result.innerHTML=' ';

        //先构造一个div用来显示结果的个数
        let countDiv = document.createElement("div");
        countDiv.innerHTML = '当前找到约' + data.length + '个结果!';
        countDiv.className = 'count';
        result.appendChild(countDiv);

        //此处得到的item分别代表data的每个元素
        for(let item of data){
      
      
            let itemDiv = document.createElement('div');
            itemDiv.className = 'item';
            //构造一个标题
            let title = document.createElement('a');
            title.href=item.url;
            title.innerHTML = item.title;
            title.target='_blank';
            itemDiv.appendChild(title);

            //构造一个描述
            let desc = document.createElement('div');
            desc.className='desc';
            desc.innerHTML=item.desc;
            itemDiv.appendChild(desc);

            //构造一个url
            let url = document.createElement('div');
            url.className = 'url';
            url.innerHTML = item.url;
            itemDiv.appendChild(url);



            // 把itemDiv加入到result里面
            result.appendChild(itemDiv);
        }

      }

</script>
</body>
</html>

7.6 Implement the Web module - implement the logic of marking red

We want to achieve an effect similar to this:
insert image description here
ideas:

  1. Modify the back-end code, when generating search results (generate the description part), add a tag to the part of the query word, such as adding an i tag
  2. Then the front end sets the style of the label
 private String GenDesc(String content, List<Term> terms) {
    
    
        //先遍历结果,看看哪个结果是在content中存在
        int firstPos = -1;
        for (Term term : terms) {
    
    
            String word = term.getName();

            //因为分词结果是会把正文转成小写,所以我们要把查询词也转成小写

            //为了搜索结果独立成词 所以加" "
            firstPos =content.toLowerCase().indexOf(" " + word + " ");
            if (firstPos >= 0){
    
    
                break;
            }

            if(firstPos ==-1){
    
    
                //所有的分词结果都不在正文中存在 极端情况
                return content.substring(0,160)+"...";
            }
        }
        //从firstPos 作为基准,往前找60个字符,作为描述的起始位置
        String desc ="";
        //如果当前位置少于60个字符开始位置就是第一个 否则开始位置 在查询词前60个
        int descBeg = firstPos < 60 ?  0 : firstPos -60;
        if (descBeg+160 > content.length()){
    
    
            //判断是否超过正文长度
            //从开始位置到最后
            desc = content.substring(descBeg);
        }else {
    
    
            desc  =content.substring(descBeg,descBeg + 160)+"...";
        }
        //在此处加上一个替换操作,把描述中的分词结果相同的部分,给加上一层<i>标签,就可以通过replace的方式实现
        for (Term term : terms){
    
    
            String word = term.getName();
            //只有是全部是一个查询词 才加上 i 标签  正则规则(?i)大小写全部匹配
            desc =  desc.replaceAll("(?i)"+word+" ","<i> " + word + " </i>");
        }
        return desc;
    }

The final code part:
insert image description here

The final effect:

insert image description here


7.7 Implementing the Web module - testing more complex query terms

We tried to separate the words in the search box, and the server reported a 500 error. Why is this?
insert image description here
Detailed information:
insert image description here
It turned out to be our extreme route: the total length was not 160, and we intercepted 160 characters, so we reported an error
insert image description here

Let's change the code:
insert image description here
it's ready:
insert image description here

But we found another problem: there are many terms that have nothing to do with the query word

insert image description here

Why is this?

Because the query term we just mentioned is not just Array List , but Array, space, and List , the code will use spaces to query the inverted index! , so we're going to get rid of spaces, and here we introduce a special thing.


Pause words : some things are high-frequency, but meaningless content

Like: a is have One is yes, we want this class not to trigger code queries.


7.8 Implementing the Web Module - Handling Stop Words

We can go to the Internet to find a ready-made stop vocabulary. Note that the first pause word list here is a space:
insert image description here

Next, you can let the search program load this stop word list into the memory, use a hashSet to store these words, and then filter the word segmentation results in the stop word list, if the result exists in the word list, Just kill it.

Here we put it in the txt text:insert image description here

package searcher;

import org.ansj.domain.Term;
import org.ansj.splitWord.analysis.ToAnalysis;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.*;

/**
 * Created by Lin
 * Description:
 * User: Administrator
 * Date: 2022-12-21
 * Time: 13:22
 */
public class DocSearcher {
    
    
    private static final String STOP_WORD_PATH ="D:\\gitee\\doc_searcher_index\\stop_word.txt";

    //使用这个HashSet 来保存停用词
    private HashSet<String> stopWords = new HashSet<>();

    private Index index = new Index();

    public DocSearcher() {
    
    
        //一开始要加载
        index.load();
        loadStopWords();
    }

    //完成整个搜索过程的方法
    //参数(输入部分)就是用户给出的查询词
    //返回值(输出部分)就是搜索结果的集合
    public List<Result> search(String query) {
    
    
        //1.[分词]针对query这个查询词进行分词
        List<Term> oldTerms = ToAnalysis.parse(query).getTerms();
        List<Term> terms = new ArrayList<>();
        //针对分词结果,使用暂停词表
        for (Term term : oldTerms){
    
    
            if (stopWords.contains(term.getName())){
    
    
                //是暂停词就不拷贝
                continue;
            }
            terms.add(term);
        }

        //2.[触发]针对分词结果来查倒排
        List<Weight> allTermResult = new ArrayList<>();
        for (Term term : terms) {
    
    
            String word = term.getName();
            List<Weight> invertedList = index.getInverted(word);
            if (invertedList == null) {
    
    
                //说明词不存在
                continue;
            }
            allTermResult.addAll(invertedList);
        }
        // 3.[排序]针对触发的结果按照权重降序排序
        allTermResult.sort(new Comparator<Weight>() {
    
    
            @Override
            public int compare(Weight o1, Weight o2) {
    
    
                //降序排序 return o2.getWeight-01.getWeight  升序反之
                return o2.getWeight() - o1.getWeight();
            }
        });
        //4.[包装结果]针对排序的结果,去查正排,构造出要返回的数据
        List<Result> results = new ArrayList<>();
        for (Weight weight : allTermResult) {
    
    
            DocInfo docInfo = index.getDocInfo(weight.getDocId());
            Result result = new Result();
            result.setTitle(docInfo.getTitle());
            result.setUrl(docInfo.getUrl());
            result.setDesc(GenDesc(docInfo.getContent(),terms));
            results.add(result);
        }
        return results;
    }

    private String GenDesc(String content, List<Term> terms) {
    
    
        //先遍历结果,看看哪个结果是在content中存在
        int firstPos = -1;
        for (Term term : terms) {
    
    
            String word = term.getName();

            //因为分词结果是会把正文转成小写,所以我们要把查询词也转成小写

            //为了搜索结果独立成词 所以加" "
            firstPos =content.toLowerCase().indexOf(" " + word + " ");
            if (firstPos >= 0){
    
    
                break;
            }

            if(firstPos ==-1){
    
    
                if(content.length() > 160){
    
    
                    return content.substring(0,160)+"...";

                }
                //所有的分词结果都不在正文中存在 极端情况
                return content;
            }
        }
        //从firstPos 作为基准,往前找60个字符,作为描述的起始位置
        String desc ="";
        //如果当前位置少于60个字符开始位置就是第一个 否则开始位置 在查询词前60个
        int descBeg = firstPos < 60 ?  0 : firstPos -60;
        if (descBeg+160 > content.length()){
    
    
            //判断是否超过正文长度
            //从开始位置到最后
            desc = content.substring(descBeg);
        }else {
    
    
            desc  =content.substring(descBeg,descBeg + 160)+"...";
        }
        //在此处加上一个替换操作,把描述中的分词结果相同的部分,给加上一层<i>标签,就可以通过replace的方式实现
        for (Term term : terms){
    
    
            String word = term.getName();
            //只有是全部是一个查询词 才加上 i 标签  正则规则(?i)大小写全部匹配
            desc =  desc.replaceAll("(?i)"+word+" ","<i> " + word + " </i>");
        }
        return desc;
    }

    public void loadStopWords (){
    
    
        System.out.println("加载暂停词表");
        try(BufferedReader bufferedReader = new BufferedReader(new FileReader(STOP_WORD_PATH))){
    
    
            while (true){
    
    
                String line = bufferedReader.readLine();
                if (line == null){
    
    
                    //读取文件完毕
                    break;
                }
                stopWords.add(line);
            }

        }catch (IOException e){
    
    
            e.printStackTrace();
        }
    }

    public static void main(String[] args) {
    
    
        DocSearcher docSearcher = new DocSearcher();
        Scanner scanner = new Scanner(System.in);
        while (true) {
    
    
            System.out.print("->");
            String query = scanner.next();
            List<Result> results = docSearcher.search(query);
            for (Result result : results) {
    
    
                System.out.println("======================================");
                System.out.println(result);
            }
        }
    }
}

Loading of pause words to HashSet:
insert image description here
also change it in the search method:insert image description here

If the pause word contains do not add to the set


7.9 Implementing Web Modules - Dealing with Bugs in Generated Descriptions

The description generation here is generated from the very beginning, not according to our rules.
insert image description here
We didn’t see their descriptors like this but they also appeared. What’s the problem?

Is it also included in it? So let's go to the page:

insert image description here
It can be found that there are indeed some here, but the generated description is problematic: we found out a single word before:
insert image description here

There is a word in the text that is List. We should not match List with part of ArrayList. We can match independent words (a List b) which is ok, but we still have some problems, such as (a List.) , followed by a punctuation mark. That would be inaccurate,

To solve this problem, our solution is still regular expressions.

\b: matches a word boundary, including symbolic punctuation

insert image description here
But how do we achieve this regularity?
insert image description here
The principle is to convert the words with boundaries first to add spaces, and then operate on the words with spaces.

insert image description here
Is it also successful as described here.


7.10 Implementing the Web module - plus the number of search results

We want to achieve an effect like this:
insert image description here
we have 2 methods:

  1. Calculate the number directly on the server side and return it to the browser
  2. On the browser side, according to the length of the received result array, the number is automatically displayed

We choose the second option:
insert image description here

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Java文档搜索</title>
</head>
<body>
    <!-- 1.搜索框 和搜索按钮 -->
    <!-- 2.显示搜索结果 -->

    <!-- 通过.container来表示整个页面的元素的容器 -->
    <div class="container">
        <!-- 搜索框加搜索按钮 -->
        <div class="header">
            <input type="text">
            <button id="search-btn">搜索</button>
        </div> 
        <!-- 显示搜索结果 -->
        <div class="result">
                <!-- 包含了很多记录 -->
                <!-- 通过访问服务器的方式获取搜索结果 -->
<!--                <div class="item">-->
<!--                    <a href="#">我是标题</a>-->
<!--                    <div class="class">我是一段描述: Lorem ipsum dolor sit, amet consectetur adipisicing elit. Cumque sunt maxime eveniet ducimus error nihil quidem assumenda eius soluta esse, officiis, dolores tenetur sit temporibus. Ea aliquam culpa beatae vitae.</div>-->
<!--                    <div class="url">http://www.baidu.com</div>-->
<!--                </div>-->

<!--                <div class="item">-->
<!--                    <a href="#">我是标题</a>-->
<!--                    <div class="class">我是一段描述: Lorem ipsum dolor sit, amet consectetur adipisicing elit. Cumque sunt maxime eveniet ducimus error nihil quidem assumenda eius soluta esse, officiis, dolores tenetur sit temporibus. Ea aliquam culpa beatae vitae.</div>-->
<!--                    <div class="url">http://www.baidu.com</div>-->
<!--                </div>-->

<!--                -->
<!--                <div class="item">-->
<!--                    <a href="#">我是标题</a>-->
<!--                    <div class="desc">我是一段描述: Lorem ipsum dolor sit, amet consectetur adipisicing elit. Cumque sunt maxime eveniet ducimus error nihil quidem assumenda eius soluta esse, officiis, dolores tenetur sit temporibus. Ea aliquam culpa beatae vitae.</div>-->
<!--                    <div class="url">http://www.baidu.com</div>-->
<!--                </div>-->


<!--                <div class="item">-->
<!--                    <a href="#">我是标题</a>-->
<!--                    <div class="desc">我是一段描述: Lorem ipsum dolor sit, amet consectetur adipisicing elit. Cumque sunt maxime eveniet ducimus error nihil quidem assumenda eius soluta esse, officiis, dolores tenetur sit temporibus. Ea aliquam culpa beatae vitae.</div>-->
<!--                    <div class="url">http://www.baidu.com</div>-->
<!--                </div>-->

<!--                <div class="item">-->
<!--                    <a href="#">我是标题</a>-->
<!--                    <div class="desc">我是一段描述: Lorem ipsum dolor sit, amet consectetur adipisicing elit. Cumque sunt maxime eveniet ducimus error nihil quidem assumenda eius soluta esse, officiis, dolores tenetur sit temporibus. Ea aliquam culpa beatae vitae.</div>-->
<!--                    <div class="url">http://www.baidu.com</div>-->
<!--                </div>-->

        </div>
    </div>

    <style >
        /* 这部分代码来写样式 */
        /* 先去掉浏览器默认的样式 */
        *{
      
      
            margin: 0;
            padding: 0;
            box-sizing: border-box;
        }

        /* 给整体的页面指定一个高度(和浏览器窗口一样高) */
        html,body{
      
      
            height: 100%;
            /* 设置背景图 */
            background-image: url(image/bjt.jpg);
            /* 设置背景图不平铺 */
            background-repeat: no-repeat;
            /* 设置背景图的大小 */
            background-size: cover;
            /* 设置背景图的位置 */
            background-position: center center;
        }

        /* 针对.container 也设置样式,实现版心效果 */
        .container{
      
      
            width: 1135px;
            height: 100%;
            /* 设置水平居中 */
            margin: 0 auto;
            /* 设置背景色,让版心和背景图能够区分开 */
            background-color:rgba(255, 255, 255, 0.8);
            /* 设置圆角矩形 */
            border-radius: 10px;
            /* 设置内边距 避免文章内容紧填边界 */
            padding: 15px;

            /* 超出元素的部分,自动生成一个滚动条 */
            overflow: auto;
        }
        .header{
      
      
            width: 100%;
            height: 50px;
            display: flex;
            justify-content: space-between;
            align-items: center;
        }
        .header> input{
      
      
            height: 30px;
            width: 1000px;
            font-size: 22px;
            line-height: 50px;
            padding-left: 10px;
            border-radius: 10px;
        }

        .header>button{
      
      
            height: 30px;
            width: 100px;
            background-color: antiquewhite;
            color: black;
            border-radius: 10px;

        }
        .result .count{
      
      
            color: darkblue;
            margin-top: 10px;

        }
        .header>button:active{
      
      
            background: gray;
        }
        .item{
      
      
            width:100%;
            margin-top: 20px;
        }

        .item a{
      
      

            display: block;
            height: 40px;
            font-size: 22px;
            line-height: 40px;
            font-weight: 700;

            color: rgb(42, 107, 205);
        }

        .item .desc{
      
      
            font-size: 18px;
        }

        .item .url{
      
      
            font-size: 18px;
            color: rgb(0, 130, 0);
        }

        .item>.desc>i {
      
      
            color: red;
            /* 去掉斜体 */
            font-style: normal;
        }
    </style>

<script src="http://libs.baidu.com/jquery/2.0.0/jquery.min.js"></script>
<script>
    //放置用户自己写的js代码
    let button = document.querySelector("#search-btn");
    button.onclick =function(){
      
      
        // 先获取输入框的内容
        let input =document.querySelector(".header input");
        let query = input.value;
        //然后构造ajax的请求
        $.ajax({
      
      
            type:"GET",
            url:"search?query="+query,
            success:function(data,status){
      
      
                // success会在请求成功后调用
                //data参数就是表示拿到的结果数据
                //status参数就表示的HTTP状态码
                //根据收到的数据结果,构造出页面内容
                //console.log(data);
                buildResult(data);
            }
        })
    }

    function buildResult(data) {
      
      
        //通过这个函数,来把响应数据给构造成页面内容
        //遍历data中的元素,针对每个元素都创建一个div.item,然后把标题,url,描述构造好
        //再把这个div.item 给加入到div.result中
        //这些操作都是基于DOM API来展开

        //获取到.result这个标签
        let result = document.querySelector('.result');

        //清空上次结果
        result.innerHTML=' ';

        //先构造一个di用来显示结果的个数
        let countDiv = document.createElement("div");
        countDiv.innerHTML = '当前找到约' + data.length + '个结果!';
        countDiv.className = 'count';
        result.appendChild(countDiv);

        //此处得到的item分别代表data的每个元素
        for(let item of data){
      
      
            let itemDiv = document.createElement('div');
            itemDiv.className = 'item';
            //构造一个标题
            let title = document.createElement('a');
            title.href=item.url;
            title.innerHTML = item.title;
            title.target='_blank';
            itemDiv.appendChild(title);

            //构造一个描述
            let desc = document.createElement('div');
            desc.className='desc';
            desc.innerHTML=item.desc;
            itemDiv.appendChild(desc);

            //构造一个url
            let url = document.createElement('div');
            url.className = 'url';
            url.innerHTML = item.url;
            itemDiv.appendChild(url);



            // 把itemDiv加入到result里面
            result.appendChild(itemDiv);
        }

      }

</script>
</body>
</html>

7.11 Implementing Web Modules - Questions about Duplicate Documentation

Array: 1598
insert image description here
List: 1381
insert image description here
ArrayList: 2979
insert image description here
Finally, it is found that the number of Array List is the result of adding the previous two results, and then there is another problem that duplicate documents appear: for
example, this Collections appears twice

First occurrence:
insert image description here

The position of the second occurrence:
insert image description here

We shouldn't let them appear twice, and we also need to make the weight of this appearing twice should be higher

What is the problem?

We calculate the weight and trigger sequentially for word segmentation results

array triggers a set of docId
list triggers a set of docId
Now Collections is triggered once by both array and list

Our solution:

Find a way to add the weights. For example, the first group of weights is 10 and the second group of triggers is 5, so the weight is 10+5.
To achieve this effect, we need to combine the results of the triggers.
We consider that the two sets of trigger results may have the same docId, so there are still a lot of results for merging, so we need to deduplicate and merge the weights at the same time.

So how to go about deduplication?

The core idea of ​​​​deduplication:
In the data structure we have learned, there is a linked list. A classic problem of linked lists is to merge two ordered linked lists . We can merge our two arrays in a similar way.

Merge two ordered linked lists: two linked lists, respectively take two pointers to point to their respective heads, and then compare the pointed nodes to see whose value is smaller, and insert it into the new one.

Our general idea is:

First sort the word segmentation results (according to docId ascending order) so that it is in an orderly manner, and then start merging. When merging, you can add them according to the same docId value. When inserting, see if the id is It is not the same as the id inserted last time. If it is the same, it means that it is repeated. If it is repeated, the weight will be added.

There is another problem here. The word segmentation result of the user's query is not necessarily two parts. It may be multiple word segmentation results, which will be combined into N arrays.


7.12 Implementing Web Modules - Ideas for Multi-way Merge

For two-way array merging: the core is to compare the size relationship between the two elements pointed to, find the smallest one and insert it into the result

For multi-way array merging: the core is to compare and point to the size relationship of multiple elements, find the small one and insert it into the result

insert image description here
Two-dimensional array:
insert image description here

We need to find out who has the smallest corresponding value in the above figure, take out the smallest value, insert it into the result array, and perform subscript ++ at the same time. Here to find the smallest value, we use a heap or a priority queue.


The above figure shows each docid and Weight object, we can regard them as the subscript of the first row is 1, and each Weight of a row is regarded as a column . When it comes to rows and columns, we need to consider two dimension array.

Let's look at the code



7.13 Implementing Web Modules - Implementing Multiway Merge

DocSearcher

package searcher;

import org.ansj.domain.Term;
import org.ansj.splitWord.analysis.ToAnalysis;
import org.omg.CORBA.INTERNAL;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.*;

/**
 * Created by Lin
 * Description:
 * User: Administrator
 * Date: 2022-12-21
 * Time: 13:22
 */
public class DocSearcher {
    
    
    private static final String STOP_WORD_PATH ="D:\\gitee\\doc_searcher_index\\stop_word.txt";

    //使用这个HashSet 来保存停用词
    private HashSet<String> stopWords = new HashSet<>();

    private Index index = new Index();

    public DocSearcher() {
    
    
        //一开始要加载
        index.load();
        loadStopWords();
    }

    //完成整个搜索过程的方法
    //参数(输入部分)就是用户给出的查询词
    //返回值(输出部分)就是搜索结果的集合
    public List<Result> search(String query) {
    
    
        //1.[分词]针对query这个查询词进行分词
        List<Term> oldTerms = ToAnalysis.parse(query).getTerms();
        List<Term> terms = new ArrayList<>();
        //针对分词结果,使用暂停词表
        for (Term term : oldTerms){
    
    
            
            if (stopWords.contains(term.getName())){
    
    
                //是暂停词就不拷贝
                continue;
            }
            terms.add(term);
        }

        //2.[触发]针对分词结果来查倒排
        //二维数组
        List<List<Weight>> termResult = new ArrayList<>();
        for (Term term : terms) {
    
    
            String word = term.getName();
            List<Weight> invertedList = index.getInverted(word);
            if (invertedList == null) {
    
    
                //说明词不存在
                continue;
            }
            termResult.add(invertedList);
        }
        //3.[合并]针对多个分词结果触发出的相同文档进行权重合并 去重
        List<Weight> allTermResult = mergeResult(termResult);

        // 4.[排序]针对触发的结果按照权重降序排序
        allTermResult.sort(new Comparator<Weight>() {
    
    
            @Override
            public int compare(Weight o1, Weight o2) {
    
    
                //降序排序 return o2.getWeight-01.getWeight  升序反之
                return o2.getWeight() - o1.getWeight();
            }
        });
        //5.[包装结果]针对排序的结果,去查正排,构造出要返回的数据
        List<Result> results = new ArrayList<>();
        for (Weight weight : allTermResult) {
    
    
            DocInfo docInfo = index.getDocInfo(weight.getDocId());
            Result result = new Result();
            result.setTitle(docInfo.getTitle());
            result.setUrl(docInfo.getUrl());
            result.setDesc(GenDesc(docInfo.getContent(),terms));
            results.add(result);
        }
        return results;
    }
    //进行合并的时候把多个行合并成为一行,
    //合并过程要操作二维数组的每个元素 所以我们把行和列创建好
    static class Pos{
    
    
        public int row;
        public int col;



        public Pos(int row, int col) {
    
    
            this.row = row;
            this.col = col;
        }
    }
    private List<Weight> mergeResult(List<List<Weight>> source) {
    
    

        //1.先针对每一行进行排序(按照id进行升序)不然没办法合并
        for(List<Weight> curRow :source){
    
    
            curRow.sort(new Comparator<Weight>() {
    
    
                @Override
                public int compare(Weight o1, Weight o2) {
    
    
                    return o1.getDocId() - o2.getDocId();
                }
            });
        }

        //2.借助优先队列,针对这些进行合并
        // target 表示合并的结果
        List<Weight> target = new ArrayList<>();
        // 搞一个优先队列 要找到对应的Weight对象位置
//        优先队列存在的意义就是:为了能够找出每一行对应的最小的docid对象,把最小的对象插入到target里面,同时把对应的下标给往后移动
        PriorityQueue<Pos> queue = new PriorityQueue<>(new Comparator<Pos>() {
    
    
            @Override
            public int compare(Pos o1, Pos o2) {
    
    
                // 找到Weight对象 然后再根据Weight 的docId来排序

//                第一个对象
                Weight w1 = source.get(o1.row).get(o1.col);
                Weight w2 = source.get(o2.row).get(o2.col);
                return w1.getDocId() - w2.getDocId();
            }
        });

        // 初始化队列,把每一行的第一个元素放到队列中
        for (int row = 0; row<source.size();row++){
    
    
            queue.offer(new Pos(row,0));  //把每一行的第一个元素放到队列中
        }
        // 循环取队首元素 (当前若干行最小的元素)
        while (!queue.isEmpty()){
    
    
            Pos minPos = queue.poll();
            //获取最小weight对象
            Weight curWeight = source.get(minPos.row).get(minPos.col);
            //判断当前取到的对象,是否和前一个插入到target中的结果是相同的 docId
            // 如果是就合并
            if (target.size() >0){
    
    
                //取出上次插入的元素
                Weight lastWeight = target.get(target.size()-1);
                if (lastWeight.getDocId() == curWeight.getDocId()){
    
    
                    //遇到了相同的文档
                    lastWeight.setWeight(lastWeight.getWeight()+curWeight.getWeight());
                }else{
    
    
                    //如果不相同的话
                    target.add(curWeight);
                }
            }else {
    
    
                    // 如果当前是空着的 直接插入即可
                    target.add(curWeight);
            }
            //        把当前的元素处理完了之后,要把对应这个元素的光标往后移动 取下一个元素
            Pos newPos = new Pos(minPos.row, minPos.col+1);
            if (newPos.col >= source.get(newPos.row).size()){
    
    
                //如果移动光标之后,超过了这一行的列数,就说明到达了末尾 处理完毕

                continue;
            }
//            优先队列 自己放到合适的地方
            queue.offer(newPos);
        }
        return target;
    }

    private String GenDesc(String content, List<Term> terms) {
    
    
        //先遍历结果,看看哪个结果是在content中存在
        int firstPos = -1;
        for (Term term : terms) {
    
    
            String word = term.getName();

            //因为分词结果是会把正文转成小写,所以我们要把查询词也转成小写

            //为了搜索结果独立成词 所以加" "
            content = content.toLowerCase().replaceAll("\\b" + word + "\\b"," " + word +" ");
            firstPos =content.toLowerCase().indexOf(" " + word + " ");
            if (firstPos >= 0){
    
    
                break;
            }

        }
        if(firstPos ==-1){
    
    
            if(content.length() > 160){
    
    
                return content.substring(0,160)+"...";
            }
            //所有的分词结果都不在正文中存在 极端情况
            return content;
        }
        //从firstPos 作为基准,往前找60个字符,作为描述的起始位置
        String desc ="";
        //如果当前位置少于60个字符开始位置就是第一个 否则开始位置 在查询词前60个
        int descBeg = firstPos < 60 ?  0 : firstPos -60;
        if (descBeg+160 > content.length()){
    
    
            //判断是否超过正文长度
            //从开始位置到最后
            desc = content.substring(descBeg);
        }else {
    
    
            desc  =content.substring(descBeg,descBeg + 160)+"...";
        }
        //在此处加上一个替换操作,把描述中的分词结果相同的部分,给加上一层<i>标签,就可以通过replace的方式实现
        for (Term term : terms){
    
    
            String word = term.getName();
            //只有是全部是一个查询词 才加上 i 标签  正则规则(?i)大小写全部匹配
            desc =  desc.replaceAll("(?i)"+word+" ","<i> " + word + " </i>");
        }
        return desc;
    }

    public void loadStopWords (){
    
    
        System.out.println("加载暂停词表");
        try(BufferedReader bufferedReader = new BufferedReader(new FileReader(STOP_WORD_PATH))){
    
    
            while (true){
    
    
                String line = bufferedReader.readLine();
                if (line == null){
    
    
                    //读取文件完毕
                    break;
                }
                stopWords.add(line);
            }

        }catch (IOException e){
    
    
            e.printStackTrace();
        }
    }

    public static void main(String[] args) {
    
    
        DocSearcher docSearcher = new DocSearcher();
        Scanner scanner = new Scanner(System.in);
        while (true) {
    
    
            System.out.print("->");
            String query = scanner.next();
            List<Result> results = docSearcher.search(query);
            for (Result result : results) {
    
    
                System.out.println("======================================");
                System.out.println(result);
            }
        }
    }
}


The code changes are in the search method and the mergeResult method

Let's explain the important part of the code,

Here is a two-dimensional array transformed because we want to operate a Weight object of a row. This is the explanation of mergeResult
insert image description here
after de-reordering :
insert image description here

Because we use a two-dimensional array to determine that the operation object needs rows and columns, so here is an internal class Pos:
insert image description here

Here we use the operation of merging two ordered linked lists. To be ordered, they must be sorted first, according to docId
insert image description here
insert image description here

target is a new collection, we have to return after we are done, here we are the priority queue, find which line of the Pos object is the smallest, and add it to our result. The significance of the priority queue is: in order to find the smallest docid object corresponding to each row, insert the smallest object into the target, and move the corresponding subscript backward

insert image description here

First on each line:

insert image description here

insert image description here

Loop to get the first element of the queue (the smallest element in the current rows), the poll operation is to take out the element, and then get the Weight object of the local id, if it is empty, directly add the Weight object to the array, and then find out the last inserted The element is compared with this time, if the id is the same, the weight is added

insert image description here
The movement of the cursor here is the current line, just add 1 to the next column, then skip the last column, and finally return to target.
insert image description here


7.14 Implementing Web Module-Validation Weight Merging

Our first search results: both repetitive and redundant.
The result after we modified the code: Duplication and redundant results were eliminated, and the weight changed: Did you find that the weight is still low, because the weight of other documents may also increase, so Collections will decrease, but in the end it is not too big questionable.
insert image description here

8. Search engine - change to springboot - create project

Change the current Servlet version to the Spring version:

Create a springboot project using an online website: website address
insert image description here


8.1 Search engine - change to springboot - copy code to new project

  1. Copy the contents of the pom.xml file

Just copy ansj and jackson
2.

  1. Copy the code and pay attention to the path
    Directly copy all the code under the searcher package
    insert image description here
  2. Copy front-end static resources

Note that it is placed in the static directory of resources:

insert image description here


8.2 Search engine - change to springboot - implement the controller layer

package com.one.JavaDocSearchEngine.controller;

import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.one.JavaDocSearchEngine.searcher.DocSearcher;
import com.one.JavaDocSearchEngine.searcher.Result;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;

import java.util.List;

@RestController
public class DocSearcherController {
    
    
    private static DocSearcher searcher = new DocSearcher();
    private ObjectMapper objectMapper = new ObjectMapper();


    @RequestMapping(value = "/search",produces = "application/json;charset=utf-8")
    public String search(@RequestParam("query") String query) throws JsonProcessingException {
    
    
        //参数是查询词,返回是响应内容
        //参数的query 来自请求的url,querystring的query的值
        List<Result> results = searcher.search(query);
        return objectMapper.writeValueAsString(results);
    }
}

Just run it directly:
insert image description here


8.3 Search engine - change to springboot - make a path switch

Our project will be deployed to the server, but our index and pause words are all in the local path, and the path of our code is also local, but there are no index files on the server, so we put the files in On the created server path:
insert image description here
server path:
insert image description here
We can make a path switching switch when the program thinks where to run, just turn on and off the switch.
insert image description here
Indexed file path switch:
insert image description here
pause word file path switch:

The principle is also very simple is to judge the flag bit to switch according to the flag bit.


8.4 Search engine - change to springboot - deploy to server

First of all, you have to have a server: Huawei Cloud, Tencent Cloud is fine.

Points to note for the server: add the port of the server system firewall, and the entry rules of the server

If you are not familiar with it, you can install a pagoda to make it more convenient:
insert image description here
then:
insert image description here
mine is 8084: if there is a port conflict, you can change it
insert image description here
and then package it into a jar package:
insert image description here

Run: Change projectName to your own project name

nohup java -jar projectName.jar &

This command is kept running and not closed.

Above you can access your project through your elastic ip and port.

Guess you like

Origin blog.csdn.net/qq_46874327/article/details/128314437