2020/03/18 full-text retrieval technology

1. The full-text search

1.1 Classification of data

Structured data:

mysql: Field type and size are fixed

Unstructured data:

Search:

1.2 Comparison of the general search and full-text search

Normal retrieval (MySQL) (additions and deletions) Full-text search (search)
type of data Structured Data Structured data and unstructured data
process Create an index, and then query based on the id Create an inverted index , then according to the inverted index query
Query speed Sometimes fast, sometimes slow Certain fast
The results range ordinary wide
Affairs stand by It does not support transactions

1.3. Full text search scenario

The inner (1) Search

For example: micro recruitment Primus United boss hired straight

(2) vertical search

For example, Tencent video can be found to Sohu video

(3) search engine

Baidu Google

2.lucene (understand)

lucene: underlying all popular full-text search of the framework are lucene, to achieve a jar package repository (library) official website full-text search: https: //lucene.apache.org/

solr: This frame package lucene jar package library database

elastic search: frame package lucene jar package this library, stronger, more professional than solr, simpler

2.1 full-text search

(1) Create an empty project:

Named: full-text-searching

(2) create a new module, lucene

(3) adding pom dependency:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.shenyian.demo</groupId>
    <artifactId>lucene</artifactId>
    <version>1.0-SNAPSHOT</version>
    <!-- 版本锁定-->
    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.1.7.RELEASE</version>
        <relativePath/> <!-- lookup parent from repository -->
    </parent>
    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter</artifactId>
        </dependency>
        <!-- lucene 依赖-->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>4.10.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>4.10.3</version>
        </dependency>
        <!-- mybatis plus 的起步依赖-->
        <dependency>
            <groupId>com.baomidou</groupId>
            <artifactId>mybatis-plus-boot-starter</artifactId>
            <version>2.3</version>
        </dependency>
        <!-- mysql 依赖-->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
        </dependency>
        <!-- lombok -->
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
        </dependency>
        <!-- 单元测试的起步依赖-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
        </dependency>
        <!-- ik分词器 -->
        <dependency>
            <groupId>com.janeluo</groupId>
            <artifactId>ikanalyzer</artifactId>
            <version>2012_u6</version>
        </dependency>

    </dependencies>

</project>

:( sql scripts folder in the code file)

(4) edit the configuration file application.yml

spring: 
  datasource:
    driver-class-name: com.mysql.jdbc.Driver
    url: jdbc:mysql://192.168.176.109:3306/elastic_search?useUnicode=true&characterEncoding=UTF8&useSSL=false&allowMultiQueries=true&serverTimezone=Asia/Shanghai
    username: root
    password: ****

(5) create a startup class, add annotations MapperScan

package com.shenyian;

import org.mybatis.spring.annotation.MapperScan;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

@SpringBootApplication
@MapperScan("com.shenyian.mapper")
public class LuceneApplication {

    public static void main(String[] args) {
        SpringApplication.run(LuceneApplication.class, args);
    }
}
(6) create the entity class JobInfo
package com.shenyian.domain;

import com.baomidou.mybatisplus.annotations.TableId;
import com.baomidou.mybatisplus.annotations.TableName;
import lombok.Data;

@Data
@TableName("job_info")
public class JobInfo {
    @TableId
    private Long id;
    //公司名称
    private String companyName;
    //职位名称
    private String jobName;
    //薪资范围,最小
    private Integer salaryMin;
    //招聘信息详情页
    private String url;
}
(7) create mapper
package com.shenyian.mapper;

import com.baomidou.mybatisplus.mapper.BaseMapper;
import com.shenyian.domain.JobInfo;

public interface JobInfoMapper extends BaseMapper<JobInfo> {
}
(8) unit test classes:

Create an index database, add documents:

  @Test
    public void test() throws Exception {

        List<JobInfo> jobInfos = jobInfoMapper.selectList(null);

        //Directory d, IndexWriterConfig conf
        Directory directory = FSDirectory.open(new File("H:\\lucene\\index"));//指定索引库保存的地址
        //Version matchVersion, Analyzer analyzer
        Analyzer analyzer = new IKAnalyzer();//中文分词器
        //Analyzer analyzer = new StandardAnalyzer();//标准分词器 对于英文识别,不识别中文
        //Analyzer analyzer = new CJKAnalyzer();//中日韩分词器 分词的不准
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LATEST, analyzer);
        IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig); //用来创建索引库的工具
        for (JobInfo jobInfo : jobInfos) {
            Document document = new Document();
            document.add(new TextField("companyName", jobInfo.getCompanyName(), Field.Store.YES));
            document.add(new TextField("jobName", jobInfo.getJobName(), Field.Store.YES));
            document.add(new DoubleField("salaryMin", jobInfo.getSalaryMin(), Field.Store.YES));
            document.add(new StringField("url", jobInfo.getUrl(), Field.Store.YES));
            indexWriter.addDocument(document);//添加document 文档
        }
        indexWriter.close();//io关闭

    }

View index database by luke tools:

Select the folder where the index code:


Click ok:

Index Library:

(9) The query system
    @Test
    public void search() throws Exception {
        IndexReader indexReader = DirectoryReader.open(FSDirectory.open(new File("H:\\lucene\\index")));//用来读取索引库的信息
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);//是用来检索
        TopDocs topDocs = indexSearcher.search(new TermQuery(new Term("jobName", "java")), 10);//通过term查询,最多显示10条
        int totalHits = topDocs.totalHits;
        System.out.println("匹配到的数据条数:" + totalHits);
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;//通过倒排索引查询到的id数组
        for (ScoreDoc scoreDoc : scoreDocs) {
            int doc = scoreDoc.doc;//文档的id
            Document document = indexSearcher.doc(doc);//通过id查询到文档
            System.out.println(document.get("companyName"));
            System.out.println(document.get("jobName"));
            System.out.println(document.get("salayMin"));
            System.out.println(document.get("url"));
            System.out.println("=====================================");
        }
    }

Extended word 2.2.IK's word and stop words

Create a file in the resources: IKAnalyzer.cfg.xml

Then create a file ext.dic in resources and stopword.dic

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">  
<properties>  
    <comment>IK Analyzer 扩展配置</comment>
    <!--用户可以在这里配置自己的扩展字典 -->
    <entry key="ext_dict">ext.dic</entry>
    <!--用户可以在这里配置自己的停止词字典-->
    <entry key="ext_stopwords">stopword.dic</entry>
    
</properties>    

(1) adding an extension words and stop words Profiles

(2) adding an extension word

(3) create an index library again, you can clear before the index database

After (4) configuration, you need to reload the document (to remove the old before loading):

(5) that can be found in the extended term, stop words finding out

2.3. PPC (key code below)

The default sort is based Match: If a match is the same degree, then according to id

PPC higher priority than the matching degree, by setting the field attributes: textField.setBoost (10000); // Rate

     @Test
    public void test() throws Exception {

        List<JobInfo> jobInfos = jobInfoMapper.selectList(null);

        //Directory d, IndexWriterConfig conf
        Directory directory = FSDirectory.open(new File("H:\\lucene\\index"));//指定索引库保存的地址
        //Version matchVersion, Analyzer analyzer
        Analyzer analyzer = new IKAnalyzer();//中文分词器
        //Analyzer analyzer = new StandardAnalyzer();//标准分词器 对于英文识别,不识别中文
        //Analyzer analyzer = new CJKAnalyzer();//中日韩分词器 分词的不准
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LATEST, analyzer);
        IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig); //用来创建索引库的工具
        indexWriter.deleteAll();//清除原先的索引库
        for (JobInfo jobInfo : jobInfos) {
            Document document = new Document();
            document.add(new TextField("companyName", jobInfo.getCompanyName(), Field.Store.YES));
            document.add(new TextField("jobName", jobInfo.getJobName(), Field.Store.YES));
            document.add(new DoubleField("salaryMin", jobInfo.getSalaryMin(), Field.Store.YES));
            document.add(new StringField("url", jobInfo.getUrl(), Field.Store.YES));
            indexWriter.addDocument(document);//添加document 文档
        }
        //单独添加一个给钱的公司
        Document document = new Document();
        TextField textField = new TextField("companyName", "给钱的随便写的给了排名第一的有限公司", Field.Store.YES);
        textField.setBoost(10000); //打分
        document.add(textField);
        document.add(new TextField("jobName", "java", Field.Store.YES));
        document.add(new DoubleField("salaryMin", 30000, Field.Store.YES));
        document.add(new StringField("url", "www.suibian.com", Field.Store.YES));
        indexWriter.addDocument(document);
        indexWriter.close();//io关闭

    }

es has two ports: 9200 (can be accessed through a browser http protocol) 9300 (cluster es visit each other, tcp protocol access)

Installation: node.js "6.2.4 version es" 6.2.4 version kibana visualization tool "Google browser plug-es

1, the installation node:

(1) all the way to next, intelligent installation

(2) checking the cmd command input window: If versioning successful installation

2, the installation es and kibana:

(1) es and kibana compressed, decompressing

(2) found in the config directory software solutions es after compression, and then modified as follows:

1) elasticsearch.yml: save log data retention and address address

2) jvm.options: Start configuration memory occupied

(3) adding the plugins folder in es ik word plug-in, if you already have, then direct the fourth step:

(4) Start es:

(5) launched a two-port 9200 can be accessed through a browser, 9300 for the cluster.

(6) Verification: If the browser displays the following information has been successfully installed it on behalf es

(6) If the startup es of error, see the error message:

(7) found in the bin directory of the installation software kibana, start kibana:

(8) If the console is printed as follows:

(9) open kibana page in a browser, click Dev Tools: http: // localhost: 5601

(9) Google browser plug-in installed:

1) Google Plug Address: google --- "More tools ----" extension

​ 2)

Unzip the file;

3) If there is no extension of this load compressed, then click on the developer mode

4) Add extension

5) Check Google browser

6) After clicking:

3.1. Verify whether the entry into force ik word

GET /_analyze
{
  "text": "我是一个好学生",
  "analyzer": "ik_smart"
}
或者
GET /_analyze
{
  "text": "我是一个好学生",
  "analyzer": "ik_max_word" 推荐的
}    

kibana support restful Style:

PUT usually create an index database, type

POST generally add and modify documents

Data obtained on behalf of GET

Delete to delete the general

3.2. Operation index Library

Create an index Library: PUT / shenyian

Query index Library: GET / shenyian

Remove the index Library: DELETE / shenyian


3.3. Creating an index type library (type) (not recommended)

PUT /shenyian
PUT /shenyian/_mapping/goods  //创建索引库中的类型
{
  "properties": { //固定写法
     "goodsName":{// 类型中的字段
       "type": "text", //字段的field类型
       "index": true, //是否会检索
       "store": true, //是否在文档中保存
       "analyzer": "ik_max_word" //用哪个分词器
     }
  }
}

3.4. At the same time the library and create an index type (type) (recommended)

PUT /shenyian
{
  "mappings": {
    "goods":{
      "properties": {
         "goodsName":{
          "type": "text",
          "index": true,
          "store": true,
          "analyzer": "ik_max_word"
        },
        "price":{
          "type": "double", //double的field类型
          "index": true,
          "store": true
        },
        "image":{
          "type": "keyword", //和lucene的stringField一样,保存字符串,但是不分词
          "index": true,
          "store": true
        }
      }
    }
  }
}

Create a template (understand)

PUT /shenyian2
{
  "mappings": {
    "goods":{           
      "properties": {
        "goodsName":{ 
          "type": "text",  
          "index": true,
          "store": true,  
          "analyzer": "ik_max_word" 
        } 
    },  
    "dynamic_templates":[
        {
          "myStringTemplate":{ //自定义的模板名称
            "match_mapping_type": "string", //匹配到的字段类型
            "mapping":{
               "type": "text",//如果匹配的是字符串,那么自动textfiled类型
               "analyzer": "ik_max_word" //默认的ik_max_word分词器
            }
          }
        }
    ]
  }
 } 
}

3.5. Operation of the document

Adding documents:

POST /shenyian/goods  
{
  "goodsName": "小米9手机",
  "price": 2999,
  "image": "www.xiaomi9.com/9.jpg"
}
或
POST /shenyian/goods/1 如果自己给id,那么es会用我们给的id
{
  "goodsName": "小米9手机",
  "price": 2999,
  "image": "www.xiaomi9.com/9.jpg"
}

Modify the document by id:

POST /shenyian/goods/7uHXmXAB2jTsz9zVCTTF //通过自动生成的id进行修改
{
  "goodsName": "小米9pro手机",
  "price": 3999,
  "image": "www.xiaomi9.com/9.jpg"
}

By id query:

GET /shenyian/goods/7uHXmXAB2jTsz9zVCTTF

By deleting id:

DELETE /shenyian/goods/7uHXmXAB2jTsz9zVCTTF

3.7. Various queries (focus)

data preparation:

PUT /shenyian
{
  "mappings": {
    "goods":{
      "properties": {
        "goodsName":{
          "type": "text",
          "index": true,
          "store": true,
          "analyzer": "ik_max_word"
        },
        "price":{
          "type": "double",
          "index": true,
          "store": true
        },
        "image":{
          "type": "keyword",
          "store": true
        }
      }
    }
  }
}
POST /shenyian/goods/1 
{
  "goodsName": "小米9 手机",
  "price": 2999,
  "image":"www.xiaomi.9.jpg"
}
POST /shenyian/goods/2
{
  "goodsName": "华为 p30 手机",
  "price": 2999,
  "image":"www.huawei.p30.jpg"
}
POST /shenyian/goods/3
{
  "goodsName": "华为 p30 plus",
  "price": 3999,
  "image":"www.huawei.p30plus.jpg"
}
POST /shenyian/goods/4
{
  "goodsName": "苹果 iphone 11 手机",
  "price": 5999,
  "image":"www.iphone.11.jpg"
}
POST /shenyian/goods/5
{
  "goodsName": "苹果 iphone xs",
  "price": 6999,
  "image":"www.iphone.xs.jpg"
}
POST /shenyian/goods/6
{
  "goodsName": "一加7 手机",
  "price": 3999,
  "image":"www.yijia.7.jpg"
}
(1) Search
POST /shenyian/goods/_search 如果不通过id来查询,那么需要添加_search固定语法
{
  "query": { 也是固定语法
    "match_all": {}
  }
}
(2) term query

(According to the inverted index term to query)

POST /shenyian/goods/_search
{
  "query": {
    "term": {
      "goodsName": "手机"
    }
  }
}
(3) match the query

(Query data word, and every word query term, will result collection)

POST /shenyian/goods/_search
{
  "query": {
    "match": {
      "goodsName": "手机 小米"
    }
  }
}
(4) the scope of inquiry

According to a field interval range

POST /shenyian/goods/_search
{
  "query": {
    "range": {
      "price": { //通过price这个字段
        "gte": 2000,  gte:greate than equals
        "lte": 4000   lte:less than equals
      }
    }
  }
}
(5) fuzzy query

(Fault-tolerant query, you can allow the wrong word, a maximum of two)

POST /shenyian/goods/_search
{
  "query": {
    "fuzzy": { 容错查询关键字
      "goodsName": {
        "value": "iphoww",
        "fuzziness": 2 容错率,最多是2
      }
    }
  }
}
(6) Boolean query

(Combination of query, the query combinations mentioned above)

POST /shenyian/goods/_search
{
  "query": {
    
    "bool": {
      "must": [    //下面的match查询的结构和range查询的结果的交集
        {
          "match": {      
            "goodsName": "手机 小米"
          }
        },
        
        {
          "range": {
            "price": {
              "gte": 2000,
              "lte": 4000
            }
          }
        }
      ]
    }
  }
}
POST /shenyian/goods/_search
{
  "query": {
    
    "bool": {
      "should": [ //下面的match查询的结构和range查询的结果的并集
        {
          "match": {
            "goodsName": "手机 小米"
          }
        },
        
        {
          "range": {
            "price": {
              "gte": 2000,
              "lte": 4000
            }
          }
        }
      ]
    }
  }
}
POST /shenyian/goods/_search
{
  "query": {
    
    "bool": {     must中查询出来的结果然后排除must_not中的结果
      "must": [   
        {
          "match": {
            "goodsName": "手机 小米"
          }
        },
        
        {
          "range": {
            "price": {
              "gte": 2000,
              "lte": 4000
            }
          }
        }
      ],
      "must_not": [
        {
          "term": {
            "goodsName": "华为"
          }
        }
      ]
    }
  }
}

Keywords are:

The intersection of match && range combined term, match, range, fuzzy and so on query results: must

According to the results of the query results must, or should, and then removing must_not query: must_not

should: a combination of term, match, range, etc. query result set and match || range

In general: together with the must and must_not, or together with the should and must_not

filter (this filter is a filter term, the same function and so is considered equal ...... MUST MUST)

3.8. Filter

Filter the display field, you do not want to display those fields can be filtered ....

includes:

POST /shenyian/goods/_search
{
  "query": {
    "match": {
      "goodsName": "华为"
    }
  },
  "_source": {
    "includes": ["goodsName","price"]
  }
}

excludes:

POST /shenyian/goods/_search
{
  "query": {
    "match": {
      "goodsName": "华为"
    }
  },
  "_source": {
    "excludes": ["image"] 不想显示的字段
  }
}

3.8. Sorting, pagination

POST /shenyian/goods/_search
{
  "query": {
    "match": {
      "goodsName": "手机"
    }
  },
  "sort": [   排序
    {
      "price": {
        "order": "desc"
      }
    }
  ],
  "from": 0,  分页
  "size": 2
}

3.9. Highlight (keyword being searched discoloration)

POST /shenyian/goods/_search
{
  "query": {
    "match": {
      "goodsName": "手机"
    }
  },
  "highlight": {
    "fields": {
      "goodsName": {}  //需要高亮的字段和上面的查询字段要一致
    },
    "pre_tags": "<font color=red>",  //前置html标签
    "post_tags": "</font>"          //闭合html标签
  }
}

3.10. Polymerization (packet)

Polymerized field field must be of type: keyword

elastic search mysql
Polymerization (packet) Pail (bucket) group by
avg, max, min, count (*) is calculated after a packet measure Aggregate function

mysql polymerization:

Prepare the data:

PUT /car
{
  "mappings": {
    "orders": {
      "properties": {
        "color": {
          "type": "keyword"
        },
        "make": {
          "type": "keyword"
        }
      }
    }
  }
}
POST /car/orders/_bulk
{ "index": {}}
{ "price" : 10000, "color" : "红", "make" : "本田", "sold" : "2014-10-28" }
{ "index": {}}
{ "price" : 20000, "color" : "红", "make" : "本田", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 30000, "color" : "绿", "make" : "福特", "sold" : "2014-05-18" }
{ "index": {}}
{ "price" : 15000, "color" : "蓝", "make" : "丰田", "sold" : "2014-07-02" }
{ "index": {}}
{ "price" : 12000, "color" : "绿", "make" : "丰田", "sold" : "2014-08-19" }
{ "index": {}}
{ "price" : 20000, "color" : "红", "make" : "本田", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 80000, "color" : "红", "make" : "宝马", "sold" : "2014-01-01" }
{ "index": {}}
{ "price" : 25000, "color" : "蓝", "make" : "福特", "sold" : "2014-02-12" }
GET /car/orders/_search
{
  "from": 0,
  "size": 0, //为了不显示查询结果,不影响聚合
  "aggs": {
    "my_aggs_color": {//聚合起名
      "terms": {//固定写法
        "field": "color", //用什么字段来进行分组
        "size": 10 //最多显示多少组
      },
      "aggs": { 
        "my_avg": {//给聚合函数起个名字
          "avg": {  //根据什么聚合函数来计算:avg  max  min
            "field": "price" //什么字段来进行计算
          }
        }
      }
    }
  }
}


(This article caused Xiewang Hao, I particularly admire the great God, ho brother.)

Guess you like

Origin www.cnblogs.com/ShenYian/p/12519814.html