What is the full-text search

file

Full-text search technology has been widely used search engine, query and other fields. Most of our search services on the network are used in a full-text search technology.

For the large amount of data, the data structure is not fixed data can be used to retrieve full-text search, such as Baidu, Google and other search engines, search forums station, search and other stations within the electricity supplier website.

What is the full-text search of it? Look at the technical definition Baidu Encyclopedia.

file

To better understand, we look at a simple example.

Case

Implement a file search function, search for files by keyword, any file name or file contents of files, including keywords need to find out. It can also be queried according to Chinese words, and the need to support multiple criteria query.

Original content in this case is that some sample files on the disk, as shown below:

file

If you use the database to achieve, then search the database is easy to achieve, often using sql statement to query, and can quickly get results.

Why is it easy to search database?

Because the data stored in the database is regular, there are rows and columns with a data format, a data length is fixed.

However, the data in our lives is generally divided into two types: the structured data and unstructured data.

Structured Data: means having a fixed data format or a finite length, such as a database, metadata and the like.

Zhibuding long file on disk or no fixed format of the data, such as e-mail, word documents and other: unstructured data

Query structured data we can get through sql, so unstructured it?

Query method of unstructured data

Unstructured data query in two ways:

(1) sequential scanning method (Serial Scanning)

The so-called sequential scanning, such as looking for a file that contains the contents of a string, is a look at one document, for each document, start to finish, if this document contains this string, this document we're looking for file, a file on to the next, until you have all the documents scanned. Such as the use of search windows can also search the contents of the file, but rather slow.

(2) full-text search (Full-text Search)

The part of the information extracted from unstructured data, reorganized to make it have a certain structure, and some of this data structure is searched, so as to achieve a relatively fast search purposes. This information is then reorganized part extracted from unstructured data, we call it the index .

For example: the dictionary. Phonetic dictionary of Radical Indexing Table and index table is equivalent to the dictionary, the interpretation of every word is unstructured, if there is no dictionary of syllables and Radical Indexing table, find a word in the dictionary can only order the vast scanning. However, some of the information word can be extracted be structured, such as pronunciation, is more structured, initial and final points, respectively, only a few can be enumerated, then the pronunciation out arranged in a certain order, each pronunciation point to the number of pages a detailed explanation of this word. We search for a structured search to phonetic pronunciation, then pointed to their pages, you can find our unstructured data - that is the interpretation of the word.

This indexing process first, and then the index is called to search full-text search (Full-text Search).

Although the index creation process is very time-consuming, but the index once created can be used multiple times, mainly dealing with full-text search queries, so time-consuming to create the index is worth it.

So how full-text search of it?

Lucene

Mentioned full-text search, a technology have to mention is that Lucene, Lucene is an open source under the apache full-text search engine tool kit. It provides a complete query engine and index engine, part of the text analysis engine. We know full-text search engine Lucene Solr and are based on the ES.

file

1, green for the indexing process, the original content indexing to search the library to build an index, the indexing process include:

That is, to determine original content search content -> Documents collection -> Create Document -> Document Analysis -> Index Documents

2, red indicates that the search process, the search index database content from the search process includes:

Search by user interface -> Create Query -> perform a search from the index library search - rendering search results>

Creating an index

It is simply the process of document indexing, document content users to search index, the index is stored in the index database (index) in.

For example, just these documents:

file

We have to analyze all the words which will map the relationship word, document name.

(To cut the word word points include the extraction of the original document, remove stop words such as process, a process called segmentation)

We analyze a document which Lucene.txt:

The original content of the document:

Lucene is a Java full-text search engine.  Lucene is not a complete

application, but rather a code library and API that can easily be used

to add search capabilities to applications.

We can get after analysis unit vocabulary:

lucene、java、full、search、engine。。。。

Another document flink.txt add a few words:

java quite kakfa

We can also get vocabulary unit:

java quite kakfa

So we established a relationship mapping, lucene, java, full, search in Lucene.txt rather flink not Lucene.txt in, but flink.txt in. java that Lucene.txt, and also in flink.txt in.

file

That when we look for the word lucene, in Lucene.txt, but it can look for java informed that both files in.

Create a vocabulary index is a unit index, by looking for words in the document, this structure is called the index called an inverted index structure . ?

The traditional method is to find files based on the contents of the file matches the search keywords in the content of the document, this method is sequential scanning method, the amount of data, searching slow.

Inverted index structure is the (words) to find documents based on content, as shown below:

file

Inverted index structure is also called an inverted index structure, including indexes and documents of two parts, namely the index vocabulary, its small size and large document collections.

There are inverted index, the corresponding affirmative, there is a positive index. Forward index is actually a sequential scan all files, so efficiency is very low itself.

Query Index

Process also query the index search. Search is the user enter a keyword, the process of searching from the index (index) in. Keyword search index, according to the index to find the corresponding documentation to find the content you want to search (here refers to the file on disk).

We are here is to query the index table, find the location where the document is to complete the inquiry, but other scenes can be flexible out of the query results show out, such as when our Baidu search, as we show related pages.

file

Develop its own full-text search

To develop a manual indexing and query index function requires a lot of work, but fortunately lucene has helped us to complete a lot of work, just call the java api to complete the related work.

But the underlying Lucene's API too, is not easy to use, and the lack of enterprise-class management tools for their monitoring and management, so the enterprise-class full-text search engine came into being, the most popular of the two is: Solr and ES . They are built on top of Lucene.

Solr

Solr is an open source project Apache Lucene enterprise search platform. Solr is highly scalable and provides distributed search and index replication.

file

Solr in Java development, operation in the Servlet container, is an independent full-text search server. And powerful API functions and external configuration such that no coding, you can adjust them to suit many types of applications.

2010 Apache Lucene and Apache Solr projects merge, so Lucene / Solr has become Apache project.

Thus, Solr's advantage is:

There is a mature developer community; the province is relatively stable; supports multiple formats of the index.

However, due to the underlying mechanism, Solr's shortcomings are obvious:

Indexing, search efficiency decreases; real-time index search efficiency is not high.

IS

ES is Elasticsearch, is a distributed real-time search and analysis engine that can be used for full-text search, structured search and analysis.

file

Because Lucene is too complex, difficult to use. Elasticsearch use as an internal engine Lucene, but when Elasticsearch do search engine, just use the same API that can be, without the need to understand complex Lucene principle.

And Elasticsearch can not do full-text search capabilities in the enterprise as:

  • Distributed Real-time file storage;
  • Distributed Search Engine real-time analysis;

Elasticsearch of Restful API friendly and simple, especially easy to use.

Currently it includes Wikipedia, Stackoverflow, Github, etc. are used Elasticsearch as its search engine.

ES Simple Experience

Here we use a simple ES complete a full-text search function.

1. Download

First, download the official website, the official website address: https://www.elastic.co/products/elasticsearch 

Download the following address: https://www.elastic.co/cn/downloads/elasticsearch

To choose their own system of our choice WIndows version.

file

We can download kibana, kibana is a visual tool with the ES.

2, installation and deployment

Decompression on the d drive

Then we start at the command line:

C:\Users\JN>d:
D:>cd 
D:\elasticsearch-6.4.0>cd bin
D:\elasticsearch-6.4.0\bin>elasticsearch.bat

kibana also

C:\Users\JN>d:
D:>cd kibana-6.4.0-windows-x86_64
D:\kibana-6.4.0-windows-x86_64>cd bin
D:\kibana-6.4.0-windows-x86_64\bin>kibana.bat

Successful deployment: You can localhost: 9200 Access es

file

localhost: 5601 Access kibana

file

3, simple to use

We simply experience the ES, open kibana of DevTools tool.

Data are inserted into two, and search.

ES also calls using a programming language similar to this, simple to use.

file

Reference documents:

lucene in action

Elasticsearch Definitive Guide

More ES, Flink, Kafka and other real-time streaming calculated Bowen, welcome attention to real-time streaming calculated as follows:

file

This article from the blog article multiple platforms OpenWrite release!

Guess you like

Origin www.cnblogs.com/tree1123/p/11711703.html