Elasticsearch Getting a full-text search

First, an overview of information retrieval

1, information overload

According to Baidu Baike introduction, information overload refers to social or personal information over the system can accept the range of treatment or effective use of the situation and lead to failure.

Information overload has the following three characteristics

(1) those who pass on information reflecting the speed is much lower than the information transmission speed;
(2) mass media information is much higher than the audience can consume, or the amount of information required to bear;
(3) a large number of independent redundant data useless information seriously interfere with the audience of useful information related to the accuracy of selection.

2, cause information overload

With the popularity of the Internet, a sensor, and a variety of digital terminal equipment, all interconnected world is a molding. Meanwhile, with the data showing explosive exponential growth of digital has become the basis for building the power of modern society, and driving us towards a deep change of era.

According to IDC released "2025 Data era," the report shows that the global data generated per year from 2018 growth of 33ZB to 175ZB, equivalent to generate data 491EB day. Then the data 175ZB is in the end be? 1ZB equivalent to 1.1 trillion GB. If there is 175ZB all DVD disc, the DVD stack up height will be 23 times (the most recent month for about 393,000 kilometers away), or 222 laps around the earth (a circle of about forty thousand kilometers) from the Earth and the Moon. The current US average speed is 25Mb / sec, a person to complete download this 175ZB data, take 18 billion years.

 

 

 The rapid growth of data caused by the reason of data overload, so we are in a data era, is a data overload era

3, the characteristics of big data (IBM proposed)

5V characteristics of big data: Volume (a lot), Velocity (high-speed), Variety (diversity), Value (the value of low density), Veracity (authenticity).

A, Volume: large volumes of data, including collection, storage and calculation are very large. Start unit of measurement is at least large data P (1000 th T), E (100 million th T) or Z (10 billion th T).

Two, Variety: diversity of species and origin. Including structured, semi-structured and unstructured data, specific performance network logs, audio, video, pictures, location information, etc., many types of data processing capability of the data put forward higher requirements.

Three, Value: value of the data density is relatively low, or that waves in Sentosa but precious. With the widespread use of the Internet and the Internet of things, information perception everywhere, a flood of information, but a lower density value, and how to combine business logic with powerful data mining algorithms to the value of the machine, the era of big data is most needed to solve the problem.

Four, Velocity: growing faster data processing speed is fast, time-critical requirements. For example, search engines require news a few minutes ago to user queries can be personalized recommendation algorithm requires real time as possible to complete the recommendation. This is different from the traditional big data mining significant feature.

Five, Veracity: accuracy and reliability of the data, that is data quality.

4, the definition information retrieval

The amount of information resources is exploding, get the information you want in the ocean of information where it becomes more difficult. In order to solve the problem of information overload, many scientists and engineers made a lot of talent solutions. The most representative is classified directories and search engines

Directory: Directory is the site organize information systems, providing a web directory organized by category, lined with station names fall into this category sites in each category, website link, executive summary, as well as sub-categories, can be Categories in stages and browse for relevant websites, categories, often also provides cross-reference, so you can easily jump in and browse among relevant directories. Such as Sina, Sohu, Netease, are the sort of information from different sources in a uniform form, stored and presented to the customer, the user filter web content according to sources of information, information type, keywords, etc..

Search engine: The search engine refers to automatically collect information from the Internet, after a certain compile and provide to the user query system. For example: Baidu search, 360 search, search, search dogs

5, information retrieval common terms

There are some common areas of information retrieval? Terminology, a deep understanding of these terms of entry information retrieval is necessary

  • User needs (User Need, referred to as the UN) information the user needs to be acquired. Text description, such as finding and "elasticsearch" tutorials, sometimes also known as the theme (Topic)
  • Query (Query) UN called to submit queries to the search system. Such as "elasticsearch Guide", a UN for the same, different people at different times can construct different Query, these needs can also be expressed as "elasticsearch Definitive Guide"
  • Document (Document) documents are subject information retrieval, documents can be text only, it can also be a picture, video, audio, voice and other multimedia documents
  • A document set consisting of a plurality of document sets (CRPS) referred to the document collection.
  • Document number (the DocumentID) document ID is a unique identifier for each document of the document set given, to distinguish between different documents through the document ID, this can facilitate the internal processing of the search engine
  • Terms of (The tokenization) entry of a given character sequence is split into a series of process sequence, each sub-sequence is referred to split an entry
    • 输入:Elasticsearch is a distributed RESTful search engine built for the cloud.
    • 输出:elasticsearch distributed restful search engine built cloud
  • A term (Term) is a term after a pre-linguistic normalization of entries. Lexical items is the smallest unit of the index.
    • FIG follows: i.e., from the longitudinal view of the document latitude, and each column represents a document contains a term of information, if doc1 comprising term1, doc2 comprising term2, and has doc3
    • That is a term from the horizontal point of view this latitude, each row represents the distribution of information word item in the document, such as term3 appears only once in doc3, other documents do not exist and so on.
    •  
        Docl doc2 doc3
      term1 1 0 1
      term2 0 1 1
      term3 0 0 1
  • Lexical items - documents associated matrix (Incidence Matrix) a term - document matrix is ​​associated with the performance of a conceptual model of terms and documents containing between has a relationship
  • Lexical items frequency (Term Frequency) frequency of the same word appears in a document. For example, the word "elasticsearch" appeared three times in a document, then the word in the document frequency of terms is 3
  • The number of documents certain terms in the document frequency (Document Frequency) appears. For example, the word "elasticsearch" only appears in the document collection documents 1 and 5, then the word document frequency is 2
  • Inverted record sheet (Postings List) inverted recording sheet for recording location information of a word appeared in all the documents list of documents and words that appear in this document, each record is called an inverted entry through inverted list to know which documents contain which words
  • Inverted file (Inverted File) inverted table records the physical file is stored on disk is called the inverted file.

6, the information retrieval system

A basic architecture shown in FIG complete information retrieval system is as follows. Information retrieval system can be divided into information collection, information collation, and section 3 user queries

 

 

Information Collection: Information Collection are basically done automatically by the web crawler.

Finishing: The process of information retrieval system called index builds organize information.

User queries: information retrieval system user wants to send queries, information retrieval system accepts a query and return the user to retrieve documents

Guess you like

Origin www.cnblogs.com/huanmin/p/11715250.html