Getting Started with ElasticSearch: Why Choose ES as a Search Engine?

introduce

As data volumes continue to grow, searching and analyzing large-scale data sets becomes increasingly important. Traditional databases often perform poorly when facing this demand. At this time, an engine specialized for search and analysis is needed. ElasticSearch (ES for short) is such a powerful search engine. It has many advantages, making it the first choice for many enterprises and developers.

To put it simply: ElasticSearchit is a real-time distributed storage , search , and analysis engine.

In my opinion, the strongest thing about ES is actually its fuzzy search function.
Then some people will ask: Can my database also implement fuzzy search?

select * from student where name like '%宁正%'

For example, this sql can find students with the word Ning Zheng in their names
. Indeed, this can be used for fuzzy search, but name like '%宁正%'this way of writing is not indexed, so it means: If your data volume is very large, For example, tens of millions or hundreds of millions of items, no matter if you optimize the code, your query will definitely be in seconds.

And there is another situation. When we search most of the time, the information entered is not very accurate. For example, I want to search for ElasticSearch relevant information, but I accidentally typed it ElesticSearch . If you use the SQL statement to perform a fuzzy search, you will not be able to search. Find information related to es

So it can be used in this situation ElasticSearch , it is just for searching.

Therefore, I will list the advantages of ES and conduct a simple analysis:

ES is very good at fuzzy search of full text

Reason: ES is based on inverted index, which allows ES to quickly match keywords and return relevant results without the need for full table scans like traditional databases. Inverted indexes are highly efficient when storing and querying large-scale text data.

Then some friends may ask after reading it: What is an inverted index? What is the difference between inverted index and forward index? Can the database we use every day use inverted index?

Then let’s answer one by one:

What is an inverted index?

Inverted index is a keyword-based index structure commonly used in full-text search engines and information retrieval systems. It is a data structure that maps keywords in documents to corresponding document IDs.

Specifically, the inverted index establishes a mapping between each keyword in the document and the ID of the document containing the keyword. For each keyword, the inverted index records a list of documents where the keyword appears, including their word frequency, location and other information. This allows for a given keyword to quickly find relevant documents containing the keyword.

What is the difference between inverted index and forward index?

A forward index is an index structure sorted by document ID, which stores detailed information about documents and each entry in the document.

I have an easy-to-understand way to express it:

The forward index is like the table of contents when we read a book. We can directly find the content of the corresponding page number through the page number.

The inverted index extracts the vocabulary in the entire book and records which page numbers the vocabulary exists in to form a mapping relationship. When I want to find which pages a vocabulary appears in, I only need to use this mapping table. Quickly find the page you want

With this explanation, everyone should understand.

Can inverted indexes be used in the databases we use every day?

In fact, the database can support inverted index, but compared with the traditional forward index, the implementation of database inverted index is relatively complicated, and the main design goal of the database is to support efficient data management and transaction processing, rather than Focus on complex query needs such as full-text search

The query syntax of ES is more flexible, which can precisely control query conditions and weights, and perform more complex fuzzy searches

Elasticsearch's query syntax is quite flexible. You can control query conditions and weights as needed, and perform complex queries such as Boolean queries, range queries, fuzzy queries, geographical locations, etc. By using query syntax, more precise searches can be achieved.

Here I am writing a demo based on my own geographical location query

geoDistanceQueryIt is a geographic location query, which is used to query documents within a certain distance from a certain geographic coordinate point. Just provide the latitude and longitude coordinates of a geographic point, a distance, and a unit:

SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
GeoDistanceQueryBuilder geoQuery = QueryBuilders.geoDistanceQuery("local")
                .point(lat, lon) // 地理位置坐标
                .distance(distance, DistanceUnit.KILOMETERS); // 查询距离

sourceBuilder.query(geoQuery);
SearchRequest searchRequest = new SearchRequest("indexName");
searchRequest.source(sourceBuilder);

// 执行查询
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);

In the demo above, we created a geoDistanceQueryquery to query documents whose location field is within a certain distance of a given geographical coordinate. The unit used here is kilometers.

ES provides rich aggregation and analysis functions

ES natively provides rich aggregation and analysis functions, which can perform a variety of operations such as aggregation , grouping , and sorting of results . ES also provides many other analysis functions, such as word frequency statistics , date histograms , etc. These features can help users gain a deeper understanding of data and generate dashboards and visualizations.

I will also write a relatively simple demo here, just take a look at it. I will explain it in detail in a later blog.

Suppose we have an index that stores movie information, including the fields: title (movie title), genre (movie type), and rating (movie rating).
Now, we want to aggregate movies of different genres and calculate the average rating for each genre.
First, we need to build an aggregation query specifying grouping by genre field and calculating the average value of each grouping.

GET movies/_search
{
    
    
  "size": 0,
  //指定聚合操作的容器
  "aggs": {
    
     
  	//聚合操作起的一个名字
    "genres": {
    
     
      //指定分组字段的聚合操作类型
      "terms": {
    
     
      	// 指定要分组的字段
        "field": "genre"
      },
      "aggs": {
    
    
      	//平均值聚合操作起的一个名字
        "avg_rating": {
    
    
          //计算均值的聚合操作类型
          "avg": {
    
    
            //操作的字段
            "field": "rating"
          }
        }
      }
    }
  }
}

In the above query, we use termsaggregation to group movies by genrefields and use avgaggregation to calculate the average of each grouped ratingfield.
You will get results similar to the following:

"aggregations" : {
    
    
  "genres" : {
    
    
    "buckets" : [
      {
    
    
        "key" : "Action",
        "doc_count" : 100,
        "avg_rating" : {
    
    
          "value" : 4.2
        }
      },
      {
    
    
        "key" : "Drama",
        "doc_count" : 80,
        "avg_rating" : {
    
    
          "value" : 3.8
        }
      },
      ...
    ]
  }
}

There are 100 movies in the Action category with an average rating of 4.2, and 80 movies in the Drama category with an average rating of 3.8.

This is just a simple example of aggregation and analysis functions. In fact, ES provides more rich aggregation operations and analysis functions, and more complex operations can be performed according to specific needs.

ES uses a distributed architecture to better handle large-scale data and highly concurrent queries.

The ES distributed architecture performs well in horizontal expansion. By sharding data and storing it on multiple nodes, ES can process large-scale data and improve query performance through parallel queries and distributed computing.

I won’t go into details here, but will introduce it in detail later.

So do we use ES without any brain, or do we need to do it under a specific situation?

The answer to the first question is, of course, you can’t go to ES without thinking!

  1. Although ES is powerful, it is also complex. Using it requires some learning and understanding. Without proper training or experience, you may encounter configuration errors, performance issues, indexing and query errors, and more
  2. ES is a distributed system that requires appropriate hardware and resource support to function properly. Improper deployment can lead to performance issues or wasted resources.
  3. ES requires proper management and maintenance, including monitoring cluster health, backing up and restoring data, updating and upgrading, etc. If not managed and maintained properly, you may experience data loss, performance degradation, or security risks
  4. cost! cost! Or the cost!

The second question is, when should we use ES?

In my opinion, it can be considered in the following ways:

  1. Data scale. If the data you want to process reaches millions or even hundreds of millions, you can use ES to process large data sets.
  2. Search complexity. If you need to frequently perform complex text queries and concerns on full-text data, then ES is a better choice.
  3. Real-time, if you need to analyze real-time data quickly, ES is a suitable choice. It supports real-time data indexing and querying, and can analyze and visualize the data immediately when it arrives.
  4. Distributed and high-availability requirements, if you need a scalable, highly available and fault-tolerant data storage and analysis solution, then ES is a suitable choice

So don’t use this technology mindlessly just because it is powerful. When using technology, you should also consider the risks it brings.

Guess you like

Origin blog.csdn.net/qq_43649799/article/details/132637341