Deep Thoughts on ES Interview

1 recommended article

20,000 words of detailed explanation to understand Elasticsearch thoroughly

2 What is an inverted index and why is it called that?

Inverted Index is a data structure designed for fast full-text search. It is widely used in search engines, of which Elasticsearch (ES for short) is a major user.

2.1 What is an inverted index?

Traditional indexes (such as those of books or B-Tree indexes of databases) are "forward": they map from "documents to vocabulary items". In other words, you look up a document and then list the vocabulary items in the document.

In contrast, the inverted index maps from vocabulary items to documents. This means that for every unique vocabulary item in the index, there is a list of related documents that contain that vocabulary item.

For example, consider the following simple document set:

1. Apple is tasty.
2. Banana is tasty.
3. Apple and banana are fruits.

A simple inverted index might be:

Apple -> 1, 3
Banana -> 2, 3
tasty -> 1, 2
...

2.2 Why is it called “inverted index”?

The name comes from its opposition to "forward indexing". A forward index maps documents directly to the vocabulary items in them, while an inverted index does the opposite, mapping vocabulary items to the documents that contain them.

2.2 Why use inverted index?

Inverted indexes are particularly useful for full-text searches because:

  1. Speed : It allows to quickly find documents containing specific vocabulary items.
  2. Space efficiency : only keep indexes for unique vocabulary items and associate them with related documents, which avoids duplication.
  3. Ranking and Relevance : Using the inverted index, it is easy to rank search results based on term frequency, document frequency, etc.

As a distributed search and analysis engine, Elasticsearch makes extensive use of inverted indexes to achieve its high-speed and efficient text search capabilities.

3 ES’s internal FST index

Insert image description here
In order to quickly find a term, Elasticsearch first sorts all the terms, and then searches for the term according to the dichotomy method. The time complexity is logN, just like searching through a dictionary. This is the Term Dictionary.

Now it looks like it is very similar to the prefix tree, sharing prefixes, but FST can also share suffixes . But if there are too many Term, the Term Dictionary will be very large and it is impractical to store it in memory, so we have the Term Index.

Just like the index page in a dictionary, which terms start with A and which page they are on, you can understand that the Term Index is a tree.

This tree does not contain all Term, it contains some prefixes of Term. Through the Term Index, you can quickly locate an offset in the Term Dictionary, and then search sequentially from this position.

Use the FST method to compress the Term Index in the memory. FST stores all terms in bytes. This compression method can effectively reduce the storage space and make the Term Index enough to fit into the memory. However, this method will also cause the need for searching. More CPU resources.

Compression technology is also used for inverted tables stored on disk to reduce the space occupied by storage.

3.1 What is FST in ES

In Elasticsearch, FST (Finite State Transducer) is a data structure used to optimize and compress Term Dictionary. It is a special kind of finite state machine that can represent not only a key (in the context of Elasticsearch, a key is usually a term or term), but also the value associated with that key.

Here are some core points about FST in Elasticsearch:

  1. Compression : FST is well suited for compressing dictionaries because it can share prefixes and suffixes. This compression method is very efficient considering that many terms in the vocabulary share common prefixes and suffixes.

  2. Fast search : In addition to storage advantages, FST also provides fast search capabilities. When a user queries a term, Elasticsearch can quickly find the metadata or other related information corresponding to the term through FST.

  3. Storing keys and values : Unlike traditional prefix trees (trie), FST not only stores terms (keys) but also the values ​​associated with them. In Elasticsearch, these values ​​might be metadata about the term, such as its position in the inverted index.

  4. Reduce memory usage : Due to its compression characteristics, FST can greatly reduce the memory footprint of Term Dictionary. This allows more data to be cached in memory, thereby improving query performance.

In short, in Elasticsearch, FST is an efficient data structure used to represent and compress Term Dictionary, making the storage of Term Dictionary more efficient and querying faster.

3.2 Why the prefix tree is also a finite state machine (FSM)

To be precise, the prefix tree (Trie) is not an FST (Finite State Transducer), but they all belong to the category of finite state machines. FST is an extension of finite state machines that can represent both keys and values. However, both Trie and FST are data structures based on finite state machines, and both can effectively represent sets of keys, which is also the similarity between them.

Let’s take a closer look at their differences and connections:

  1. Prefix tree (Trie) :

    • Trie is a tree data structure designed to store a collection of strings.
    • It optimizes storage space by sharing a common prefix.
    • Each node represents a character, and the path from the root node to a leaf node represents a string.
    • Tries do not directly store values ​​associated with keys, but values ​​can be attached to leaf nodes or to some specific nodes.
  2. Finite state converter (FST) :

    • FST is an extension of finite state machines that can represent transitions (i.e. keys) between states and the output values ​​associated with each transition.
    • FST can share not only prefixes but also suffixes, further compressing the data.
    • In FST, a state can represent a prefix or suffix of a string, or a complete string.
    • FST represents the value associated with the key directly in its structure.

Since FST is able to share prefixes and suffixes, and directly represent keys and values, it is more flexible and efficient than Trie, especially when representing large amounts of data.

In short, although Trie and FST are both data structures based on finite state machines and have the characteristics of shared prefixes, FST provides more functions and higher storage efficiency. In some contexts, such as Elasticsearch's implementation, FST is preferred due to its compression and representation capabilities.

3.3 Differences between prefix tree and FST

FST and prefix tree (Trie) are similar in the strategy of sharing prefixes. They both store information by establishing a path for each character of the string. In this structure, words with the same prefix share storage paths until their characters start to differ.

However, FST has the following main differences from traditional prefix trees:

  1. Suffix sharing : In addition to prefix sharing, FST also attempts to share suffixes. This means that, unlike a prefix tree, which may store similar word suffixes multiple times, FST attempts to reuse and share these suffixes, further reducing the storage space required.

  2. Storing values : FST can also store values ​​associated with keys. This allows the FST to be viewed as a key-value map, where the key is the term and the value can be any relevant data, such as the term's position in the inverted index. In the context of Elasticsearch, this is usually the metadata of the term.

  3. Compression : FST also employs a variety of compression techniques to reduce actual storage requirements.

Overall, while FST and prefix trees are similar in the basic strategy of sharing prefixes, FST includes more optimizations and extensions that make it more efficient in certain applications.

3.4 How does FST achieve suffix sharing?

Let us discuss suffix sharing in detail.

Shared suffixes are more common in complex word sets. Consider a more complex set of words, such as: ["attempt", "attempts", "attempted", "attempting", "doing", "seated"].

In these words we see "attempt"the sharing of prefixes, but also the suffixes "s", "ed", and "ing".

To build an FST that takes suffix sharing into account:

  1. From the start state, we first connect to characters a, then to t, t, e, m, in sequence p.
  2. So far we have represented words "attempt".
  3. From "attempt"the last character "t", we can branch to three different suffixes:
    • "s"express"attempts"
    • "ed"express"attempted"
    • "ing"express"attempting"
  4. When creating a node in the tree for "doing", the d->o node will be created in sequence to the lower level, and then the pointer of the o node will point to the ing node in 3. At this time, ing realizes suffix sharing.
  5. In the same way, four levels of nodes are established for seated entries, and ed directly reuses the suffix already established in 3.

Each suffix is ​​stored only once in the FST, thus enabling suffix sharing.

Sharing prefixes and suffixes is key to the efficiency of FST. In real datasets, this sharing can greatly compress storage requirements, especially when the word set contains many variations and derivatives.

As shown in the figure below, you can see the suffix sharing method in the above example
Insert image description here

3 Use ES to find the process

3.1 First figure out the main data structure of ES

Insert image description here

3.2 Search process using ES as a search engine

Elasticsearch (ES) is a search engine based on the Lucene library. When you perform a search on ES, it goes through the following main steps:

  1. Query analysis :

    • The user issues a query request, usually a JSON structure.
    • ES parses this request and converts the query statement into an internal format suitable for Lucene processing.
    • This includes parsing the query string, identifying the query type (e.g. match, term, rangeetc.) and other possible query parameters.
  2. Text analysis :

    • If the query involves a text search, the entered text is analyzed.
    • The Analyzer will split the text into individual terms, also called tokens, and may perform other operations such as lowercase, remove stop words, or apply stemming.
    • This process ensures that the terms in the query and the terms in the index are in the same format.
  3. Find Term Dictionary :

    • For each term, ES will first look up the Term Dictionary to find the metadata and location information of the term. This usually uses FST (Finite State Transducer) to optimize lookup speed.
  4. Inverted index search :

    • Once a term is found, ES will consult the inverted index to find all document IDs that contain the term.
    • For multi-term queries, ES does this for each term and combines the results based on the query type (e.g. Boolean query).
  5. Scoring and sorting :

    • Using the BM25 algorithm (default) or other relevance algorithms, ES calculates a score for each matching document.
    • This score represents the document's relevance to the query.
    • All matching documents are sorted according to their scores.
  6. Get documentation :

    • ES fetches the full version of the document matching the highest score from the main index.
    • All or part of the fields can be returned according to the needs of the query.
  7. Highlighting and aggregation (if requested):

    • If the query request contains highlighting instructions, ES will mark the matching part of the query.
    • If the request contains aggregation, ES will calculate statistics based on the specified fields and aggregation type.
  8. Return results :

    • ES constructs a JSON response containing the matching documents, scores, possible highlights, and aggregation results and returns it to the client.

This is the basic search process of ES. Of course, actual operations may involve more complexity, such as sharding processing, multi-node query and result merging, application of filters, etc.

3.3.1 Do general browser search engines use ES?

no. Major search engines in browsers such as Google, Bing, Yahoo, etc. all use their own proprietary search technology. Elasticsearch is typically used for in-website search, log analysis, real-time application monitoring, and other scenarios, and is not designed for global web search.

3.3.2 Is the result returned by ES the metadata information of the document corresponding to docid? Or the document content?

The results returned by ES can be either the metadata of the document or the actual content of the document. When executing a search query, you can specify which fields are required to be returned as results. If not specified, ES will return the entire document by default.

3.3.3 What is the BM25 algorithm of ES?

BM25 is a modern, probability-based document ranking algorithm that is considered more efficient than the traditional TF-IDF method. BM25 considers term frequency (TF) and inverse document frequency (IDF) to calculate the relevance score of a document, but compared with TF-IDF, it has an upper limit for high-frequency terms, which can avoid excessive use of certain terms. Affects the document's relevance score.

3.3.4 Obtaining documents: What is the main index here? What is the difference between index and FST?

When we refer to the "master index" of Elasticsearch, we are referring to the data structure in which documents are stored , similar to the way index nodes are in OS. The master index node only stores the metadata information of the document and a pointer to the document location . Here, documents are stored and retrieved by their ID. FST is a data structure used to store and look up terms in a vocabulary. Simply put, the main index focuses on the entire document, while the FST focuses on the terms in the document.

3.3.5 Why only the department field is returned? What does the field here contain?

In some scenarios, in order to improve efficiency and reduce network transmission, you may only be interested in certain fields of the document rather than the entire document. For example, if your document is a user profile with multiple fields, but you only want name and email address, you can specify that only those two fields are returned. When querying ES, you can use the "_source" parameter to specify which fields to return. "Field" here refers to any attribute within the document, such as title, author, publication date, etc.

4. Do general browser search engines use ES?

no. Major search engines in browsers such as Google, Bing, Yahoo, etc. all use their own proprietary search technology. Elasticsearch is typically used for in-website search, log analysis, real-time application monitoring, and other scenarios, and is not designed for global web search.

5 Why ES is quasi-real-time

Why ES is quasi-real-time

6 How to use ES

How is ElasticSearch used in the project?

6.1 Should Elasticsearch be built as a separate project and used as a full-text search, or should it be used directly in the project?

It depends on your needs and the complexity of the project. Many companies and projects choose to run Elasticsearch as a separate service or microservice, separate from the main application. The benefit of this is that it can be extended, maintained, and updated independently of the main application, thus achieving decoupling. But if your application is smaller, or you just want to prototype quickly, you can integrate Elasticsearch directly into the application.

6.2 When adding new data, does it need to be inserted into es at the same time when it is inserted into mysql?

If you rely on Elasticsearch for searches or other operations that require fast reads, when you insert data into MySQL, you should also insert that data into Elasticsearch. This way, you ensure that Elasticsearch's index is up to date when users try to search for newly added content. There are various ways to do this, such as using log file synchronization or using tools and libraries like Logstash or Debezium.

When you add new data to MySQL and want the data to be searchable in Elasticsearch, you need to ensure that the data is also synchronized to Elasticsearch. There are several common methods for triggering data synchronization:

  1. Client-initiated double writes : After the application (client) inserts data in MySQL, it can also be responsible for sending the same data to Elasticsearch. The challenge with this approach is to ensure that writes on both sides are successful, otherwise data inconsistency may occur.

  2. MySQL binlog monitoring and synchronization : Use tools such as Debezium or Logstash to monitor MySQL's binlog (binary log). When new data changes are detected, the changes are automatically pushed to Elasticsearch. This method is more reliable than client-side double writing because it ensures that the data can be synchronized correctly even if there is an application problem.

  3. Periodic data synchronization : Set a scheduled task, such as checking MySQL for new data every few minutes and synchronizing it to Elasticsearch. This is suitable for scenarios where data real-time requirements are not high.

  4. Triggers : Set triggers in MySQL to trigger certain operations for synchronization when new data is inserted or changed. But this method may affect MySQL performance and requires additional work to ensure successful synchronization.

Overall, which method to choose depends on your specific needs and the architecture of your system. But usually, the method of monitoring binlog is widely adopted because of its relative reliability and low latency.

6.3 When searching, do you directly return the es search results, or do you need to go back to mysql and check again based on the id in the es results?

Again this depends on your application and needs. In some scenarios, users may only need metadata obtained from Elasticsearch, so returning Elasticsearch results directly is enough. In other cases, more detailed information may need to be returned, and the ID provided by Elasticsearch can be used to look it up in MySQL.

About using MySQL: Elasticsearch is primarily designed for search and analysis and is not intended to replace traditional relational databases. MySQL (or other RDBMS) provides transactional, integrity, backup and recovery features that Elasticsearch is not good at or does not support. Elasticsearch is mainly aimed at fast search and analysis of large amounts of data. So, generally speaking, MySQL is used as the primary data store, while Elasticsearch is used as the search and analysis tool.

7 ES application scenarios in the flash sale system

7.1 Which part of my flash sale system can use this ES?

Elasticsearch (ES) is primarily used for searching and analyzing large amounts of data. In a flash sale system, such functionality may not be core, but there are still some places that may benefit from ES:

  1. Product search : If your flash sale system has a large number of products or multiple flash sales activities, ES can help users quickly find the products they are interested in. Users can search efficiently by keywords, brands, categories, etc.

  2. Log analysis : Flash sale systems usually face huge traffic. Using ES, you can perform real-time analysis of system logs and quickly discover any potential problems, bottlenecks or attacks.

  3. User behavior analysis : By sending users' browsing, clicking and purchasing behavior data to ES, you can conduct in-depth analysis to understand users' interests and behavior patterns, and optimize the strategy of flash sale activities accordingly.

  4. Inventory monitoring : Although MySQL or other relational databases are the first choice for managing inventory, when facing huge traffic, ES can provide you with a near-real-time snapshot of inventory status, helping you monitor and adjust strategies.

  5. Recommendation system : Based on the user's search and purchase history, ES can help you implement a basic recommendation system to recommend products that may be of interest to the user.

  6. Alarm system : You can set certain thresholds or patterns in ES and send alerts when these conditions are reached. For example, when the inventory of an item falls below a certain quantity, or when abnormal traffic is detected.

Nonetheless, for the core functions of a flash sale system, such as inventory reduction, order processing, etc., a traditional relational database or a specific distributed solution is more suitable. The main value of ES is the search and analysis capabilities it provides.

7.2 How to use ES to implement product search function

7.2.1 If I want to implement this function in my personal flash sale system, how should I use it?

To implement the product search function in the flash sale system and use Elasticsearch (ES), you can follow the steps below:

a.Data synchronization :

When new products are added to the flash sale activity (usually stored in MySQL), these product data are synchronized to Elasticsearch at the same time. You can use tools such as Logstash, Debezium, etc. to synchronize data from MySQL to Elasticsearch, or write data to both at the same time in your application layer code.

b.Data structure design :

In Elasticsearch, you need to define a suitable index structure for product data, consider which fields need to be indexed, which fields may be used for filtering, etc.

c.Search interface :

In your flash sale system, provide users with a search interface. When a user enters a keyword and initiates a search request, your system should initiate a query to Elasticsearch and return the search results to the user.

7.2.2 In this system, what is the relationship between ES, mysql, and redis, and what is the data flow between the three?

1 a. MySQL:

  • Main function : Store core data such as product information, user information, order data, etc.
  • Relationship with Elasticsearch : As the main source of data, new or updated product data is synchronized from MySQL to Elasticsearch.

2 b. Elasticsearch (ES):

  • Main function : Provide high-speed, full-text product search function.
  • Data flow :
    • From MySQL : New or updated product data is synchronized to ES.
    • To the front end : When the user initiates a search request, the system queries ES and returns the results to the user.

3 c. Redis:

  • Main function : Due to the high concurrency characteristics of flash sale activities, Redis is often used for functions such as caching, inventory counting, and current limiting.
  • Relationship with MySQL : Product inventory, user purchase records, etc. may be cached in Redis to provide faster read and write speeds. When the inventory changes, such as the flash sale is successful, the data may be written to Redis first and synchronized back to MySQL at the appropriate time.
  • Relationship with Elasticsearch : Usually there is no direct data interaction between the two. However, in some scenarios, such as real-time product recommendation based on user behavior, user behavior data in Redis may be used to query Elasticsearch.

4 Summary:

  • When new products are added: Application -> MySQL -> Elasticsearch.
  • When a user searches: Apply query -> Elasticsearch -> Apply returns results to the user.
  • Flash sale or high concurrency operation: Application -> Redis -> (delayed or batch synchronization) -> MySQL.

This architecture allows each technology to leverage its best strengths while ensuring data integrity and system high availability.

7.3 When new products are added to the flash sale activity (usually stored in MySQL), these product data are synchronized to Elasticsearch at the same time. If the inventory update in the database is successful, but the ES update fails, what should be done?

When MySQL updates succeed but Elasticsearch updates fail, this is a typical distributed system data consistency problem. Here are a few commonly used resolution strategies:

  • Retry mechanism : If the failure is due to temporary network issues or other transient reasons, you can simply retry updating Elasticsearch.
  • Logging : Record all failed update operations so that administrators can manually intervene or reprocess these failed operations through scripts.
  • Use a queue : Put the update operation into a message queue, such as RabbitMQ or Kafka. The consumer takes the operation off the queue and attempts to update Elasticsearch. If it fails, you can put it back in the queue and try again later.

7.4 In Elasticsearch, you need to define a suitable index structure for product data, consider which fields need to be indexed, which fields may be used for filtering, etc. Can you help me design an index structure?

{
    
    
  "mappings": {
    
    
    "properties": {
    
    
      "product_id": {
    
    
        "type": "keyword"
      },
      "product_name": {
    
    
        "type": "text",
        "analyzer": "standard"
      },
      "description": {
    
    
        "type": "text",
        "analyzer": "standard"
      },
      "price": {
    
    
        "type": "float"
      },
      "stock": {
    
    
        "type": "integer"
      },
      "category": {
    
    
        "type": "keyword"
      },
      "brand": {
    
    
        "type": "keyword"
      },
      "tags": {
    
    
        "type": "keyword"
      },
      "created_at": {
    
    
        "type": "date"
      }
    }
  }
}

7.4.1 Yes, fields marked with "type": "text"and "analyzer": "standard"will be analyzed in Elasticsearch and generate terms (Terms). The "standard" parser performs a series of processing steps on the text, including word segmentation, lowercase conversion, and some common punctuation removal.

For example, suppose there is a field with the value "ElasticSearch is Great!" Using the "standard" parser, this text might be broken into three terms: elasticsearch, isand great.

When we use these terms when searching, Elasticsearch will use these generated terms to match, thereby realizing the text search function.

Additionally, for "type": "keyword"fields marked as , they are not analyzed but stored and indexed as a whole. This means that the values ​​of these fields will be treated as a single entry, which is suitable for fields that do not require word segmentation, such as tags, categories, etc.

7.5 Problems with application queries:

Indeed, what is returned from Elasticsearch queries is a structured JSON string. There are several reasons for using a business server as the middle tier for querying:

  • Security : Directly letting the front end communicate with Elasticsearch may expose Elasticsearch details, which may be exploited by malicious users.
  • Flexibility : Query logic can be easily changed or enhanced using a business server as a middle tier.
  • Load balancing and caching : The business server can perform query result caching, load balancing and other optimizations.

7.6 Example of business server sending ES request:

Using Java's Elasticsearch client library as an example:

import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.indices.CreateIndexRequest;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.action.delete.DeleteRequest;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.builder.SearchSourceBuilder;

public class ESExample {
    
    
    public static void main(String[] args) throws IOException {
    
    
        // 创建Elasticsearch客户端
        RestHighLevelClient client = new RestHighLevelClient(
                RestClient.builder(
                        new HttpHost("localhost", 9200, "http")));

        // 插入数据
        IndexRequest indexRequest = new IndexRequest("products");
        indexRequest.id("1");
        String jsonString = "{" +
                "\"product_name\":\"电视\"," +
                "\"price\":2999.99," +
                "\"stock\":100" +
                "}";
        indexRequest.source(jsonString, XContentType.JSON);
        client.index(indexRequest, RequestOptions.DEFAULT);

        // 查询数据
        SearchRequest searchRequest = new SearchRequest("products");
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        searchSourceBuilder.query(QueryBuilders.matchQuery("product_name", "电视"));
        searchRequest.source(searchSourceBuilder);
        SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
        searchResponse.getHits().forEach(hit -> System.out.println(hit.getSourceAsString()));

        // 删除数据
        DeleteRequest deleteRequest = new DeleteRequest("products", "1");
        client.delete(deleteRequest, RequestOptions.DEFAULT);

        client.close();
    }
}

In this example, we first connect to the Elasticsearch service, then build a simple match query to search for products that contain "TV" in the product name, and finally print the source data for each hit product.

7.6.1 If I want to insert multiple such products, should I add them under products?

Yes, if you want to insert multiple products, you will productsadd multiple documents under the same index. Each document has a unique ID, which in this example is hardcoded to "1". But in reality, each item should have a unique ID so that you can uniquely identify and reference it.

If you want to batch insert multiple items, you can use Elasticsearch BulkRequestto perform multiple operations at once, which is usually more efficient than sending each request individually.

Here is an example of a bulk insert:

BulkRequest bulkRequest = new BulkRequest();

// 第一个商品
IndexRequest firstProduct = new IndexRequest("products");
firstProduct.id("1");
String firstProductJson = "{" +
        "\"product_name\":\"电视\"," +
        "\"price\":2999.99," +
        "\"stock\":100" +
        "}";
firstProduct.source(firstProductJson, XContentType.JSON);
bulkRequest.add(firstProduct);

// 第二个商品
IndexRequest secondProduct = new IndexRequest("products");
secondProduct.id("2");
String secondProductJson = "{" +
        "\"product_name\":\"冰箱\"," +
        "\"price\":3999.99," +
        "\"stock\":50" +
        "}";
secondProduct.source(secondProductJson, XContentType.JSON);
bulkRequest.add(secondProduct);

// ... 添加更多的商品

client.bulk(bulkRequest, RequestOptions.DEFAULT);

In this way, all items will be stored under productsthe index, and each item will have its unique ID. When you query productsthe index, you get all the items stored in it.

7.6.2 The products here are equivalent to namespaces. Each space has an FST. All fields marked as text and keyword in this space will be part of the FST. Is this true?

Almost, but there are some details that need to be clarified:

  1. Index : In Elasticsearch, productsit is an index , not just a namespace. This index contains mappings (which define the data types and other properties of fields) and document data.

  2. FST : FST (Finite State Transformer) is a data structure used by Elasticsearch to compress and store terms, mainly used in term dictionaries . But not every field will have an FST. Only indexed fields (such as those marked with textor keyword) will have entries in the term dictionary, so they may use FST.

  3. Mapping types : In earlier versions of Elasticsearch, an index could have multiple mapping types, which allowed the same index to store documents with different structures. But starting from Elasticsearch 7.0, indexes can only have one mapping type, which makes using the index simpler and more intuitive.

  4. Fields and FST : For textfields of type, Elasticsearch will break them into words and store these terms in the term dictionary. These terms are used to build the FST. For keywordfields of type, the entire field value is treated as a single term and stored in the term dictionary.

  5. Multiple fields : Sometimes you may want to index a field both as text(for word segmentation and searching) and as (for aggregation or exact matching). keywordElasticsearch allows you to do this through so-called "multi-fields". For example, a field can be considered primarily text, but can also have a keywordsubfield.

So, in short, fields productsmarked as textand in the index keywordwill be added to the term dictionary, and this dictionary may be compressed and stored using FST.

7.6 What synchronization method from MySQL to ES is currently used by major manufacturers? Let’s talk about the specific synchronization process.

When large companies ("big companies") choose the synchronization method from MySQL to Elasticsearch, they usually make a choice based on their own business needs, existing technology stack, and data stability/real-time requirements. Here are some common synchronization methods and the reasons why they might be adopted by larger companies:

  1. JDBC plugin for Logstash :

    • Advantages : Simple configuration, easy to start, no need for extensive development work.
    • Disadvantages : May not be suitable for highly real-time synchronization needs because it is usually polling-based.
    • Applicable scenarios : Companies that do not have high requirements for real-time data synchronization, or are initially trying Elasticsearch, may choose this method.
  2. Debezium and Kafka :

    • Advantages : Based on CDC (Change Data Capture), database changes can be captured in real time, and combined with Kafka, it can ensure high reliability and scalability of data.
    • Disadvantages : Requires additional components (such as Kafka) and configuration work, and may require more operational efforts.
    • Applicable scenarios : Companies that have real-time requirements for data synchronization and already have Kafka infrastructure may choose this method.
  3. Kafka Connect:

    • Advantages : Similar to Debezium, Kafka Connect is also based on CDC, but it is an official component of Kafka and is more closely integrated with Kafka.
    • Disadvantages : Also requires Kafka infrastructure.
    • Applicable scenarios : Companies that already use Kafka and want to further simplify integration may consider this approach.
  4. Custom sync tools :

    • Advantages : Completely customized to the specific needs of the company, the synchronization process can be deeply optimized.
    • Disadvantages : High development and maintenance costs.
    • Applicable scenarios : Companies with specific synchronization needs or hoping for deep integration may develop their own synchronization tools.

In general, different companies may choose different methods based on their own business scenarios and technical backgrounds. For example, companies that already use Kafka at scale might choose Debezium or Kafka Connect. Companies that don't have high real-time requirements or want to experiment quickly may first try Logstash's JDBC plug-in.

In large Internet companies, because the real-time and stability of data are critical, CDC-based methods (such as Debezium and Kafka Connect) are more popular. But at the same time, some large companies may also develop customized synchronization tools for their own specific needs.

7.7 For the flash sale system, are the real-time requirements for data high?

For the flash sale system, the real-time requirements for data are very high. Flash sale activities usually involve the rush to purchase limited-time and limited quantities of goods, so the real-time requirements in several key aspects are particularly important:

  1. Real-time inventory : In flash sale activities, the inventory quantity of goods is very critical. If the system cannot update and feedback inventory information in real time, it may lead to oversold (that is, the actual quantity sold exceeds the inventory). Not only can overselling result in chargebacks and reduced customer satisfaction, but you can also face legal risks.

  2. Real-time order processing : Once the user clicks the "Buy" button, the system must immediately process the order and confirm whether the transaction can be completed. Any delay may result in a degraded user experience and may even cause a system crash.

  3. Real-time user feedback : During flash sale activities, users always want to know in real time whether they have successfully grabbed the product. Any delay or misleading information may lead to user dissatisfaction.

  4. Traffic monitoring and adjustment : Since flash sale activities usually trigger a large amount of traffic, the system needs to monitor traffic conditions in real time and be able to make quick adjustments to cope with possible traffic peaks.

  5. Real-time notifications and messages : Users need to be notified in real time about the start of flash sale activities, inventory changes or other related information to ensure that users do not miss any key moments.

Considering the above-mentioned real-time requirements, flash sale systems usually need to adopt high-performance, low-latency technical solutions, and are designed with high concurrency and high availability in mind.

7.8 How does kafka connect realize data synchronization from mysql to ES (synchronizing products in the flash sale system)

Kafka Connect is an extensible tool for importing data from different source systems into Apache Kafka and exporting data from Kafka to target systems. To achieve data synchronization from MySQL to Elasticsearch, you can use Kafka Connect with a specific connector (Connector). The following is the general process of implementation steps:

  1. Use Debezium Connector for MySQL :

    • Debezium is an open source CDC (Change Data Capture) tool that can capture and monitor database change events.
    • For MySQL, Debezium listens to its binlog to capture data changes and publishes these change events to the Kafka topic.
  2. 设置Kafka Connect Elasticsearch Connector:

    • Kafka Connect has an official Elasticsearch connector, which can consume data from Kafka topics and write to Elasticsearch.
  3. Steps :

    a. Set up Debezium Connector for MySQL :

    • Make sure MySQL's binlog function is turned on.
    • Configure Debezium and specify MySQL related parameters.
    • Start the Debezium Connector, at which point it will start listening to MySQL for data changes and send them to the specified Kafka topic.

    b. 设置Kafka Connect Elasticsearch Connector:

    • Configure the connector and specify Elasticsearch-related parameters, such as address, index name, etc.
    • Specify which Kafka topic to read data from.
    • Start the Elasticsearch Connector, which will fetch data from the specified Kafka topic and write it to Elasticsearch.
  4. Operation :

    • Once the above steps are set up, any data changes you make in MySQL (such as inserts, updates, or deletes) will be captured by Debezium and sent to Elasticsearch via Kafka.
  5. Things to note :

    • Ensure that the Kafka cluster has sufficient resources and capacity to handle the data changes captured from MySQL.
    • Consider partitioning and replication strategies for Kafka topics to ensure high availability and durability of data.
    • In a production environment, you may need to consider data transformation, filtering or other logic, which can be processed using Kafka Connect's Transform API.

Using Kafka Connect for data synchronization from MySQL to Elasticsearch not only ensures the real-time nature of the data, but also uses Kafka as a buffer to reduce system pressure and ensure data reliability.

7.9 Use Alibaba’s open source data synchronization product canal to update mysql data practice into ES.

How does MySQL synchronize data to ES in real time? Try this Alibaba open source artifact!

Guess you like

Origin blog.csdn.net/yxg520s/article/details/132787612