Elasticsearch when the database uses: Join

Use Elasticsearch as the database to make the series:

 

Sell ​​Elasticsearch

Secrets of Time Series Databases (1) - Introduction
to Secrets of Time Series Databases (2) - Secrets of Indexing
Time Series Databases (3) - Loading and Distributed Computing

Querying Elasticsearch with SQL

https://github.com/taowen/es-monitor

[01] Use Elasticsearch as a database: table structure definition
[02] Use Elasticsearch as a database: filtering and sorting
[03] Use Elasticsearch as a database: simple indicators
[04] Use Elasticsearch as a database: aggregate by fields
[05] Use Elasticsearch as a database: HISTOGRAM aggregation
[06] Use Elasticsearch as a database: CASE WHEN aggregation
[07] Use Elasticsearch as a database: Sort after aggregation
[08] Use Elasticsearch as a database: Calculate and then aggregate
[09] Use Elasticsearch When the database is used: HAVING and Pipeline Aggregation
[10] Use Elasticsearch as a database: Drill Down
[11] Use Elasticsearch as a database: Filter drill down
[12] Use Elasticsearch as a database: Calculate after aggregation
[13] Elasticsearch when the database uses: Join

 

 

 

 

Use  https://github.com/taowen/es-monitor  to query elasticsearch with SQL. To truly use Elasticsearch as a database, Join is an unavoidable topic. This slide summarizes well how Elasticsearch supports joins: http://www.slideshare.net/sirensolutions/searching-relational-data-with-elasticsearch . In general, there are several ways:

  • Do not join at all, fuse the fields of the associated table into one table. Of course, this will cause data redundancy

  • Join when entering: use nested documents (nested document and main document are stored in the same segment, for a symbol, tens of millions of quotes are not suitable for scenarios like this)

  • Join when entering: use siren

  • Join when querying: use parent/child (this is a feature of elasticsearch, requiring parent/child to exist with shard)

  • Join when querying: use siren-joins (that is, a filter that is evaluated on the server, and then publishes the result to each shard for a second match)

  • Query-time join: assemble a second query on the client side (similar to siren-joins, but with one more round-trip from client to server)

  • Query-time join: join and merge two queries on the coordinate node ( https://github.com/NLPchina/elasticsearch-sql )

My personal favorites are siren-joins and client-side assembly. In both schemes, a query is performed first, and the query results are distributed to each distributed node for distributed aggregation again. It is more scalable than doing joins on coordinate nodes.

client evaluation

First let me see how to complete the evaluation of the result set on the client side

$ cat << EOF | python2.6 es_query.py http://127.0.0.1:9200
    SELECT symbol FROM symbol WHERE sector='Finance' LIMIT 1000;
    SAVE RESULT AS finance_symbols;
EOF

The SAVE RESULT AS introduced here is used to trigger the evaluation of the preceding SQL, and the result set is named finance_symbols. If we don't need it because of some intermediate results, we can also use the REMOVE command to remove the evaluation results

$ cat << EOF | python2.6 es_query.py http://127.0.0.1:9200
    SELECT symbol FROM symbol WHERE sector='Finance' LIMIT 1000;
    SAVE RESULT AS finance_symbols;
    REMOVE RESULT finance_symbols;
EOF

Even we can use arbitrary python code to modify result_map.

$ cat << EOF | python2.6 es_query.py http://127.0.0.1:9200
    SELECT symbol FROM symbol WHERE sector='Finance' LIMIT 1000;
    SAVE RESULT AS finance_symbols;
    result_map['finance_symbols'] = result_map['finance_symbols'][1:-1];
EOF

Client Join

Based on the client-side evaluation, we can use the result set retained by the client to make a second request.

cat << EOF | python2.6 es_query.py http://127.0.0.1:9200
    SELECT symbol FROM symbol WHERE sector='Finance' LIMIT 5;
    SAVE RESULT AS finance_symbols;
    SELECT MAX(adj_close) FROM quote 
        JOIN finance_symbols ON quote.symbol = finance_symbols.symbol;
    REMOVE RESULT finance_symbols;
EOF

The resulting Elaticsearch request is the following two:

{
  "query": {
    "term": {
      "sector": "Finance"
    }
  }, 
  "size": 5
}

Then according to its return, a second request is generated

{
  "query": {
    "bool": {
      "filter": [
        {}, 
        {
          "terms": {
            "symbol": [ "TFSC", "TFSCR", "TFSCU", "TFSCW", "PIH" ] }
        }
      ]
    }
  }, 
  "aggs": {
    "MAX(adj_close)": {
      "max": {
        "field": "adj_close"
      }
    }
  }, 
  "size": 0
}

It can be seen that the so-called client join is to use the previous query results to spell out the terms filter of the second query.

Server Join

With the siren-join plugin ( https://github.com/sirensolutions/siren-join ), we can complete the same join operation on the server side

cat << EOF | python2.6 es_query.py http://127.0.0.1:9200
    WITH finance_symbols AS (SELECT symbol FROM symbol WHERE sector='Finance' LIMIT 5);
    SELECT MAX(adj_close) FROM quote 
        JOIN finance_symbols ON quote.symbol = finance_symbols.symbol;
EOF

The first query above is to evaluate with SAVE RESULT AS and name it finance_symbols, here we do not evaluate but give it a name (WITH AS), and then it can be referenced.

{
  "query": {
    "bool": {
      "filter": [
        {}, 
        {
          "filterjoin": {
            "symbol": { "indices": "symbol*", "path": "symbol", "query": { "term": { "sector": "Finance" } } } }
        }
      ]
    }
  }, 
  "aggs": {
    "MAX(adj_close)": {
      "max": {
        "field": "adj_close"
      }
    }
  }, 
  "size": 0
}

It can be seen that the generated filterjoin combines the two steps into one. Note that for filterjoin queries, you need to POST _coordinate_search instead of the _search URL.
Profile

[
  {
    "query": [
      {
        "query_type": "BoostQuery",
        "lucene": "ConstantScore(BytesFieldDataTermsQuery::[size=8272])^0.0",
        "time": "29.32334300ms",
        "breakdown": {
          "score": 0,
          "create_weight": 360426,
          "next_doc": 137906,
          "match": 0,
          "build_scorer": 15027540,
          "advance": 0
        },
        "children": [
          {
            "query_type": "BytesFieldDataTermsQuery",
            "lucene": "BytesFieldDataTermsQuery::[size=8272]",
            "time": "13.79747100ms",
            "breakdown": {
              "score": 0,
              "create_weight": 14903,
              "next_doc": 168010,
              "match": 0,
              "build_scorer": 13614558,
              "advance": 0
            }
          }
        ]
      }
    ],
    "rewrite_time": 30804,
    "collector": [
      {
        "name": "MultiCollector",
        "reason": "search_multi",
        "time": "1.529236000ms",
        "children": [
          {
            "name": "TotalHitCountCollector",
            "reason": "search_count",
            "time": "0.08967800000ms"
          },
          {
            "name": "MaxAggregator: [MAX(adj_close)]",
            "reason": "aggregation",
            "time": "0.1675550000ms"
          }
        ]
      }
    ]
  }
]

From the results of the profile, the principle is also the terms filter (BytesFieldDataTermsQuery). So this also determines that this join is just a pseudo join. A true join can not only use the first table to filter the second table, but also be able to refer to the results of the first stage in the calculation stage of the second query. This cannot be done with just the terms filter. Of course, all these join efforts are just to make data maintenance easier. If we really require Elasticsearch's joins to be as powerful as traditional SQL, then we can't expect such complex joins to go anywhere fast, and we will lose The meaning of using Elasticsearch. With the above two Join methods, we can get a certain choice between extremely fast and extremely flexible.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326262131&siteId=291194637