Comprehensively explain the principle and use of Nebula Graph indexing

This article was first published on the Nebula Graph Community public account

Comprehensively explain the principle and use of Nebula Graph indexing

index not found? Can't find index? Why do I need to create a Nebula Graph index? When to use Nebula Graph native indexes? In response to common problems in the community, this article aims to explain the use of indexes in one article.

The index of Nebula Graph is actually very similar to the index in the traditional relational database, but there are some confusing differences. Students who are just beginning to understand Nebula will wonder:

  • It is not clear what concept is indexed in the Nebula Graph database;
  • When should Nebula Graph indexes be used;
  • Do Nebula Graph indexes affect write performance? How big is the impact?

In this article, we will solve these problems one by one, so that everyone can use Nebula Graph better.

What exactly is a Nebula Graph index?

In short, Nebula Graph index is used and only used to query scenarios for pure attribute conditions . It has the following characteristics:

  • It is not required for attribute condition filtering in graph walk queries
  • Query based on pure attribute conditions (note: non-sampling case) must create an index

Pure attribute condition start query

We know that in traditional relational databases, an index is one or more copies of table data that are reordered for a specific column , which is used to speed up read queries for specific column filter conditions and bring additional data writes. In short, indexes can speed up, but it is not necessary for queries to use indexes.

In the Nebula Graph database, the index is a reordered copy of the point and edge specific attribute data , which is used to provide pure attribute conditions to start the query .

Take the following query as an example, this statement implements the way to obtain graph data from the specified point and edge attribute conditions instead of the point ID:

#### 必须 Nebula Graph 索引存在的查询

# query 0 纯属性条件出发查询
LOOKUP ON tag1 WHERE col1 > 1 AND col2 == "foo" \
    YIELD tag1.col1 as col1, tag1.col3 as col3;

# query 1 纯属性条件出发查询
MATCH (v:player { name: 'Tim Duncan' })-->(v2:player) \
        RETURN v2.player.name AS Name;

The above two pure attribute conditions start the query literally "acquiring the point or edge itself according to the specified attribute condition", and the negative example is the ID of the given point. Refer to the following examples:

#### 不基于索引的查询

# query 2, 从给定的点做的游走查询 vertex VID: "player100"

GO FROM "player100" OVER follow REVERSELY \
        YIELD src(edge) AS id | \
    GO FROM $-.id OVER serve \
        WHERE properties($^).age > 20 \
        YIELD properties($^).name AS FriendOf, properties($$).name AS Team;

# query 3, 从给定的点做的游走查询 vertex VID: "player101" 或者 "player102"

MATCH (v:player { name: 'Tim Duncan' })--(v2) \
        WHERE id(v2) IN ["player101", "player102"] \
        RETURN v2.player.name AS Name;

Let's take a closer look at the preceding query 1and query 3, although the conditions in the statement all have filter conditions for tag as player { name: 'Tim Duncan' }, one needs to rely on the index implementation, and the other does not need an index. The specific reasons are here:

  • query 3index is not needed because:
    • It will (v:player { name: 'Tim Duncan' })bypass the starting point without a given VID, ["player101", "player102"]expand GetNeighbors()outward from the starting point where the VID is given in v2, and then GetVertices()get the point at the other end of the edge by getting the next hop, vand v.player.namefilter . data;
  • query 1It is different because it doesn't have any given origin VID:
    • You can only { name: 'Tim Duncan' }start , and first find the matching points in the index data sorted by name: IndexScan() gets v;
    • Then get the other end of the edge vfrom v2GetNeighbors(), and get the data v2in ;

In fact, the key here is to query whether there is a given vertex ID (Vertex ID). The following two query execution plans compare their differences more clearly:

query-based-on-indexLegend: The execution plan of query 1 (requires an index);

query-requires-no-index

Legend: The execution plan of query 3 (no index is required);

Why must an index be used in a query based on pure attribute conditions?

Because when Nebula Graph stores data, its structure is designed for distributed and relational relationships, and full scan conditional search without indexes in a table-structured database is actually more expensive, so it is intentionally prohibited by design.

However, if you don't pursue all the data, just sample a part of it. After 3.0, it supports LIMIT <n>the The following query (there is LIMIT) does not need an index:

MATCH (v:player { name: 'Tim Duncan' })-->(v2:player) \
    RETURN v2.player.name AS Name LIMIT 3;

Why do queries only need pure attribute conditions for indexing?

Here, we compare the implementation of normal graph query graph-queries and pure-prop-condition queries :

  • graph-queries, such as query 2and query 3are extended walks along the edge to find specific path conditions;
  • pure-prop-condition queries, such as query 0and query 1are to find satisfying points and edges only through specific property conditions (or unrestricted conditions);

In Nebula Graph, when graph-queries are expanding, the original data of the graph has been sorted by VID (both points and edges), or it has been indexed in the data. This sorting brings continuous storage (physical upper adjacency) so that the extended walk itself is optimized to return results quickly.

Summary: What is an index and what is an index not?

What is an index?

  • The Nebula Graph index is for the ordering of a piece of attribute data from a given attribute condition to query and edge, and it makes this read query mode possible at the cost of writing.

What is an index not?

  • Nebula Graph indexes are not used to speed up general graph queries : queries that expand outward from a single point (even for filtering attribute conditions) do not rely on native indexes, because the storage of Nebula data itself is optimized and sorted for this kind of query. .

Some design details of Nebula Graph indexes

In order to better understand the limitations, costs, and capabilities of the index, let's explain more details about it.

  • Nebula Graph indexes are locally (not separate, centralized) stored and sharded together with point data.
  • It only supports left matching
    • Because the bottom layer is RocksDB Prefix Scan;
  • Performance cost:
    • The path when writing: not only one more data, but also expensive read operations to ensure consistency;
    • Read path: rule-based optimization to select indexes, fan-out to all StorageDs;

This information can be seen in #handdrawingsandvideos on my personal website (link: https://www.siwei.io/sketch-notes/ ), refer to the following picture:

Due to the design of left matching, in complex query scenarios, such as: wildcards and REGEXP are involved in queries based on pure attribute conditions, Nebula Graph provides the function of full-text indexing, which uses Raft Listener to asynchronously write data to an external Elasticsearch cluster Among them, and check ES when querying, see the document for specific full-text index usage: https://docs.nebula-graph.com.cn/3.0.0/4.deployment-and-installation/6 .deploy-text-based-index/2.deploy-es/ .

In this hand drawing, we can also see that

  • Write path
    • Writing index data is a synchronous operation;
  • Read path
    • This part draws an example of RBO. The rules in the query assume that when col2 is equal and matched on the left, the performance is better than that of col1, so the second index is selected;
    • After the index is selected, the request to scan the index is fan-out to the storage node, and some of the filter conditions such as TopN can be pushed down;

in conclusion:

  • Because of the cost of writing, it is only used when the index must be used. If the sampling query can meet the reading requirements, LIMIT <n> can be used without creating an index.
  • Index has left matching restriction
    • The order that matches the query needs to be carefully designed
    • Sometimes it is necessary to use a full-text index .

Use of indexes

For details, please refer to the official Nebula index document: https://docs.nebula-graph.io/3.0.0/3.ngql-guide/14.native-index-statements/ Some key points are:

The first point is to create an index on the Tag or EdgeType for the attributes of the edge that you want to be checked conditionally, and use the CREATE INDEXstatement ;

The second point is that after the index is created, part of the index data will be written synchronously, but if the index corresponding to the point-edge data already existing before the index is created needs to be explicitly specified to create, this is an asynchronous job, and the statement needs to be executed REBUILD INDEX;

The third point, REBUILD INDEXafter , can SHOW INDEX STATUSquery the status with the statement:

Fourth, the query that uses the index can be LOOKUP, and often can use the pipe character to expand the query on top of it. Refer to the following example:

LOOKUP ON player \
    WHERE player.name == "Kobe Bryant"\
    YIELD id(vertex) AS VertexID, properties(vertex).name AS name |\
    GO FROM $-.VertexID OVER serve \
    YIELD $-.name, properties(edge).start_year, properties(edge).end_year, properties($$).name;

It can also be MATCHthat here vis obtained through the index, and v2is obtained by expanding the query in the data (non-index) part.

MATCH (v:player{name:"Tim Duncan"})--(v2:player) \
    RETURN v2.player.name AS Name;

The fifth point is the capabilities and limitations of compound indexes. Understanding that the matching of the native index is the left matching allows us to know that the index for more than one attribute: the composite index, and can help us understand its ability to be limited, here are a few conclusions:

  • We create compound indexes on multiple properties that are order-dependent
    • For example, we create a dual-attribute composite index index_a:(isRisky: bool, age: int) and index_b:(age: int, isRisky: bool) when querying based on WHERE n.user.isRisky == true AND n.user.age > 18filter conditions, index_a is obviously more efficient because the left matches an equal short field.
  • Only the filter condition of the proper subset of the compound indexed property that is matched by the compound left can be supported only
    • For example, index_a:(isRisky: bool, age: int) , and index_b:(age: int, isRisky: bool) When querying WHERE n.user.age > 18this statement, only index_bthe composite leftmost match can satisfy this query.
  • For some dependent attributes as the starting point of the query to find points and edges, the native index cannot meet the matching scenario of full-text search. At this time, we should consider using the Nebula full-text index, which is an out-of-the-box external Elasticsearch supported by the Nebula community. Through configuration, the data created with the full-text index will be asynchronously updated to the Elastic cluster through the Raft listener. The query of the full-text index The entrance is also LOOKUP, please refer to the document for details: https://docs.nebula-graph.com.cn/3.0.1/4.deployment-and-installation/6.deploy-text-based-index/2.deploy- es/ .

review

  • Nebula Graph index scans enumerations and edges by ordering copies of attributes when only attribute conditions are provided;
  • Nebula Graph indexes are not used for graph expansion queries;
  • Nebula Graph index is left matching, not used for fuzzy full-text search;
  • Nebula Graph indexes have performance costs when writing;
  • Remember to rebuild the index if there is already data on the corresponding edge before the Nebula Graph index is created;

Happy Graphing!


Exchange graph database technology? To join the Nebula exchange group, please fill in your Nebula business card first , and the Nebula assistant will pull you into the group~~

Pay attention to the public account

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4169309/blog/5506218