An introduction to the R-tree algorithm

Get into the habit of writing together! This is the first day of my participation in the "Nuggets Daily New Plan · April Update Challenge", click to view the details of the event .

introduce

Some well-known tree structures include B Tree and B+ Tree. Recently, I encountered a new tree structure R tree in the learning process. This structure can be said to be another form of B tree development to multi-dimensional space. For B/B+- Trees are often used to index one-dimensional data due to their linear nature. (The bigger one goes to the side, the smaller one goes to the side, but only in one dimension for comparison). B-tree is a balanced tree. It divides a one-dimensional straight line into several segments. When we find a point that meets a certain requirement, we just need to find the line segment to which it belongs. This kind of thinking is actually to find a large space first, then gradually narrow the space to be searched, and finally find a solution that meets the requirements in a minimum indivisible space set by oneself. A typical B-tree lookup looks like this:

To find a point that satisfies the condition, first find the line segment that satisfies the condition, and then traverse the points on the line segment to find the answer. B-tree is a relatively complex data structure, especially in its deletion and insertion operations, because it involves the decomposition and merging of leaf nodes.

R tree

B-tree is to solve low-dimensional data (usually one-dimensional, that is, one data dimension for comparison), and R-tree is a good solution to this high-dimensional space search problem. It extends the idea of B-tree to multi-dimensional space very well, adopts the idea of B-tree to divide space (if B-tree is divided by one-dimensional line segment, R-tree is in two-dimensional or even multi-dimensional space), and adds , In the deletion operation, the method of merging and decomposing nodes is adopted to ensure the balance of the tree. Therefore, an R-tree is a balanced tree used to store high-dimensional data.

basic structure

R-trees are highly balanced trees that contain pointers to data objects in the index records of their leaf nodes. For the leaf node of the R-tree , it contains the entry of the index record, the basic form is (I, tuple identifier), where the tuple identifier points to the corresponding data, and I is a spatial object contained in a bounding box. n-dimensional rectangle, expressed as: $I =(I_1 , I_2,...I_n)$ where n is the number of dimensions, $I_i$ is a closed bounded interval $[a, b]$ , used to describe the extent of spatial objects in dimension i, $I_i$ There can be one or both bounds that are infinite, and surface objects that are infinite.

The non-leaf nodes of the R tree contain the form $(I, child-pointer)$ entry, here $child-pointer$ 是一个低级节点在 R树中的地址，而 I 覆盖了所有低级节点条目中的矩形。简单来讲，即每个节点包含了多个子节点或数据（当节点为叶子时），而节点中又包含了多维矩形 I 表示所有子节点或数据的最小包围矩形。

简单的一个 R 树结构如下图：

性质

R 树有两个重要属性：M 和 m。其中 M 表示一个节点中条目的最大数量，而 m 小于等于 M/2，表示一个节点中条目的最小数量。

一个 R 树则具有下列性质：

每个叶节点若不是根节点，则包含 m 至 M 个索引记录。
叶节点中的每一个索引记录(I, 元组标示符)，I 是在空间上包含 n 维数据对象的最小的矩形，该对象是由相应的元组给出的。
每个非叶节点若不是根节点则有 m 至 M 个子节点。
对于一个非叶节点中的条目(I, child-pointer)，I 是在空间上包含在子节点中的矩形的最小矩形。
根节点若不是一个叶子，则至少有两个子节点。
所有的叶子都位于同一层上。

R树更新

这里用 EI 表示索引条目 E 的矩形，用 EP 表示元组标识符或 child-pointer。

搜索算法 Search 给定一棵 R 树，其根节点是 T，输入参数为需要搜索的矩形 S，找出其矩形覆盖 S 的所有索引记录。

[搜索子树]如果 T 不是叶子，检查每一个条目 E，判断是否 EI 与 S 相交。对于所有相交的条目，在由 EP 指向的子树的根节点上调用 Search
[搜索叶节点]如果 T 是一个叶子，检查所有 EI 判断是否覆盖 S，若是，则E 就是一个要求的记录

插入算法 Insert 把一个新的索引条目 E 插入一个 R 树中。

[找到新纪录的位置]调用 Choose Leaf 选择一个叶节点 L 存放 E。
[把记录加入到叶节点中]如果 L 有空间存放额外的条目，加入 E；否则调用Split Node 以获得包含 E 及所有原来 L 的条目 L 及 LL。
[向上传递变化]在 L 上调用 Adjust Tree，若完成了分裂，则也调整 LL。
[把树变高]如果节点分裂导致根节点的分裂，则生成一个新的根节点，其子节点为两个已有的节点

算法 Choose Leaf 选择一个叶节点来存放一个新的索引条目 E。

[初始化]设 N 为根节点。
[检查叶子]如果 N 是叶子，返回 N。
[选择子树]如果 N 不是叶子，设 F 为 N 中条目，它的矩形 FI 需要至少放大到包含了 EI。通过选择有最小区域的矩形的条目来重新连接。
[向下进行直至到达一个叶子]设 N 为由 FP 指向的一个子节点，并从第 2 步处重复此过程。

4、算法 Adjust Tree 从一个叶节点 L 点上升到根，调整覆盖的矩形，需要则传递分裂。

[初始化]令 N=L，如果 L 前面分裂过，设 NN 为所得的第二个节点。
[检查是否完成]如果 N 为根，则停止。
[调整在父条目中覆盖的矩形]令 P 作为 N 的父节点，令 EN 作为 N 在 P 中的条目，调整 ENI，使其紧密地围住 N 中全部的条目矩形。
[向上传递节点分裂]如果 N 有一个伙伴——从早先分裂中得到的 NN，则生成一个新的条目 ENN，并用 ENNP 指向 NN，而 ENNI 围住所有 NN 中的矩形。若还有空间，则把 ENN 加入 P 中。否则调用 Split Node 来生成 P、PP、ENN 及 P 中的所有条目。
[移动到下一层]令 N=P，如果出现一个分裂令 NN=PP，从第 2 步重复。

算法 Delete 从 R 树中删除索引记录 E。

[找到包含记录的节点]调用 Find Leaf 找到包含 E 的叶节点 L，若果没有找到记录则停止。
[删除记录]从 L 中删除 E。
[传递变化]调用 Condense Tree，经过 L。
[降低树]如果根节点在经过树的调整之后仅有一个子节点，将这个子节点作为新的根节点。

算法 Find Leaf 给定一棵 R 树，其根节点为 T，找出所有包含索引条目 E 的叶节点。

[搜索子树]如果 T 不是叶子，检查 T 中的每个条目 F，判断是否 FI 覆盖了EI。针对每个这样的条目，在由 FP 指向的树中的根节点上调用 Find Leaf，直至 E 被找到或者所有的条目都被检查过。
[搜索记录的叶节点]如果 T 是一个叶子，检查每个条目，看其是否与 E 匹配，若找到 E 则返回 T。

算法 Condense Tree 给定一个叶节点，其中已删除一个条目，如果它有很少的条目，则消除节点，并转移其条目。若有需要，则向上传递的节点消除。调整到达根的路径上所有覆盖的矩形，如果可能，则使他们变得更小。

[开始]令 N=L，设 Q 为一组消除的节点，置为空。
[找到父条目]如果 N 为根，转移到第 6 步，否则令 P 为 N 的父节点，令 EN为 N 在 P 中的条目。
[消除不饱和的节点]若 N 中有少于 m 个条目，从 P 中删除 EN，把 N 加入到 Q 中。
[调整覆盖的矩形]若 N 未被消除，调整 ENI，使其包含 N 中的所有条目。
[向上移动一层]令 N=P，从第 2 步重复进行。
[重新插入孤立的条目]重新插入所有 Q 中节点的条目。将消除的叶节点中的条目重新插入到树的叶节点上，插入算法见算法 Insert。但在高一层的节点上条目必须放置在树的高层上，这样它们独立的子树上的叶子可以在同一层上，就像主树上的叶子一样。

8、节点分裂为了在一个以包含 M 个条目的已满的节点中加入一个新的条目，把 M+1 个条目的集合分成两个节点是必须的。这个分配应按下述方法进行，尽量使两个新节点在接下来的搜索检查中不同时出现。因为访问一个节点取决于其覆盖的矩形所覆盖的搜索面积。两个覆盖矩形的总面积在一个分裂之后应为最小。下图举例说明了这点。不良分裂的覆盖矩形面积要比最佳情况下的面积大的多。

二次方代价算法这个算法是为了找到一个最小面积的分裂，但无法保证一定会找到这个最小面积。代价是 M 的二次方并且与维数成线性关系。算法首先从 M+1 个条目中选出两个条目作为两个新组中的第一个成员，选择这两个条目的方法是若两个条目放在同一组内将浪费的面积最大，即覆盖了两个条目的矩形面积减去两个条目的面积最大的。剩余的条目每次分配就是使两个差别最大的那个条目。
算法 Quadratic Split 把 M+1 个索引条目分成两组。

[从每个组中取第一个条目]调用算法 Pick Seeds 选出两个条目作为两组中的第一个成员，分配到组中。
[检查是否结束]若所有的条目都被分配，则停止；若一组中条目很少，则剩余的条目必须分配到这组中，来保证其条目数量达到最小值 m，分配之后

停止。

[选择待分配的条目]调用算法 Pick Next 选择下一个待分配的条目。把它加入到所覆盖矩形在进行最小扩展就可容纳它的组中。将条目加入到哪个组中的方案为：首先考虑面积较小的组，其次是条目较少的组，最后是其他条件。从第 2 步重复。

算法 Pick Seeds 选择两个条目作为组中的新成员。

[计算两个条目的面积对应值]对每一对条目 E1 及 E2 组成一个包括 E1I 及E2I 的矩形J，计算 d=area(J)-area(E1I)-are(E2I)。
[选择浪费最大的对]选择 d 最大的一对条目返回。

算法 Pick Next 从余下的条目中选出一个防盗组中。

[判断把一个条目放入一个组中的代价]对还未放入组中的每个条目 E，计算d1=第一组包含 EI 后覆盖矩形增加的买年纪，类似地计算第二组 d2。
[找到对于每个组的最佳条目]选择 d1 与 d2 最大差别的条目。

总结

R-tree is a data structure that can effectively perform high-dimensional space search, and it has been widely used in various databases and related applications. However, the processing of R-trees also has limitations. Its best application range is to process data of 2 to 6 dimensions. Higher-dimensional storage will become very complicated, so it is not applicable. In recent years, there have been many variants of R-tree, and R* tree is one of them. These variants improve the performance of R-trees, and interested readers can refer to the relevant literature. If there are any mistakes in the article, I hope that readers will not hesitate to enlighten me. This article is over.