A Taste of Papers | Dually Distilling KGE for Faster and Cheaper Reasoning

d7f499230ebbb27afa08fdc70b486e7b.png

Notes arrangement: Zhang Jinrui, master of Tianjin University, research direction is knowledge graph

Link: https://dl.acm.org/doi/10.1145/3488560.3498437

motivation

Knowledge graphs have been proven useful for various AI tasks, such as semantic search, information extraction, and question answering, etc. However, it is well known that knowledge graphs are far from complete, which in turn has promoted many studies on the completeness of knowledge graphs. One of the more common and widely used methods is KGE Knowledge Graph Embedding, such as TransE, ComplEx, and RotatE. Meanwhile, for better performance, it is usually preferred to train KGE with higher dimensionality. However, the cost of model size (number of parameters) and inference time usually increases rapidly with the increase of embedding dimension, as shown in Figure 1: as the embedding dimension increases, the performance gain becomes smaller and smaller, while the model size and inference cost However, it still maintains an almost linear growth rate. In addition, high-dimensional KGE is impractical in many real-world scenarios, especially for applications with limited computing resources or limited inference time, and low-dimensional KGE is essential. However, directly training a small-sized KGE usually performs poorly, and we further propose a new research question: Is it possible to obtain a low-dimensional KGE from a pre-trained high-dimensional KGE, achieving good performance at a faster and lower cost? Effect.

highlights

The highlights of DualDE mainly include:

1. A novel framework is proposed to extract low-dimensional KGE from high-dimensional KGE and achieve good performance;

2. The dual influence between teacher and student is considered in the distillation process, and a soft-label evaluation mechanism is proposed, which adaptively assigns different soft-label and hard-label weights to different triplets, and a two-way Staged distillation method to improve students' acceptance of teachers.

Concept and Model

The overall framework of the model is shown in Figure 1:

1b460931c85a5483e9d4ec1e3aec77f9.png

Figure 1 DualDE overall architecture diagram

  • distillation target

Prepare a pre-trained high-dimensional KGE model (teacher), and randomly initialize a low-dimensional KGE model (student). In DualDE, the hard label loss for training the student model is the original loss of the KGE method, usually the binary cross-entropy loss. Furthermore, we let students imitate the teacher in terms of the overall credibility and embedding structure of target triples.

First, for a triplet (h, r, t), the teacher and the student model can assign a score to it through the scoring function: 1) The overall credibility of the student imitating the teacher to the triplet can be obtained by fitting two models The output triplet score is complete; 2) The embedding structure of the student imitating the teacher can be reflected by fitting the length ratio and angle of the triplet head entity embedding and tail entity embedding in both models. Finally, we take the sum of the triplet score difference and embedding structure difference between teachers and students as the soft label optimization objective.

  • Soft Label Evaluation Mechanism

The soft-label evaluation mechanism can evaluate the quality of the soft labels provided by the teacher, and adaptively assign different soft-label and hard-label weights to different triplets, thereby preserving the positive effect of high-quality soft labels and avoiding the negative effects of low-quality soft labels. Negative impact.

In theory, the KGE model will give higher scores to positive triples and lower scores to negative triples, but the opposite is true for some triples that KGE models are difficult to grasp. Specifically, if the teacher gives a high (low) score to a negative (positive) triplet, which means that the teacher tends to judge it as a positive (negative) triplet, then the soft Labels are unreliable and can even mislead students. For this triplet, we need to weaken the weight of soft labels and encourage students to learn more from hard labels.

  • two-stage distillation

The previous section describes how to allow students to extract knowledge from a KGE teacher, where students are trained with hard labels and soft labels are generated by a fixed teacher. To get better students, we propose a two-stage distillation method to improve student-to-teacher acceptance by unfreezing the teacher and letting it learn from the student in the second stage of distillation.

The first stage. The first stage is similar to the traditional knowledge distillation method in which the teacher remains constant while training the students. second stage. While adjusting the teacher in the second stage, for those triples that the students have not mastered well, we also hope to reduce the negative impact of the student's output on the teacher, so that the teacher can learn more from hard labels, so as to maintain the teacher's high accuracy sex. Therefore, we also apply the soft-label evaluation mechanism to the teacher's adjustment. The weights of the teacher's hard and soft labels are adaptively assigned by evaluating the scores students give to each triplet. In this phase, teachers and students optimize together.

experiment

We evaluate DualDE on the typical KGE benchmark, and conduct experiments to explore the following questions:

(1) Can DualDE extract a good low-dimensional student from a high-dimensional teacher and perform better than a model of the same dimension trained from scratch? Same dimensionality model trained without distillation or using other KD methods? Without distillation or using other KD methods?

(2) How much does the inference time improve after distillation?

(3) Does the soft-label evaluation mechanism and the two-stage distillation method contribute to our proposal, and how much?

DualDE is tested on commonly used datasets WN18RR and FB15K-237, and DualDE demonstrates superior performance compared to several current state-of-the-art distillation methods. The following are the experimental results:

f0e8baee2f07e1324ea57ff701115353.png

Table 1 Link prediction results of WN18RR

c522ca4199cf69315448509f53261b92.png

Table 2 Link prediction results of FB15k-237

Q1: Is our method successful in distilling a good student?

First, we analyzed the results of WN18RR in Table 1. Table 1 shows that as the embedding dimension decreases, the performance of the “no-DS” model drops significantly. For Simple, compared to the 512-dimensional teacher, the 32-dimensional "no DS" model only achieves 64.8%, 66.1% and 47.8% results on MRR, Hit@3 and Hit@1. While for ComplEx, MRR decreased from 0.433 to 0.268 (38.1%). This shows that directly training low-dimensional KGE produces poor results.

Compared with "no-DS", DualDE greatly improves the performance of 32-dimensional students. The MRR of TransE, SimplE, ComplEx and RotatE on WN18RR increased from 0.164 to 0.21 (28.0%), from 0.273 to 0.384 (40.7%), from 0.268 to 0.397 (48.1%), from 0.421 to 0.468 (11.2%) . On the basis of "no-DS", our 32-dimension students achieved an average improvement of 32.0%, 23.0%, 33.9% and 46.7%, and finally reached the teacher's level of 92.9%, 94.8%, 93.1% and 102.3% on MRR, Hit@10, Hit@2 and Hit@3. We can also observe similar results for FB15k-237 in Table 2. Experimental results show that DualDE can achieve 16 times (512:32) embedding compression ratio while retaining most of the teacher's performance (over 90%), which is still better than directly training low-dimensional models despite some performance loss much.

Q2: Train students to speed up reasoning and how much?

In order to test the inference speed, we conduct link prediction experiments on 93003 experimental samples WN18RR and 310116 FB15k237 samples. Since the inference speed is not affected by the prediction mode (head prediction or tail prediction), the tail prediction times are compared uniformly. Inference is performed on a single Tesla-V100 GPU, and the test batch size is set to the total number of entities: 40943 for WN18RR and 14541 for FB15k-237. To avoid chance factors, we repeated the experiment 3 times and reported the average time. Table 3 shows the results of inference time cost in seconds.

9dd144fc7ccf810c07aad2784273a6f7.png

Table 3 Inference time (seconds)

The results show that trained students greatly speed up inference. Taking ComplEx and RotatE as examples, the inference time of 512-dimensional teachers on WN18RR is 7.03 times and 7.81 times that of 32-dimensional students. Compared with the teacher, the average speed of 64-dimensional students on TransE, SimplE, ComplEx and RotatE of the two datasets is 2.25×, 2.22×, 3.66× and 3.98×, and the average speed of 32-dimensional students is 3.11×, 3.35×, 5.90× and 5.76×.

Q3: Does the soft label evaluation mechanism and the two-stage distillation method contribute? How much is the contribution?

We conduct a series of ablation studies to evaluate the impact of two strategies in DualDE: the soft-label evaluation mechanism and the two-stage distillation method.

First, to study the impact of the soft label evaluation mechanism, we compare our method with the method after removing the soft label evaluation mechanism. Then, to study the effect of the secondary distillation method, we compare DS with the model after removing the first stage (-s1) and removing the second stage (s2). Table 4 summarizes the results of MRR and Hit@10 on the WN18RR dataset.

3a5fee4ed5fdd47762359b4278b6025b.png

Table 4: WN18RR ablation studies. D refers to the dimension of the student and M refers to the method.

After removing SEM (see -SEM), all students showed a decline in performance compared to DS. Among the four KGEs, MRR and Hit@10 decreased by 3.7% and 2.8% on average for 64-dimensional students, and by 7.9% and 5.4% on average for 32-dimensional students. The experimental results show that the soft label evaluation module evaluates the soft label quality of each triplet and assigns different soft label and hard label weights to different triplets, which is indeed helpful for the student model to master the more difficult triplets , for better performance.

Delete S1, keep only S2 (see -S1), the overall performance is lower than DS. Presumably, the reason is that both teachers and students will get used to each other in S2. For randomly initialized students, the students pass mostly useless information to the teacher, which can be misleading and will crash the teacher. Also, the performance of "-S1" is very unstable. Under the "-S1" setting, the 64-dimensional students performed only slightly worse than DS, while the 32-dimensional students performed significantly worse. For SimplE 32-dimensional students, the MRR and Hit@10 of “-S1” decreased by 21.4% and 10.6% compared with DS. This is even worse than using the most basic distillation method, BKD, showing that the first stage is necessary for DualDE.

Removing S2, leaving only S1 (see -S2), performance degrades in almost all metrics. Compared with DS, the 64-dimensional and 32-dimensional-s2 students decreased by 2.4% and 3.8% on average, indicating that the second stage can indeed make the teacher and students adapt to each other and further improve the results.

These results support the effectiveness of our two-stage distillation, which first trains the student S1 to converge to a certain performance, and then jointly optimizes the teacher and student in S2.

Summarize

Too many embedding parameters of the knowledge graph will bring huge storage and computing challenges to practical application scenarios. In this work, we propose a novel KGE distillation method, DualDE, to compress KGE into a low-dimensional space and effectively transfer the teacher's knowledge to the student. Considering the dual influence between teachers and students, we propose DualDE to use two distillation strategies divided into: a soft-label evaluation mechanism for adaptively assigning different soft-label and hard-label weights to different triplets; and a two-stage Distillation method for improving student-teacher acceptance by encouraging student-teacher to learn from each other. We have evaluated DualDE with link prediction tasks on several KGE and benchmark datasets. Experimental results show that the method can effectively reduce the embedding parameters and greatly improve the inference speed of high-dimensional KGE with little or no performance loss.


OpenKG

OpenKG (Chinese Open Knowledge Graph) aims to promote the openness, interconnection and crowdsourcing of knowledge graph data with Chinese as the core, and promote the open source and open source of knowledge graph algorithms, tools and platforms.

1e5b06d6c95bf0416ba6c67d9cd95eae.png

Click to read the original text and enter the OpenKG website.

Guess you like

Origin blog.csdn.net/TgqDT3gGaMdkHasLZv/article/details/131136669