Deep Knowledge Tracing·Thesis learning summary

Deep Knowledge Tracing

0 Summary

In computer-supported education, knowledge tracking (machines model students’ knowledge as they interact with coursework) is a well-recognized problem. Although effective modeling of student knowledge can have a high educational impact, there are many inherent challenges in this task. In this article, we explore the use of recurrent neural networks (RNNs) to simulate student learning. Compared with previous methods, the RNN model family has important advantages because they do not require explicit coding of human domain knowledge and can capture more complex student knowledge representations. The use of neural networks can significantly improve the prediction performance of a series of knowledge tracking data sets. In addition, the learned model can be used for intelligent curriculum design and allows direct interpretation and discovery of the structure in student tasks. These results provide a promising new research direction for knowledge tracking and a typical application task for RNN.

1 Introduction

Computer-assisted education promises to open access to world-class education and reduce the growing cost of learning. We can achieve this promise by building large-scale student tracking data models on popular education platforms such as Khan Academy, Coursera, and EdX.
The task of knowledge tracking is to model students' knowledge so that we can accurately predict how students will perform in future interactions. Improvements to this task means that resources can be suggested to students based on their personal needs, and content that is too easy or too difficult to predict can be skipped or postponed. The manual adjustment of the intelligent teaching system that attempts to customize the content has shown exciting results. One-to-one human tutoring can bring about two standard deviations of learning benefits to ordinary students, and machine learning solutions can provide these benefits of high-quality personalized teaching to anyone in the world for free. Since the basis of human learning is the complexity of the human brain and human knowledge, the problem of knowledge tracing itself is very difficult. Therefore, it seems appropriate to use a rich model. However, most of the previous educational work relied on the first-order Markov model with restricted functional form.
This paper proposes a formula called Deep Knowledge Tracing (DKT), which applies the “deep” flexible recurrent neural network in time to the task of knowledge tracing. This series of models uses a large number of artificial "neurons" to represent the latent knowledge state and its time dynamics, and allows the latent variable representation of student knowledge to be learned from data, rather than hard coding. The main contributions of this work are:
1. A new method of encoding student interaction as recurrent neural network input.
2. On the knowledge tracking benchmark, AUC is 25% higher than the previous best result.
3. It is proved that our knowledge tracking model does not require expert annotations.
4. Discovery of the impact of exercises and improve the generation of exercise courses.

Insert picture description here
Figure 1: A student and her predicted response when solving 50 Khan exercises. She seems to have mastered the skills of finding the x and y intercepts, and then encountered difficulties in translating the knowledge into linear equations.

The task of knowledge tracking can be formalized as: given the observation of the interaction X0. Students use xt in a specific learning task to predict all aspects of their next interaction xt+1. In the most common knowledge tracking instantiation, the interaction takes the form of a tuple of xt={qt, at}, which combines the label qt of the exercise being answered with whether it was answered correctly in the exercise. When making predictions, the model is provided with the label qt of the exercise to be answered, and must predict whether the student will complete the exercise correctly. Figure 1 shows a visualization of the tracking knowledge of a single student studying 8th grade math. Students first answer the two square root questions correctly, and then get a wrong x interception exercise. In the subsequent 47 interactions, students solved a series of x-intercept, y-intercept and drawing exercises. Every time a student answers an exercise, we can predict whether she will answer each type of exercise correctly in the next interaction. In the visualization, we only show predictions for a subset of relevant exercise types over time. In most previous work, the exercise label represents a single "concept" assigned to the exercise by a human expert. Our model can be used but does not require such expert annotations. We proved that the model can learn content sub-structures autonomously without annotations.

2 Related work

The task of modeling and predicting how humans learn involves multiple fields such as education, psychology, neuroscience, and cognitive science. From the perspective of social science, learning is understood as being affected by complex macro-level interactions, including emotion, motivation, and even identity. The current challenges are further exposed at the micro level. Learning is essentially a reflection of human cognition, which is a highly complex process. Two particularly relevant topics in the field of cognitive science are theories, that is, the human brain and its learning process are recursive and driven by analogy. First, the question of knowledge competition was raised, and in-depth research was carried out in the smart tutoring community. In the face of these challenges, building models that may not capture all cognitive processes but are still useful has always been a major goal.

2.1 Bayesian knowledge tracking

Bayesian Knowledge Tracking (Bayesian Knowledge Tracking, BKT) is the most commonly used method to build student learning time models. BKT models the learner's latent knowledge state as a set of binary variables, each of which represents the understanding or incomprehension of a single concept. When the learner answers the exercise of a given concept correctly or incorrectly, a Hidden Markov Model (HMM) is used to update the probability of each of these binary variables. The original model formula assumes that once a skill is learned, it will never be forgotten. Recent extensions of the model include the contextualization of guessing and sliding estimation, estimating the prior knowledge of a single learner, and estimating the difficulty of the problem.
With or without such extensions, knowledge tracking will encounter several difficulties. First, the binary representation that students understand may be impractical. Secondly, the meaning of hidden variables and their mapping in exercises may be ambiguous, which rarely meets the model’s expectation that there is only one concept for each exercise. Several techniques have been developed to create and improve concept categories and concept exercise mappings. The current gold standard, cognitive task analysis is a difficult and iterative process. Domain experts require learners to discuss their thinking process while solving problems. Finally, the binary response data used for modeling transformation limits the types of exercises that can be modeled.

2.2 Other dynamic probability models

The Partially Observable Markov Decision Process (POMDP) ​​has been used to simulate the behavior of the learner over time, when the learner follows an open path to the solution. Although POMDP provides an extremely flexible framework, they need to explore an exponentially large state space. Current implementations are also limited to discrete state space, which has hard-coded meanings for latent variables. This makes them stubborn or rigid in practice, although it is possible for them to overcome these two limitations.
Simpler models from the Performance Factor Analysis (PFA) framework and the Learning Factor Analysis (LFA) framework have shown predictive power comparable to BKT. In order to obtain better prediction results than using either model alone, various ensemble methods are used to combine BKT and PFA. Model combinations supported by AdaBoost, Random Forest, Linear Regression, Logistic Regression, and Feedforward Neural Networks all show better results than BKT and PFA themselves. But because of the learner model they rely on, these integration technologies face the same limitations, including the requirement for accurate concept annotation.
Recent work has explored the combination of item response theory (IRT) models with switching nonlinear Kalman filters and knowledge tracking. Although these methods are promising, they are currently more restricted in functional form and (due to the inference of latent variables) are more expensive than the methods we propose here.

2.3 Recurrent Neural Network

Recurrent neural networks are a kind of flexible dynamic models that connect artificial neurons together over time. The propagation of information is recursive, because hidden neurons evolve based on inputs to the system and their previous activations. Unlike the hidden Markov model that appears in education, the hidden Markov model is also dynamic, and RNN has a high-dimensional, continuous representation of the latent state. A significant advantage of the richer representations of RNNs is that they can use information from the input for prediction at a later point in time. This is especially true for long-term short-term memory (LSTM) networks-this is a popular RNN.
Recurrent neural networks are competitive or state-of-the-art for several time-series tasks (for example, speech-to-text, translation, and image captioning), where a large amount of training data is available. These results show that if we express this task as a new application of temporal neural networks, we can track student knowledge more successfully.

3 In-depth knowledge tracking

We believe that human learning is governed by many different attributes-material, context, time course of presentation, and individuals involved-many of which are difficult to quantify, and only rely on first principles to assign attributes to exercises or build graphical models. Here, we will apply two different types of RNN-an ordinary RNN model with sigmoid nodules and a long short-term memory (LSTM) model-to predict students' responses to exercises based on their past activities.

3.1 Model

Traditional recurrent neural networks (RNNs) map the input sequence (x1,...,Xt) of the vector to the output sequence (y1,...,YT) of the vector. This is by calculating a series of "hidden" states (h1,...,hT) that can be regarded as a continuous encoding of relevant information from past observations, which will be useful for future predictions. See Figure 2 for an illustration. The variables are related using a simple network defined by the formula:
**HT=Tanh(Whxxt+Whhht−1+bh),(1)yt=σ(Wyhht+By),(2)**

Insert picture description here
Figure 2: The connection between variables in a simple recurrent neural network. The input (Xt) of the dynamic network is a one-hot encoding or compressed representation of the student's actions, and the prediction (Yt) is a vector representing the probability of getting each data set to practice correctly.

Among them, the TANH and Sigmoid functions σ(·) are applied element-wise. The model is parameterized by the input weight matrix Whx, the recursive weight matrix Whh, the initial state H0 and the read weight matrix Wyh. The bias of the latent cell and the read cell is given by bh and by.
Long and short-term memory (LSTM) networks are a more complex variant of RNNs, which generally prove to be more powerful. In LSTM, potential units retain their values ​​until they are explicitly cleared through the action of the "forgotten gate". Therefore, they more naturally maintain information over many time steps, which is thought to make them easier to train. In addition, hidden units are updated using multiplicative interactions, so they can perform more complex transformations on the same number of potential units. The update formula of LSTM is much more complicated than RNN and can be found in Appendix A.

3.2 Input and output time series

In order to train RNN or LSTM on student interaction, it is necessary to convert these interactions into a fixed-length input vector sequence XT. According to the nature of these interactions, we use two methods to achieve this:
For a data set with a small number of M unique exercises, we set xt as a one-hot encoding of the student interaction tuple ht={qt, at}, It represents the combination of which exercise was answered, and whether the exercise was answered correctly, so xt ∈ {0, 1}^2M. We found that having separate representations for qt and at will degrade performance.
For large feature spaces, one-hot encoding may quickly become unrealistically large. Therefore, for a data set with a large number of unique exercises, we instead assign a random vector nq∼N(0,i) to each input tuple, where nq∈R^N, and N<<M. Then we will each The input vectors xt are set to the corresponding random vectors xt=nqt. This random low-dimensional representation of a one-hot high-dimensional vector is driven by compressed sensing. Compressive sensing points out that d-dimensional k-sparse signals can be accurately recovered from k log d random linear projections (maximum scale and additive constant). Since a one-hot encoding is a sparse signal, it can be accurately encoded by assigning student interaction tuples to a fixed random Gaussian input vector with a length of ∼log 2M. Although the current paper only deals with 1-one-hot vectors, this technique can be easily extended to capture various aspects of more complex student interactions with fixed-length vectors.
The output yt is a vector of length equal to the number of questions, where each entry represents the predicted probability of the student correctly answering the particular question. Therefore, the prediction of at+1 can then be read from the entry in yt corresponding to qt+1.

3.3 Optimization

The training goal is the negative log likelihood of the student response sequence observed under this model. Let δ(qt+1) be the one-hot code of the practice answer at time t+1, and l is the binary cross-entropy. The loss of a given prediction is l(y^Tδ(qt+1), at+1), and the loss of a single student is:
Insert picture description here
the goal is to minimize the mini-batch of stochastic gradient descent. In order to prevent overfitting in the training process, when calculating the reading yt, discarding is applied to the ht value, but not applied when calculating the next hidden state ht+1. We prevent the gradient from propagating back in time by truncating the length of the gradient whose norm is higher than the threshold. For all models in this article, we always use a hidden dimension of 200 and a mini-batch size of 100. In order to promote the research on DKTS, we released our code and related preprocessed data.

4 Educational applications

The training goal of knowledge tracking is to predict the future performance of students based on their past activities. This is directly useful. For example, if a student’s ability is continuously assessed, there is no need for formal testing. As discussed in the experiments in Section 6, the DKT model can also promote other improvements.

4.1 Improve the curriculum

One of the biggest potential effects of our model is to choose the best order of learning items to present to students. Given a student with an estimated hidden knowledge state, we can query our RNN to calculate what their expected knowledge state will be if we assign them a specific exercise. For example, in Figure 1, after the student has answered 50 exercises, we can test every possible next exercise we can show her and calculate her expected knowledge state given that choice. For this student, the next problem expected to be optimal is to revisit the solution of the y-intercept.
We use a well-trained DKT to test two classic curriculum rules in the educational literature: mixing when exercises on different topics are mixed, and closing when students answer a series of exercises of the same type. Since selecting the entire sequence of the next exercise to maximize the prediction accuracy can be expressed as a Markov decision problem, we can also evaluate the benefits of using the expected maximum algorithm to select the optimal problem sequence.

4.2 Discover the training relationship

The DKT model can also be applied to the task of discovering potential structures or concepts in data, which is usually performed by human experts. We solve this problem by assigning an influence Jij to each pair of directed exercises i and j,
Insert picture description here
where (j|i) is the correct probability that the RNN assigns to exercise j at the second time step, assuming that the student is in the first Time step answered exercise i correctly. We show that this representation of the dependencies captured by the RNN restores the preconditions related to the exercise.

5 data set

We tested the ability to predict student performance on three data sets: simulated data, Khan Academy data, and Assistments benchmark data set. On each data set, we measure the area under the curve (AUC). For non-simulated data, we use 5 times of cross-validation to evaluate our results, and in all cases learn hyperparameters from the training data. We compare the results of in-depth knowledge tracking with standard BKT, and optimize BKT variants where possible. In addition, we compare our results with predictions made by simply calculating the marginal probability of a student's correct answer in a particular exercise.
Insert picture description here
Table 1: AUC results of all data sets tested. BKT is the standard BKT. BKT is the best result reported in Literature for Assistments. DKT is the result of deep knowledge tracking using LSTM. *

Simulated data: We simulate virtual students to learn virtual concepts and test the accuracy of our predictive responses in this controlled environment. For each run of this experiment, we generated 2000 students who answered 50 exercises drawn from the concepts of k ∈ 1...5. For this data set only, all students answered the same 50 practice sequences. Each student has a potential knowledge state "skill" for each concept, and each exercise has a single concept and a difficulty. Using the classic project response theory to model the difficulty and correct probability of students completing the exercise when the conceptual skills are correct: p(βα|α, β)=c+(1-c)/(1+e^(β− α)), where c is the probability of random guessing (set to 0.25). Over time, students “learn” by adding conceptual skills corresponding to the exercises they answer. In order to understand how different models combine unlabeled data, we do not provide models with hidden concept labels (instead, the input is just the practice index and whether the practice answer is correct). We evaluated the predictive performance of another 2,000 simulated test students. For each quantity concept, we repeated the experiment 20 times with different randomly generated data to evaluate accuracy, mean and standard error.
Khan Academy Data: We used an interactive sample of anonymous students from the Khan Academy's eighth grade common core curriculum. The data set includes 1.4 million exercises completed by 47,495 students, involving 69 different exercise types. It does not contain any personal information. Only the researchers working on this paper have access to this anonymous data set, and its use is governed by an agreement that aims to protect the privacy of students in accordance with Khan Academy’s privacy statement. Khan Academy provides a particularly relevant source of learning data because students often interact with the website for a long time and various content, and students are usually self-directed in the topics they research and the trajectories they take through the materials.
Benchmark Dataset:In order to understand how our model compares with other models, we evaluated the model on the Assistments 2009-2010 "Skill Builder" public benchmark data set 2. Assistments is an online tutoring that teaches and evaluates elementary school mathematics students at the same time. As far as we know, it is the largest publicly available knowledge tracking data set.

6 results

On all three data sets, the performance of deep knowledge tracking is much better than previous methods. On Khan Academy Data, the AUC obtained by using the LSTM neural network model is 0.85, which is a significant improvement over the performance of the standard BKT (AUC=0.68), especially compared with the BKT of the edge baseline (AUC=0.63). Small improvement. See Table 1 and Figure 3(B). On the Assistments data set, DKT is 25% higher than the previous best report (AUC of 0.86 and 0.69, respectively). Compared with the marginal baseline (0.24), the gain we reported in AUC is more than three times the maximum gain (0.07) achieved so far on the data set.
The prediction results from the synthetic data set provide an interesting demonstration of deep knowledge tracking capabilities. The performance of LSTM and RNN models in predicting student responses is no less than a prophet who has a perfect understanding of all model parameters (only needs to fit potential student knowledge variables). See Figure 3(A). In order to obtain accuracy comparable to the prediction, the model must simulate functions that include the following aspects: the underlying concept, the difficulty of each exercise, the prior distribution of student knowledge, and the growth of conceptual skills that occurs.
Insert picture description here
Figure 3: Left: (A) Simulation data and (B) Khan Academy Data prediction results. Right: © Predictive knowledge about auxiliary data of different exercise courses. The error bar is the standard error of the mean.

After each practice. In contrast, BKT's predictions dropped significantly as the number of hidden concepts increased, because it did not have a mechanism to learn unlabeled concepts.
We tested our ability to intelligently select exercises on a subset of five concepts in the Assistance data set. For each course method, we use our DKT model to simulate how students answer questions and evaluate how much students know after 30 exercises. We repeat the student simulation 500 times to measure the average predicted probability of the student answering the correct question in the future. In the context of Assistance, blocking strategies have significant advantages over hybrid strategies. See Figure 3©. Although the performance of blocking is comparable to the expected maximum depth of practice (MDP-1), if we look into the future more deeply when choosing the next problem, in the course we propose, students have more experience after solving fewer problems. High predictive knowledge (MDP-8).
The prediction accuracy of the synthetic data set indicates that it is possible to use the DKT model to extract the underlying structure between the evaluations in the data set. The conditional influence diagram of our model on the synthetic data set shows a perfect cluster of five potential concepts (see Figure 4), and a directed edge is set using the influence function in Equation 4. An interesting observation is that some exercises in the same concept are far apart in time. For example, in the synthetic data set where the node number describes the sequence, the fifth exercise in the synthetic data set comes from the hidden concept 1. Although it is not until the 22nd question that another question of the same concept is asked, we can understand the difference between the two Strong conditional dependence between. We analyzed Khan Academy Data using the same technique. The generated diagram clearly and convincingly illustrates how the concepts in the 8th grade common core are related to each other (see Figure 4. The node numbers describe the exercise labels). We limit the analysis to an ordered pair of exercises {A, B} so that after A appears, B appears in the rest of the sequence more than 1% of the time). In order to determine whether the resulting conditional relationship is the product of an obvious potential trend in the data, we compared our results with two baseline measurements (1) Assuming that students have just answered A, then they answered the transition probability of B; (2) ) If the student answered A before, then the probability of answering B correctly in the data set (do not use the DKT model). Both of these baseline methods produced inconsistent charts, as shown in the appendix. Although many of the relationships we found may not be surprising to education experts, their findings confirm that the DKT network has learned a coherent pattern.

7 Conclusion

In this article, we apply RNN to the knowledge tracking problem in education, and show more advanced performance than ever on the Assistants test and Khan Academy Data. Our new model has two particularly interesting new features:
(1) It does not require expert annotations (it can learn conceptual patterns by itself)
(2) It can operate on any student input that can be vectorized. Compared with the simple hidden Markov method, one disadvantage of RNN is that it requires a lot of training data, so it is very suitable for online education environment, rather than small classroom environment.
Insert picture description here
Figure 4: Conditional influence diagram between exercises in the DKT model. Above: We observed perfect clustering of latent concepts in synthetic data. The following is a convincing description of how the common core exercises of 8th grade mathematics affect each other. The size of the arrow indicates the strength of the connection. Note that nodes can be connected in both directions. Edges with amplitude less than 0.1 have a threshold set. The cluster labels are added manually, but are exactly the same as the exercises in each cluster.

The application of RNN in knowledge tracking provides many directions for future research. Further research can incorporate other characteristics as input (e.g. time spent), explore other educational influences (e.g. prompt generation, dropout prediction), and verify assumptions made in the educational literature (e.g. interval repetition, simulating how students forget). Because DKT accepts vector input, it should be possible to track knowledge in more complex learning activities. A particularly interesting extension is to track students' knowledge as they solve open programming tasks. Using the recently developed program vectorization method, we hope to be able to intelligently model students’ knowledge over time as they learn to program.
In the ongoing cooperation with Khan Academy, we plan to test the effectiveness of DKT in curriculum planning in a controlled experiment by making exercise suggestions on site.

Paper related information

Deep Knowledge Tracing (Original)
Paper Source Data Set
Insert picture description here

Guess you like

Origin blog.csdn.net/zjlwdqca/article/details/112212828