Based on T-distribution in DTI attack detection

Paper:
Detection of Shilling Attack Based on Tdistribution on the Dynamic Time Intervals in
Recommendation Systems In the

previous article, we talked about ensuring that user behavior characteristics are credible to defend against recommender attacks. Another method is to eliminate fake data, such as this one. Based on the T distribution of the dynamic time interval, its assumption is that the attack must inject a large amount of fake data in a short time interval.
Insert picture description here
In order to facilitate the understanding of the algorithm below, some basic concepts are introduced. I is the item set, U is the user set, H'is the user's rating behavior on the item, and the rating record is a record of each item having a rating, such as at a certain time , A user rated the item, the item rating time interval sequence is the time series of rating behaviors, gap is the time interval between two rating behaviors, midT is half of the gap, and the time window is a series of time windows generated by IRTGS. Window size is the number of ratings contained in a time window.
Insert picture description here
DTI is a method of dividing time windows. Every time, it is divided from the middle of the window until this condition is met. The middle position is recorded for each divided time window. The algorithm is shown in the figure:
Insert picture description here
the result of the final division is shown in the figure, and the fake data in the figure are all divided into the same time window.
Insert picture description here
Next is the important T distribution. The T distribution is to estimate the mean of a population with a normal distribution and unknown variance based on a small sample. The sample population in the article refers to the distribution of the number of Item ratings in two time windows. Formulas (1) and (2) are calculating the mean value of the rating, formulas (3) and (4) are the correction to the mean value, which is to calculate the mean value of the rating in the whole window, formulas (5) and (6) are in a time Calculate the variance within the window. The number of random variables in the T distribution is the degree of freedom, and the random variable is the rating score. Because calculating the mean is a constraint, the degree of freedom is m+n-2,
Insert picture description hereas shown in the figure below
It is the boundary value with 95% confidence of different degrees of freedom. After calculating the T distribution of the two time windows, the matrix in Figure 10 will be obtained. The position in the matrix above the boundary value of the corresponding degree of freedom is assigned a value of 1, and the rest are 0, if the total number of 1 in a certain window is higher than the average of the total number of 1 in all windows, then this window is a possible window, and further judge the average of the rating time interval in this window and the total rating time interval, if the time If the interval is too frequent and the scores are all high, it is determined that the window is an attack window, and the score records in the window that are less than the average score and the score greater than the score are eliminated. Insert picture description here
This is the pseudo code of the entire T distribution. The thesis defends against attacks by eliminating fake data.
Insert picture description here

Guess you like

Origin blog.csdn.net/qq_42316533/article/details/109300415