This paper is used to learn to measure the similarity of distributions of source domain and target domain probabilities! ! !
1. Use PCA to reduce the dimension of the source domain and target domain datasets into two-dimensional and three-dimensional features, and then visualize them in two-dimensional and three-dimensional spaces, that is, each sample is represented in the space in the form of scatter points. By comparing the two The scatter distribution of each field to analyze whether the distributions of the two fields are different:
from mpl_toolkits.mplot3d import Axes3D from numpy import * import pandas as pd import matplotlib.pyplot as plt #PCA dimension reduction def pca(dataMat,topNfeat=9999999): meanVals=mean(dataMat,axis=0) #print("Mean matrix: ",meanVals) meanRemoved = dataMat-meanVals #print("Centered matrix: ",shape(meanRemoved)) covMat = cov (meanRemoved, rowvar = 0) eigVals, eigVects = linalg.eig (mat (covMat)) #print("Eigenvalue: ",eigVals) eigValInd=argsort(eigVals) #print("The sorted eigenvalues: ",eigValInd) eigValInd = eigValInd [:-( topNfeat + 1): - 1] #print("Select the index of the largest top n eigenvalues: ",eigValInd) redEigVects=eigVects[:,eigValInd] #print("Select the largest top n eigenvectors", redEigVects) lowDataMat=meanRemoved*redEigVects return lowDataMat #read CSV file srcFrame = pd.read_csv('OGSrc.csv') tarFrame = pd.read_csv('OGTar.csv') #Convert DataFrame type to array type srcData = srcFrame.values tarData = tarFrame.values #OG Source domain and target domain data are dimensionally reduced respectively lowSrcData = pca(srcData,3) lowTarData = pca (tarData, 3) #Get the data of each column of each field array of OG src_f1 = transpose(lowSrcData)[0].tolist()[0] src_f2 = transpose(lowSrcData)[1].tolist()[0] src_f3 = transpose(lowSrcData)[2].tolist()[0] tar_f1 = transpose(lowTarData)[0].tolist()[0] tar_f2 = transpose(lowTarData)[1].tolist()[0] tar_f3 = transpose(lowTarData)[2].tolist()[0] ax = plt.subplot(111, projection='3d') #The 3D scatter plot corresponding to the source domain and the target domain ax.scatter(src_f1, src_f2, src_f3, c='r') ax.scatter (tar_f1, tar_f2, tar_f3, c = 'k') ax.set_xlabel('feature1') ax.set_ylabel('feature2') ax.set_zlabel('feature3') plt.show()
3.KL散度(Kullback-Leibler Divergence)
It is used to compare the proximity of two probability distributions. It does not measure the distance between the two distributions in space. A more accurate understanding is to measure the information loss of one distribution compared to the other distribution. The loss of this information can be obtained through KL divergence. measure.
The calculation formula of KL divergence is actually a simple deformation of the entropy calculation formula. Add the approximate probability distribution q to the original probability distribution p, and calculate the difference of the corresponding logarithms of each of their values:
from numpy import * import pandas as pd import scipy.stats #KL divergence def kl(srcArr, tarArr): m,n = srcArr.shape result = 0; for i in range(m): result += sum( srcArr[i]*(log(srcArr[i])-log(tarArr[i])) ) # result += scipy.stats.entropy( srcArr[i], tarArr[i] ) return result #read CSV file srcFrame = pd.read_csv('OGSrc.csv') tarFrame = pd.read_csv('OGTar.csv') #Convert DataFrame type to array type srcData = srcFrame.values tarData = tarFrame.values print( srcData.shape, tarData.shape ) result = kl(srcData, tarData)
3. Maximum Mean Difference
The source domain that satisfies the p distribution and the target domain data set that satisfies the q distribution are mapped to the regenerated kernel Hilbert space, and the mean value of the samples from different distributions is calculated in this space, and the mean difference is the mean difference. Generally, the kernel function with the largest difference in the current data set is found through experiments as the test statistic of MMD, so as to judge whether the two distributions are the same.
from numpy import * import pandas as pd #Gaussian kernel def gaussianKernel(xArr,yArr,s): temp = sqrt(sum((xArr-yArr)**2)) return exp(-(temp/s)) def mmd(srcArr, tarArr, sigma): s = 2*(sigma**2) m = srcArr.shape[0] n = tarArr.shape[0] result1 = 0 result2 = 0 result3 = 0 for i in range(m): for j in range(m): result1 += gaussianKernel(srcArr[i], srcArr[j], s) for i in range(m): for j in range(n): result2 += gaussianKernel(srcArr[i], tarArr[j], s) for i in range(m): for j in range(n): result3 += gaussianKernel(tarArr[i], tarArr[j], s) print( result1, result2, result3 ) return 1/(m**2)*result1 - 2/(m*n)*result2 + 1/(n**2)*result3 #read CSV file srcFrame = pd.read_csv('OGSrc.csv') tarFrame = pd.read_csv('OGTar.csv') #Convert DataFrame type to array type srcData = srcFrame.values tarData = tarFrame.values print( srcData.shape, tarData.shape ) sigma = 0.1 print( mmd(srcData, tarData, sigma) )
References:
1. The formula in this article is taken from the CSDN blog, so I directly paste it here as a screenshot
2. "Machine Learning in Practice"