Similarity Analysis of Source Domain and Target Domain Probability Distribution

This paper is used to learn to measure the similarity of distributions of source domain and target domain probabilities! ! !

1. Use PCA to reduce the dimension of the source domain and target domain datasets into two-dimensional and three-dimensional features, and then visualize them in two-dimensional and three-dimensional spaces, that is, each sample is represented in the space in the form of scatter points. By comparing the two The scatter distribution of each field to analyze whether the distributions of the two fields are different:

from mpl_toolkits.mplot3d import Axes3D   from numpy import *
import pandas as pd
import matplotlib.pyplot as plt
#PCA dimension reduction
def pca(dataMat,topNfeat=9999999):
    meanVals=mean(dataMat,axis=0)
    #print("Mean matrix: ",meanVals)
    meanRemoved = dataMat-meanVals
    #print("Centered matrix: ",shape(meanRemoved))
    covMat = cov (meanRemoved, rowvar = 0)
    eigVals, eigVects = linalg.eig (mat (covMat))
    #print("Eigenvalue: ",eigVals)
    eigValInd=argsort(eigVals)
    #print("The sorted eigenvalues: ",eigValInd)
    eigValInd = eigValInd [:-( topNfeat + 1): - 1]
    #print("Select the index of the largest top n eigenvalues: ",eigValInd)
    redEigVects=eigVects[:,eigValInd]
    #print("Select the largest top n eigenvectors", redEigVects)
    lowDataMat=meanRemoved*redEigVects
    return lowDataMat
#read CSV file
srcFrame = pd.read_csv('OGSrc.csv')
tarFrame = pd.read_csv('OGTar.csv')
#Convert DataFrame type to array type
srcData = srcFrame.values
tarData = tarFrame.values
#OG Source domain and target domain data are dimensionally reduced respectively
lowSrcData = pca(srcData,3)
lowTarData = pca (tarData, 3)
#Get the data of each column of each field array of OG
src_f1 = transpose(lowSrcData)[0].tolist()[0]
src_f2 = transpose(lowSrcData)[1].tolist()[0]
src_f3 = transpose(lowSrcData)[2].tolist()[0]

tar_f1 = transpose(lowTarData)[0].tolist()[0]
tar_f2 = transpose(lowTarData)[1].tolist()[0]
tar_f3 = transpose(lowTarData)[2].tolist()[0]
ax = plt.subplot(111, projection='3d')
#The 3D scatter plot corresponding to the source domain and the target domain
ax.scatter(src_f1, src_f2, src_f3, c='r')
ax.scatter (tar_f1, tar_f2, tar_f3, c = 'k')
ax.set_xlabel('feature1')
ax.set_ylabel('feature2')
ax.set_zlabel('feature3')
plt.show()

3.KL散度(Kullback-Leibler Divergence)

It is used to compare the proximity of two probability distributions. It does not measure the distance between the two distributions in space. A more accurate understanding is to measure the information loss of one distribution compared to the other distribution. The loss of this information can be obtained through KL divergence. measure.

The calculation formula of KL divergence is actually a simple deformation of the entropy calculation formula. Add the approximate probability distribution q to the original probability distribution p, and calculate the difference of the corresponding logarithms of each of their values:



from numpy import *
import pandas as pd
import scipy.stats
#KL divergence  
def kl(srcArr, tarArr):
	m,n = srcArr.shape
	result = 0;
	for i in range(m):
		result += sum( srcArr[i]*(log(srcArr[i])-log(tarArr[i])) )
		# result += scipy.stats.entropy( srcArr[i], tarArr[i] )
	return result
#read CSV file
srcFrame = pd.read_csv('OGSrc.csv')
tarFrame = pd.read_csv('OGTar.csv')
#Convert DataFrame type to array type
srcData = srcFrame.values
tarData = tarFrame.values
print( srcData.shape, tarData.shape )
result = kl(srcData, tarData)

3. Maximum Mean Difference

The source domain that satisfies the p distribution and the target domain data set that satisfies the q distribution are mapped to the regenerated kernel Hilbert space, and the mean value of the samples from different distributions is calculated in this space, and the mean difference is the mean difference. Generally, the kernel function with the largest difference in the current data set is found through experiments as the test statistic of MMD, so as to judge whether the two distributions are the same.



from numpy import *
import pandas as pd
#Gaussian kernel
def gaussianKernel(xArr,yArr,s):
	temp = sqrt(sum((xArr-yArr)**2))
	return exp(-(temp/s))
def mmd(srcArr, tarArr, sigma):
	s = 2*(sigma**2)
	m = srcArr.shape[0]
	n = tarArr.shape[0]
	result1 = 0
	result2 = 0
	result3 = 0
	for i in range(m):
		for j in range(m):
			result1 += gaussianKernel(srcArr[i], srcArr[j], s)
	for i in range(m):
		for j in range(n):
			result2 += gaussianKernel(srcArr[i], tarArr[j], s)
	for i in range(m):
		for j in range(n):
			result3 += gaussianKernel(tarArr[i], tarArr[j], s)
	print( result1, result2, result3 )
	return 1/(m**2)*result1 - 2/(m*n)*result2 + 1/(n**2)*result3
#read CSV file
srcFrame = pd.read_csv('OGSrc.csv')
tarFrame = pd.read_csv('OGTar.csv')
#Convert DataFrame type to array type
srcData = srcFrame.values
tarData = tarFrame.values
print( srcData.shape, tarData.shape )
sigma = 0.1
print( mmd(srcData, tarData, sigma) )

References:

1. The formula in this article is taken from the CSDN blog, so I directly paste it here as a screenshot

2. "Machine Learning in Practice"


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324892364&siteId=291194637