I. Overview
We'll be coding in code (Python/Numpy, etc.) to better understand everything from the basics of data preprocessing to techniques used in deep learning.
Starting with basic but very useful concepts in data science and machine learning/deep learning, such as variance and covariance matrices, we'll move on to some preprocessing techniques for feeding images into neural networks. Use specific codes to understand what each equation does!
We call it preprocessing all transformations of the raw data before feeding it into a machine learning or deep learning algorithm. For example, training a convolutional neural network on raw images may result in poor classification performance. Preprocessing is also important to speed up training (eg, centering and scaling techniques, etc.).
2. Variance and Covariance
The variance of a variable describes how widely the values are distributed. Covariance is a measure of the degree of dependence between two variables. A positive covariance means that the value of the first variable is large when the value of the second variable is also large. Negative covariance means the opposite: large values of one variable are associated with small values of another variable. The covariance value depends on the scale of the variable, making it difficult to analyze it. Correlation coefficients that are easier to interpret can be used. It's just the normalized covariance.
A covariance matrix is a matrix that aggregates the variance and covariance of a set of vectors that can tell you a lot about a variable. The diagonal lines correspond to the variance of each vector:
The formula for calculating variance is as follows
where n is the length of the vector and is the mean of the vector. For example, the variance of the first column vector of A is:
This is the first cell of our covariance matrix. The second element on the diagonal corresponds to the variance of the second column vector of A, and so on.
Note: The vectors extracted from matrix A correspond to the columns of A.
The other cells correspond to the covariance between the two column vectors from A. For example, the covariance between the first and third columns is in the covariance matrix, i.e. column 1 and row 3 (or column 3 and row 1).
Let's confirm that the covariance between the first and third column vectors of A is equal to -2.67. The formula for the covariance between two variables X and Y is:
The variables X and Y are the first and third column vectors from the previous example. Let's split this formula:
1. The summation symbol is the element of the vector that needs to be iterated. We'll start with the first element (i=1) and compute the first element of X minus the mean of the vector X.
2. Multiply the result by the first element of Y minus the mean of the vector Y.
3. Repeat the process for each element of the vector and calculate the sum of all results.
4. Divide by the number of elements in the vector.
1, example 1
With the following matrix , we will calculate the covariance between the first column vector and the third column vector:
and , , , , so
That is, the value of the covariance matrix.
Now using Numpy, the covariance matrix can be calculated using the function np.cov. It's worth noting that if you want Numpy to use the column as a vector, you must use the parameter rowvar=False. Also, bias=True allows division by n instead of n-1.
A = np.array([[1, 3, 5], [5, 4, 1], [3, 8, 6]])
np.cov(A, rowvar=False, bias=True)
Calculation results
array([[ 2.66666667, 0.66666667, -2.66666667],
[ 0.66666667, 4.66666667, 2.33333333],
[-2.66666667, 2.33333333, 4.66666667]])
2, example 2
There is another way to calculate the covariance matrix of A, using the dot product to find the covariance matrix. You can center A at 0 (the mean of the vector minus each element of the vector), multiply it by its own transpose and divide by the number of observations. Let's start with an implementation, then we'll try to understand the connection to the previous equation:
def calculateCovariance(X):
meanX = np.mean(X, axis = 0)
lenX = X.shape[0]
X = X - meanX
covariance = X.T.dot(X)/lenX
return covariance
Let's compute it on matrix A:
calculateCovariance(A)
Calculation results
array([[ 2.66666667, 0.66666667, -2.66666667],
[ 0.66666667, 4.66666667, 2.33333333],
[-2.66666667, 2.33333333, 4.66666667]])
The results are the same as in Example 1.
The dot product between two vectors can be expressed as:
Then divide by the number n of elements in the vector:
You can notice that this is very similar to the covariance formula we saw above:
The only difference is that in the covariance formula we subtract each element of the vector from the mean. That's why we need to center the data before doing the dot product.
Now if we have a matrix A, the dot product between A and its transpose results in a new matrix:
This is the covariance matrix!
3. Data and covariance matrix visualization
To gain a deeper understanding of the covariance matrix and its purpose, we will create a function to visualize it with 2D data. You will be able to see the connection between the covariance matrix and the data.
As we saw above, this function will calculate the covariance matrix. It will create two subplots: one for the covariance matrix and one for the data. Seaborn's heatmap
functions are used to create color gradients: smaller values will be colored light green, larger values dark blue. Data are represented as scatter plots. We choose one of our palette colors, but you may prefer others.
def plotDataAndCov(data):
ACov = np.cov(data, rowvar=False, bias=True)
print 'Covariance matrix:\n', ACov
fig, ax = plt.subplots(nrows=1, ncols=2)
fig.set_size_inches(10, 10)
ax0 = plt.subplot(2, 2, 1)
# Choosing the colors
cmap = sns.color_palette("GnBu", 10)
sns.heatmap(ACov, cmap=cmap, vmin=0)
ax1 = plt.subplot(2, 2, 2)
# data can include the colors
if data.shape[1]==3:
c=data[:,2]
else:
c="#0A98BE"
ax1.scatter(data[:,0], data[:,1], c=c, s=40)
# Remove the top and right axes from the data plot
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
Now that we have the plotting function, we'll generate some random data to visualize what the covariance matrix can tell us. We will use Numpy functions np.random.normal()
to extract some data from a normal distribution.
1. Irrelevant data
The function requires the mean, standard deviation, and number of observations of the distribution as input. We will create two random variables with 300 observations with a standard deviation of 1. The first has an average of 1 and the second has an average of 2. If we sample 300 observations twice from a normal distribution, the two vectors will not be correlated.
np.random.seed(1234)
a1 = np.random.normal(2, 1, 300)
a2 = np.random.normal(1, 1, 300)
A = np.array([a1, a2]).T
A.shape
Note 1 : We transpose the data .T
because the original shape is (2, 300) and we want the number of observations to be rows (hence the shape (300, 2)).
Note 2 : We use the same random numbers and np.random.seed
functions for reproducibility.
Let's check what the data looks like:
array([[ 2.47143516, 1.52704645],
[ 0.80902431, 1.7111124 ],
[ 3.43270697, 0.78245452],
[ 1.6873481 , 3.63779121],
[ 1.27941127, -0.74213763],
[ 2.88716294, 0.90556519],
[ 2.85958841, 2.43118375],
[ 1.3634765 , 1.59275845],
[ 2.01569637, 1.1702969 ],
[-0.24268495, -0.75170595]])
We have two column vectors.
Now, we can check if the distribution is normal:
sns.distplot(A[:,0], color="#53BB04")
sns.distplot(A[:,1], color="#0A98BE")
plt.show()
plt.close()
We can see that the distributions have the same standard deviation, but different means (1 and 2). So this is exactly what we asked for!
Now we can plot our dataset and its covariance matrix with our function:
plotDataAndCov(A)
plt.show()
plt.close()
The result is as follows
Covariance matrix:
[[ 0.95171641 -0.0447816 ]
[-0.0447816 0.87959853]]
2. Relevant data
Now, let's construct dependent data by specifying one column from another.
np.random.seed(1234)
b1 = np.random.normal(3, 1, 300)
b2 = b1 + np.random.normal(7, 1, 300)/2.
B = np.array([b1, b2]).T
plotDataAndCov(B)
plt.show()
plt.close()
The data is generated as follows
Covariance matrix:
[[ 0.95171641 0.92932561]
[ 0.92932561 1.12683445]]
4. Preprocessing
1. Mean normalization
Mean normalization simply removes the mean from each observation.
, where X' is the normalized dataset, X is the original dataset, and is the mean of X.
It will have the effect of centering the data around 0. We will create the function center() implementation:
def center(X):
newX = X - np.mean(X, axis = 0)
return newX
Let's try the matrix B we created above:
BCentered = center(B)
print 'Before:\n\n'
plotDataAndCov(B)
plt.show()
plt.close()
print 'After:\n\n'
plotDataAndCov(BCentered)
plt.show()
plt.close()
The result is as follows
Before:
Covariance matrix:
[[ 0.95171641 0.92932561]
[ 0.92932561 1.12683445]]
After:
Covariance matrix:
[[ 0.95171641 0.92932561]
[ 0.92932561 1.12683445]]
2. Standardization
Normalization is used to put all features on the same scale. The way to do this is to divide each zero-centered dimension by its standard deviation. where X′ is the normalized dataset, X is the original dataset, x¯ is the mean of X, and σX is the standard deviation of X.
def standardize(X):
newX = center(X)/np.std(X, axis = 0)
return newX
Let's create another dataset with a different scale to check if it works.
np.random.seed(1234)
c1 = np.random.normal(3, 1, 300)
c2 = c1 + np.random.normal(7, 5, 300)/2.
C = np.array([c1, c2]).T
plotDataAndCov(C)
plt.xlim(0, 15)
plt.ylim(0, 15)
plt.show()
plt.close()
The result is as follows
Covariance matrix:
[[ 0.95171641 0.83976242]
[ 0.83976242 6.22529922]]
We can see that the x and y scales are different. Also note that the correlation appears to be smaller due to the scale difference. Now let's normalize it:
CStandardized = standardize(C)
plotDataAndCov(CStandardized)
plt.show()
plt.close()
The result is as follows
Covariance matrix:
[[ 1. 0.34500274]
[ 0.34500274 1. ]]
You can now see that the scales are the same and the dataset is centered at zero according to both axes. Now, take a look at the covariance matrix: you can see that the variance of each coordinate (top-left cell and bottom-right cell) is equal to 1. This new covariance matrix is actually the correlation matrix. The Pearson correlation coefficient between the two variables (c1 and c2) is 0.54220151.
3. Albino
Whitening or spherical data means that we want to transform it in some way to get a covariance matrix, the identity matrix (1 on the diagonal and 0 on the other cells; more details on the identity matrix). It is called white noise reference white noise.
Performing whitening requires the following steps:
1- 零中心数据
2- 去相关数据
3- 重新调整数据
Again, use the matrix C above.
(1)Zero-centering
CCentered = center(C)
plotDataAndCov(CCentered)
plt.show()
plt.close()
Covariance matrix:
[[ 0.95171641 0.83976242]
[ 0.83976242 6.22529922]]
(2)Decorrelate
We need to decorrelate the data. Intuitively, rotate the data until there are no more correlations. See below:
The question is: how can we find the correct rotation for uncorrelated data? In fact, this is exactly what the eigenvectors of the covariance matrix do: they indicate the direction in which the data spreads the most:
So we can decorrelate the data by projecting the data on the basis of the feature vector . This will have the effect of applying the desired rotation and removing the correlation between dimensions. Here are the steps:
1-计算协方差矩阵
2-计算协方差矩阵的特征向量
3-将特征向量矩阵应用于数据(这将应用旋转)
Let's create a function:
def decorrelate(X):
XCentered = center(X)
cov = XCentered.T.dot(XCentered)/float(XCentered.shape[0])
# Calculate the eigenvalues and eigenvectors of the covariance matrix
eigVals, eigVecs = np.linalg.eig(cov)
# Apply the eigenvectors to X
decorrelated = X.dot(eigVecs)
return decorrelated
Let's try decorrelating our zero-centered matrix C to see it in action:
plotDataAndCov(C)
plt.show()
plt.close()
CDecorrelated = decorrelate(C)
plotDataAndCov(CDecorrelated)
plt.xlim(-5,5)
plt.ylim(-5,5)
plt.show()
plt.close()
Covariance matrix:
[[ 0.95171641 0.83976242]
[ 0.83976242 6.22529922]]
Covariance matrix:
[[ 5.96126981e-01 -1.48029737e-16]
[ -1.48029737e-16 3.15205774e+00]]
We can see that the correlation no longer exists, and the covariance matrix (now a diagonal matrix) confirms that the covariance between the two dimensions is equal to 0.
(3)Rescale the data
The next step is to scale the uncorrelated matrix to get the covariance matrix corresponding to the identity matrix (matrix on the diagonal and zeros on other cells). To do this, we scale our decorrelated data by dividing each dimension by the square root of its corresponding eigenvalue.
def whiten(X):
XCentered = center(X)
cov = XCentered.T.dot(XCentered)/float(XCentered.shape[0])
# Calculate the eigenvalues and eigenvectors of the covariance matrix
eigVals, eigVecs = np.linalg.eig(cov)
# Apply the eigenvectors to X
decorrelated = X.dot(eigVecs)
# Rescale the decorrelated data
whitened = decorrelated / np.sqrt(eigVals + 1e-5)
return whitened
Note: we add a minima (here ) to avoid division by 0.
CWhitened = whiten(C)
plotDataAndCov(CWhitened)
plt.xlim(-5,5)
plt.ylim(-5,5)
plt.show()
plt.close()
Covariance matrix:
[[ 9.99983225e-01 -1.06581410e-16]
[ -1.06581410e-16 9.99996827e-01]]
5. Image whitening
Let's take a look at how whitening can be applied to preprocessed image datasets. For this, we will use the paper by Pal & Sudeep (2016), which provides some details on the process. This preprocessing technique is called Zero Component Analysis (ZCA).
Check out the paper, but this is what they got:
First load the CIFAR dataset. This dataset is available from Keras.
from keras.datasets import cifar10
(X_train, y_train), (X_test, y_test) = cifar10.load_data()
X_train.shape
The training set of the CIFAR10 dataset contains 50,000 images. The shape of X_train is (50000, 32, 32, 3). Each image is 32px by 32px, and each pixel contains 3 dimensions (R, G, B). Each value is the brightness of the corresponding color between 0 and 255.
We'll start by selecting only a subset of the images, say 1000:
X = X_train[:1000]
print X.shape
Now we will reshape the array to have flat image data with one image per row. Each image will be (1, 3072) because 32×32×3=3072. So the array with all the images will be (1000, 3072):
X = X.reshape(X.shape[0], X.shape[1]*X.shape[2]*X.shape[3])
print X.shape
The next step is to be able to see the image. The function imshow() in Matplotlib can be used to display images. It needs an image of shape (MxNx3), so let's create a function to reshape the image and be able to visualize them from shape (1, 3072).
def plotImage(X):
plt.figure(figsize=(1.5, 1.5))
plt.imshow(X.reshape(32,32,3))
plt.show()
plt.close()
For example, let's draw an image we loaded:
plotImage(X[12, :])
We can now whiten the image.
1. The first step is to rescale the image to get the range [0, 1] by dividing by 255 (the maximum value in pixels).
Use the following formula , where the minimum value is 0, so:
X_norm = X / 255.
print 'X.min()', X_norm.min()
print 'X.max()', X_norm.max()
X.min() 0.0
X.max() 1.0
2. Subtract the mean value from all images.
One approach is to take each image and remove the average of that image from each pixel (Jarrett et al., 2009). The intuition behind this process is that it centers the pixels of each image around 0.
Another way is to take each of the 3072 pixels we have (32 x 32 pixels for R, G, and B) for each image and subtract the average of that pixel across all images. This is called average subtraction per pixel. This time, each pixel will be centered at 0 according to all images. When you feed an image to the network, each pixel is treated as a different feature. With per-pixel mean subtraction, we center each feature (pixel) around 0. This is the common way to deal with it.
We will now do a per-pixel average subtraction from our 1000 images. Our data is organized along these dimensions (images, pixels). is (1000, 3072) because there are 1000 images, 32×32×3=3072 pixels. Therefore, the average value of each pixel can be obtained from the first axis:
X_norm.mean(axis=0).shape
X_norm = X_norm - X_norm.mean(axis=0)
X_norm.mean(axis=0)
Now we want to compute the covariance matrix for the zero-centered data. As we saw above, we can use np.cov()
Numpy's functions to calculate it.
We can calculate two possible correlation matrices X from the matrix: the correlation between rows or between columns. In our case, each row X of the matrix is an image, so the rows of the matrix correspond to observations and the columns of the matrix correspond to features (image pixels). We want to compute correlations between pixels because the goal of whitening is to remove these correlations so that the algorithm can focus on higher-order relationships .
To do this, set the parameter Numpy rowvar=False
: it will use columns as variables (or features) and rows as observations.
cov = np.cov(X_norm, rowvar=False)
The shape of the covariance matrix should be 3072 x 3072 to represent the correlation between each pair of pixels (and there are 3072 pixels):
We will compute the singular values and vectors of the covariance matrix and use them to rotate our dataset.
U,S,V = np.linalg.svd(cov)
print np.diag(S)
print '\nshape:', np.diag(S).shape
Also the shape (3072, 3072) and the U sum , we also see that X has the shape (1000, 3072) and we need to transpose it to (3072, 1000). So the shape is
Corresponds to the shape of the initial dataset after transposition.
epsilon = 0.1
X_ZCA = U.dot(np.diag(1.0/np.sqrt(S + epsilon))).dot(U.T).dot(X_norm.T).T
Let's rescale the image:
X_ZCA_rescaled = (X_ZCA - X_ZCA.min()) / (X_ZCA.max() - X_ZCA.min())
print 'min:', X_ZCA_rescaled.min()
print 'max:', X_ZCA_rescaled.max()
Finally, we can see the effect of whitening by comparing an image before and after whitening:
Looks like an image from the Pal & Sudeep (2016) paper. They use epsilon = 0.1. We can try other values to see the effect on the image.