I. Overview

We'll be coding in code (Python/Numpy, etc.) to better understand everything from the basics of data preprocessing to techniques used in deep learning.

Starting with basic but very useful concepts in data science and machine learning/deep learning, such as variance and covariance matrices, we'll move on to some preprocessing techniques for feeding images into neural networks. Use specific codes to understand what each equation does!

We call it preprocessing all transformations of the raw data before feeding it into a machine learning or deep learning algorithm. For example, training a convolutional neural network on raw images may result in poor classification performance. Preprocessing is also important to speed up training (eg, centering and scaling techniques, etc.).

2. Variance and Covariance

The variance of a variable describes how widely the values are distributed. Covariance is a measure of the degree of dependence between two variables. A positive covariance means that the value of the first variable is large when the value of the second variable is also large. Negative covariance means the opposite: large values of one variable are associated with small values of another variable. The covariance value depends on the scale of the variable, making it difficult to analyze it. Correlation coefficients that are easier to interpret can be used. It's just the normalized covariance.

Positive covariance means that large values of one variable are associated with large values of another variable (left). Negative covariance means that large values of one variable are associated with small values of another variable (right).

A covariance matrix is a matrix that aggregates the variance and covariance of a set of vectors that can tell you a lot about a variable. The diagonal lines correspond to the variance of each vector:

matrix and its covariance matrix. The diagonal lines correspond to the variance of each column vector.

The formula for calculating variance is as follows

$V(\boldsymbol{X}) = \frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})^2$

where n is the length of $\bar{x}$ the vector and is the mean of the vector. For example, the variance of the first column vector of A is:

$V(\boldsymbol{A}_{:,1}) = \frac{(1-3)^2+(5-3)^2+(3-3)^2}{3} = 2.67$

This is the first cell of our covariance matrix. The second element on the diagonal corresponds to the variance of the second column vector of A, and so on.

Note: The vectors extracted from matrix A correspond to the columns of A.

The other cells correspond to the covariance between the two column vectors from A. For example, the covariance between the first and third columns is in the covariance matrix, i.e. column 1 and row 3 (or column 3 and row 1).

position in the covariance matrix. Columns correspond to the first variable, and rows correspond to the second (or vice versa). The covariance between the first and third column vector of A is the elements in column 1 and row 3 (or vice versa = same value).

Let's confirm that the covariance between the first and third column vectors of A is equal to -2.67. The formula for the covariance between two variables X and Y is:

$cov(\boldsymbol{X},\boldsymbol{Y}) = \frac{1}{n} \sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})$

The variables X and Y are the first and third column vectors from the previous example. Let's split this formula:

1. $(x_1-\bar{x})$ The summation symbol is the element of the vector that needs to be iterated. We'll start with the first element (i=1) and compute the first element of X minus the mean of the vector X.

2. $(x_1-\bar{x})(y_1-\bar{y})$ Multiply the result by the first element of Y minus the mean of the vector Y.

3. $\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})$ Repeat the process for each element of the vector and calculate the sum of all results.

4. $\frac{1}{n} \sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})$ Divide by the number of elements in the vector.

1, example 1

With the following matrix $\boldsymbol{A}= \begin{bmatrix} 1 & 3 & 5 \\ 5 & 4 & 1 \\ 3 & 8 & 6 \end{bmatrix}$ , we will calculate the covariance between the first column vector and the third column vector:

$\boldsymbol{X} = \begin{bmatrix} 1 \\ 5 \\ 3 \end{bmatrix}$ and $\boldsymbol{Y} = \begin{bmatrix} 5 \\ 1 \\ 6 \end{bmatrix}$ , $\boldsymbol{\bar{x}}=3$ , $\boldsymbol{\bar{y}}=4$ , $n=3$ , so

$the (X, Y) = \ frac {(1-3) (5-4) + (5-3) (1-4) + (3-3) (6-4) {{3} = \ frac { -8} {3} = - 2.67$

That is, the value of the covariance matrix.

Now using Numpy, the covariance matrix can be calculated using the function np.cov. It's worth noting that if you want Numpy to use the column as a vector, you must use the parameter rowvar=False. Also, bias=True allows division by n instead of n-1.

A = np.array([[1, 3, 5], [5, 4, 1], [3, 8, 6]])
np.cov(A, rowvar=False, bias=True)

Calculation results

array([[ 2.66666667,  0.66666667, -2.66666667],
       [ 0.66666667,  4.66666667,  2.33333333],
       [-2.66666667,  2.33333333,  4.66666667]])

2, example 2

There is another way to calculate the covariance matrix of A, using the dot product to find the covariance matrix. You can center A at 0 (the mean of the vector minus each element of the vector), multiply it by its own transpose and divide by the number of observations. Let's start with an implementation, then we'll try to understand the connection to the previous equation:

def calculateCovariance(X):
    meanX = np.mean(X, axis = 0)
    lenX = X.shape[0]
    X = X - meanX
    covariance = X.T.dot(X)/lenX
    return covariance

Let's compute it on matrix A:

calculateCovariance(A)

Calculation results

array([[ 2.66666667,  0.66666667, -2.66666667],
       [ 0.66666667,  4.66666667,  2.33333333],
       [-2.66666667,  2.33333333,  4.66666667]])

The results are the same as in Example 1.

The dot product between two vectors can be expressed as: $\boldsymbol{X^\text{T}Y}= \sum_{i=1}^{n}(x_i)(y_i)$

The dot product corresponds to the sum of the products of each element of the vector.

Then divide by the number n of elements in the vector: $\frac{1}{n}\boldsymbol{X^\text{T}Y}= \frac{1}{n}\sum_{i=1}^{n}(x_i)(y_i)$

You can notice that this is very similar to the covariance formula we saw above:

$cov(\boldsymbol{X},\boldsymbol{Y}) = \frac{1}{n} \sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})$

The only difference is that in the covariance formula we subtract each element of the vector from the mean. That's why we need to center the data before doing the dot product.

Now if we have a matrix A, the dot product between A and its transpose results in a new matrix:

This is the covariance matrix!

3. Data and covariance matrix visualization

To gain a deeper understanding of the covariance matrix and its purpose, we will create a function to visualize it with 2D data. You will be able to see the connection between the covariance matrix and the data.

As we saw above, this function will calculate the covariance matrix. It will create two subplots: one for the covariance matrix and one for the data. Seaborn's heatmapfunctions are used to create color gradients: smaller values will be colored light green, larger values dark blue. Data are represented as scatter plots. We choose one of our palette colors, but you may prefer others.

def plotDataAndCov(data):
    ACov = np.cov(data, rowvar=False, bias=True)
    print 'Covariance matrix:\n', ACov

    fig, ax = plt.subplots(nrows=1, ncols=2)
    fig.set_size_inches(10, 10)

    ax0 = plt.subplot(2, 2, 1)

    # Choosing the colors
    cmap = sns.color_palette("GnBu", 10)
    sns.heatmap(ACov, cmap=cmap, vmin=0)

    ax1 = plt.subplot(2, 2, 2)

    # data can include the colors
    if data.shape[1]==3:
        c=data[:,2]
    else:
        c="#0A98BE"
    ax1.scatter(data[:,0], data[:,1], c=c, s=40)

    # Remove the top and right axes from the data plot
    ax1.spines['right'].set_visible(False)
    ax1.spines['top'].set_visible(False)

Now that we have the plotting function, we'll generate some random data to visualize what the covariance matrix can tell us. We will use Numpy functions np.random.normal()to extract some data from a normal distribution.

1. Irrelevant data

The function requires the mean, standard deviation, and number of observations of the distribution as input. We will create two random variables with 300 observations with a standard deviation of 1. The first has an average of 1 and the second has an average of 2. If we sample 300 observations twice from a normal distribution, the two vectors will not be correlated.

np.random.seed(1234)
a1 = np.random.normal(2, 1, 300)
a2 = np.random.normal(1, 1, 300)
A = np.array([a1, a2]).T
A.shape

Note 1 : We transpose the data .Tbecause the original shape is (2, 300) and we want the number of observations to be rows (hence the shape (300, 2)).

Note 2 : We use the same random numbers and np.random.seedfunctions for reproducibility.

Let's check what the data looks like:

array([[ 2.47143516,  1.52704645],
       [ 0.80902431,  1.7111124 ],
       [ 3.43270697,  0.78245452],
       [ 1.6873481 ,  3.63779121],
       [ 1.27941127, -0.74213763],
       [ 2.88716294,  0.90556519],
       [ 2.85958841,  2.43118375],
       [ 1.3634765 ,  1.59275845],
       [ 2.01569637,  1.1702969 ],
       [-0.24268495, -0.75170595]])

We have two column vectors.

Now, we can check if the distribution is normal:

sns.distplot(A[:,0], color="#53BB04")
sns.distplot(A[:,1], color="#0A98BE")
plt.show()
plt.close()

We can see that the distributions have the same standard deviation, but different means (1 and 2). So this is exactly what we asked for!

Now we can plot our dataset and its covariance matrix with our function:

plotDataAndCov(A)
plt.show()
plt.close()

The result is as follows

Covariance matrix:
[[ 0.95171641 -0.0447816 ]
 [-0.0447816   0.87959853]]

We can see on the scatter plot that these two dimensions are uncorrelated. Note that we have an average of 1 in one dimension and an average of 2 in the other. Also, the covariance matrix shows that the variance of each variable is very large (about 1), while the covariances of columns 1 and 2 are very small (about 0). Since we ensured that the two vectors are independent, this is consistent (the opposite is not necessarily true: a covariance of 0 does not guarantee independence.

2. Relevant data

Now, let's construct dependent data by specifying one column from another.

np.random.seed(1234)
b1 =  np.random.normal(3, 1, 300)
b2 = b1 + np.random.normal(7, 1, 300)/2.
B = np.array([b1, b2]).T
plotDataAndCov(B)
plt.show()
plt.close()

The data is generated as follows

Covariance matrix:
[[ 0.95171641  0.92932561]
 [ 0.92932561  1.12683445]]

The correlation between the two dimensions is visible on the scatter plot. We can see that a line can be drawn and used to predict y from x and vice versa. The covariance matrix is not diagonal (there are non-zero cells outside the diagonal). This means that the covariance between dimensions is not zero.

4. Preprocessing

1. Mean normalization

Mean normalization simply removes the mean from each observation.

$\boldsymbol{X'} = \boldsymbol{X} - \bar{x}$ , where X' is the normalized dataset, X is the original dataset, and $\bar{x}$ is the mean of X.

It will have the effect of centering the data around 0. We will create the function center() implementation:

def center(X):
    newX = X - np.mean(X, axis = 0)
    return newX

Let's try the matrix B we created above:

BCentered = center(B)

print 'Before:\n\n'

plotDataAndCov(B)
plt.show()
plt.close()

print 'After:\n\n'

plotDataAndCov(BCentered)
plt.show()
plt.close()

The result is as follows

Before:

Covariance matrix:
[[ 0.95171641  0.92932561]
 [ 0.92932561  1.12683445]]

After:

Covariance matrix:
[[ 0.95171641  0.92932561]
 [ 0.92932561  1.12683445]]

The first plot shows the original data B again, and the second plot shows the centered data (look at the x,y coordinates for comparison).

2. Standardization

Normalization is used to put all features on the same scale. The way to do this is to divide each zero-centered dimension by its standard deviation. $\boldsymbol{X'} = \frac{\boldsymbol{X} - \bar{x}}{\sigma_{\boldsymbol{X}}}$ where X′ is the normalized dataset, X is the original dataset, x¯ is the mean of X, and σX is the standard deviation of X.

def standardize(X):
    newX = center(X)/np.std(X, axis = 0)
    return newX

Let's create another dataset with a different scale to check if it works.

np.random.seed(1234)
c1 =  np.random.normal(3, 1, 300)
c2 = c1 + np.random.normal(7, 5, 300)/2.
C = np.array([c1, c2]).T

plotDataAndCov(C)
plt.xlim(0, 15)
plt.ylim(0, 15)
plt.show()
plt.close()

The result is as follows

Covariance matrix:
[[ 0.95171641  0.83976242]
 [ 0.83976242  6.22529922]]

We can see that the x and y scales are different. Also note that the correlation appears to be smaller due to the scale difference. Now let's normalize it:

CStandardized = standardize(C)

plotDataAndCov(CStandardized)
plt.show()
plt.close()

The result is as follows

Covariance matrix:
[[ 1.          0.34500274]
 [ 0.34500274  1.        ]]

You can now see that the scales are the same and the dataset is centered at zero according to both axes. Now, take a look at the covariance matrix: you can see that the variance of each coordinate (top-left cell and bottom-right cell) is equal to 1. This new covariance matrix is actually the correlation matrix. The Pearson correlation coefficient between the two variables (c1 and c2) is 0.54220151.

3. Albino

Whitening or spherical data means that we want to transform it in some way to get a covariance matrix, the identity matrix (1 on the diagonal and 0 on the other cells; more details on the identity matrix). It is called white noise reference white noise.

Performing whitening requires the following steps:

1- 零中心数据
2- 去相关数据
3- 重新调整数据

Again, use the matrix C above.

（1）Zero-centering

CCentered = center(C)

plotDataAndCov(CCentered)
plt.show()
plt.close()

Covariance matrix:
[[ 0.95171641  0.83976242]
 [ 0.83976242  6.22529922]]

（2）Decorrelate

We need to decorrelate the data. Intuitively, rotate the data until there are no more correlations. See below:

The graph on the left shows the relevant data. For example, if you take a data point with a large x value, it is likely that y will also be large. Now get all the data points and rotate (maybe around 45 degrees counterclockwise): the new data (plotted on the right) is no longer relevant.

The question is: how can we find the correct rotation for uncorrelated data? In fact, this is exactly what the eigenvectors of the covariance matrix do: they indicate the direction in which the data spreads the most:

The eigenvectors of the covariance matrix give you the direction to maximize the variance. The direction of the green line is where the variance is greatest. Look at the min and max points projected on this line: the spread is huge. Compare this to the projection on the orange line: the spread is very small.

So we can decorrelate the data by projecting the data on the basis of the feature vector . This will have the effect of applying the desired rotation and removing the correlation between dimensions. Here are the steps:

1-计算协方差矩阵
2-计算协方差矩阵的特征向量
3-将特征向量矩阵应用于数据（这将应用旋转）

Let's create a function:

def decorrelate(X):
    XCentered = center(X)
    cov = XCentered.T.dot(XCentered)/float(XCentered.shape[0])
    # Calculate the eigenvalues and eigenvectors of the covariance matrix
    eigVals, eigVecs = np.linalg.eig(cov)
    # Apply the eigenvectors to X
    decorrelated = X.dot(eigVecs)
    return decorrelated

Let's try decorrelating our zero-centered matrix C to see it in action:

plotDataAndCov(C)
plt.show()
plt.close()

CDecorrelated = decorrelate(C)
plotDataAndCov(CDecorrelated)
plt.xlim(-5,5)
plt.ylim(-5,5)
plt.show()
plt.close()

Covariance matrix:
[[ 0.95171641  0.83976242]
 [ 0.83976242  6.22529922]]

Covariance matrix:
[[  5.96126981e-01  -1.48029737e-16]
 [ -1.48029737e-16   3.15205774e+00]]

We can see that the correlation no longer exists, and the covariance matrix (now a diagonal matrix) confirms that the covariance between the two dimensions is equal to 0.

（3）Rescale the data

The next step is to scale the uncorrelated matrix to get the covariance matrix corresponding to the identity matrix (matrix on the diagonal and zeros on other cells). To do this, we scale our decorrelated data by dividing each dimension by the square root of its corresponding eigenvalue.

def whiten(X):
    XCentered = center(X)
    cov = XCentered.T.dot(XCentered)/float(XCentered.shape[0])
    # Calculate the eigenvalues and eigenvectors of the covariance matrix
    eigVals, eigVecs = np.linalg.eig(cov)
    # Apply the eigenvectors to X
    decorrelated = X.dot(eigVecs)
    # Rescale the decorrelated data
    whitened = decorrelated / np.sqrt(eigVals + 1e-5)
    return whitened

Note: we add a minima (here $10^{-5}$ ) to avoid division by 0.

CWhitened = whiten(C)

plotDataAndCov(CWhitened)
plt.xlim(-5,5)
plt.ylim(-5,5)
plt.show()
plt.close()

Covariance matrix:
[[  9.99983225e-01  -1.06581410e-16]
 [ -1.06581410e-16   9.99996827e-01]]

5. Image whitening

Let's take a look at how whitening can be applied to preprocessed image datasets. For this, we will use the paper by Pal & Sudeep (2016), which provides some details on the process. This preprocessing technique is called Zero Component Analysis (ZCA).

Check out the paper, but this is what they got:

Whitened images from the CIFAR10 dataset. Results from the Pal & Sudeep (2016) paper. The original image (left) and the image after ZCA (right) are shown.

First load the CIFAR dataset. This dataset is available from Keras.

from keras.datasets import cifar10
(X_train, y_train), (X_test, y_test) = cifar10.load_data()
X_train.shape

The training set of the CIFAR10 dataset contains 50,000 images. The shape of X_train is (50000, 32, 32, 3). Each image is 32px by 32px, and each pixel contains 3 dimensions (R, G, B). Each value is the brightness of the corresponding color between 0 and 255.

We'll start by selecting only a subset of the images, say 1000:

X = X_train[:1000]
print X.shape

Now we will reshape the array to have flat image data with one image per row. Each image will be (1, 3072) because 32×32×3=3072. So the array with all the images will be (1000, 3072):

X = X.reshape(X.shape[0], X.shape[1]*X.shape[2]*X.shape[3])
print X.shape

The next step is to be able to see the image. The function imshow() in Matplotlib can be used to display images. It needs an image of shape (MxNx3), so let's create a function to reshape the image and be able to visualize them from shape (1, 3072).

def plotImage(X):
    plt.figure(figsize=(1.5, 1.5))
    plt.imshow(X.reshape(32,32,3))
    plt.show()
    plt.close()

For example, let's draw an image we loaded:

plotImage(X[12, :])

We can now whiten the image.

1. The first step is to rescale the image to get the range [0, 1] by dividing by 255 (the maximum value in pixels).

Use the following formula $\frac{data - min(data)}{max(data) - min(data)}$ , where the minimum value is 0, so: $\frac{data}{max(data)} = \frac{data}{255}$

X_norm = X / 255.
print 'X.min()', X_norm.min()
print 'X.max()', X_norm.max()

X.min() 0.0
X.max() 1.0

2. Subtract the mean value from all images.

One approach is to take each image and remove the average of that image from each pixel (Jarrett et al., 2009). The intuition behind this process is that it centers the pixels of each image around 0.

Another way is to take each of the 3072 pixels we have (32 x 32 pixels for R, G, and B) for each image and subtract the average of that pixel across all images. This is called average subtraction per pixel. This time, each pixel will be centered at 0 according to all images. When you feed an image to the network, each pixel is treated as a different feature. With per-pixel mean subtraction, we center each feature (pixel) around 0. This is the common way to deal with it.

We will now do a per-pixel average subtraction from our 1000 images. Our data is organized along these dimensions (images, pixels). is (1000, 3072) because there are 1000 images, 32×32×3=3072 pixels. Therefore, the average value of each pixel can be obtained from the first axis:

X_norm.mean(axis=0).shape
X_norm = X_norm - X_norm.mean(axis=0)
X_norm.mean(axis=0)

Now we want to compute the covariance matrix for the zero-centered data. As we saw above, we can use np.cov()Numpy's functions to calculate it.

We can calculate two possible correlation matrices X from the matrix: the correlation between rows or between columns. In our case, each row X of the matrix is an image, so the rows of the matrix correspond to observations and the columns of the matrix correspond to features (image pixels). We want to compute correlations between pixels because the goal of whitening is to remove these correlations so that the algorithm can focus on higher-order relationships .

To do this, set the parameter Numpy rowvar=False: it will use columns as variables (or features) and rows as observations.

cov = np.cov(X_norm, rowvar=False)

The shape of the covariance matrix should be 3072 x 3072 to represent the correlation between each pair of pixels (and there are 3072 pixels):

We will compute the singular values and vectors of the covariance matrix and use them to rotate our dataset.

U,S,V = np.linalg.svd(cov)
print np.diag(S)
print '\nshape:', np.diag(S).shape

$diag(\frac{1}{\sqrt{diag(\boldsymbol{S}) + \epsilon}})$ Also the shape (3072, 3072) and the U sum $\boldsymbol{U^{\text{T}}}$ , we also see that X has the shape (1000, 3072) and we need to transpose it to (3072, 1000). So $\boldsymbol{X}_{ZCA}$ the shape is

$(3072, 3072) . (3072, 3072) . (1000, 3072)^{\text{T}} = (3072, 3072) . (3072, 1000) = (3072, 1000)$ Corresponds to the shape of the initial dataset after transposition.

epsilon = 0.1
X_ZCA = U.dot(np.diag(1.0/np.sqrt(S + epsilon))).dot(U.T).dot(X_norm.T).T

Let's rescale the image:

X_ZCA_rescaled = (X_ZCA - X_ZCA.min()) / (X_ZCA.max() - X_ZCA.min())
print 'min:', X_ZCA_rescaled.min()
print 'max:', X_ZCA_rescaled.max()

Finally, we can see the effect of whitening by comparing an image before and after whitening:

Looks like an image from the Pal & Sudeep (2016) paper. They use epsilon = 0.1. We can try other values to see the effect on the image.

Machine Learning Notes - Preprocessing and Image Whitening for Deep Learning

I. Overview

2. Variance and Covariance

1, example 1

2, example 2

3. Data and covariance matrix visualization

1. Irrelevant data

2. Relevant data

4. Preprocessing

1. Mean normalization

2. Standardization

3. Albino

（1）Zero-centering

（2）Decorrelate

（3）Rescale the data

5. Image whitening

Guess you like