KTV song recommendation - PCA dimensionality reduction + logistic regression - gender prediction and overfitting treatment

foreword

The previous article used logistic regression to predict the gender of the user, which will affect the training speed due to the sparse matrix. So consider dimensionality reduction, there are many dimensionality reduction schemes, this time only PCA and SVD are considered.

Principles of PCA and SVD

If you are interested, you can study it yourself https://medium.com/@jonathan_hui/machine-learning-singular-value-decomposition-svd-principal-component-analysis-pca-1d45e885e491

Let me briefly describe:

PCA is to map high-dimensional data into a low-dimensional coordinate system, so that the data is as sparse as possible
SVD is PCA of non-square matrix
There is not much difference between SVD and PCA in actual use
If the feature is larger than the number of data records, it will not have a good effect. You can see the specific reasons for yourself.

code

Data acquisition and processing

The article has been written many times before, and the shape of the original data is omitted here: 2000*1900

PCA and matrix transformation

See the optimal number of dimensions

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
pca = PCA().fit(song_hot_matrix)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

It can be seen from the figure that about 1500 dimensions can already reach 90+ interpretability

Retain 99% matrix interpretability

pca = PCA(n_components=0.99, whiten=True)
song_hot_matrix_pca = pca.fit_transform(song_hot_matrix)

The compressed features are: 2000*1565 does not compress much

model training

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"   # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"] = ""

import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding,Flatten,Dropout
import matplotlib.pyplot as plt
from keras.utils import np_utils
from sklearn import datasets
from sklearn.model_selection import train_test_split

n_class=user_decades_encoder.get_class_count()
song_count=song_label_encoder.get_class_count()
print(n_class)
print(song_count)

train_X,test_X, train_y, test_y = train_test_split(song_hot_matrix_pca,
                                                   decades_hot_matrix,
                                                   test_size = 0.2,
                                                   random_state = 0)
train_count = np.shape(train_X)[0]
# 构建神经网络模型
model = Sequential()
model.add(Dense(input_dim=song_hot_matrix_pca.shape[1], units=n_class))
model.add(Activation('softmax'))

# 选定loss函数和优化器
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

# 训练过程
print('Training -----------')
for step in range(train_count):
    scores = model.train_on_batch(train_X, train_y)
    if step % 50 == 0:
        print("训练样本 %d 个, 损失: %f, 准确率: %f" % (step, scores[0], scores[1]*100))
print('finish!')

Training result:

训练样本 4750 个, 损失: 0.371499, 准确率: 83.207470
训练样本 4800 个, 损失: 0.381518, 准确率: 82.193959
训练样本 4850 个, 损失: 0.364363, 准确率: 83.763909
训练样本 4900 个, 损失: 0.378466, 准确率: 82.551670
训练样本 4950 个, 损失: 0.391976, 准确率: 81.756759
训练样本 5000 个, 损失: 0.378810, 准确率: 83.505565

Test set validation:

# 准确率评估
from sklearn.metrics import classification_report
scores = model.evaluate(test_X, test_y, verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))


Y_test = np.argmax(test_y, axis=1)
y_pred = model.predict_classes(song_hot_matrix_pca.transform(test_X))
print(classification_report(Y_test, y_pred))

accuracy: 50.20%

Clearly overfit

Deal with overfitting - increase Dropout

Here, overfitting is handled by adding Dropout and randomly discarding features. The code:

# 构建神经网络模型
model = Sequential()
model.add(Dropout(0.5))
model.add(Dense(input_dim=song_hot_matrix_pca.shape[1], units=n_class))
model.add(Activation('softmax'))

accuracy：70%

Handling Overfitting - L1L2 Regularization

Add regularity to the weights here

# 构建神经网络模型
model = Sequential()
model.add(Dense(input_dim=song_hot_matrix_pca.shape[1], units=n_class, kernel_regularizer=regularizers.l2(0.01)))
model.add(Activation('softmax'))

accuracy：62%

Well Done

In fact, the practice of SVD is similar to that of PCA, and will not be demonstrated here. After my testing, I found that on my data set, although PCA accelerated the training speed, it discarded too many features, which made the data easy to overfit. Adding Dropout or adding a regular phase can improve the overfitting situation. The next article will share self-encoding dimensionality reduction.