KTV song recommendation -PCA dimensionality reduction + logistic regression - gender prediction and over-fitting process

Previous use of logistic regression to predict the user's gender, due to the relatively sparse matrix it will affect training speed. So consider dimensionality reduction, dimension reduction program there are many, this only consider PCA and SVD.

PCA and SVD principle

Interested parties can go to look at their own https://www.sangyulpt.com / @ jonathan_hui / machine-learning-singular-value-decomposition-svd-principal-component-analysis-pca-1d45e885e491

I briefly:

  • PCA is to map the high-dimensional data to low-dimensional coordinate system, so try to sparse data
  • SVD is a non-square matrix of PCA
  • The actual use of SVD and PCA is not much difference
  • If the feature is greater than the number of data records, and can not have good results, you can go to the specific cause.

Code

Data acquisition and processing

Articles written many times before, here Skip original shape data as follows: 2000 1900 *

PCA and matrix transformation

View optimal number of dimensions

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA pca = PCA().fit(song_hot_matrix) plt.plot(np.cumsum(pca.explained_www.yixingylzc.cn variance_ratio_)) plt.xlabel('number of components') plt.ylabel('cumulative explained variance'); 

As can be seen from the figure has dimensions of about 1500 to achieve 90+ interpretative

Explanatory matrix retains 99%

pca = PCA(n_components=0.99, whiten=www.lecaixuanzc.cn True)
song_hot_matrix_pca = pca.fit_transform(song_hot_matrix)

After obtaining compression features are: 2000 * 1565 is not much compression

Model training

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"   # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"] www.tianjiptzc.cn= "" import numpy as np from keras.models import Sequential from keras.layers import Dense, Activation, Embedding,Flatten,Dropout import matplotlib.pyplot as plt from keras.utils import np_utils from sklearn import datasets from sklearn.model_selection import train_test_split n_class=user_decades_encoder.get_class_count(www.shentuylgw.cn) song_count=song_label_encoder.get_class_count(shentuylzc.cn ) print(n_class) print(song_count) train_X,test_X, train_y, test_y = train_test_split(song_hot_matrix_pca, decades_hot_matrix, test_size = 0.2, random_state = 0) train_count = np.shape(train_X)[0] # 构建神经网络模型 model = Sequential(www.shicaiyulezc.cn) model.add(Dense(input_dim=song_hot_matrix_pca.shape[1], units=n_class)) model.add(Activation('softmax')www.lecaixuangj.cn) # 选定loss函数和优化器 model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) # 训练过程 print('Training -----------') for step in range(train_count): scores = model.train_on_batch(train_X, train_y) if step % 50 == 0: print("训练样本 %d 个, 损失: %f, 准确率: %f" % (step, scores[0], scores[1]*100)) print('finish!') 

Training results:

训练样本 4750 个, 损失: 0.371499, 准确率: 83.207470
训练样本 4800 个, 损失: 0.381518, 准确率: 82.193959 训练样本 4850 个, 损失: 0.364363, 准确率: 83.763909 训练样本 4900 个, 损失: 0.378466, 准确率: 82.551670 训练样本 4950 个, 损失: 0.391976, 准确率: 81.756759 训练样本 5000 个, 损失: 0.378810, 准确率: 83.505565 

Test set validation:

# 准确率评估
from sklearn.metrics import classification_report
scores = model.evaluate(test_X, test_y, verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100)) Y_test = np.argmax(test_y, axis=1) y_pred = model.predict_classes(song_hot_matrix_pca.transform(test_X)) print(classification_report(Y_test, y_pred)) 

accuracy: 50.20%

Obviously it has been fitted

Fitting treated

As used herein, plus Dropout, random drop feature treated fitting manner, the code:

# 构建神经网络模型
model = Sequential()
model.add(Dropout(0.5))
model.add(Dense(input_dim=song_hot_matrix_pca.shape[1], units=n_class)) model.add(Activation('softmax')) 

Accuracy rate of 70%

well done

In fact, SVD and PCA approach is similar, no demonstration here. After I tested was found, in my data set, PCA although accelerate the training speed, but drops too many features, data can easily lead to over-fitting. Join dropout situation can be improved through the fitting, under a share from coding dimensionality reduction.

Guess you like

Origin www.cnblogs.com/laobeipai/p/12436445.html