[Ali Tianchi Algorithm Learning Competition] Test your love at first sight index/machine learning/deep learning/data processing/Python basics

[Ali Tianchi Algorithm Learning Competition] Test your love at first sight indexinsert image description here

Address: https://tianchi.aliyun.com/competition/entrance/531825/introduction?spm=5176.12281973.0.0.4c883b74SwHDoH
Personality judgment, career prediction, love index... You must have come into contact with various web pages or WeChat public accounts Such a small test game. You only need to answer a few simple questions, and a lengthy analysis article will be thrown in front of your face, which makes sense and seems reasonable. In the context of the information age, it has long been common to use data analysis to implement personality tests.

Background

"During 2002-2004, when preparing the thesis, Professor Ray Fisman and Professor Sheena Iyengar invited volunteers to participate in the lightning speed dating experiment (blind date wheel battle, fast communication with a blind date every 4 minutes, and then change to another blind date ), provide some relevant personal information to the blind date object, and ask the blind date object whether they are willing to meet again in the near future. The analysis data of this learning competition records the relevant information of the volunteers and the blind date in the love at first sight blind date experiment. result."

Contestants can analyze the interaction between different fields of the data set, and train a machine learning model to predict the impact of one or more characteristics of the experimenter on the success of their blind date. That is to use other feature information to predict the result of the "match" field in the data set, 1=successful, 0=unsuccessful.

Problem-solving ideas:

Discard data with high missing rate, fill data with small missing rate, linear dimensionality reduction + nonlinear dimensionality reduction, design model, training, prediction

1. Data preprocessing

def missing(data,threshold,dropna  = False):#  第三个参数为是否把有缺失数据的样本也删除
    percent_missing = data.isnull().sum() / len(data)
    # 统计缺失情况
    missing = pd.DataFrame({
    
    'column_name': data.columns, 
    						'percent_missing': percent_missing})
    # 要删除处理的数据						
    out = missing[missing['percent_missing']>threshold]
    # 把缺失率大于阈值的列删了
    data_new = data.drop(out['column_name'], axis=1).dropna() if dropna else
    		   data.drop(out['column_name'], axis=1)
    return data_new

Here I did three rounds of processing. In the first round, the data with a missing rate greater than 50% was deleted. In the second round, the samples with a missing rate of less than 50% were filled with the average value. In the third round, a missing check was performed to remove the data that could not be filled. The entire row of samples is deleted.

data = pd.read_csv('./data/speed_dating_train.csv')
data = missing(data, 0.5)
data.fillna(data.mean(), inplace=True)
data = missing(data, 0, True)

2. Data dimensionality reduction

The original data has more than 180 features, and 130 features are left after preprocessing, but often many features have nothing to do with whether the match can be successful, such as the ID of the participant. There are some characteristics that have a high correlation with the success of matching, such as income and so on.
Considering that there are a lot of samples, and love is a fantasy thing, I used the Pearson correlation coefficient to extract the features whose absolute value of correlation with the label is greater than 0.15.

corr = data.corr('pearson')
# 这一行是用来创建一个布尔矩阵的,其中每个元素表示两个变量之间的相关系数是否大于0.15(绝对值)。如果是,那么对应的元素为True,否则为False。这样可以筛选出相关性较强的变量对
bool_matrix = (corr.abs() > 0.15)
# 这里使用的是条件选择,也就是只保留那些与’match’列相关系数大于0.15的列。这样可以减少数据框中的维度,只保留与目标变量(‘match’)相关性较强的变量。
data = data_new.loc[:, bool_matrix.loc['match']]

I originally planned to use PCA for linear dimensionality reduction, but after printing it out, I found that there are 16 features left, so I think it’s okay not to do it.

print(data.shape)

3. Design model & training

import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
import pandas as pd

training set/validation set

The distribution of successful and failed blind date matches is uneven, and the ratio of positive and negative samples is about 1:5. Therefore, to expand the training set, I chose simple copy and paste for the expansion method.

device = torch.device("cuda:0" if torch.cuda.is_available() else 'cpu')

# 训练集取前面6500个样本,剩下1700多个样本做验证集
train_data = data[0:6500]
test_data = data[6500:]
# 将训练集正样本复制4份
positive_samples = train_data[train_data['match'] == 1].copy()
positive_samples = pd.concat([positive_samples] * 4)
train_data = pd.concat([train_data, positive_samples])
# 对数据集进行打乱,保证样本顺序随机
train_data = train_data.sample(frac=1).reset_index(drop=True)

convert to tensor

train_X = train_data.drop('match', axis=1)
train_y = train_data['match']

test_X = test_data.drop('match', axis=1)
test_y = test_data['match']

test_X = torch.from_numpy(test_X.to_numpy()).float().to(device)
test_y = torch.from_numpy(test_y.to_numpy()).float().to(device)

train_X = torch.from_numpy(train_X.to_numpy()).float().to(device)
train_y = torch.from_numpy(train_y.to_numpy()).float().to(device)
# 打包成小批量随机梯度下降
dataset = TensorDataset(train_X, train_y)
dataloader = DataLoader(dataset, batch_size=30, shuffle=True)

Model Design/Training

# 三层神经网络,逻辑回归模型
model = nn.Sequential(
    nn.Linear(16, 64),
    nn.ReLU(),
    nn.Linear(64, 1),
    nn.Sigmoid()
).to(device)
# 优化器随机梯度下降
optimizer = optim.SGD(model.parameters(), lr=0.01)
# 损失函数二元交叉熵
criterion = nn.BCELoss()
# 模型权重初始化
def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, std=0.01)

model.apply(init_weights)
# 训练模型
epochs = 60
for epoch in range(epochs):
    for X, y in dataloader:
        # 前向传播,得到预测值
        y_pred = model(X)
        # 计算损失值
        loss = criterion(y_pred.squeeze(), y)
        # 反向传播,计算梯度
        loss.backward()
        # 更新参数
        optimizer.step()
        # 清零梯度
        optimizer.zero_grad()
        # 每10轮打印一次损失值
    if (epoch + 1) % 10 == 0:
        print(f'Epoch {
      
      epoch + 1}, Loss: {
      
      loss.item():.4f}')

4. Test model

# 计算准确率函数
def accuracy(y_pred, y_true):
    y_pred = y_pred.squeeze()
    y_true = y_true.squeeze()
    correct = torch.eq(y_pred, y_true).float()
    acc = correct.sum() / len(correct)
    return acc

with torch.no_grad():  # 不需要计算梯度
    y_pred = model(test_X)
    y_pred = y_pred.round()  # 将预测值四舍五入为0或1
    acc = accuracy(y_pred, test_y)
    print(f'Accuracy: {
      
      acc:.4f}')

result:

Epoch 20, Loss: 0.0105
Epoch 40, Loss: 0.0041
Epoch 60, Loss: 0.0017
Accuracy: 1.0000

5. Save data & submit

# 同样的方法预处理data2
data2 = pd.read_csv('./data/speed_dating_test.csv')
data2 = missing(data, 0.5)
data2.fillna(data2.mean(), inplace=True)
data2 = missing(data2, 0, True)
data2 = torch.from_numpy(data2.to_numpy()).float().to(device)
# data2是没有'match'字段的,所以可以不用分离出来
predict_test = model(data2)
predict_test = predict_test.round().cpu()
predict_list = predict_test.squeeze().tolist()  # 将张量转换为列表
predict_int = [int(x) for x in predict_list]  # 将列表中的每个元素转换为int

# 提交要按照官方给出的格式,所以把sample导入进来,替换sample的'match'字段即可
save = pd.read_csv('data/sample_submission.csv')
save['match'] = predict_int
# 保存
save.to_csv('data/predict_submission.csv', index=False)  # index=False表示不写入行索引

Submit documents, get grades and rankings
submit grades

Guess you like

Origin blog.csdn.net/Dec1steee/article/details/130160929