[CSDN Creation Topics] What are the competitions

Foreword/Background

I have never won any prizes in the mathematical modeling competitions that I have participated in. I plan to participate in a large number of mathematical modeling competitions and continue to accumulate experience. Although I am very good now, I am the kind of person who loves to play and enjoys. Anyway, I only care about what I learned in the process, of course winning the award is my motivation. I believe that in the future I can get an award that belongs to my own recognition.

Introduction to the competition

 I have participated in this competition.

Entry process

First of all, we need a capable team, one who can program, one who can give speeches, and one who can perform mathematical algorithms. Then everyone knows a little bit about the other two, and I feel comfortable with that kind of team. Secondly, the three people must have passion, no negative energy, and the team must be cohesive, so that when everyone is sprinting, encountering problems, or staying up late, they will not complain and affect everyone's enthusiasm. Finally, this is very important. It is the preparation before the competition. What should you prepare? For example, you can prepare the common algorithm code of the competition, as well as the usage of the tools for writing papers, typesetting, etc.

Competition experience

When I competed, there was no team, because it required more than two people to participate, so I pulled one person to make up the number. So after I finish programming, I don't have much time to write the thesis. In the end, there are still many points left unfinished, not even the typesetting. The problem is solved. pull.

Experience

Teams are important! Teams are important! Teams are important!

data sharing

The data analysis part is shared below.

 The first is to load the data

slightly…………

 Aggregate and splicing feature table and label table

# 聚合数据
df_1 = pd.merge(bhv_train,cust_train)
train = pd.merge(df_1,train_label)
test = pd.merge(bhv_test,cust_test)

 View Dimensions

#  样本个数和特征维度
train.shape  #(7206, 34)
test.shape   #(1655, 34)

. view feature names

# 查看特征名
train.columns
test.columns

Because the competition provides desensitized data, we don't know exactly what these features mean.

# 查看数据集的一些基本信息
train.info()

train.head().T

# 查看一下数据的描述性分析
train.describe().T

 The above can give us a better understanding of the data

The following analysis of data types

# 数值类型
numerical_feature = list(train.select_dtypes(exclude=['object']).columns)
numerical_feature
len(numerical_feature)  ## 34
# 连续型变量
serial_feature = []
# 离散型变量
discrete_feature = []
# 单值变量
unique_feature = []

for fea in numerical_feature:
    temp = train[fea].nunique()# 返回的是唯一值的个数
    if temp == 1:
        unique_feature.append(fea)
     # 自定义变量的值的取值个数小于10就为离散型变量    
    elif temp <= 10:
        discrete_feature.append(fea)
    else:
        serial_feature.append(fea)
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#每个数字特征得分布可视化
f = pd.melt(train, value_vars=serial_feature)
g = sns.FacetGrid(f, col="variable",  col_wrap=3, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

plt.figure(1 , figsize = (8 , 5))
sns.distplot(train.A3,bins=40)
plt.xlabel('A3')

discrete_feature

['label']

import seaborn as sns
import matplotlib.pyplot as plt
df_ = train[discrete_feature]# 离散型变量
sns.set_style("whitegrid") # 使用whitegrid主题
fig,axes=plt.subplots(nrows=1,ncols=1,figsize=(8,10))# nrows=4,ncols=2,括号加参数4x2个图
for i, item in enumerate(df_):
    plt.subplot(4,2,(i+1))
    ax=sns.countplot(item,data = df_,palette="Pastel1")
    plt.xlabel(str(item),fontsize=14)   
    plt.ylabel('Count',fontsize=14)
    plt.xticks(fontsize=13)
    plt.yticks(fontsize=13)
    #plt.title("Churn by "+ str(item))
    i=i+1
    plt.tight_layout()
plt.show()

label=train.label           
label.value_counts()/len(label)

train_positve = train[train['label'] == 1]
train_negative = train[train['label'] != 1]
f, ax = plt.subplots(len(numerical_feature),2,figsize = (10,80))
for i,col in enumerate(numerical_feature):
    sns.distplot(train_positve[col],ax = ax[i,0],color = "blue")
    ax[i,0].set_title("positive")
    sns.distplot(train_negative[col],ax = ax[i,1],color = 'red')
    ax[i,1].set_title("negative")
plt.subplots_adjust(hspace = 1)

 Missing value view

# 去掉标签
# X_missing = train.drop(['label'],axis=1)
X_missing =test
# 查看缺失情况
missing = X_missing.isna().sum()
missing = pd.DataFrame(data={'特征': missing.index,'缺失值个数':missing.values})
#通过~取反,选取不包含数字0的行
missing = missing[~missing['缺失值个数'].isin([0])]
# 缺失比例
missing['缺失比例'] =  missing['缺失值个数']/X_missing.shape[0]
missing.to_csv("2455.csv")
# 可视化
s=(train.isnull().sum()/len(train)).plot.bar(figsize = (20,6),color=['#d6ecf0','#a3d900','#88ada6','#ffb3a7','#cca4e3','#a1afc9'])

# 可以看到,所有的特征缺失值都在10%以内,这里考虑全部保留。

Outlier Handling


# 数值类型
numerical_feature = list(train.select_dtypes(exclude=['object']).columns)
def find_outliers_by_3segama(data,fea):
    data_std = np.std(data[fea])
    data_mean = np.mean(data[fea])
    outliers_cut_off = data_std * 3
    lower_rule = data_mean - outliers_cut_off
    upper_rule = data_mean + outliers_cut_off
    data[fea+'_outliers'] = data[fea].apply(lambda x:str('异常值') if x > upper_rule or x < lower_rule else '正常值')
    return data
data_train = train.copy()
for fea in numerical_feature:
    data_train = find_outliers_by_3segama(data_train,fea)
    print(data_train[fea+'_outliers'].value_counts())
    print(data_train.groupby(fea+'_outliers')['label'].sum())
    print('*'*10)

# 检索异常值
fig,ax=plt.subplots(figsize=(15,5))
train.boxplot()

#使用拉依达准则(3σ准则)
import numpy as np
import pandas as pd
#设置需读取文件的路径

data =train
# 记录方差大于3倍的值
#shape[0]记录行数,shape[1]记录列数
sigmayb = [0]*data.shape[0]
for i in range(1,data.shape[1]):
       print("处理第"+str(i)+"行")
       # 循环 每一列
       lie = data.iloc[:, i].to_numpy()
       print(lie)
       mea = np.mean(lie)
       s = np.std(lie, ddof=1)
       # 计算每一列 均值 mea 标准差 s
       print("均值和标准差分别为:"+str(mea)+" "+str(s))
       #统计大于三倍方差的行
       for t in range(1,data.shape[0]):
          if (abs(lie[t]-mea) > 3*s):
            print(">3sigma"+" "+str(t)+" "+str(i))
            #将异常值置空
            if i != 33:
                data.iloc[t,i]= np.nan
           
# 将处理后的数据存储到原文件中
train=data

data relationship

f, ax = plt.subplots(1,1, figsize = (20,20))
cor = train[numerical_feature].corr()
sns.heatmap(cor, annot = True, linewidth = 0.2, linecolor = "white", ax = ax, fmt =".1g" )

#查看变量与标签的相关性
train.corr()["label"].sort_values()

The above is some part of the process sharing of data analysis EDA.

The specific analysis will not be said, but the operation can only be shared.

Guess you like

Origin blog.csdn.net/qq_21402983/article/details/126598877