Datawhale 零基础入门数据挖掘-Baseline
Baseline-v1.0 版
Tip:这是一个最初始baseline版本,抛砖引玉,为大家提供一个基本Baseline和一个竞赛流程的基本介绍,欢迎大家多多交流。
赛题:零基础入门数据挖掘 - 二手车交易价格预测
地址:https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX
# 查看数据文件目录 list datalab files
!ls datalab/
'ls' 不是内部或外部命令,也不是可运行的程序
或批处理文件。
Step 1:导入函数工具箱
## 基础工具
import numpy as np
import pandas as pd
import warnings
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.special import jn
from IPython.display import display, clear_output
import time
warnings.filterwarnings('ignore')
%matplotlib inline
## 模型预测的
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
## 数据降维处理的
from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA
import lightgbm as lgb
import xgboost as xgb
## 参数搜索和评价的
from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle:
The text.latex.preview rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle:
The mathtext.fallback_to_cm rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle: Support for setting the 'mathtext.fallback_to_cm' rcParam is deprecated since 3.3 and will be removed two minor releases later; use 'mathtext.fallback : 'cm' instead.
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle:
The validate_bool_maybe_none function was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle:
The savefig.jpeg_quality rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle:
The keymap.all_axes rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle:
The animation.avconv_path rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
In d:\installed\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test.mplstyle:
The animation.avconv_args rcparam was deprecated in Matplotlib 3.3 and will be removed two minor releases later.
Step 2:数据读取
## 通过Pandas对于数据进行读取 (pandas是一个很友好的数据读取函数库)
Train_data = pd.read_csv('used_car_train_20200313.csv', sep=',')
TestA_data = pd.read_csv('used_car_testA_20200313.csv', sep=',')
## 输出数据的大小信息
print('Train data shape:',Train_data.shape)
print('TestA data shape:',TestA_data.shape)
Train data shape: (150000, 31)
TestA data shape: (50000, 30)
1) 数据简要浏览
## 通过.head() 简要浏览读取数据的形式
Train_data.head()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 20040402 | 30 | 6 | 1 | 0.0 | 0 | 60 | 12.5 | ... | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914763 |
1 | 1 | 2262 | 20030301 | 40 | 1 | 2 | 0.0 | 0 | 0 | 15 | ... | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 |
2 | 2 | 14874 | 20040403 | 115 | 15 | 1 | 0.0 | 0 | 163 | 12.5 | ... | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 |
3 | 3 | 71865 | 19960908 | 109 | 10 | 0 | 0.0 | 1 | 193 | 15 | ... | 0.274293 | 0.110300 | 0.121964 | 0.033395 | 0.000000 | -4.509599 | 1.285940 | -0.501868 | -2.438353 | -0.478699 |
4 | 4 | 111080 | 20120103 | 110 | 5 | 1 | 0.0 | 0 | 68 | 5 | ... | 0.228036 | 0.073205 | 0.091880 | 0.078819 | 0.121534 | -1.896240 | 0.910783 | 0.931110 | 2.834518 | 1.923482 |
5 rows × 31 columns
2) 数据信息查看
## 通过 .info() 简要可以看到对应一些数据列名,以及NAN缺失信息
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID 150000 non-null int64
name 150000 non-null int64
regDate 150000 non-null int64
model 150000 non-null int64
brand 150000 non-null int64
bodyType 150000 non-null int64
fuelType 150000 non-null float64
gearbox 150000 non-null object
power 150000 non-null object
kilometer 150000 non-null object
notRepairedDamage 150000 non-null object
regionCode 150000 non-null int64
seller 150000 non-null int64
offerType 150000 non-null float64
creatDate 150000 non-null float64
price 150000 non-null float64
v_0 150000 non-null float64
v_1 150000 non-null float64
v_2 150000 non-null float64
v_3 150000 non-null float64
v_4 150000 non-null float64
v_5 150000 non-null float64
v_6 150000 non-null float64
v_7 150000 non-null float64
v_8 150000 non-null float64
v_9 150000 non-null float64
v_10 150000 non-null float64
v_11 150000 non-null float64
v_12 148531 non-null float64
v_13 146417 non-null float64
v_14 135884 non-null float64
dtypes: float64(19), int64(8), object(4)
memory usage: 35.5+ MB
## 通过 .columns 查看列名
Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode',
'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3',
'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12',
'v_13', 'v_14'],
dtype='object')
TestA_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
SaleID 50000 non-null int64
name 50000 non-null int64
regDate 50000 non-null int64
model 50000 non-null int64
brand 50000 non-null int64
bodyType 50000 non-null int64
fuelType 50000 non-null float64
gearbox 50000 non-null object
power 50000 non-null object
kilometer 50000 non-null object
notRepairedDamage 50000 non-null object
regionCode 50000 non-null int64
seller 50000 non-null float64
offerType 50000 non-null float64
creatDate 50000 non-null float64
v_0 50000 non-null float64
v_1 50000 non-null float64
v_2 50000 non-null float64
v_3 50000 non-null float64
v_4 50000 non-null float64
v_5 50000 non-null float64
v_6 50000 non-null float64
v_7 50000 non-null float64
v_8 50000 non-null float64
v_9 50000 non-null float64
v_10 50000 non-null float64
v_11 50000 non-null float64
v_12 49548 non-null float64
v_13 48880 non-null float64
v_14 45356 non-null float64
dtypes: float64(19), int64(7), object(4)
memory usage: 11.4+ MB
3) 数据统计信息浏览
## 通过 .describe() 可以查看数值特征列的一些统计信息
Train_data.describe()
SaleID | name | regDate | model | brand | bodyType | fuelType | regionCode | seller | offerType | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 150000.000000 | 150000.000000 | 1.500000e+05 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 1.500000e+05 | 1.500000e+05 | 1.500000e+05 | ... | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 150000.000000 | 148531.000000 | 146417.000000 | 135884.000000 |
mean | 74999.500000 | 68349.172873 | 2.003417e+07 | 47.128953 | 8.052527 | 1.870747 | 1.394827 | 1.997783e+05 | 2.841478e+05 | 1.415701e+06 | ... | 0.246643 | 0.062381 | 0.174574 | 0.296920 | 0.406928 | -0.164371 | -0.446352 | -0.085471 | 0.022190 | 0.008456 |
std | 43301.414527 | 61103.875095 | 5.364988e+04 | 49.535881 | 7.864603 | 5.221312 | 15.676749 | 1.985073e+06 | 2.376421e+06 | 5.151321e+06 | ... | 0.116636 | 0.133581 | 0.927042 | 1.773396 | 1.962003 | 3.758661 | 2.002930 | 2.257730 | 1.267140 | 1.050417 |
min | 0.000000 | 0.000000 | 1.991000e+07 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | ... | 0.000000 | 0.000000 | -0.273510 | -8.206004 | -8.399672 | -9.168192 | -9.404106 | -9.639552 | -6.113291 | -6.546556 |
25% | 37499.750000 | 11156.000000 | 1.999091e+07 | 10.000000 | 1.000000 | 0.000000 | 0.000000 | 7.390000e+02 | 0.000000e+00 | 0.000000e+00 | ... | 0.241064 | 0.000161 | 0.055272 | 0.036050 | 0.035225 | -3.666042 | -2.026105 | -1.745234 | -0.999703 | -0.426907 |
50% | 74999.500000 | 51638.000000 | 2.003091e+07 | 30.000000 | 6.000000 | 1.000000 | 0.000000 | 2.010000e+03 | 0.000000e+00 | 0.000000e+00 | ... | 0.256928 | 0.001547 | 0.090081 | 0.058523 | 0.063335 | 1.240603 | -0.457218 | -0.160305 | 0.008602 | 0.155026 |
75% | 112499.250000 | 118841.250000 | 2.007111e+07 | 66.000000 | 13.000000 | 3.000000 | 1.000000 | 3.719000e+03 | 0.000000e+00 | 0.000000e+00 | ... | 0.265170 | 0.104255 | 0.120590 | 0.081996 | 0.094738 | 2.691063 | 1.115744 | 1.572130 | 0.929041 | 0.700543 |
max | 149999.000000 | 196812.000000 | 2.015121e+07 | 247.000000 | 39.000000 | 999.000000 | 3500.000000 | 2.016041e+07 | 2.016041e+07 | 2.016041e+07 | ... | 1.401999 | 1.387847 | 12.357011 | 18.819042 | 18.801218 | 18.802072 | 13.562011 | 11.147669 | 8.658418 | 2.743993 |
8 rows × 27 columns
TestA_data.describe()
SaleID | name | regDate | model | brand | bodyType | fuelType | regionCode | seller | offerType | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000.000000 | 50000.000000 | 5.000000e+04 | 50000.000000 | 50000.000000 | 50000.00000 | 50000.000000 | 5.000000e+04 | 5.000000e+04 | 5.000000e+04 | ... | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 49548.000000 | 48880.000000 | 45356.000000 |
mean | 174999.500000 | 68542.223280 | 2.003393e+07 | 46.844520 | 8.056240 | 1.82978 | 1.310420 | 1.846027e+05 | 2.693423e+05 | 1.420901e+06 | ... | 0.245942 | 0.062157 | 0.167631 | 0.278298 | 0.394680 | -0.180498 | -0.441886 | -0.093695 | 0.018620 | 0.010720 |
std | 14433.901067 | 61052.808133 | 5.368870e+04 | 49.469548 | 7.819477 | 4.38703 | 11.319999 | 1.907934e+06 | 2.314646e+06 | 5.160173e+06 | ... | 0.113548 | 0.132116 | 0.889236 | 1.705061 | 1.904670 | 3.749435 | 2.011260 | 2.264558 | 1.262383 | 1.039922 |
min | 150000.000000 | 0.000000 | 1.991000e+07 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000e+00 | 0.000000e+00 | -4.137733e+00 | ... | 0.000000 | 0.000000 | 0.000000 | -7.481381 | -8.088973 | -9.160049 | -8.916949 | -8.249206 | -5.881834 | -6.112667 |
25% | 162499.750000 | 11203.500000 | 1.999091e+07 | 10.000000 | 1.000000 | 0.00000 | 0.000000 | 7.500000e+02 | 0.000000e+00 | 0.000000e+00 | ... | 0.241186 | 0.000157 | 0.055647 | 0.035926 | 0.034963 | -3.656769 | -2.033607 | -1.744112 | -0.999841 | -0.428772 |
50% | 174999.500000 | 52248.500000 | 2.003091e+07 | 29.000000 | 6.000000 | 1.00000 | 0.000000 | 2.025000e+03 | 0.000000e+00 | 0.000000e+00 | ... | 0.257005 | 0.005279 | 0.090068 | 0.058519 | 0.063502 | 1.208642 | -0.447549 | -0.165747 | 0.009142 | 0.152347 |
75% | 187499.250000 | 118856.500000 | 2.007110e+07 | 65.000000 | 13.000000 | 3.00000 | 1.000000 | 3.739000e+03 | 0.000000e+00 | 0.000000e+00 | ... | 0.265163 | 0.104231 | 0.120749 | 0.081606 | 0.094614 | 2.675705 | 1.136973 | 1.567727 | 0.924715 | 0.702953 |
max | 199999.000000 | 196805.000000 | 2.015121e+07 | 246.000000 | 39.000000 | 500.00000 | 610.000000 | 2.016041e+07 | 2.016041e+07 | 2.016041e+07 | ... | 1.339804 | 1.332522 | 12.338872 | 18.761276 | 18.811053 | 18.856218 | 12.950498 | 7.430223 | 5.228962 | 2.624622 |
8 rows × 26 columns
Step 3:特征与标签构建
1) 提取数值类型特征列名
numerical_cols = Train_data.select_dtypes(exclude = 'object').columns
print(numerical_cols)
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType',
'regionCode', 'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1',
'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11',
'v_12', 'v_13', 'v_14'],
dtype='object')
categorical_cols = Train_data.select_dtypes(include = 'object').columns
print(categorical_cols)
Index(['gearbox', 'power', 'kilometer', 'notRepairedDamage'], dtype='object')
2) 构建训练和测试样本
## 选择特征列
feature_cols = [col for col in numerical_cols if col not in ['SaleID','name','regDate','creatDate','price','model','brand','regionCode','seller']]
feature_cols = [col for col in feature_cols if 'Type' not in col]
## 提前特征列,标签列构造训练样本和测试样本
X_data = Train_data[feature_cols]
Y_data = Train_data['price']
X_test = TestA_data[feature_cols]
print('X train shape:',X_data.shape)
print('X test shape:',X_test.shape)
print('Y data shape:',Y_data.shape)
X train shape: (150000, 15)
X test shape: (50000, 15)
Y data shape: (150000,)
## 定义了一个统计函数,方便后续信息统计
def Sta_inf(data):
print('_min',np.min(data))
print('_max:',np.max(data))
print('_mean',np.mean(data))
print('_ptp',np.ptp(data))
print('_std',np.std(data))
print('_var',np.var(data))
3) 统计标签的基本分布信息
print('Sta of label:')
Sta_inf(Y_data)
Sta of label:
_min -4.236904217
_max: 99999.0
_mean 5635.615473908368
_ptp 100003.236904217
_std 7481.058399218618
_var 55966234.77251944
## 绘制标签的统计图,查看标签分布
plt.hist(Y_data)
plt.show()
plt.close()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5b165xDG-1602235515574)(output_24_0.png)]
4) 缺省值用-1填补
X_data = X_data.fillna(-1)
X_test = X_test.fillna(-1)
Step 4:模型训练与预测
1) 利用xgb进行五折交叉验证查看模型的参数效果
Y_data=Y_data.astype(int)
print(Y_data)
0 1850
1 3600
2 6222
3 2400
4 5200
...
149995 5900
149996 9500
149997 7500
149998 4999
149999 4700
Name: price, Length: 150000, dtype: int32
# print(X_data)
## xgb-Model
xgr = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, gamma=0, subsample=0.8,\
colsample_bytree=0.9, max_depth=7) #,objective ='reg:squarederror'
scores_train = []
scores = []
## 5折交叉验证方式
sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
for train_ind,val_ind in sk.split(X_data,Y_data):
print(train_ind,val_ind)
train_x=X_data.iloc[train_ind].values
train_y=Y_data.iloc[train_ind]
val_x=X_data.iloc[val_ind].values
val_y=Y_data.iloc[val_ind]
xgr.fit(train_x,train_y)
pred_train_xgb=xgr.predict(train_x)
pred_xgb=xgr.predict(val_x)
score_train = mean_absolute_error(train_y,pred_train_xgb)
scores_train.append(score_train)
score = mean_absolute_error(val_y,pred_xgb)
scores.append(score)
print('Train mae:',np.mean(score_train))
print('Val mae',np.mean(scores))
[ 1 2 4 ... 149994 149996 149997] [ 0 3 11 ... 149995 149998 149999]
[16:52:52] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[ 0 1 2 ... 149997 149998 149999] [ 6 7 10 ... 149987 149990 149993]
[16:53:23] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[ 0 1 3 ... 149997 149998 149999] [ 2 4 8 ... 149980 149986 149991]
[16:53:54] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[ 0 2 3 ... 149997 149998 149999] [ 1 9 14 ... 149971 149973 149983]
[16:54:25] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[ 0 1 2 ... 149995 149998 149999] [ 5 16 22 ... 149994 149996 149997]
[16:54:56] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Train mae: 658.4843255196383
Val mae 751.1093641226322
2) 定义xgb和lgb模型函数
def build_model_xgb(x_train,y_train):
model = xgb.XGBRegressor(n_estimators=150, learning_rate=0.1, gamma=0, subsample=0.8,\
colsample_bytree=0.9, max_depth=7) #, objective ='reg:squarederror'
model.fit(x_train, y_train)
return model
def build_model_lgb(x_train,y_train):
estimator = lgb.LGBMRegressor(num_leaves=127,n_estimators = 150)
param_grid = {
'learning_rate': [0.01, 0.05, 0.1, 0.2],
}
gbm = GridSearchCV(estimator, param_grid)
gbm.fit(x_train, y_train)
return gbm
3)切分数据集(Train,Val)进行模型训练,评价和预测
## Split data with val
x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3)
print('Train lgb...')
model_lgb = build_model_lgb(x_train,y_train)
val_lgb = model_lgb.predict(x_val)
MAE_lgb = mean_absolute_error(y_val,val_lgb)
print('MAE of val with lgb:',MAE_lgb)
print('Predict lgb...')
model_lgb_pre = build_model_lgb(X_data,Y_data)
subA_lgb = model_lgb_pre.predict(X_test)
print('Sta of Predict lgb:')
Sta_inf(subA_lgb)
Train lgb...
MAE of val with lgb: 733.8377157667358
Predict lgb...
Sta of Predict lgb:
_min -4653.411185581689
_max: 89013.32963882982
_mean 5640.487279687575
_ptp 93666.7408244115
_std 7358.320067626515
_var 54144874.21763509
print('Train xgb...')
model_xgb = build_model_xgb(x_train,y_train)
val_xgb = model_xgb.predict(x_val)
MAE_xgb = mean_absolute_error(y_val,val_xgb)
print('MAE of val with xgb:',MAE_xgb)
print('Predict xgb...')
model_xgb_pre = build_model_xgb(X_data,Y_data)
subA_xgb = model_xgb_pre.predict(X_test)
print('Sta of Predict xgb:')
Sta_inf(subA_xgb)
Train xgb...
[16:56:15] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
MAE of val with xgb: 759.8445985911052
Predict xgb...
[16:56:47] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Sta of Predict xgb:
_min -847.2674
_max: 90753.81
_mean 5640.081
_ptp 91601.08
_std 7335.928
_var 53815840.0
4)进行两模型的结果加权融合
## 这里我们采取了简单的加权融合的方式
val_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*val_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*val_xgb
val_Weighted[val_Weighted<0]=10 # 由于我们发现预测的最小值有负数,而真实情况下,price为负是不存在的,由此我们进行对应的后修正
print('MAE of val with Weighted ensemble:',mean_absolute_error(y_val,val_Weighted))
MAE of val with Weighted ensemble: 732.8346989009693
sub_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*subA_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*subA_xgb
## 查看预测值的统计进行
plt.hist(Y_data)
plt.show()
plt.close()
5)输出结果
sub = pd.DataFrame()
sub['SaleID'] = TestA_data.SaleID
sub['price'] = sub_Weighted
sub.to_csv('./sub_Weighted.csv',index=False)
sub.head()
SaleID | price | |
---|---|---|
0 | 150000 | 40535.122371 |
1 | 150001 | 349.752006 |
2 | 150002 | 6698.887589 |
3 | 150003 | 11373.901882 |
4 | 150004 | 562.043999 |
Baseline END.
— By: ML67
Email: [email protected]
PS: 华中科技大学研究生, 长期混迹Tianchi等,希望和大家多多交流。
github: https://github.com/mlw67 (近期会做一些书籍推导和代码的整理)
— By: AI蜗牛车
PS:东南大学研究生,研究方向主要是时空序列预测和时间序列数据挖掘
公众号: AI蜗牛车
知乎: https://www.zhihu.com/people/seu-aigua-niu-che
github: https://github.com/chehongshu
— By: 阿泽
PS:复旦大学计算机研究生
知乎:阿泽 https://www.zhihu.com/people/is-aze(主要面向初学者的知识整理)
— By: 小雨姑娘
PS:数据挖掘爱好者,多次获得比赛TOP名次。
知乎:小雨姑娘的机器学习笔记:https://zhuanlan.zhihu.com/mlbasic
关于Datawhale:
Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。
本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale: