网格搜索
import numpy as np
from sklearn import datasets
digits=datasets.load_digits()
X=digits.data
y=digits.target
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=666)
from sklearn.neighbors import KNeighborsClassifier
knn_clf=KNeighborsClassifier(n_neighbors=4,weights="uniform")
knn_clf.fit(X_train,y_train)
knn_clf.score(X_test,y_test)
0.9916666666666667
Grid Search
import numpy as np
from sklearn import datasets
digits=datasets.load_digits()
X=digits.data
y=digits.target
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=666)
param_grid=[
{
'weights':['uniform'],
'n_neighbors':[i for i in range(1,11)]
},
{
'weights':['distance'],
'n_neighbors':[i for i in range(1,11)],
'p':[i for i in range(1,6)]
}
]
实例化一个空的分类器
from sklearn.neighbors import KNeighborsClassifier
knn_clf=KNeighborsClassifier()
n_jobs表示运行内核数,-1表示全部内核都运行,verbose表示输出
from sklearn.model_selection import GridSearchCV
grid_search=GridSearchCV(knn_clf,param_grid,n_jobs=-1,verbose=2)
%%time
grid_search.fit(X_train,y_train)
Fitting 5 folds for each of 60 candidates, totalling 300 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 33 tasks | elapsed: 3.4s
[Parallel(n_jobs=-1)]: Done 154 tasks | elapsed: 18.6s
Wall time: 44.7 s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed: 44.6s finished
GridSearchCV(cv=None, error_score=nan,
estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
metric='minkowski',
metric_params=None, n_jobs=None,
n_neighbors=1, p=2,
weights='uniform'),
iid='deprecated', n_jobs=-1,
param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'weights': ['uniform']},
{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'p': [1, 2, 3, 4, 5], 'weights': ['distance']}],
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=2)
得到最好的模型best_estimatoe_
grid_search.best_estimator_
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=1, p=2,
weights='uniform')
最好模型的准确率
grid_search.best_score_
0.9860820751064653
返回最佳参数
grid_search.best_params_
{'n_neighbors': 1, 'weights': 'uniform'}
knn_clf=grid_search.best_estimator_
knn_clf.score(X_test,y_test)
0.9833333333333333
更多距离定义
向量空间余弦相似度 | 调整余弦相似度 | 皮尔斯相关系数 | Jaccard相似系数 |
---|---|---|---|
评估两个向量的夹角的相似度,计算两个向量的夹角余弦值,cos=a•b/ % a的模*b的模 | 调整余弦相似性是通过在计算余弦公式之前减去平均值来实现的 | 皮尔逊相关系数适用于:(1)、两个变量之间是线性关系,都是连续数据。(2)、两个变量的总体是正态分布,或接近正态的单峰分布。(3)、两个变量的观测值是成对的,每对观测值之间相互独立。 | 给定两个集合A,B jaccard 系数定义为A与B交集的大小与并集大小的比值,jaccard值越大说明相似度越高,当A和B都为空时,jaccard(A,B)=1 |
grid_search.fit(X_train,y_train)报错原因
- weights写成了weight
- 要重新导入sklearn的包,不然报错,具体原因我也不知道。视频里没报错,但是自己运行却报错
param=[
{
'weights':['uniform'],
'n_neighbors':[i for i in range(1,11)]
},
{
'weights':['distance'],
'n_neighbors':[i for i in range(1,11)],
'p':[i for i in range(1,6)]
}
]