Learning Directory:
1. Sklearn converter and estimator
1. Converter (the parent class of feature engineering)
2. Estimator (implementation of sklearn machine learning algorithm)
Step 1: Instantiate an estimator
Step 2: estimator.fit(x_train, y_train) training and calculation (the call is complete, Model generation)
Step 3: Model evaluation
1. Directly compare the true value and the predicted value
y_predice=estimator.predict(x_test) (predict the test set)
y_text==y_predict
2. Calculate the accuracy rate
accuracy=estimator.score(x_text ,y_text)
2. KNN algorithm (K—nearest neighbor algorithm)
Principle : According to a sample, calculate the k samples that are most similar to him in the feature space (that is, the closest neighbors in the feature space). Most of these k samples belong to the category, and the sample is considered to belong to this category.
Calculating distance formula : Euclidean distance, Manhattan distance, Minkowski distance
Example : Use KNN algorithm to predict and classify the iris data set
Advantages:
simple, easy to understand, easy to implement, no training required (recalculate when the test set is received)
Disadvantages:
lazy algorithm: the classification of test samples is computationally intensive, and the memory overhead is large. The
value of k must be specified, but the value of k is small and easy to accept Affected by abnormal points, when k is large, it is easy to be affected by sample balance.
Use scenario: small data scenario
3. Model selection and tuning
Application: It can easily help us to choose the value of k in the KNN algorithm.
1. What is cross-validation?
The training data is divided into 4 parts, three training sets and one validation set, each time a different validation set is changed, after 4 sets of tests, the four sets of model results are averaged, which is 4-fold cross-validation.
2. Hyperparameter search-the
principle of grid search : try hyperparameters one by one, and then select an optimal hyperparameter to build a model
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.neighbor import KNeighborsClassifer
from sklearn.model_selection import GridSearchCV
def knn_iris_gscv():
"""用KNN算法对鸢尾花数据集进行分类,并添加网格搜索和交叉验证
:return:"""
#获取数据
iris=load_iris()
#划分数据集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=22)
#特征工程:标准化
transfer=StandardScaler()#实例化
x_train=transfer.fit_transform(x_train)
x_test=transfer.transform(x_test)#不使用fit(),因为对验证集进行标准化要按照训练集的标准化标准进行
#KNN算法预估器
estimator=KNeighborsClassifier()#实例化
#加入网格搜索和交叉验证
#参数准备
param_dict={
'n_neighbors':[1,3,5,7,9,11]}#K值分别设为1,3,5,7,9,11
estimator=GridSearchCV(estimator,param_grid=param_dict,cv=10)#10折交叉验证
estimator.fit(x_train,y_train)#把训练数据放进去
#模型评估
#方法一:直接比对真实值和预测值
y_predict=estimator.predict(x_test)
print('y_predict:\n',y_predict)
print('直接比对真实值和预测值:\n', y_test==y_predict)
# 方法二:计算准确率
score = estimator.score(x_test,y_test)
print('准确率:\n', score)
print('最佳参数:\n', estimator.best_params_)
print('最佳结果:\n', estimator.best_score_)
print('最佳估计器:\n', estimator.best_estimator_)
print('交叉验证结果:\n', estimator.cv_results_)
if __name__=='__main__':
knn_iris_gscv()
Case: Predicting the Facebook sign-in location
Reduce the data range from more than 20 million to 80,000:
Convert the timestamp to year, month, day, hour, minute and second: (pd.to_datatime() can parse different date expression modes)
Leave the date, time and day of the week: (Because the year, month and everything are the same, there is no need to leave it)
Group by place_id, use groupby(), use .count() to count the number of check-ins with different place_id, just keep row_id :
Keep the ID of the place where the number of check-ins is greater than three times:
For the place_id in the data, return the result of place_count[place_count>3], and only keep the value of the index (index) of the id (values): (The number of check-ins in the place_id is greater than 3 true)
Boolean index data, and keep place_id where place_count>3 (the above is true):
Summary of classification algorithms: