Data mining algorithms

KNN algorithm

Super model parameters Parameter # 
# model parameters are arranged inside the model variables, the model parameters can be estimated data; hyper-parameters model is a model of the external configuration, the value of the parameter must be set manually. 


Cross-validation # 
# general procedure we solve the problem of machine learning:                                                                                
# Data Preprocessing: Repeat Deleted outliers process, feature selection, feature transformation, feature reduction                                
# Data Modeling: machine learning problems into two categories: Classification and Regression we choose the right model according to problem          
# model assessment: the predicted results of the model do an assessment, commonly used indicators of accuracy, recall, F1, PRC, ROC_AUC, IOU , and so on.                                                                  
# Cross-validation is an effective means of our model evaluation.                                                            

# If we used the training data set to validate the model, so it is prone to over-fitting. So in general we put data set into training and test sets, but this will be a lot of problems. 
# For chestnuts, we usually do training simulation title is set, set of examination papers is to test, if the test is usually done when a full simulation of the original title, then what significance does it, all out,  
# actually really there are levels of thing out, and then not done out of a title or not, this situation is over-fitting. But the question is too simple simulation, test questions too difficult, so the results are not satisfactory.
# in order for us the model adequately reflect the true effect, we have introduced a cross-validation.
# Cross-validation of the basic idea is to put in some sense the original data packet, as part of the training set, as another part of the validation set, first to train the classifier with the training set, and then use the validation set to test the trained model 
# to do this for the performance evaluation of the classifier. As shown above, we mean the training set is divided into ten parts, each time incremented out as a test set, the remaining nine as the training set, and ten iterations, 
# 10 times to average the results, as an indicator of the model. 


# KNN linear regression and the difference between 
# 1 ) KNN either do return, you can do the classification. If you do so, then return, when KNN when to use linear regression with it? It turns out that if the linear relationship between the data more obvious, then 
# will be better than the linear regression KNN, if the linear relationship between data is not obvious, then, KNN would be more better. But generally speaking, these two machines are learning the simplest, most basic model. 
# With the rise dimensions, these two models, especially KNN will face the problem of failure, which is the so-called curse of dimensionality 
# 2 ) linear regression is not affected by the amount of classes, KNN affected dimensionless, so the best KNN We need to be standardized. 

# KNN algorithm 
# KNN (K - Nearest Neighbor) is one of the simplest machine learning algorithms that can be used for classification and regression, is a supervised learning algorithm. 
# It is the idea that if a majority of samples belongs to a category K most similar (i.e. nearest feature space) in the sample in feature space, then the sample may also fall into this category.
# In other words, the method in a given class or category a decision based solely on the nearest several samples determined to be sub-sample belongs to category 

# ultra-parameter 
    # N_NEIGHBORS 
    # Uniform weights, Distance
    
Model # 
    
#      from sklearn.neighbors Import KNeighborsRegressor, KNeighborsClassifier 

# x_train, x_test, y_train, android.permission.FACTOR. = Train_test_split (X-, Y, test_size = 0.25 , random_state = . 3 ) 
# KNN = KNeighborsRegressor () 
# knn.fit (x_train, y_train) 




    ultra parameter adjustment # 
        # grid search means GridSearchCV 
            # scenarios: select the appropriate hyper-parameters 
            # this is a model parameter adjustment method of machine learning models usually used, which returns the best combination of model parameters, using these parameters modeling able to give the best model 
    # from sklearn.model_selection Import GridSearchCV 
    
    # Parameters ,= { ' N_NEIGHBORS ' : Range ( . 1 10 ), ' weights ' : [ ' Uniform ' , ' Distance ' ]} 
        # define attempts required over parameter combinations 
    # CLF = GridSearchCV (Estimator = KNN, param_grid = Parameters, n_jobs = - . 1 , Scoring = ' R2 ' , CV = . 5 , verbose = 10 ) 
        # Scoring linear regression or logistic regression, to select a different evaluation standard model. Linear -> R2, MSE Logic -> F1, Recall, Accuracy 
        # Estimator evaluator, i.e. over which the model adjustment parameters 
        # cv fold cross validation number 
    # clf.fit (x_train, y_train)
    Clf.best_score_ # 
    # # clf.best_estimator_ best training good hyperparameter model. Have trained well, the direct use of them, no longer fit up! ! !
    Clf.best_estimator_.predict # (x_test) 
    # Print (clf.best_estimator_.score (x_train, y_train)) 
    # Print (clf.best_estimator_.score (x_test, android.permission.FACTOR.)) 
    
    
    
    # Normalized data 
        # actual also belong to a pretreatment 
#       
    # Linefrom sklearn.preprocessing Import StandardScaler, MinMaxScaler 
    
# Scaler = MinMaxScaler () 
# x_train_sca = scaler.fit_transform (x_train) different methods of training set and test set used. 
X_test_sca # = scaler.transform (x_test) 
# knn.fit (x_train_sca, y_train) 
# Print (knn.score (x_train_sca, y_train)) 
# Print (knn.score (x_test_sca, android.permission.FACTOR.))


    
# Pipeline Many algorithmic models can be connected in series, such as feature extraction, normalization, classification and organization together to form a typical machine learning problem workflows. Main bring two benefits: 
#      1 , direct calls to fit and predict approach to training and forecasting algorithm for all models in the pipeline. 
#      2 , may be bonded to the grid search parameters are selected. 
# Note: 
# except the last one tansform, the rest of the transform must implement fit_transform function 
# custom transform the class, must implement fit_transform function, because fit_transform is a transform parameter 
# value every step of the transform returns the array form numpy of data 
        
# evaluator of each pipeline may be regarded as a step, then a plurality of steps successively performed as a whole. 
All methods # pipeline with the last evaluator, when calling the object through the pipeline method, performs this procedure: 
# If fit method, the former will be n- - . 1 evaluators fit_transform method is invoked sequentially, then a final the method of the evaluator fit call 
# If other (eg: predict) method, the former will be n- - then call the method (predict) 1 evaluators transform method is invoked in turn on the last evaluator.

#     from sklearn.pipeline import Pipeline
    
#     steps = [('scaler',MinMaxScaler()),('knn',KNeighborsRegressor())]
#     p = Pipeline(steps=steps)
#     p.set_params(knn__n_neighbors=5,knn__weights='distance')
#     p.fit(x_train,y_train)
#     print(p.score(x_train,y_train))
#     print(p.score(x_test,y_test))

 

 

Naive Bayes

Guess you like

Origin www.cnblogs.com/654321cc/p/12115377.html