Classification algorithm -k nearest neighbor (KNN):
definition:
If a sample in feature space of the k most similar (i.e. nearest feature space) most of the samples belonging to a particular category , then the sample may also fall into this category
source:
KNN algorithm is a classification algorithm was first proposed by Cover and Hart
Calculate the distance formula:
The distance between two samples can be calculated by the following formula, known as the Euclidean distance , for example,
sklearn k- nearest neighbor API:
problem:
1. k value take how much? What is the impact?
It takes a small value of k: susceptible of outliers
k value takes a lot: too vulnerable to the most recent data lead to proportional change
2. Performance issues
k- nearest neighbor advantages and disadvantages:
advantage:
Simple, easy to understand, without estimating parameters, with no training
Disadvantages:
Lazy algorithm to calculate the time of the test sample classification capacity, large memory overhead
Must develop the value of k, k value of the inappropriate choice of the classification accuracy is not guaranteed
scenes to be used:
Small data scene, thousands - tens of thousands of samples, specific scenarios to test specific business
Neighbor algorithm instance k - predicted check-in:
Data Sources:
kaggle official website, link address: https://www.kaggle.com/c/facebook-v-predicting-check-ins/data (need to download the official website after login)
Data processing:
1. The narrow range of values: DataFrame.query (), because the data is too large, the data acquisition section
2. Data processing date: pd.to_datetime (), pd.DatetimeIndex (), the interface of the two libraries pandas
3. Add the date data segmentation: After the source data is divided inside timestamp data conversion, adding the new columns, day, hour, etc.
4. Delete useless data: Interface pd.drop (), pandas library
5. The check-in with less than n delete user data, some of the pandas library of knowledge:
place_count = data.groupby('place_id').aggregate(np.count_nonzero)
tf = place_count[place_count.row_id>3].rest_index()
data = data[data['place_id'].isin(tf.place_id)]
Examples of process:
1. Data processing sets
2. divided data sets
3. The table of conversion data sets
4. estimator processes classified prediction
Code:
. 1 Import OS 2 Import PANDAS AS PD . 3 from sklearn.model_selection Import train_test_split . 4 from sklearn.neighbors Import KNeighborsClassifier . 5 from sklearn.preprocessing Import StandardScaler . 6 . 7 . 8 DEF knn_cls (): . 9 "" " - K-nearest neighbor algorithm predicted user has checked "" " 10 . 11 # a, the read data 12 is fb_location the os.path.join = (the os.path.join (os.path.curdir, ' data ' ), 'fb_location ' ) 13 is Data = pd.read_csv (the os.path.join (fb_location, ' train.csv ' )) 14 # Print (data.head (10)) 15 16 # two, data processing 17 # 1 data reduction, filter query data 18 is data = data.query ( ' X> 1.0 X & <1.25 & Y> 2.5 Y & <2.75 ' ) . 19 # 2. data processing time 20 is TIME_VALUE = pd.to_datetime (data [ ' time ' ], = Unit ' S ' ) # accurate to second 21 is # Print (TIME_VALUE) 22 # 3 Date format into the format of the dictionary 23 is TIME_VALUE = pd.DatetimeIndex (TIME_VALUE) 24 # 4. Some characteristic structure 25 data.loc [:, ' Day ' ] = time_value.day 26 is data.loc [:, ' hour ' ] = time_value.hour 27 data.loc [:, ' WEEKDAY ' ] = time_value.weekday 28 # 5. the time stamp deleting wherein 29 data.drop ([ ' Time ' ], Axis =. 1, InPlace = True) 30 # Print (Data) 31 # 6 to check the position number is less than n target remove 32 place_count = data.groupby ( ' place_id ' ) .count () 33 is TF = place_count [place_count.row_id>. 3 ] .reset_index () 34 is Data = Data [Data [ ' place_id ' ] .isin (tf.place_id)] 35 # 7. The extracted feature value and the target value 36 Y = Data [ ' place_id ' ] 37 [ X = data.drop ([ ' place_id ' , ' row_id ' ], Axis = 1 ) 38 # divided 8. the data of the training set and test set 39 x_train, x_test, y_train, android.permission.FACTOR. = train_test_split (X, Y, test_size = 0.25 ) 40 41 is # III wherein Engineering (normalized) 42 is STD = StandardScaler () 43 is # feature values of the test set and the training set is normalized 44 is x_train = std.fit_transform (x_train) 45 x_test = std.transform (x_test) # here do not fit with std.transform to recalculate the mean value of the standard deviation of a class 46 is 47 # Fourth, the algorithm flow 48 KNN = KNeighborsClassifier (N_NEIGHBORS = . 9 ) 49 50 # Fit, Predict, Score 51 is knn.fit (x_train, y_train) 52 is 53 is #Fourth, the results obtained and the accuracy of the predicted 54 is y_predict = knn.predict (x_test) 55 Print ( ' predicted target position check: ' , y_predict) 56 is Print ( ' prediction accuracy: ' , knn.score (x_test, android.permission.FACTOR.)) 57 is 58 59 IF the __name__ == ' __main__ ' : 60 knn_cls ()