k- nearest neighbor (KNN) algorithm to predict check-ins

Classification algorithm -k nearest neighbor (KNN):

definition:

  If a sample in feature space of the k most similar (i.e. nearest feature space) most of the samples belonging to a particular category , then the sample may also fall into this category

source:

  KNN algorithm is a classification algorithm was first proposed by Cover and Hart

Calculate the distance formula:

  The distance between two samples can be calculated by the following formula, known as the Euclidean distance , for example,

  

  

 

 sklearn k- nearest neighbor API:

problem:

1. k value take how much? What is the impact?

  It takes a small value of k: susceptible of outliers

  k value takes a lot: too vulnerable to the most recent data lead to proportional change

2. Performance issues

k- nearest neighbor advantages and disadvantages:

  advantage:

    Simple, easy to understand, without estimating parameters, with no training

  Disadvantages:

    Lazy algorithm to calculate the time of the test sample classification capacity, large memory overhead

    Must develop the value of k, k value of the inappropriate choice of the classification accuracy is not guaranteed

  scenes to be used:

    Small data scene, thousands - tens of thousands of samples, specific scenarios to test specific business

Neighbor algorithm instance k - predicted check-in:

Data Sources:

  kaggle official website, link address: https://www.kaggle.com/c/facebook-v-predicting-check-ins/data   (need to download the official website after login)

Data processing:

1. The narrow range of values: DataFrame.query (), because the data is too large, the data acquisition section

2. Data processing date: pd.to_datetime (), pd.DatetimeIndex (), the interface of the two libraries pandas

3. Add the date data segmentation: After the source data is divided inside timestamp data conversion, adding the new columns, day, hour, etc.

4. Delete useless data: Interface pd.drop (), pandas library

5. The check-in with less than n delete user data, some of the pandas library of knowledge:

  place_count = data.groupby('place_id').aggregate(np.count_nonzero)

  tf = place_count[place_count.row_id>3].rest_index()

  data = data[data['place_id'].isin(tf.place_id)]

Examples of process:

1. Data processing sets

2. divided data sets

3. The table of conversion data sets

4. estimator processes classified prediction

Code:

. 1  Import OS
 2  Import PANDAS AS PD
 . 3  from sklearn.model_selection Import train_test_split
 . 4  from sklearn.neighbors Import KNeighborsClassifier
 . 5  from sklearn.preprocessing Import StandardScaler
 . 6  
. 7  
. 8  DEF knn_cls ():
 . 9      "" " - K-nearest neighbor algorithm predicted user has checked "" " 
10  
. 11      # a, the read data 
12 is      fb_location the os.path.join = (the os.path.join (os.path.curdir, ' data ' ), 'fb_location ' )
 13 is      Data = pd.read_csv (the os.path.join (fb_location, ' train.csv ' ))
 14      # Print (data.head (10)) 
15  
16      # two, data processing 
17      # 1 data reduction, filter query data 
18 is      data = data.query ( ' X> 1.0 X & <1.25 & Y> 2.5 Y & <2.75 ' )
 . 19      # 2. data processing time 
20 is      TIME_VALUE = pd.to_datetime (data [ ' time ' ], = Unit ' S ' )   # accurate to second 
21 is      # Print (TIME_VALUE)
22      # 3 Date format into the format of the dictionary 
23 is      TIME_VALUE = pd.DatetimeIndex (TIME_VALUE)
 24      # 4. Some characteristic structure 
25      data.loc [:, ' Day ' ] = time_value.day
 26 is      data.loc [:, ' hour ' ] = time_value.hour
 27      data.loc [:, ' WEEKDAY ' ] = time_value.weekday
 28      # 5. the time stamp deleting wherein 
29      data.drop ([ ' Time ' ], Axis =. 1, InPlace = True)
 30      # Print (Data)
31      # 6 to check the position number is less than n target remove 
32      place_count = data.groupby ( ' place_id ' ) .count ()
 33 is      TF = place_count [place_count.row_id>. 3 ] .reset_index ()
 34 is      Data = Data [Data [ ' place_id ' ] .isin (tf.place_id)]
 35      # 7. The extracted feature value and the target value 
36      Y = Data [ ' place_id ' ]
 37 [      X = data.drop ([ ' place_id ' , ' row_id ' ], Axis = 1 )
 38      # divided 8. the data of the training set and test set 
39     x_train, x_test, y_train, android.permission.FACTOR. = train_test_split (X, Y, test_size = 0.25 )
 40  
41 is      # III wherein Engineering (normalized) 
42 is      STD = StandardScaler ()
 43 is      # feature values of the test set and the training set is normalized 
44 is      x_train = std.fit_transform (x_train)
 45      x_test = std.transform (x_test)   # here do not fit with std.transform to recalculate the mean value of the standard deviation of a class 
46 is  
47      # Fourth, the algorithm flow 
48      KNN = KNeighborsClassifier (N_NEIGHBORS = . 9 )
 49  
50      # Fit, Predict, Score 
51 is      knn.fit (x_train, y_train)
 52 is  
53 is      #Fourth, the results obtained and the accuracy of the predicted 
54 is      y_predict = knn.predict (x_test)
 55      Print ( ' predicted target position check: ' , y_predict)
 56 is      Print ( ' prediction accuracy: ' , knn.score (x_test, android.permission.FACTOR.))
 57 is  
58  
59  IF  the __name__ == ' __main__ ' :
 60      knn_cls ()

 

 

Guess you like

Origin www.cnblogs.com/springionic/p/11787590.html