# 菜鸟 机 学 的 逆袭 之 路 # day12

In python pandas, for a one-dimensional array, describe will return a series of parameters, count, mean, std, min, 25%, 50%, 75%, max.
To solve this problem, the return value of describe () is explained as follows:

1. count: returns the number of arrays, as above is 4 elements, so the return is 4;

2. mean: returns the average value of the array, the average value of 1 3 5 9 is 4.5;

3. std: returns the standard deviation of the array;

4. min: returns the minimum value of the array;

5. 25%, 50%, 75%: returns the values ​​of three different percentile positions of the array, that is, the quartiles in statistics. First, you need to determine the position of the three quartiles. The position of the first quartile Q1 is: (1 + 4) / 4 = 1.25. Similarly, the position of Q2 is: (1 + 4) / 4 × 2 = 2.5, the position of Q3 is: (1 + 4) / 4 × 3 = 3.75, then the numbers corresponding to the three positions are:

1×0.25+3×0.75=2.5

3×0.5+5×0.5 = 4

5×0.75+9×0.25 = 6

That is the corresponding value returned by the function, of which 50% corresponds to the median.

6. max: Returns the maximum value of the list.

Extended information:

The describe () function has three parameters that can be specified, namely percentiles, include, and exclude. The meanings of the three are as follows:

1. percentiles: The default is to return the quartiles, namely 25%, 50% and 75%, which can be modified: describe (percentiles = [. 75, 0.8]), the returned position is 50%, 75%, 80% The number can be processed accordingly.

2. include: By default, only the statistics of numeric features are calculated. When the parameter is 'all', all types of data are displayed; when the parameter is numpy.number, the numeric data is returned; when the parameter is numpy.object , The returned data is of type object; when include = ['category'], it returns category; when include = ['O'], it returns string data.

3. Exclude: include can specify the return type, and exclude can specify not to return a certain type, that is, return data other than the specified type.
Reference source: python API-describe

Reference source: Baidu Encyclopedia-Quartile

The loc function mainly indexes row data through row labels, delineates key points, and labels! label! label!
loc [1] selects the row label to be 1 (from 0, 1, 2, 3 row labels)

Seaborn reference link:
https://blog.csdn.net/wqc_csdn/article/details/80515920

where ()

First of all, it should be emphasized that where () function returns only different input for different input.

1 When the array is a one-dimensional array, the returned value is a one-dimensional index, so there is only one set of index array

2 When the array is a two-dimensional array, the array value that meets the condition returns the position index of the value, so there will be two sets of index arrays to indicate the position of the value

for eg:
1 >>>b=np.arange(10)
2 >>>b
3 array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
4 >>>np.where(b>5)
5 (array([6, 7, 8, 9], dtype=int64),)
6
7 >>>a=np.reshape(np.arange(20),(4,5))
8 >>>a
9 array([[ 0, 1, 2, 3, 4],
10 [ 5, 6, 7, 8, 9],
11 [10, 11, 12, 13, 14],
12 [15, 16, 17, 18, 19]])
13 >>>np.where(a>10)
14 (array([2, 2, 2, 2, 3, 3, 3, 3, 3], dtype=int64),
15 array([1, 2, 3, 4, 0, 1, 2, 3, 4], dtype=int64))

Remarks: np.where function: The returned index value is a tuple, you need to add [0] to get the desired index value.

randint (n, m) produces an n * m-dimensional matrix. The elements of the matrix are either 0 or 1, which is random.

If you want to generate a range of numbers, you can set an interval, such as randint (2,3, [1 6]), which is to generate a 2 * 3 random matrix whose elements are random numbers in the interval [1 6].

In this case, a number is used to represent the interval. For example, randint (num, N, 3) in matlab means to produce a num * N matrix, and the range of elements in the matrix is ​​[0, (3-1)]
If the value is negative, such as randint (num, N, -3) in matlab, it means the interval is [-3 + 1,0]

Original link: https://blog.csdn.net/yimixgg/article/details/87875103

Display picture plt.imshow (X_recovered)
tip: X_recovered is pixel

** sklearn.cluster.KMeans parameter introduction (* VIP)
Why should we introduce kmeans in sklearn library?
  This is now the most popular integrated library for Python machine learning. At the same time, because of this method, looking directly at English documents is exhausting and time-consuming, and the efficiency is relatively low.
  There is another reason here. K-means ++ is introduced above. The selection of the initial clustering center of sklearn.cluster.KMeans just happens to be k-means ++ by default.
Parameters:
n_clusters: shaping, default = 8 [the number of clusters generated, that is, the number of centroids generated.
Max_iter: shaping, default = 300
, the maximum number of iterations performed by the k-means algorithm.
n_init: shaping, default = 10
times the algorithm is run with different centroid initialization values, and the final solution is the optimal result selected in the sense of inertia.
init: There are three optional values: 'k-means ++', 'random', or pass an ndarray vector.
This parameter specifies the initialization method, the default value is 'k-means ++'.
(1) 'k-means ++' uses a special method to select the initial centroid to accelerate the convergence of the iterative process (that is, the introduction of k-means ++ above)
(2) 'random' randomly selects the initial centroid from the training data.
(3) If an ndarray is passed, it should look like (n_clusters, n_features) and give the initial centroid.
precompute_distances: three optional values, 'auto', True or False.
Pre-calculated distance, the calculation speed is faster but takes up more memory.
(1) 'auto': If the number of samples multiplied by the number of clusters is greater than 12million, the distance is not pre-calculated. This corresponds to about 100MB overhead per job using double precision.
(2) True: Always calculate the distance in advance.
(3) False: The distance is never calculated in advance.
tol: float shape, default = 1e-4 combined with inertia to determine the convergence conditions.
n_jobs: integer number. Specifies the number of processes used in the calculation. The internal principle is to calculate the number of times specified by n_init at the same time.
(1) If the value is -1, all CPUs are used for calculation. If the value is 1, no parallel operation is performed, which is convenient for debugging.
(2) If the value is less than -1, the number of CPUs used is (n_cpus + 1 + n_jobs). Therefore, if the value of n_jobs is -2, the number of CPUs used is the total number of CPUs minus 1.
random_state: integer or numpy.RandomState type, optional
generator used to initialize centroid. If the value is an integer, a seed is determined. The default value of this parameter is numpy's random number generator.
copy_x: Boolean, default = True
When we precomputing distances, centralizing the data will get more accurate results. If the value of this parameter is set to True, the original data will not be changed. If it is False, it will directly
modify the original data and restore it when the function returns the value. However, due to the addition and subtraction of the mean of the data during the calculation, there may be small differences between the original data and the original data after the data is returned.
Attributes:
cluster_centers_: vector, [n_clusters, n_features] (coordinates of cluster centers)
Labels_: classification of each point
inertia_: float shape The
sum of the distances of each point from the centroid of its cluster.
Notes:
  This k-means uses Lioyd's algorithm. The average computational complexity is O (knT), where n is the sample size and T is the number of iterations.
  The computationally complex read is O (n ^ (k + 2 / p)) in the worst case, where n is the sample size and p is the number of features. (D. Arthur and S. Vassilvitskii, 'How slow is the k-means method?' SoCG2006)
  In practice, the k-means algorithm is very fast and belongs to the fastest category of practical algorithms. But its solution is only a partial solution generated by a specific initial value. Therefore, in order to make the result more accurate and true, it is necessary to repeat it with different initial values ​​several times in practice.
Methods:
fit (X [, y]):
 Calculate k-means clustering.
fit_predictt (X [, y]):
 calculate the cluster centroid and predict the category for each sample.
fit_transform (X [, y]):
calculate the cluster and transform X to cluster-distance space.
get_params ([deep]):
 Get the parameters of the estimator.
predict (X): predict (X)
 estimates the closest cluster to each sample.
score (X [, y]):
 calculate clustering error
set_params ( params):
 set parameters manually for this estimator.
transform (X [, y]): transform X into cluster distance space.
 In the new space, each dimension is the distance to the center of the cluster. Note that even if X is sparse, the array returned by the conversion is usually dense.

Published 31 original articles · praised 0 · visits 692

Guess you like

Origin blog.csdn.net/ballzy/article/details/104880744