Data mining: descriptive statistical analysis

Disclaimer: This article is a blogger original article, shall not be reproduced without the bloggers allowed. https://blog.csdn.net/xiaxianba/article/details/91384171

Data analysis is the basis of statistics, statistics divided descriptive statistics and inferential statistics, descriptive statistics which is the basis of statistics, inferential statistics is leading. Its Baidu Encyclopedia is defined like this: descriptive statistics is that for data to sort through charts or mathematical methods to analyze the relationship between the state and distribution of data, and numerical characteristics of random variables to estimate and methods described herein. From the analysis and trend analysis and correlation analysis Descriptive statistics are divided into three major parts of central tendency.

First, focus on trend analysis

  1. Average: the average is divided by the total sum.
  2. The number: The median value is the middle value of size, where the value is different according to the parity of the total.
    2.1 Total N is odd: (. 1 + N) / 2
    2.2 Total number is an even number N: [N / 2 + ( N / 2 + 1)] / 2
  3. The mode: the mode is the largest number of occurrences of value.
    As already described the dimensions of statistical analysis and calculation methods are clear, and then we draw an example to illustrate the next. For example, now there is a class of data height height = [165,166,167,168,170,170,170,172,175,180,190], we describe how this class of central tendency height of it?
    Here Insert Picture Description

Second, from the trend analysis (discrete)

  1. Range: maximum minus the minimum.
  2. Variance: The difference with the average of the square of all data and average.
  3. Standard deviation: square root of the mean difference from the arithmetic mean of the square is the square root of the variance.
  4. Coefficient of variation: the standard deviation of the raw data than the original data average.
    The above is to describe the degree of dispersion of data from the data, we use the example above to illustrate the degree of dispersion of data.
    Here Insert Picture Description
    z-score, in addition to the analysis of discrete dimensions above, we have to introduce a degree of divergence, i.e. the number of measured values from the standard deviation of the mean phase difference, the formula is: z-score = [X - mean (X)] / std (X), wherein the variance and standard deviation of the distribution of z-score is a zero mean.

Third, correlation analysis

  1. Covariance (COV): if X, Y two variables, "the difference between the mean value and its X" each time by "its mean value of the difference Y" to give a product, and then summing these multiplied each time and find the mean, is the covariance. Covariance is a measure of the overall error of two variables. Is a special case of the variance-covariance, i.e., when the two variables are the same situation. Covariance is positive is a positive correlation, negative correlation is negative, 0 is irrelevant.
  2. The correlation coefficient (CORRCOEF): two variables X, Y covariance than the standard deviation of each product.
    Here Insert Picture Description

References
1. Descriptive statistics Baidu Encyclopedia defines
2. using descriptive statistics Python

Guess you like

Origin blog.csdn.net/xiaxianba/article/details/91384171