The application of normal distribution-based on descriptive statistics and distribution inference

Content import:

Hello everyone, here is a little bit of analysis every day. This issue introduces the basic relationship between descriptive statistical indicators and distribution, including the basic types of distribution, the relationship between central tendency and distribution, the relationship between discrete trend and distribution, and the application of distribution and descriptive statistical analysis in real life combined with national income cases. . The content of the article is suitable for data analysis novices, the content is in-depth and simple, and the case fits the actual situation. I will introduce you to the skewness coefficient in the next issue, and welcome your attention.

Concept introduction:

Type of distribution:

The last issue mainly introduced the normal distribution to you. In fact, in addition to the normal distribution, there are many types of distributions. Today I will give you a popular science. The distributions generated by classical probabilities such as standard two distributions and uniform distributions will not be introduced here. The distribution introduced this time is also a commonly used type in statistics.

First, T distribution.

If it is known that the population waiting to be analyzed obeys a normal distribution, all possible samples with a capacity of n are drawn from the population, and their corresponding T statistics are calculated for each sample, then the values ​​of all T statistics will form a continuous type The probability distribution, this distribution is the T distribution, and the probability density function of the T distribution is:

t represents the value of the T statistic; v represents the degree of freedom, which is equal to the sample size n minus 1; c is a constant, making the area under the T distribution function curve equal to 1.

What is the function of T distribution? Have you heard of the coefficient T test, the sample T test. The T distribution can be used to judge the significance of two continuous variables. It is often used to judge whether the coefficient in linear regression is significant. If it is not significant, you need to remove the variable and refit. The general situation is that the two-sided P value of the T test is less than 5%, which is considered significant, and more than 5% is considered insignificant. For example, is there a significant relationship between height and age, and whether there is a significant relationship between GDP and investment.

Second, the chi-square (χ2) distribution.

If n mutually independent random variables ξ₁, ξ₂,...,ξn all obey the standard normal distribution, the sum of the squares of these n random variables obey the standard normal distribution constitutes a new random variable, and its distribution law It is called the chi-square distribution. Probability density function:

X2 represents the chi-square statistic; e is the natural base, which is equal to 2.72; v represents the degree of freedom, which is equal to the sample size n-1; c represents the adjustment constant, making the total area under the chi-square distribution curve equal to 1.

Chi-square distribution is used to detect variance. It is usually used to test whether the variance of two types of a variable is significant. It is usually used in logistic regression. For example, a class with a total of 60 people, 35 males and 25 females, analyze whether there is a significant difference in the height levels of men and women, divide the 60 height data into men and women, and then use the chi-square distribution to test. The general situation is that the two-sided P value of the chi-square test is less than 5%, which is considered significant, and more than 5% is considered insignificant. The significance test of binary logistic regression uses the chi-square test.

Second, F distribution.

The variance relationship between two normally distributed populations.

The F statistic can actually be thought of as being obtained by dividing two chi-square (χ2) statistics. Generally, the population with the larger chi-square value is used as the denominator, and the population with the smaller chi-square value is the numerator by default. The probability density function is:

v1 represents the degree of freedom of the numerator of the F statistic; v2 represents the degree of freedom of the denominator of the F statistic; c represents the correction constant, which makes the total area under the F distribution curve equal to 1.

What is the F distribution used for? You must have used it for data analysis, but you may not know that F test can be used to test variance and function, yes, test function, linear regression and logistic regression model significance, You can use the F distribution to test. The general situation is that the two-sided P value of the F test is less than 5%, which is considered significant, and more than 5% is considered insignificant.

The relationship between distribution and descriptive statistical analysis:

I asked you a few questions in the last issue. How do you describe the characteristics of the frequency distribution graph?

1. Is there a lot of data on the left or on the right?

2. Is the left steep or the right steep?

3. Are there any extremely small outliers?

4. Is it'convex' or'concave'?

5. What does the overall shape look like?

The characteristics of the distributed data have been introduced to you just now. The shape of the distribution, the degree of steepness, and the characteristics of outliers are related to descriptiveness through related indicators. Next, let's have a specific understanding.

The relationship between central tendency and distribution:

After understanding in previous periods, we know that the average, median, and mode are indicators of central tendency. However, not all data, the average and median represent the central tendency of the data. For example, if you think of data with an inverted U-shaped distribution like a normal distribution, the average, median, and mode can represent the central tendency of the data. U-shaped data distribution, only the mode represents the central tendency. For example, for data such as 49 1, 49 99, and 1 50, the median of the average is 50, and the modes are 1 and 99. At this time, there is only the mode. The number represents the central tendency of the data.

In addition, the positions of the average, median, and mode are related to the left and right shapes of the distribution graph. When the mean is less than the median and less than the mode, the distribution shape is a bulge on the right and a long tail on the left; when the mean is greater than the middle When the number of digits is greater than the mode, the shape of the distribution is a bulge on the left and a long tail on the right.

The relationship between discrete trend and distribution:

The indicators of the discrete trend are range, variance and standard deviation. This time we will mainly discuss standard deviation. I just talked about how to judge the data to the left or right; now I will introduce you to the index of whether the distribution is "convex" or "concave". In all distributions, the larger the variance, the more "convex" the data distribution, and the smaller the variance, the more "concave" the data distribution. How to determine whether the variance is large or small? Just refer to the normal distribution with the same mean.

Moreover,'convex' and'concave' have further applications. 'Convex' means that the mode is relatively concentrated and the two ends drop rapidly. After displaying it, it is found that the values ​​on both sides are very different from the concentrated value, which means that there are outliers, and the specific outliers are on the larger side. It is still on the small side, which can be seen in combination with the bias of the data distribution. 'Concave' means that the mode is not very concentrated, the decline on both sides is gentle, and the value difference is not large, which means that the data has no obvious outliers.

Do you understand? It doesn't matter, we made a small video intimately to help you digest and understand. Those who are interested can follow our public account to watch.

Comprehensive application scenarios:

Next, let's look at an interesting case.

#Country income level case

#(1) A white-collar worker has a salary higher than those around him, but lower than the average salary of the national statistical industry. Why?

#(2) x is income, y is the corresponding number of people

x=['1000','2000','3000','4000','5000','6000','7000','8000','9000','10000','20000','30000','40000','50000','1000000','2000000']

y = [1000,3000,7000,10000,14000,16000,14000,8000,1000,500,100,100,100,100,50,50]

Requirements: Calculate the mode, median and average of the data, explain the above phenomenon, and evaluate the overall income situation of the country.

Based on the data, we draw graphs, calculate indicators, and see what is going on?

The graph looks more "convex", the right side drops steeply, the average value is greater than the median and mode, and there is a large outlier on the right side.

Conclusion 1: What we see is the mode, so the income is higher than them. When the maximum value is averaged, the overall income level is raised.

Conclusion 2: The average value is greater than the median and the mode, there is a maximum value, and the national income gap is large; the data is concentrated around the mode and the median, and most people's income is at the same level; the overall income level is low and large The income level of most people is below the average.

Share this issue here, we will continue to update every week, we will see you next issue, and look forward to your visit.

Hello everyone, this time we have provided the case code, please enter the official account to receive it. If you have any suggestions, such as the knowledge you want to know, the problems in the content, the materials you want, the content to be shared next time, and the problems encountered in learning, please leave a message below. Please pay attention if you like it.

Guess you like

Origin blog.csdn.net/qq_40433634/article/details/108771681