【Statistical Analysis】(task1) Hypothesis Test 1: Methodology and Unary Numerical Test

summarize

  • Mathematically we cannot prove a hypothesis with a special sample, but we can use it to disprove (reject) a proposition. Therefore, hypothesis testing is essentially exploring how to reject the null hypothesis H 0 H_0H0to accept the alternative hypothesis H 1 H_1H1.
  • The basic steps of hypothesis testing based on p-value (p-value, which is the minimum significance level given to reject the null hypothesis under the determined sample observation value):
    • Determine the alternative hypothesis H 1 H_1H1, the sign of the alternative hypothesis determines which cumulative probability we use.
    • Clarify the formula of Test statistics. Different hypothesis tests have their own test statistics. Check the data to find it!
    • Specify the distribution to which the test statistic follows so that we can calculate the cumulative probability.
    • According to the alternative hypothesis H 1 H_1H1Calculate p-values ​​with Test statistics.
    • Associate p-value with significance level α \alphaAlpha comparison, ifp > α p>\alphap>α , the null hypothesis cannot be rejected; ifp < α p<\alphap<α , the null hypothesis can be rejected.
  • Test common hypotheses, according todata type, data characteristicsClassification, in practical applications, you can directly select the corresponding hypothesis test according to the characteristics of the data to be analyzed and the analysis task:
    • For some of the more common and important hypothesis tests (such as various t tests), briefly learn their principles;
    • For tests that are uncommon and difficult to explain (such as normality tests), you can use them.
  • Different test statistics have different forms and different distributions, but the idea of ​​hypothesis testing has something in common:Construct the test statistic - output the quantile of the corresponding distribution - calculate the critical value (rejection domain) - make a judgment.
    insert image description here

Zero, basic knowledge review

0.1 Classification of data

  • Statistics: A variable used in statistical theory to analyze and test data. Inferring the properties of the whole from the sample, we usually infer it through statistics, such as inferring the lifespan of the bulbs produced by the factory by calculating the average lifespan of 100 bulbs. Common statistics are: sample mean, sample variance, sample moment, sample K-order center distance, sample skewness, sample kurtosis, etc.; there are also statistics constructed for statistical analysis needs, such as zz of statistical tests.z statistic,ttt statistic,χ 2 \chi^{2}h2 Statistics,FFF statistic etc.
  • Classification of statistical data (common practice):
    • Categorical data: Calculate frequency or frequency, mode and disparity ratios for each group, perform contingency table analysis and χ 2 \chi^{2}h2 Inspection, etc.;
    • Ordinal data: calculate median and interquartile range, calculate rank correlation coefficient, etc.;
    • Numerical data (divided into discrete variables and continuous variables, the former such as the number of products, the latter such as temperature): There are more statistical methods for analysis, such as calculating various statistics, parameter estimation and testing, etc.

insert image description here
insert image description here

0.2 Graphical display of data

  • Graphical display of the data:
    • Quality data: that is, categorical data (frequency distribution chart, divided into contingency table, cross table; pie chart, histogram, Pareto chart, ring chart, etc.) and sequential data (using cumulative frequency to draw line chart or frequency picture).
    • Numeric data:
      • If it is grouping, for example, after grouping at equal intervals, calculate the frequency within the group, and then use the histogram or bar chart;
      • If it is ungrouped, such as stem and leaf plots or boxplots, boxplots are less familiar. It is possible to compare the distribution characteristics of multiple sets of data, one of which is a set of 5 features: the maximum value, the minimum value, the median, and the two quartiles of the data. The following are boxplots of different distributions;
      • In the case of time series data, a line graph.
      • If it is multivariate data: scatter plot, bubble chart, radar chart, etc.

insert image description here
[Box Plot Xiao Lizi] Plot the 8 courses of 11 students into a box plot (picture from Jia Junping - Statistics 6th Edition P59):
insert image description here

0.3 Generalized measures of data

insert image description here

The basics of hypothesis testing

Reason for learning: When I study courses such as probability theory and statistical analysis, multivariate statistical analysis, etc., I always feel that I have learned a lot of hypothesis testing, but I lack a systematic generalization, so I don’t know what kind of data and what kind of data. requirements to select the corresponding hypothesis test.

(1) Several characteristics of hypothesis testing:

  • Hypothesis testing runs through all aspects of statistical analysis. In mathematical modeling, we can not only conduct exploratory information mining on the data through hypothesis testing, but also give us the basis for choosing a model;
  • After the modeling is completed, the validity of the model can be verified through a specific hypothesis test. It is necessary to select the corresponding hypothesis test according to the characteristics of the data and the needs of the task.
  • Unlike regression analysis and other systematic and structured statistical analysis tasks, each field has its own hypothesis tests that can be used. For example, there are significance tests of model coefficients in regression analysis, and unit root tests, white noise tests, and so on in time series analysis.

(2) Two major categories of hypothesis testing (hypothesis testing based on statistical models and hypothesis testing not based on statistical models):

  • The former is based on a known statistical model and "provides services" for the use of the model. The coefficient significance test of the linear regression model we mentioned earlier is a typical example. For this hypothesis test, the corresponding statistical model can be learned first. Learn more about hypothesis testing.
  • The latter (the focus of this study) is to "start directly from the data" and directly test some properties of the data, such as normality test, two-sample t test, variance analysis, and so on.

1.1 The principle of hypothesis testing

(1) The essence of hypothesis testing

[Chestnut] Known: The average grade of the first class x ˉ = 108.2 \bar{x}=108.2xˉ=108.2 , sample standard deviations = 4 s=4s=4 , the number of peoplen = 25 n=25n=25 , and according to past experience, grades' test scores show a normal distribution. So, can a prefect consider a grade point average of at least 110?

Simple analysis:
Population: Mathematics scores of the entire grade
One sample: Mathematics scores of a class
Known: The sample mean x ˉ = 108.2 \bar{x}=108.2xˉ=108.2
The needs of prefects: Can we inferwhetheroverallmeanreaches 110 through the sample of a class?

  • We need to answer "yes" or "no" to the proposition that the overall grade point average is not less than 110. Such questions are called hypothesis testing questions.
  • The process of mathematically testing and answering hypothesis testing questions is called hypothesis testing.

Summary: This kind of "yes or no" test and answer to a proposition describing the overall nature based on sample information and known information is the essence of hypothesis testing. That is, hypothesis testing does not verify the nature of the sample itself, but the nature of the population in which the sample is located .

Hypothesis testing can be roughly divided into two types: parametric hypothesis testing and nonparametric hypothesis testing.

  • If the hypothesis is about a parameter or set of parameters of the population, the hypothesis test is a parametric hypothesis test. The hypothesis of Example.1 is about the population mean, and the mean is a parameter, so this is a parametric hypothesis test;
  • If the hypothesis cannot be represented by a set of parameters, the hypothesis test is a nonparametric hypothesis test, a typical one is the normality test.

1.2 Derivation of Hypothesis Testing

(1) The establishment of the hypothesis

Chestnut 1: The math score of a student in a certain grade XXX follows a normal distributionX ∼ N ( μ , σ 2 ) X\sim N\left( \mu ,\sigma ^2 \right)XN( m ,p2 ), take a class of students as a sample, the known sample meanx ˉ = 108.2 \bar{x}=108.2xˉ=108.2 , sample standard deviations = 4 s=4s=4 , the number of people in a class isn = 25 n = 25n=25 , can it be considered that the population meanμ > 110 \mu > 110m>110 ?

The proposition "The population mean μ > 110 \mu > 110m>Whether 110 ” is correct involves the following two assumptions:
H 0 : μ ⩽ 110 ↔ H 1 : μ > 110 H_0:\mu \leqslant 110\leftrightarrow H_1:\mu >110\,H0:m110H1:m>110
H 0 H_0 H0is called the null hypothesis, H 1 H_1H1Known as the alternative hypothesis, the two hypotheses must be mutually exclusive, because only then, reject the hypothesis H 0 H_0H0is equivalent to accepting the hypothesis H 1 H_1H1.

  • The discussion of proposition establishment is transformed into rejecting the null hypothesis H 0 H_0H0Instead of focusing on accepting the alternative hypothesis, this is because mathematically we cannot prove a hypothesis with a special sample, but we can use it to disprove (reject) a proposition. therefore,Hypothesis testing is essentially exploring how to reject the null hypothesis H 0 H_0H0to accept the alternative hypothesis H 1 H_1H1.
  • So in Example.1, it is difficult for us to directly prove the proposition μ > 110 \mu > 110m>110 is true, but it can be proved by assumingμ ⩽ 110 \mu \leqslant 110m110 error indirectly verifyμ > 110 \mu > 110m>110 Establishment. In actual hypothesis testing,We usually take the proposition we want to test as the alternative hypothesis H 1 H_1H1, by testing the null hypothesis H 0 H_0H0Whether it is rejected to judge whether to accept H 1 H_1H1.

(2) Three types of one-parameter tests and precautions for the null hypothesis

Although we usually take the proposition we want to test as the alternative hypothesis H 1 H_1H1, but this is not a "criteria" - there are some hypothesis tests where the setting of the original/alternative hypothesis is fixed, such as when testing whether the population in which a sample is located obeys a specific distribution (such as a normality test ), we usually set two assumptions as follows
H 0 : The population where the sample is located follows a certain distribution ↔ H 1 : The population where the sample is located does not obey a certain distribution H_0:\text{The population where the sample is located obeys a certain distribution}\leftrightarrow H_1:\,\text {The population where the sample is located does not obey a certain distribution}H0:The population in which the sample is located obeys a certain distributionH1:The population where the sample is located does not obey a certain distribution
. Another example,In the most common one-parameter tests, "=" will only appear in the null hypothesis H 0 H_0H0, but not in the alternative hypothesis H 1 H_1H1middle, i.e. we won't do things like
H 0 : μ ≠ 110 ↔ H 1 : μ = 110 H_0:\mu \ne 110\leftrightarrow H_1:\mu =110\,H0:m=110H1:m=110
Assumptions.

The three most common one-parameter test problems, take the mean test as an example:
H 0 : μ ⩽ μ 0 ↔ H 1 : μ > μ 0 H 0 : μ ⩾ μ 0 ↔ H 1 : μ < μ 0 H 0 : μ = μ 0 ↔ H 1 : μ ≠ μ 0 H_0:\mu \leqslant \mu _0\leftrightarrow H_1:\mu >\mu _0 \\ H_0:\mu \geqslant \mu _0\leftrightarrow H_1:\mu <\mu _0 \\ H_0:\mu =\mu _0\leftrightarrow H_1:\mu \ne \mu _0\,H0:mm0H1:m>m0H0:mm0H1:m<m0H0:m=m0H1:m=m0
Among them, the first two tests are called unilateral test, and the third test is bilateral test. There is another, more common way of expressing the null hypothesis of the above three problems:
H 0 : μ = μ 0 ↔ H 1 : μ > μ 0 H 0 : μ = μ 0 ↔ H 1 : μ < μ 0 H 0 : μ = μ 0 ↔ H 1 : μ ≠ μ 0 H_0:\mu = \mu _0\leftrightarrow H_1:\mu >\mu _0 \\ H_0:\mu = \mu _0\leftrightarrow H_1:\mu <\mu _0 \\ H_0:\mu =\mu _0\leftrightarrow H_1:\mu \ne \mu _0\,H0:m=m0H1:m>m0H0:m=m0H1:m<m0H0:m=m0H1:m=m0

Q: Why are the null hypotheses all set to the "=" sign?
Answer: Let's take the first test problem as an example: if we accept H 1 H_1H1, meaning μ \muμ is significantly larger thanμ 0 \mu_0m0while rejection is equal to μ 0 \mu_0m0the assumption, and if even equal to μ 0 \mu_0m0The assumptions are unacceptable, less than μ 0 \mu_0m0Not to mention. therefore,Although this notation is not mutually exclusive, its result is equivalent to the previous notation.

In the subsequent parametric tests, the symbol of the null hypothesis will be uniformly set to "=", and only the alternative hypothesis H 1 H_1 is needed to distinguish different test problems.H1That's it.

(3) Critical value, rejection domain, significance level

In Example.1, becauseSample mean x ˉ \bar{x}xˉ is the population meanμ \muunbiased estimate of μ, then if the null hypothesis is rejected, that is, μ > 110 \mu > 110m>110 , thenx ˉ \bar{x}xˉ has a high probability of being greater than 110, which means that ifactual samplecalculatedx ˉ \bar{x}xˉ is much larger than 110, then the null hypothesisis very likely to fail. To give a criterion for rejecting the null hypothesis,we set a critical valueCCC , if the actual sample calculatedx ˉ \bar{x}xˉ Satisfyx ˉ − 110 > C \bar{x}-110>Cxˉ110>C , we reject the null hypothesis. where,x ˉ − 110 > C \bar{x}-110>Cxˉ110>C is also called deny field:

{ x ˉ    : x ˉ > 110 + C } \left\{ \bar{x}\,\,: \bar{x}>110+C \right\} { xˉ:xˉ>110+C }
Once the result of the sample calculation falls into the rejection region, we reject the null hypothesis; otherwise, the null hypothesis cannot be rejected. The rejection domains of different hypothesis tests are different, but the core logic is exactly the same.

1) Determine the critical value C with probability

The next question is, how do we determine the critical value CCWhat about C ? Determine with probability.

Due to the randomness of sampling, there is always a chance of making mistakes in judging the nature of the population based on the information of the sample—that is, whether or not we reject the null hypothesis H 0 H_0H0, we all have a probability of making one of the following two types of errors:

  • Type 1 Error: Null Hypothesis H 0 H_0H0is true, but the data falls into the rejection domain (hence the rejection H 0 H_0H0judgment). The probability of making a Type 1 error is called the rejection probability α \alphaa
  • Type II Error: Null Hypothesis H 0 H_0H0is false, but the data does not fall into the reject field (hence the accept H 0 H_0H0judgment). The probability of making a Type II error is called a pseudoprobability β \betab

The two error probabilities are opposed to each other and "contradict" each other. Given the sample size, if we want to reduce the probability of one type of error by adjusting the hypothesis testing rules, it will inevitably lead to an increase in the probability of another type of error. This means that we can't control them and keep them at a low level at the same time, based on this, we can only make compromises -A common practice is to limit only the probability of making a Type 1 error α \alphaa.

2) The problem of determining the critical value

Type 1 error: The null hypothesis is true, but the null hypothesis is rejected.
Type II error: The null hypothesis is false, but the null hypothesis is not rejected.

Back to the problem of determining the critical value. When deciding on thresholds, we want to ensure thatProbability of Type 1 Error α \alphaα needs to be at a given, small level (usuallyα = 0.05 / 0.1 \alpha=0.05/0.1a=0.05/0.1 ), at this time α \alphaα is also known as the significance level.

Determine the critical value CCThe criterion for C is: null hypothesisH 0 H_0H0is true, but the probability that the data falls into the rejection region should be exactly the given α \alphaa . In Example.1, this probability can be written as:
PH 0 is true ( x ˉ − μ 0 > C ) = P ( x ˉ − 110 > C ) = α P_{H_0\,\,is\,\,true }\left( \bar{x}-\mu _0>C \right) =P\left( \bar{x}-110>C \right) =\alphaPH0istrue(xˉm0>C)=P(xˉ110>C)=a

(4) Handling Probabilities with Distributions - Constructing Test Statistics

Process P ( x ˉ − 110 > C ) P\left( \bar{x}-110>C \right)P(xˉ110>C ) , this form is like the "quantile" of the distribution that we contacted in general learning, the next step is to construct it into the quantile form of a distribution, so that the critical valueCCC. _

Note: 110 is actually μ \mu in this exampleμ is in the null hypothesisH 0 H_0H0correct value, and E ( x ˉ ) = μ E\left( \bar{x} \right) =\muAND(xˉ )=μ , so the probability is actually:
P ( x ˉ − E ( x ˉ ) > C ) , E ( x ˉ ) = μ 0 = 110 P\left( \bar{x}-E\left( \bar{x} \right) >C \right) \,\,, E\left( \bar{x} \right) =\mu _0=110P(xˉAND(xˉ )>C),AND(xˉ )=m0=110
Sincex ˉ \bar{x}xˉSubject to normal distribution, and ssin Example.1s is known, we can construct the t-statistic:
   P ( x ˉ − E ( x ˉ ) > C ) = P ( x ˉ − E ( x ˉ ) s > C s ) = α , x ˉ − E ( x ˉ ) s ∼ tn − 1 \,\,P\left( \bar{x}-E\left( \bar{x} \right) >C \right) =P\left( \frac{\bar{ x}-E\left( \bar{x} \right)}{s}>\frac{C}{s} \right) =\alpha \,\,,\frac{\bar{x}-E\ left( \bar{x} \right)}{s}\sim t_{n-1}P(xˉAND(xˉ )>C)=P(sxˉAND(xˉ )>sC)=a,sxˉAND(xˉ )tn1
This means, C s \dfrac{C}{s}sCExactly tn − 1 ( 1 − α ) t_{n-1}\left( 1-\alpha \right)tn1( 1α ) quantiles, which are known for a given distribution, soCCC is solved
C = s ⋅ tn − 1 ( 1 − α ) C=s\cdot t_{n-1}\left( 1-\alpha \right)C=stn1( 1α )
We substitute it into the above formula to have
P ( x ˉ − E ( x ˉ ) s > C s ) = P ( x ˉ − μ 0 s > s ⋅ tn − 1 ( 1 − α ) s ) = P ( x ˉ > μ 0 + s ⋅ tn − 1 ( 1 − α ) ) = α P\left( \frac{\bar{x}-E\left( \bar{x} \right)}{s }>\frac{C}{s} \right) =P\left( \frac{\bar{x}-\mu _0}{s}>\frac{s\cdot t_{n-1}\left( 1-\alpha \right)}{s} \right) =P\left( \bar{x}>\mu _0+s\cdot t_{n-1}\left( 1-\alpha \right) \right ) =\alphaP(sxˉAND(xˉ )>sC)=P(sxˉm0>sstn1( 1a ).)=P(xˉ>m0+stn1( 1a ) )=α
is: as long asx ˉ > 110 + s ⋅ tn − 1 ( 1 − α ) \bar{x}>110+s\cdot t_{n-1}\left( 1-\alpha \right)xˉ>110+stn1( 1α ) , we can thenat the significance levelα \alphaThe null hypothesis is rejected under α .

  • In the above process, we useknownThe statistic constructs a statistic that obeys a certain distribution to help us calculate the probability, and the constructed statistic is the test statistic .
  • Different test statistics have different forms and different distributions, but the idea of ​​hypothesis testing has something in common:Construct the test statistic - output the quantile of the corresponding distribution - calculate the critical value (rejection domain) - make a judgment.

In the above example, the test statistic is
t = x ˉ − μ 0 st=\dfrac{\bar{x}-\mu _0}{s}t=sxˉm0
The quantile corresponding to the t distribution is
tn − 1 ( 1 − α ) t_{n-1}\left( 1-\alpha \right)tn1( 1α )
The rejection domain is
x ˉ > 110 + s ⋅ tn − 1 ( 1 − α ) \bar{x}>110+s\cdot t_{n-1}\left( 1-\alpha \right)xˉ>110+stn1( 1α )
python manually implements the above hypothesis testing process:

## 加载包
import numpy as np
import pandas as pd
from scipy.stats import t

n=25 
x_bar=108.2
s=4
mu=110

# 计算检验统计量
tvalue=(x_bar-mu)/s
print('t值为:{}'.format(tvalue))

# 输出分位点
'''
ppf:单侧左分位点
isf:单侧右分位点
interval:双侧分位点
'''
T_isf=t.isf(0.05,n-1) #由于备择假设是大于号,因此应当选用单侧右分位点,0.05为显著性水平a,n-1为自由度
# 如果备择假设是小于号,则应选用单侧左分位点ppf,里面的参数设置不变,依次为显著性水平a与分布自由度

print('分位点为:{}'.format(T_isf))
# 拒绝域
Deny_domain=110+s*T_isf
print('拒绝域的临界点为:{}'.format(Deny_domain))

# 判断
print('样本均值是否位于拒绝域:{}'.format(x_bar>Deny_domain))
print('因此,不能拒绝原假设,不能认为总体均值大于110.')

#t值为:-1.7999999999999972
#分位点为:1.7108820799094282
#拒绝域的临界点为:116.84352831963771
#样本均值是否位于拒绝域:False
#因此,不能拒绝原假设,不能认为总体均值大于110.

Of course, the rejection region can also be directly represented by the test statistic and the corresponding distribution quantile
t > tn − 1 ( 1 − α ) t>t_{n-1}\left( 1-\alpha \right)t>tn1( 1α )
This is more convenient and more general, because no more time is spent calculating the critical valueCCC. _ The rejection domains of the three hypothesis tests (taking the normal distribution as an example) are shown in the following figure
Please add image description

It can be seen that the three hypothetical quantiles correspond to the right quantile, the left quantile, and the bilateral quantile, respectively. In practical applications, we can use the alternative hypothesis H 1 H_1H1symbol, select the corresponding quantile to construct a rejection domain.

# 直接用检验统计量与分布分位点判断
print('检验统计量是否位于拒绝域:{}'.format(tvalue>T_isf))
# 检验统计量是否位于拒绝域:False

Assuming a two-sided test using the data in Example.1
H 0 : μ = 110 ↔ H 1 : μ ≠ 110 H_0:\mu =110\leftrightarrow H_1:\mu \ne 110\,H0:m=110H1:m=110
Then the rejection domain is
∣ t ∣ > ∣ tn − 1 ( 1 − α 2 ) ∣ \left| t \right|>\left| t_{n-1}\left( 1-\frac{\alpha}{2} \right) \right|t > tn1( 12a)

# 进行双边检验
## 计算双侧分位点
T_int=t.interval(1-0.05,n-1) # 对于双侧检验(双侧分位点),分位点参数应该输入1-a,这里是1-0.05=0.95
print('检验统计量t的绝对值:{}'.format(np.abs(tvalue)))
print('双侧分位点:{}'.format(T_int))
print('显然,检验统计量不在拒绝域内,因此无法拒绝原假设')

#检验统计量t的绝对值:1.7999999999999972
#双侧分位点:(-2.0638985616280205, 2.0638985616280205)
#显然,检验统计量不在拒绝域内,因此无法拒绝原假设

1.3 Basic steps of hypothesis testing - based on p-values

  • A disadvantage of using the rejection domain method for hypothesis testing is that the quantile value and the significance level α \alphaα is relevant. If we want to test at different significance levels, we need to calculate different quantiles for comparison, which is very tedious.
  • p-value: As long as the sample information and hypothesis are determined, you can rely on a constant indicator to judge whether to reject the null hypothesis. The p-value is the minimum significance level that can reject the null hypothesis under the determined sample observation value, and the p-value is only related to the sample observation value and the hypothesis test we do. The smaller the p-value, the more rejecting the null hypothesis.

The smaller the p value is, the more the null hypothesis can be rejected. For example, if the p value is 0.001, which is smaller than the confidence level of 0.01, we think that we can also reject the null hypothesis at the confidence level of 0.01; and if the p value is 0.025, the The confidence level of 0.01 is larger, but less than 0.05, then we think that we can reject the null hypothesis at the confidence level of 0.05, but not at the confidence level of 0.01.

The p-value is in the form of the alternative hypothesis we make H 1 H_1H1related:

  • If H 1 H_{1}H1The notation is ≠ \ne=,则: p v a l u e = P ( ∣ X ∣ > ∣ T e s t    s t a t i s t i c s ∣ ) pvalue=P\left( \left| X \right|>\left| Test\,\,statistics \right| \right) pvalue=P( X >T es ts t a t i s t i cs )
  • If H 1 H_{1}H1的符号为>,则: p v a l u e = P ( X > T e s t    s t a t i s t i c s ) pvalue=P\left( X>Test\,\,statistics \right) pvalue=P(X>Teststatistics)
  • If H 1 H_{1}H1的符号为<,则: p v a l u e = P ( X < T e s t    s t a t i s t i c s ) pvalue=P\left( X<Test\,\,statistics \right) pvalue=P(X<Teststatistics)

in:

  • X is a variable that obeys a specific distribution;
  • Test statistics are the test statistics mentioned earlier.
  • The p-value is essentially a cumulative probability. For the alternative hypothesis with the symbol >, the p-value is the cumulative probability on the right; for the alternative hypothesis with the symbol <, the p-value is the cumulative probability on the left; and for the same In terms of test statistics, the p-value of a two-sided test is twice that of a certain type of one-sided test.
# 利用example.1的数据进行三种假设检验
# 利用p值进行假设检验
'''
sf:右尾累积概率
cdf:左尾累积概率
'''
# 若备择假设为mu>110
pvalue=t.sf(tvalue,n-1) 
print('备择假设为mu>110的p值为:{}'.format(pvalue))

# 若备择假设为mu<110
pvalue=t.cdf(tvalue,n-1)
print('备择假设为mu<110的p值为:{}'.format(pvalue))

# 若备择假设为mu不等于110
pvalue=t.cdf(tvalue,n-1)*2 # 之所以是左尾累积概率的两倍,是因为右尾累积概率大于0.5,而p值不可能大于1。
print('备择假设为mu不等于110的p值为:{}'.format(pvalue))

#备择假设为mu>110的p值为:0.9577775745385242
#备择假设为mu<110的p值为:0.042222425461475775
#备择假设为mu不等于110的p值为:0.08444485092295155

Note: Using p-value for hypothesis testing is more common in practical applications. All packages for hypothesis testing in python will output test statistics and p-values. In subsequent learning, p-values ​​are uniformly used for hypothesis testing.

The python scipy.statsmodule contains many types of hypothesis testing APIs that can be used directly, but compared to SPSS and R, which specialize in statistical analysis, python has relatively few hypothesis testing functions. If a certain hypothesis testing python does not have its corresponding api, you need to manually calculate the p-value. For example, the subsequent Hotelling T2 test on the mean vector.

[Summary] The basic steps of hypothesis testing based on p value:

  1. Determine the alternative hypothesis H 1 H_1H1, the sign of the alternative hypothesis determines which cumulative probability we use.
  2. Clarify the formula of Test statistics. Different hypothesis tests have their own test statistics. Check the data to find it!
  3. Cumulative probability can only be calculated by specifying the distribution to which the test statistic follows.
  4. According to the alternative hypothesis H 1 H_1H1Calculate p-values ​​with Test statistics.
  5. Associate p-value with significance level α \alphaAlpha comparison, ifp > α p>\alphap>α , the null hypothesis cannot be rejected; ifp < α p<\alphap<α , the null hypothesis can be rejected.

1.4 Classification of Hypothesis Testing

insert image description here

The common hypothesis tests are classified according to data types and data characteristics. In practical applications, the corresponding hypothesis tests can be selected directly according to the characteristics of the data to be analyzed and the analysis tasks:

  • For some of the more common and important hypothesis tests (such as various t tests), briefly learn the principles;
  • Learn how to use tests that are uncommon and difficult to explain, such as normality tests.

2. Hypothesis testing for unary numerical data

The content of hypothesis testing of between-group means in unary numerical data , learn how to test the properties of the mean of the population in which the sample data is used, and show how to implement each test in python. content:

  1. normality test
  2. A test that compares the population mean of a set of data for equality with a fixed value
  3. A test to compare the equality between the population means of two sets of data
  4. Tests for equality between population means of more than two groups of data

In the test of 2~4, it can be divided into two cases. If the data are roughly normally distributed, then parametric tests - t-tests can be used, which are more sensitive than nonparametric tests, but need to satisfy the normality assumption; if the data are not normally distributed, then some nonparametric tests can be used.

2.1 Normality test

Since parametric tests are more sensitive than nonparametric tests, we should use parametric tests once the data are normally distributed, and it is very necessary to test the data for normality.

Here, we provide three methods to help you judge the normality of data: visual judgment - probability map of normal distribution; Shapiro-Wilk test; D'Agostino's K-squared test

(1) Probability map

In statistics, along with many tools for visually assessing distributions, probability plots are one of them.

# 生成1000个服从正态分布的数据
data_norm = stats.norm.rvs(loc=10, scale=10, size=1000) # rvs(loc,scale,size):生成服从指定分布的随机数,loc:期望;scale:标准差;size:数据个数
# 生成1000个服从卡方分布的数据
data_chi=stats.chi2.rvs(2,3,size=1000)

# 画出两个概率图
fig=plt.figure(figsize=(12,6))
ax1=fig.add_subplot(1,2,1)
plot1=stats.probplot(data_norm,plot=ax1) # 正态数据
ax2=fig.add_subplot(1,2,2)
plot2=stats.probplot(data_chi,plot=ax2) # 卡方分布数据

For a given sample dataset, the probability plot:

  • First put the data xxSort x from small to large, and calculate the sorted dataxxThe distribution quantile corresponding to x ;
  • Then, plot the data points in a two-dimensional graph with the quantiles as the horizontal axis and the ordinal sample values ​​as the vertical axis.
  • If the data roughly follow the target distribution, the data points will approximately follow the line y = xy=xY=x distribution, if the data does not follow the target distribution, we will observe that the data points deviate from the liney = xy=xY=x

(2) Two normality tests

Probability plots can only roughly tell whether the data is normal or not, but it is not precise. In order to more accurately judge whether the population of the sample is normally distributed, a strict normality test needs to be carried out.

The two assumptions of the normality test are as follows:
H 0 : The population where the sample is located obeys the normal distribution ↔ H 1 : The population where the sample resides does not obey the normal distribution H_0:\text{The population where the sample is located obeys the normal distribution}\leftrightarrow H_1:\ ,\text{The population where the sample is located does not obey the normal distribution}H0:The population in which the sample is located obeys a normal distributionH1:There are many types of normality tests that the population of the sample does not obey the normal distribution
. Here are only two of the most commonly used and most powerful hypothesis tests - Shapiro-Wilk test for small samples; D'Agostino's K for large samples. -squared test.

1) Shapiro-Wilk test (small sample normality test)

Shapiro–Wilk test is one of the most effective methods for normality testing. It is a method for testing normality in frequency statistics, and its theoretical principle is relatively complicated.

This method is suitable for normality testing problems with small samples because:This test works best when each sample value is unique, and once there are too many samples, it is inevitable that several sample values ​​will be repeated, which will greatly reduce the effectiveness of this method.

Scope of application of sample size: The sample size should not be less than 8, less than 50 is the best, less than 2000 is better, and it is no longer applicable if it exceeds 5000.

2) D'Agostino's K-squared test (large sample normality test)

D'Agostino's K-squared test mainly quantifies the difference and asymmetry between the data distribution curve and the standard normal distribution curve by calculating the skewness (Skewness) and kurtosis (Kurtosis), and then calculates these values ​​and the expected value of the normal distribution the degree of difference between.

This method is a common and powerful normality test method suitable for large samples. This is because the skewness and kurtosis of the distribution curve are easily affected by the amount of data. The larger the amount of data, the more accurate the calculation of skewness and kurtosis.

Scope of application of sample size: The sample size shall not be less than 4, otherwise the larger the better.

(3) Use multiple methods to judge normality at the same time

In practical applications, due to the complexity of the data, only using one method to judge normality may produce certain errors, so we usually use multiple methods to judge at the same time. If the conclusions drawn by different methods are different, it is necessary to carefully observe the characteristics of the data and find the reasons for the inconsistent results. For example: if the Shapiro-Wilk test is significant (non-normal) and the D'Agostino's K-squared test is not significant (normal), it may be because the sample size is large, or there are duplicate values ​​in the sample, if this is the case , then we should adopt the conclusion of D'Agostino's K-squared test rather than the conclusion of the Shapiro-Wilk test.

[Code Practice] Define a function in python that combines probability map, Shapiro-Wilk test, and D'Agostino's K-squared test.

data_small = stats.norm.rvs(0, 1, size=30) # 小样本正态性数据集
data_large = stats.norm.rvs(0, 1, size=6000) # 大样本正态性数据集

# 定义一个正态性检验函数,它可以输出:
## 正态概率图
## 小样本Shapiro-Wilk检验的p值
## 大样本D'Agostino's K-squared检验的p值

from statsmodels.stats.diagnostic import lilliefors
from typing import List

def check_normality(data: np.ndarray, show_flag: bool=True) -> List[float]:
    """
    输入参数
    ----------
    data : numpy数组或者pandas.Series
    show_flag : 是否显示概率图
    Returns
    -------
    两种检验的p值;概率图
    """

    if show_flag:
        _ = stats.probplot(data, plot=plt)
        plt.show()

    pVals = pd.Series(dtype='float64')
    # D'Agostino's K-squared test
    _, pVals['Omnibus'] = stats.normaltest(data) 

    # Shapiro-Wilk test
    _, pVals['Shapiro-Wilk'] = stats.shapiro(data)

    print(f'数据量为{
      
      len(data)}的数据集正态性假设检验的结果 : ----------------')
    print(pVals)

check_normality(data_small,show_flag=True)
check_normality(data_large,show_flag=False) # 当样本量大于5000,会出现警告

2.2 Mean Test

Carry out the test of the overall mean of the univariate numerical sample. Each mean test will have a corresponding parametric test (t test) and a non-parametric test to choose from.

(1) Test for the hypothesis of a single-group sample mean

Application scenario: test whether the mean of the population in which a sample is located is equal to a reference value , which is the test of the assumption of the mean of a single group of samples.

The test problem of Example.1 is actually this kind of test (although the alternative hypothesis needs to be replaced with ≠ \ne= ).

Example.2 In Bisheng Middle School, Mr. Chen's class finished an English test. Due to the large number of students in the class, it is difficult to complete the correction and statistics in a short period of time. Teacher Wang also wanted to know whether there is a significant difference between the average score of this class and the target of 137 set by the prefect, so he randomly selected The English grades of the top 10 students:

136,136,134,136,131,133,142,145,137,140

Q: Mr. Wang, can you think that there is no significant difference between the average grade of this class and the target of 137 for the class set by the prefect?
This is a typical test of the hypothesis of a single-group sample mean, comparing whether the overall mean (average of class English scores) represented by this sample (the English scores of 10 students) is equal to the reference value of 137. So for this kind of problem, we have two kinds of tests to use: one-sample t-test and wilcoxon test.

1) One Sample t-test (One Sample t-test)

The t-test requires the population to obey a normal distribution, that is,
x ∼ N ( μ , σ 2 ) x\sim N\left( \mu ,\sigma ^2 \right)xN( m ,p2 )
In Example.2, this means that the English scores of every student in Mr. Wang's class are required to obey a normal distribution. In a t-test, the standard deviation of the populationσ \sigmaσ does not need to be known in advance, which is different from the z-test in probability theory and mathematical statistics, so the t-test is more widely used in practice.

Basic flow of hypothesis testing using p-values. (as per 1.1.3)

The two assumptions of the one-sample t-test are:
H 0 : μ = μ 0 ↔ H 1 : μ ≠ μ 0 H_0:\mu =\mu_0\leftrightarrow H_1:\mu \ne \mu_0\,H0:m=m0H1:m=m0
The corresponding test statistics are:
Test statistics = x ˉ − μ 0 s Test\,\,statistics=\frac{\bar{x}-\mu _0}{s}Teststatistics=sxˉm0
The distribution of the test statistics is:
Test statistics ∼ tn − 1 Test\,\,statistics\sim t_{n-1}Teststatisticstn1
where, nn is the sample size. We can calculate the p-value based on the above information.

2) wilcoxon signed rank sum test

If the sample data is not normal, we should use the wilcoxon signed rank sum test. This test is a very classic non-parametric test, and the principle of non-parametric test is introduced:

(1) First of all, what is "rank".
Let x 1 , ⋯ , xn x_1,\cdots ,x_nx1,,xnFor simple random samples from a continuous distribution, we sort them from small to large to obtain ordered samples x ( 1 ) ⩽ ⋯ ⩽ x ( n ) x_{\left( 1 \right)}\leqslant \cdots \leqslant x_ {\left( n \right)}x( 1 )x(n). Observations xi x_ixiordinal rr in ordered samplesr is calledxi x_ixirank. Therefore, the rank is actually the sample value xi x_ixiThe meaning of "smallest" in all samples.

Rank sum test, there must be a "sum of ranks". Let x 1 , ⋯ , xn x_1,\cdots ,x_nx1,,xnare samples, do absolute value transformation on them, and then record R i R_iRiBecause ∣ xi ∣ | x_i |xi ( ∣ x 1 ∣ , ⋯   , ∣ x n ∣ ) \left( |x_1|,\cdots ,|x_n| \right) ( x1,,xn)的秩。记
I ( x i > 0 ) = { 1 , x i > 0 0 , x i ≠ 0 I\left(x_{i}>0\right)=\left\{\begin{array}{ll} 1, & x_{i}>0 \\ 0, & x_{i} \neq 0 \end{array}\right. I(xi>0 )={ 1 ,0 ,xi>0xi=0
则称
W + = ∑ i = 1 n R i I ( x i > 0 ) W^{+}=\sum_{i=1}^{n} R_{i} I\left(x_{i}>0\right) In+=i=1nRiI(xi>0 )
is the rank sum statistic.

(2) Wilcoxon signed rank sum test steps.
The two assumptions of the wilcoxon signed rank sum test for one-sample mean comparison are still
H 0 : μ = μ 0 ↔ H 1 : μ ≠ μ 0 H_0:\mu =\mu_0\leftrightarrow H_1:\mu \ne \mu_0\,H0:m=m0H1:m=m0
For the samples to be analyzed x 1 , ⋯ , xn x_1,\cdots ,x_nx1,,xn, let all samples subtract the contrast value μ 0 \mu_0m0,得: x 1 − μ 0 , ⋯   , x n − μ 0 x_1-\mu_0,\cdots ,x_n-\mu_0 x1m0,,xnm0, calculate their rank sum statistic W + W^{+}In+ .

(3) Test statistics can be calculated as
Test statistics = W + − n ( n + 1 ) 4 n ( n + 1 ) ( 2 n + 1 ) 24 Test\,\,statistics=\frac{W^{ +}-\frac{n(n+1)}{4}}{\sqrt{\frac{n(n+1)(2 n+1)}{24}}}Teststatistics=twenty fourn ( n + 1 ) ( 2 n + 1 ) In+4n ( n + 1 )
Test statistics approximately follow the distribution
Test statistics → N ( 0 , 1 ) Test\,\,statistics\rightarrow N\left( 0,1 \right)TeststatisticsN( 0 ,1 )
where,nnn is the sample size.

(4) Calculation of p-value and alternative hypothesis H 1 H_1H1is related to the sign of , which is the same as the t-test.

This method is best used when the sample size is greater than 25, because then the test statistic is approximately normally distributed. The sample size in Example.2 is 10, which is not strictly suitable.

# 定义一个单组样本均值检验函数,使它可以同时输出t检验与wilcoxon符号秩和检验的p值
def check_mean(data,checkvalue,confidence=0.05,alternative='two-sided'):        
    '''
    输入参数
    ----------
    data : numpy数组或者pandas.Series
    checkvalue : 想要比较的均值
    confidence : 显著性水平
    alternative : 检验类型,这取决于我们备择假设的符号:two-sided为双侧检验、greater为右侧检验、less为左侧检验

    输出
    -------
    在两种检验下的p值
    在显著性水平下是否拒绝原假设
    '''
    pVal=pd.Series(dtype='float64')
    # 正态性数据检验-t检验
    _, pVal['t-test'] = stats.ttest_1samp(data, checkvalue,alternative=alternative)
    print('t-test------------------------')
    if pVal['t-test'] < confidence:
      print(('目标值{0:4.2f}在显著性水平{1:}下不等于样本均值(p={2:5.3f}).'.format(checkvalue,confidence,pVal['t-test'])))
    else:
      print(('目标值{0:4.2f}在显著性水平{1:}下无法拒绝等于样本均值的假设.(p={2:5.3f})'.format(checkvalue,confidence,pVal['t-test'])))

    # 非正态性数据检验-wilcoxon检验
    _, pVal['wilcoxon'] = stats.wilcoxon(data-checkvalue,alternative=alternative)
    print('wilcoxon------------------------')    
    if pVal['wilcoxon'] < confidence:
      print(('目标值{0:4.2f}在显著性水平{1:}下不等于样本均值(p={2:5.3f}).'.format(checkvalue,confidence,pVal['wilcoxon'])))
    else:
      print(('目标值{0:4.2f}在显著性水平{1:}下无法拒绝等于样本均值的假设.(p={2:5.3f})'.format(checkvalue,confidence,pVal['wilcoxon'])))
    
    return pVal

Whether it is a t test or a wilcoxon test, the p value is quite large. Obviously, we cannot reject the null hypothesis. Teacher Wang can think that the class is divided into 137.

(2) The mean equality test of the two groups of samples

1) Independence between groups

Two Sample t-test
Mannwhitneyu rank sum test

2) Paired groups

Paired t-test
Paired wilcoxon signed-rank-sum test

(3) Analysis of variance (test for equality of means among multiple groups of samples)

We have learned the mean test of a single group of samples and two groups of samples before, and then we start to learn to test the mean of the population of multiple groups of samples at the same time.

Analysis of variance (ANOVA): A statistical method for comparing the means of multiple populations. The following focuses on learning: the principles and ideas of one-way analysis of variance, and how to apply multi-factor variance.

1) Introduction to ANOVA

[Analysis of variance] A statistical method for comparing the means of multiple populations. Factors in One/Multivariate ANOVA? Examples of three ANOVA questions:

  1. Whether the quality indicators of the same type of product with four different trademarks are consistent.
  2. Does taking three different sales methods for the same product lead to significantly different sales volumes?
  3. Whether there are significant differences in the purchasing power of residents in five different residential areas.

In these examples, trademarks, sales methods, residential areas, etc.Basis for differentiating groupsis a factor (also called a factor), usually with capital letters A , B , CA,B,CA,B,C , etc. represent these factors, and the different states of a factor are called levels, withA 1 , A 2 A_1,A_2A1,A2etc. said. In all three examples, there is only one factor, so they are all one-way ANOVA; if there are more than one factor, it is called multi-way ANOVA.

For one-way ANOVA, the number of samples to be compared is essentially the number of levels of the factor. For example, in Example 1., we are actually comparing the sample averages of four product quality indicators with different trademarks (the average of the population in which they are located). In this example, the number of factor levels for the factor "brand" is 4.

What is the comparison method for comparing multiple population means in ANOVA? It's not a comparison between the two, it'sCompare them at the same time, written in the form of hypothesis testing:
H 0 : μ 1 = μ 2 = μ i = ⋯ μ k ↔ H 1 : μ i are not all equal H_0:\mu _1=\mu _2=\mu _i=\cdots \mu _k \leftrightarrow H_1: \mu _i\text{not all equal}H0:m1=m2=mi=mkH1:minot all equal
where,kkk is the number of samples (factor levels).
When the population means are not all equal:
Please add image description

2) One-way ANOVA

3) Two-way ANOVA

  • One-way ANOVA has the function of comparing multiple population means, but its essence isInquiry Factor AAIs A significant. If significant, it means that these population means are not equal due to factor AACaused by A ; if not significant, it means factorAAA cannot cause them to be unequal.
  • If we increase the number of factors to two, then the ANOVA becomes a two-way ANOVA. It is worth noting that the two-way ANOVA not only explores whether the two factors are significant, but also whether the interaction term of the two factors is significant (similar to regression analysis). For this kind of multivariate analysis of variance, borrowing linear regression models to solve the problem can do more with less.

3. Small exercises

Three lathes produce the same ball, and we extract 13, 14, and 16 products from each, and the measured diameters are:
A lathe: 15.0, 14.5, 15.2, 15.5, 14.8, 15.1, 15.2, 14.8, 13.9, 16.0, 15.1 , 14.5, 15.2;
B Lathe: 15.2, 15.0, 14.8, 15.2, 15.0, 15.0, 14.8, 15.1, 14.8, 15.0, 13.7, 14.1, 15.5, 15.9;
C Lathe: 14.6, 15.0, 14.7, 13.9, 13.8, 15.5 ,15.5,16.2,16.1,15.3,15.4,15,9,15.2,16.0,14.8,14.9

Suppose the significance level is α = 0.01 \alpha=0.01a=0.01 , Q:

  1. Do the diameters of balls produced by lathe A/B obey a normal distribution?

  2. Is there a significant difference in the variance of the ball diameter produced by the A/B lathe?

  3. Is there a significant difference in the diameter of the balls produced by the A/B lathe?

  4. Is there a significant difference in the ball diameters produced by the three lathes? In one-way ANOVA, what are the factors in this question?

Attachment: time schedule

Task content time
Task01 Hypothesis Testing 1: Methodology and Unary Numerical Testing 8-13 - Thursday 8-18
Task02 Hypothesis Test 2: Multivariate Numerical Vector Test 8-29 - Saturday 8-20
Task03 Hypothesis Test 3: Categorical Data Testing 8-21 - 8-22 Monday
Task04 Applied Stochastic Processes and Simulation Systems 8-23 - 8-25 Thursday
Task05 Financial Quantitative Analysis and Stochastic Simulation 8-26 - Sunday 9-28

Reference

[1] https://github.com/Git-Model/Modeling-Universe/tree/main/Data-Story
[2] datawhale course
[3] Data normality test in statistics

Guess you like

Origin blog.csdn.net/qq_35812205/article/details/126325831