Significance test [t-test, analysis of variance, ks test]

Significance test [t-test, analysis of variance, ks test]

0. Directory

1 Basic definition of significance test (what?)

2. Significance of using significance test (why?)

3. The specific operation process of the significance test (how?)

1. Basic definition of significance test

  • Statistical hypothesis testing
    • Make an assumption in advance about the parameters of the population (random variable) or the overall distribution form, and then use the sample information to judge whether the assumption is reasonable
  • Significance test
    • A type of statistical hypothesis testing
    • Significance test is a method used to detect whether there is a difference between the experimental group and the control group in a scientific experiment and whether the difference is significant.
  • Statistical assumptions must be made before using the significance test, that is, null hypothesis/null hypothesis/null hypothesis
  • Null hypothesis/null hypothesis/null hypothesis (null hypothesis)
    • There is no significant difference between the data results
    • Refers to pre-established assumptions when performing a statistical test. When the null hypothesis is established, the relevant statistics should obey a certain known probability distribution.
    • When the calculated value of the statistic falls into the negative domain, it can be known that a small probability event has occurred, and the null hypothesis should be rejected.
  • 若原假设为真,而检验的结论却劝你放弃原假设。此时,我们把这种错误称之为第一类错误。通常把第一类错误出现的概率记为α
  • 若原假设不真,而检验的结论却劝你采纳原假设。此时,我们把这种错误称之为第二类错误。通常把第二类错误出现的概率记为β
  • Usually only the maximum probability α of making the first type of error is limited, and the probability β of making the second type of error is not considered. We call such a hypothesis test a significance test , and the probability α is called a significance level.

2. Significance of using significance test

  • Examples
    • A fan wants to evaluate the online influence of Ronaldo and Messi. The following are the number of likes/comments they received after the monthly social network release in 2017. I want to know whether there is any obvious difference between the two
    • CR7= {23,25,26,27,23,24,22,23,25,29,30,32}
    • Messi= {24,25,23,26,27,25,25,28,30,31,29,28}
    • According to the definition of the null hypothesis, the assumption that "there is no significant difference in the number of likes between the two people" is made, and the final calculation shows that the p_value of the variance test = 0.459, which means that there is no significant difference in the number of likes between the two

3. The specific operation process of the significance test

variance analysis

  • Analysis of Variance (ANOVA for short), also known as "Analysis of Variance", was invented by RA Fisher and is used to test the significance of the difference between the means of two or more samples.
  • In the case of significance level α = 0.05, p > 0.05 accepts the null hypothesis, p value < 0.05 rejects the null hypothesis
  • The null hypothesis is that there is no significant difference between the two. Since p=0.459>0.05, the null hypothesis is accepted, that is, there is no significant difference between the two
  • If the p value here is less than 0.05, then the null hypothesis must be rejected, that is, there is a significant difference between the two
  • Another understanding of p_value
    • The p_value=0.459 in the example means that the probability of accidental factors causing such a difference in the data is 0.459, which is much larger than 0.05. Then it means that accidental factors are likely to cause this difference, so there is no difference between the data itself.
    '''
    
     方差齐性检验  在显著性水平α =0.05的情况下,p>0.05接受原假设, 所以接受原假设,即样本集B和样本集H间不存在显著性差异
    
    '''
    
    from scipy import stats  # 导入相应模块
    
    v3=[23,25,26,27,23,24,22,23,25,29,30,32]
    v4=[24,25,23,26,27,25,25,28,30,31,29,28]
    
    stats.levene(v3,v4, center="mean")
    fVal, pSD = stats.levene(v3,v4, center="mean")
    
    print("ANOVA-0",fVal, pSD)
    

    Output:
    0.5671069450362157
    0.45939425229350794

T test (T-Test)

  • The t-test is used to determine whether there is a significant difference between the means of two variables and to judge whether they belong to the same distribution
  • two-tailed test
  • The function ttest_ind() takes two samples of the same size and produces a tuple of t-statistics and p-values
  • Find whether given values ​​v1 and v2 come from the same distribution:
    '''
    
     T-test 在显著性水平α =0.05的情况下,p>0.05接受原假设, 所以接受原假设,即样本集B和样本集H间不存在显著性差异
    
    '''
    
    v3=[23,25,26,27,23,24,22,23,25,29,30,32]
    v4=[24,25,23,26,27,25,25,28,30,31,29,28]
    
    import numpy as np
    from scipy.stats import ttest_ind
    from scipy import stats
    
    res = ttest_ind(v3, v4)
    print(res)
    

    Output result
    Ttest_indResult(statistic=-0.8599394154935148, pvalue=0.3990967787539713)

KS test

  • The KS test is used to check whether a given value fits the distribution

  • The function receives two arguments; the value of the test and the CDF

    • CDF is the cumulative distribution function (Cumulative Distribution Function), also known as the distribution function. CDF can be a string or a callable that returns a probability.
  • Can be used as a one-tailed or two-tailed test, by default it is a two-tailed test. We can pass parameter substitutions as strings with one of both sides, less than or greater than.

  • Finds if a given value follows a normal distribution

    import numpy as np
    from scipy.stats import kstest
     
    v = np.random.normal(size=100)
     
    res = kstest(v, 'norm')
     
    print(res)
    

    Output result
    KstestResult
    (statistic=0.047798701221956841, pvalue=0.97630967161777515)

Guess you like

Origin blog.csdn.net/crist_meng/article/details/129368658