The Scipy library provides a variety of normality tests and hypothesis testing methods

The Scipy library provides a variety of normality testing and hypothesis testing methods. The following is a list of some commonly used inspection methods:

Normality test method:

  1. Shapiro-Wilk test:scipy.stats.shapiro
  2. Anderson-Darling test:scipy.stats.anderson
  3. Kolmogorov-Smirnov test:scipy.stats.kstest
  4. D'Agostino-Pearson test:scipy.stats.normaltest
  5. Lilliefors test:scipy.stats.lilliefors
     

 

When testing whether data is normally distributed, the following is a complete summary of the three major categories of normality testing methods and their representative methods:

  1. Comprehensive statistical methods:

    • Shapiro-Wilk test: Based on the W statistic, it evaluates whether the data conforms to the normal distribution and is suitable for various sample sizes.
    • D'Agostino test: combines the information of skewness and kurtosis to test whether the data conforms to the normal distribution, suitable for medium sample sizes.
    • Shapiro-Francia test: Uses the W' statistic to assess normality, especially suitable for large samples.
    • Lilliefors test: Similar to Kolmogorov-Smirnov, but suitable for small samples.
    • Ryan-Joiner test: Based on the skewness and kurtosis of the observed data, suitable for skewed and fat-tailed distributions.
  2. Goodness-of-fit test method for normal distribution:

    • Kolmogorov-Smirnov test: Evaluate the goodness of fit of the data by comparing the cumulative distribution function of the observed data with the theoretical normal distribution, suitable for large samples.
    • Anderson-Darling test: Sensitive to tail distribution, extended Kolmogorov-Smirnov, used to evaluate whether data come from a normal distribution.
    • Cramér-von Mises test: Evaluates the fit between observed data and the theoretical normal distribution, taking into account differences in all distribution regions.
    • D'Agostino's K-squared test: A variation used to test whether the data conforms to a normal distribution.
  3. Graphical representation (normal probability plot):

    • QQ plot (Quantile-Quantile): By comparing the quantiles of the observed data with the quantiles of the normal distribution, you can intuitively judge whether the data is normally distributed.
    • PP chart (Probability-Probability): similar to QQ chart, but comparing cumulative probability.
    • SP chart (Survival-Probability): It is also similar to the QQ chart, but compares the survival probability.

These methods provide diverse tools for comprehensively assessing whether data follow a normal distribution, but their use requires selecting an appropriate method based on sample size and data characteristics.

Hypothesis testing method:

  1. Independent samples t-test:scipy.stats.ttest_ind
  2. Paired samples t-test:scipy.stats.ttest_rel
  3. One-sample t-test:scipy.stats.ttest_1samp
  4. Analysis of Variance (ANOVA):scipy.stats.f_oneway
  5. Kruskal-Wallis test:scipy.stats.kruskal
  6. Mann-Whitney U test:scipy.stats.mannwhitneyu
  7. Wilcoxon signed rank test:scipy.stats.wilcoxon
  8. Chi-square test:scipy.stats.chisquare
  9. Fisher's exact test:scipy.stats.fisher_exact

These methods cover normality testing and hypothesis testing techniques commonly used in statistical analysis. You can choose the appropriate method for your analysis based on your specific data and research questions. Each method has different assumptions and prerequisites , so careful consideration is needed when using them.

Here is a brief description of these different normality testing methods:

  1. Shapiro-Wilk test ( scipy.stats.shapiro):

    • The Shapiro-Wilk test is a statistical test used to test whether data comes from a normal distribution.
    • The Shapiro-Wilk statistic is based on the difference between the observed values ​​of the data and the expected value of a normal distribution.
    • Interpretation of results: If the p-value is less than the selected significance level (usually 0.05), the null hypothesis is rejected, indicating that the data does not follow a normal distribution.
      from scipy import stats
      import numpy as np
      
      # 创建一个示例数据集
      data = np.random.normal(0, 1, 100)
      
      # 执行Shapiro-Wilk正态性检验
      stat, p = stats.shapiro(data)
      
      # 输出检验结果
      if p < 0.05:
          print("数据不服从正态分布")
      else:
          print("数据可能服从正态分布")
      
  2. Anderson-Darling test ( scipy.stats.anderson):

    • The Anderson-Darling test is also used to test whether the data comes from a normal distribution.
    • The Anderson-Darling statistic is based on the difference between the observed values ​​of the data and the expected value of a normal distribution.
    • Interpretation of results: If the Anderson-Darling statistic is greater than the critical value, the null hypothesis is rejected, indicating that the data does not obey the normal distribution.
      from scipy import stats
      import numpy as np
      
      # 创建一个示例数据集
      data = np.random.normal(0, 1, 100)
      
      # 执行Anderson-Darling正态性检验
      result = stats.anderson(data)
      
      # 输出检验结果
      print("Anderson-Darling统计量:", result.statistic)
      print("临界值:", result.critical_values)
      if result.statistic > result.critical_values[2]:
          print("数据不服从正态分布")
      else:
          print("数据可能服从正态分布")
      
  3. Kolmogorov-Smirnov test ( scipy.stats.kstest):

    • The Kolmogorov-Smirnov test is used to test whether the data comes from a specific probability distribution, including the normal distribution.
    • The results of the test are based on the comparison of cumulative distribution functions.
    • Interpretation of results: If the p-value is less than the selected significance level, the null hypothesis is rejected, indicating that the data do not come from the specified distribution.
      from scipy import stats
      import numpy as np
      
      # 创建一个示例数据集
      data = np.random.normal(0, 1, 100)
      
      # 执行Kolmogorov-Smirnov正态性检验
      stat, p = stats.kstest(data, 'norm')
      
      # 输出检验结果
      if p < 0.05:
          print("数据不服从正态分布")
      else:
          print("数据可能服从正态分布")
      
  4. D'Agostino-Pearson test ( scipy.stats.normaltest):

    • The D'Agostino-Pearson test was also used to test whether the data came from a normal distribution.
    • The results of the test are based on the values ​​of skewness and kurtosis.
    • Interpretation of results: If the p-value is less than the selected significance level, the null hypothesis is rejected, indicating that the data does not obey a normal distribution.
      from scipy import stats
      import numpy as np
      
      # 创建一个示例数据集
      data = np.random.normal(0, 1, 100)
      
      # 执行D'Agostino-Pearson正态性检验
      stat, p = stats.normaltest(data)
      
      # 输出检验结果
      if p < 0.05:
          print("数据不服从正态分布")
      else:
          print("数据可能服从正态分布")
      
  5. Lilliefors test ( scipy.stats.lilliefors):

    • The Lilliefors test is a test method used to test whether data comes from a specific distribution, usually used to test whether it comes from a normal distribution.
    • Interpretation of results: If the p-value is less than the selected significance level, the null hypothesis is rejected, indicating that the data do not come from the specified distribution.
      from scipy import stats
      import numpy as np
      
      # 创建一个示例数据集
      data = np.random.normal(0, 1, 100)
      
      # 执行Lilliefors正态性检验
      stat, p = stats.lilliefors(data)
      
      # 输出检验结果
      if p < 0.05:
          print("数据不服从正态分布")
      else:
          print("数据可能服从正态分布")
      

These tests can be chosen based on your needs, but be aware that interpretation of results may be affected by sample size, data distribution, and significance level. Proper use of these methods often requires a deep understanding of their principles and assumptions.

----------------------------

Hypothesis testing method:

Here is a brief introduction to these hypothesis testing methods and specific situations in which they are used:

  1. Paired samples t-test ( scipy.stats.ttest_rel):

    • Usage: Used to compare the mean difference between two groups of related (paired) samples, such as two measurements before and after in the same group of people.
    • Hypothesis: Test whether there is a significant difference in the means of two groups of related samples.
  2. One-sample t-test ( scipy.stats.ttest_1samp):

    • Usage: Used to test whether the mean of a sample is significantly different from a known reference value (or theoretical mean).
    • Hypothesis: Test whether the mean of a single sample is different from a given theoretical mean.
  3. Analysis of variance (ANOVA) ( scipy.stats.f_oneway):

    • Usage occasions: Used to compare the mean differences between three or more groups of samples, usually used to analyze the statistical significance between different groups.
    • Hypothesis: Test whether there is a significant difference in the means of multiple groups of samples.
  4. Kruskal-Wallis test ( scipy.stats.kruskal):

    • Usage: Used to compare the distribution differences between three or more groups of independent samples, usually for non-normally distributed data.
    • Hypothesis: Test whether the distributions of multiple groups of independent samples are significantly different.
  5. Mann-Whitney U test ( scipy.stats.mannwhitneyu):

    • Usage: Used to compare the difference in medians between two groups of independent samples, usually for non-normally distributed data.
    • Hypothesis: Test whether there is a significant difference in the medians of two independent samples.
  6. Wilcoxon signed rank test ( scipy.stats.wilcoxon):

    • Usage occasions: Used to compare the median difference between two groups of paired samples, usually used for non-normally distributed paired data.
    • Hypothesis: Test whether there is a significant difference in the medians of two paired samples.
  7. Chi-square test ( scipy.stats.chisquare):

    • Usage occasions: used to compare the difference between observed frequency and expected frequency, usually used to analyze the fitting degree of categorical data.
    • Hypothesis: Test whether there is a significant difference between the observed frequency and the expected frequency.
  8. Fisher's exact test ( scipy.stats.fisher_exact):

    • Usage occasions: used to compare the correlation between two categorical variables, usually used for small sample data.
    • Hypothesis: Test whether there is a correlation between two categorical variables.

These test methods are suitable for different types of data and research questions. You can choose the appropriate method for statistical analysis based on the nature of the data and the purpose of the research.

  1. Independent samples t-test:scipy.stats.ttest_ind
    from scipy import stats
    import numpy as np
    
    # 创建两组示例数据
    group1 = np.array([25, 30, 35, 40, 45])
    group2 = np.array([20, 28, 32, 38, 42])
    
    # 执行独立样本t检验
    t_stat, p_value = stats.ttest_ind(group1, group2)
    
    # 输出检验结果
    if p_value < 0.05:
        print("两组数据均值存在显著差异")
    else:
        print("两组数据均值无显著差异")
    
  2. Paired samples t-test:scipy.stats.ttest_rel
    from scipy import stats
    import numpy as np
    
    # 创建两组示例数据
    before = np.array([30, 32, 34, 36, 38])
    after = np.array([28, 31, 35, 37, 40])
    
    # 执行配对样本t检验
    t_stat, p_value = stats.ttest_rel(before, after)
    
    # 输出检验结果
    if p_value < 0.05:
        print("配对样本存在显著差异")
    else:
        print("配对样本无显著差异")
    
  3. One-sample t-test:scipy.stats.ttest_1samp
    from scipy import stats
    import numpy as np
    
    # 创建一个示例数据集
    data = np.random.normal(0, 1, 100)
    
    # 执行单样本t检验
    t_stat, p_value = stats.ttest_1samp(data, 0)
    
    # 输出检验结果
    if p_value < 0.05:
        print("样本均值与零存在显著差异")
    else:
        print("样本均值与零无显著差异")
    
  4. Analysis of Variance (ANOVA):scipy.stats.f_oneway
    from scipy import stats
    import numpy as np
    
    # 创建多组示例数据
    group1 = np.random.normal(0, 1, 100)
    group2 = np.random.normal(1, 1, 100)
    group3 = np.random.normal(2, 1, 100)
    
    # 执行方差分析
    f_stat, p_value = stats.f_oneway(group1, group2, group3)
    
    # 输出检验结果
    if p_value < 0.05:
        print("组之间存在显著差异")
    else:
        print("组之间无显著差异")
    
  5. Kruskal-Wallis test:scipy.stats.kruskal
    from scipy import stats
    
    # 创建多组示例数据
    group1 = [25, 30, 35, 40, 45]
    group2 = [20, 28, 32, 38, 42]
    group3 = [15, 18, 22, 28, 32]
    
    # 执行Kruskal-Wallis检验
    h_stat, p_value = stats.kruskal(group1, group2, group3)
    
    # 输出检验结果
    if p_value < 0.05:
        print("组之间存在显著差异")
    else:
        print("组之间无显著差异")
    
  6. Mann-Whitney U test:scipy.stats.mannwhitneyu
    from scipy import stats
    
    # 创建两组示例数据
    group1 = [25, 30, 35, 40, 45]
    group2 = [20, 28, 32, 38, 42]
    
    # 执行Mann-Whitney U检验
    u_stat, p_value = stats.mannwhitneyu(group1, group2)
    
    # 输出检验结果
    if p_value < 0.05:
        print("两组数据存在显著差异")
    else:
        print("两组数据无显著差异")
    
  7. Wilcoxon signed rank test:scipy.stats.wilcoxon
    from scipy import stats
    
    # 创建两组配对数据
    before = [25, 30, 35, 40, 45]
    after = [20, 28, 32, 38, 42]
    
    # 执行Wilcoxon符号秩检验
    w_stat, p_value = stats.wilcoxon(before, after)
    
    # 输出检验结果
    if p_value < 0.05:
        print("配对数据存在显著差异")
    else:
        print("配对数据无显著差异")
    
  8. Chi-square test:scipy.stats.chisquare
    from scipy import stats
    import numpy as np
    
    # 创建一个示例观察频数数组
    observed = np.array([20, 25, 30])
    expected = np.array([15, 30, 30])
    
    # 执行卡方检验
    chi_stat, p_value = stats.chisquare(observed, f_exp=expected)
    
    # 输出检验结果
    if p_value < 0.05:
        print("观察频数与期望频数存在显著差异")
    else:
        print("观察频数与期望频数无显著差异")
    
  9. Fisher's exact test:scipy.stats.fisher_exact
    from scipy import stats
    
    # 创建一个2x2的列联表
    contingency_table = [[10, 5], [3, 15]]
    
    # 执行Fisher精确检验
    odds_ratio, p_value = stats.fisher_exact(contingency_table)
    
    # 输出检验结果
    if p_value < 0.05:
        print("两个分类变量存在关联")
    else:
        print("两个分类变量无关联")
    

Guess you like

Origin blog.csdn.net/book_dw5189/article/details/133473852