Data mining and statistical analysis - T test, normality test and consistency test - code reproduction

The T-test is a statistical test used to determine whether there is a statistically significant difference in the means of two sample groups. The following is a detailed introduction to the T test:

definition:

The T test is a parametric test based on the premise that the data approximates a normal distribution. It determines whether there is a significant difference between the means of two sample groups by calculating the T statistic and comparing it to a specific distribution (T distribution).

Main types:

One-sample T-test: Compares the mean of a sample to a known or hypothesized mean.

Independent samples T test (also known as two independent samples T test): compares the means of two independent samples. For example, compare two groups of people receiving different treatments.

Paired samples T-test (also called correlated-samples T-test): Compares the means of the same group of people or entities at two different time points or conditions. For example, student performance on pre- and post-tests.

Premise:

The data is approximately normally distributed. If it is an independent samples T test, the variances of the two samples should be similar (homogeneity of variances).
Data should be continuous.
In the paired samples T-test, the differences should follow a normal distribution.

calculate:

The formula for calculating the T statistic differs slightly depending on its test type. But the basic idea is: the mean of the difference divided by the standard error of the difference. This gives the size of the difference in sample means relative to the expected random difference.

Interpret the results:

A T value and a p value will be obtained in the result. The p-value tells us the significant difference between the observed data and the null hypothesis (that there is no difference).
If the p-value is less than a predetermined significance level (usually 0.05), then we reject the null hypothesis and consider that there is a significant difference between the two groups.
The sign of the T value tells us which group has a higher mean.

Case

Background: Suppose we want to study whether a new method of teaching mathematics has a positive impact on student achievement. To this end, we randomly selected two groups of students, one using traditional teaching methods (control group) and the other using new teaching methods (experimental group). After the course, both groups of students were tested.

Data:
Scores for control group (traditional method): 85, 88, 75, 66, 90, 78, 77, 79, 80
Scores for the experimental group (new method): 92, 95, 90, 85, 97, 91, 88, 90, 93

Step 1: First, we need to calculate the mean of the two groups.
Control group mean = 78
Experimental group mean = 91.1

Step 2: Calculate the T statistic. This requires more complex calculations involving the variance of the two groups, sample size, etc. But for simplicity, let's assume that the calculated T value is 3.5.

Step 3: Find a T-distribution table or use statistical software to determine the p-value. Let's say we get a p-value of 0.003.

explain:

A T value of 3.5 means that the mean difference between the experimental group and the control group is 3.5 times its standard error. This is a relatively large value, indicating a significant difference between the two groups.
A p-value of 0.003 is much smaller than the common significance level of 0.05, which means that the data we observed is statistically significant.

Conclusion:
Based on the results of the T test, we have enough evidence to reject the null hypothesis (that is, the two teaching methods have the same effect) and believe that the new teaching method has an impact on students’ Mathematics scores have a positive impact.

It should be noted that this conclusion is only based on our sample data. Real education research will involve more control variables, larger sample sizes, and more sophisticated statistical methods to ensure the accuracy and reliability of conclusions.

The above is just a simple case, now we explore and implement a complex case through code

Suppose you are a drug researcher studying the effects of a new drug on blood pressure. To do this, you conduct a randomized, double-blind, controlled experiment.

You randomly select 50 patients with high blood pressure, 25 of whom receive the new drug, and the other 25 who receive a placebo. The patient's blood pressure was measured before and after the experiment.

Task:

You want to know whether a new drug has a significant lowering effect on blood pressure.

import numpy as np
from scipy.stats import ttest_rel

# 假设的数据
np.random.seed(42)  # 使得结果可以复现

# 生成模拟数据
baseline_bp = np.random.normal(150, 20, 25)  # 基线血压
after_treatment_bp = baseline_bp - np.random.normal(10, 5, 25)  # 治疗后血压

# 执行配对样本T检验
t_stat, p_value = ttest_rel(baseline_bp, after_treatment_bp)

print("T-statistic:", t_stat)
print("P-value:", p_value)

if p_value < 0.05:
    print("新药物对血压有显著的降压效果。")
else:
    print("新药物对血压没有显著的降压效果。")

We use the ttest_rel method in scipy.stats, which is specifically used for paired sample T tests.
First, we generated some simulated data representing the patient's baseline and post-treatment blood pressure. Then, we use T-test to compare these two sets of data. Based on the p-value, we can draw a conclusion.

However, in practice, most experiments are comparative experiments between multiple variables
When multiple variables are involved, a single T-test may no longer be appropriate or sufficient to draw conclusions. At this point, we need to use more complex statistical methods. Here are a few common methods and how to deal with multiple variables:

Multiple T-test:

When we compare multiple groups, multiple T-tests may be performed. But this increases the risk of making a Type I error (incorrectly rejecting the true null hypothesis).
Solution: The Bonferroni correction is a simple way to adjust the significance level by dividing the alpha value (such as 0.05) by the number of comparisons.

For multiple T tests, Bonferroni correction is a method to control the familywise error rate (FWER). The basic idea is that when we conduct multiple hypothesis tests, we control the overall risk of making a Type I error by lowering the significance standard (α value) of each test.

Here are the steps on how to implement Bonferroni correction:

determines the original α value. Typically, the α value is chosen to be 0.05.
Determine the number of comparisons you want to make, denoted as m.
Adjust the α value: the new α value is the original α value divided by the number of comparisons. That is, new α value = original α/m.
Perform each T-test.
Compare the p-value obtained for each test with the adjusted alpha value. A test result is considered significant if its p-value is less than the adjusted α value.

Example:

Suppose you want to conduct a T-test between 5 sets of data, then you need to make a total of 10 comparisons (i.e. 5*(5-1)/2 = 10). If the original alpha value is 0.05, then the Bonferroni corrected alpha value is 0.05/10 = 0.005.This means that the p-value of each T-test must be less than 0.005 to be considered significant.

Python code implementation:

Here is a simplified example using Python's scipy.stats library to perform a multiple T-test, applying the Bonferroni correction:

import numpy as np
from scipy.stats import ttest_ind

# 假设的数据
np.random.seed(42)

group1 = np.random.normal(50, 10, 30)
group2 = np.random.normal(52, 10, 30)
group3 = np.random.normal(55, 10, 30)
group4 = np.random.normal(53, 10, 30)
group5 = np.random.normal(50, 10, 30)

groups = [group1, group2, group3, group4, group5]

# 计算比较次数
m = len(groups) * (len(groups) - 1) // 2

# Bonferroni校正后的α值
alpha_corrected = 0.05 / m

# 两两进行T检验
for i in range(len(groups)):
    for j in range(i+1, len(groups)):
        t_stat, p_value = ttest_ind(groups[i], groups[j])
        print(f"Group {
      
      i+1} vs Group {
      
      j+1}: p-value = {
      
      p_value:.4f}")
        if p_value < alpha_corrected:
            print(f"显著差异 between Group {
      
      i+1} and Group {
      
      j+1}")

Analysis of Variance (ANOVA):

One-way ANOVA can be used when we want to compare the means of three or more groups.
If we want to consider two or more independent variables at the same time, we can use multivariate ANOVA.
The premise of ANOVA is that the data should obey a normal distribution, and the variances of each group should be equal (homogeneity of variances).
Analysis of covariance (ANCOVA):

ANCOVA allows us to compare the means of multiple groups while controlling for one or more continuous covariates.
This can help us correct for or eliminate the effects of certain variables, allowing us to see the effects of other factors more clearly.

  1. Basic principle:
    The basic idea of ​​ANOVA analysis is to divide the total variation of data into two parts: variation between groups and variation within groups. An F statistic is then calculated based on the variation in these two parts and used to determine whether the means of the different groups are equal.

Inter-group variation: reflects the differences between different groups.
Intra-group variation: reflects the differences within the same group.

  1. Hypothesis:
    Null hypothesis (H0): All group means are equal.
    Alternative hypothesis (Ha): The means of at least two groups are not equal.
  2. Premise assumption:
    The data follows a normal distribution.
    The variances of each group are equal, that is, the variances are homogeneous.
    Observations are independent.
  3. Type:
    Univariate ANOVA: Used when we have only one categorical independent variable.
    Multivariate ANOVA (MANOVA): Used when we have two or more categorical independent variables.
  4. Explanation of results:
    If the p-value is less than a predetermined significance level (usually 0.05), then we reject the null hypothesis and consider that the means of at least two groups are not equal.
    The size of the F statistic indicates the relative size of the differences between groups and within groups. A larger F value indicates a larger difference between groups.
  5. Note:
    If the result of ANOVA is significant, it means that at least two groups are different, but it will not tell us which differences between the groups are significant. To determine this, we need to perform a post hoc test such as the Tukey-Kramer method or Bonferroni correction.
    When the data do not meet the assumptions of normality or homogeneity of variances, other non-parametric methods may need to be used, such as the Kruskal-Wallis H test.
    ANOVA is a powerful tool for comparing the means of multiple groups, but when using it, you need to ensure that its prerequisite assumptions are met, and take appropriate transformations or choose alternative methods when necessary.

When the result of an ANOVA is significant, it means that we have rejected the null hypothesis (that all group means are equal), but it does not tell us which differences between the groups are significant. To determine which groups are significantly different from each other, we need to perform a post hoc test (also called a multiple comparison test).

Here are some commonly used post hoc testing methods:

  • Tukey-Kramer方法(Tukey’s Honest Significant Difference, Tukey’s HSD):

It compares pairwise combinations of all groups.
Applies when all samples are of equal size.
It controls the family error rate and keeps it below the significance level α.

  • Bonferroni correction:

This is a conservative approach that adjusts the significance level by dividing the alpha value by the number of comparisons.
For example, if you have 5 sets of data (making 10 comparisons) and α=0.05, the significance level for each comparison is 0.005.

  • Scheffé’s Test:

This is a very conservative method that can be used for any combination of comparisons, including pairwise comparisons and comparisons of multiple groups.
Suitable for unbalanced designs, where the samples in each group are of different sizes.

  • Dunnett’s Test:

Dunnett’s Test is useful when you want to compare all other groups with a specific control group.
It adjusts the significance level to control for the error of multiple comparisons.

  • Newman-Keuls Test:

It performs all possible pairwise comparisons.
Its characteristic is step-by-step, that is, it first considers the overall difference and then makes a pairwise comparison.
Perform post hoc tests in Python:
Using the statsmodels library, we can perform Tukey-Kramer post hoc tests

import numpy as np
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# 示例数据
data = [83, 85, 84, 76, 88, 92, 95, 89, 90, 78, 83, 86]
groups = ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C']

tukey_results = pairwise_tukeyhsd(data, groups, alpha=0.05)

print(tukey_results)

This example will provide pairwise comparison results for the three groups 'A', 'B', 'C'. The difference between the two groups was considered significant if the p-value for a pair of comparisons was less than 0.05.

regression analysis:

Regression analysis can be used when we are interested in the relationship between a continuous dependent variable and one or more independent variables (continuous or categorical).
Linear regression works with a continuous dependent variable and one or more continuous independent variables.
Logistic regression is used to classify the dependent variable.

Mixed effects model:

Mixed-effects models can be used when our data have a nested structure or repeated measurements (e.g., measuring the same person multiple times).
This model can handle both fixed effects (the usual regression coefficients) and random effects (which describe the random variability in the data).

Principal component analysis (PCA) and factor analysis:

PCA or factor analysis can be used when we have a large number of related variables and want to reduce dimensions.
These methods can help us extract several key components or factors from multiple variables.
When dealing with multiple variables, the key is to choose a method that is appropriate to the data structure, study design, and research question. This often requires a deep understanding of statistical methods and selection based on the characteristics, distribution and assumptions of the data.

Application scenarios:

For example, if you want to determine whether two different training methods are suitable for

Whether there is a significant impact on student performance; or whether there is a significant difference between the performance of males and females in a certain test, etc.
T-test is a basic content in statistics and is widely used in experimental research and data analysis to help researchers determine whether the observed effects are not just due to chance.

Insert image description here

Normality test:

Definition: Normality test is a statistical method used to determine whether a set of data approximates a normal distribution. The normal distribution, also known as the Gaussian distribution, is a key assumption behind many statistical techniques and methods, such as t-tests, ANOVA, and linear regression.

method:

Graphical method: Use QQ plot (Quantile-Quantile Plot) or probability plot to visually determine whether the data follows a normal distribution.

Statistical tests:

Shapiro-Wilk test: suitable for small sample data.
Kolmogorov-Smirnov test: Can be used for any continuous distribution, but is less sensitive to normality testing than the Shapiro-Wilk test.
Anderson-Darling test: Similar to Kolmogorov-Smirnov, but more weights the tail differences.
Application scenario: Before performing parametric statistical analysis such as t-test, ANOVA and other parameters, a normality test is usually performed to verify the normal distribution assumption of the data.

Consistency check:

Definition: Consistency test is a statistical method used to determine whether two or more data sets come from the same distribution. This is not limited to the normal distribution, but can be any other distribution.

method:

Kolmogorov-Smirnov test: This is a non-parametric method used to compare the cumulative distribution functions of two independent sets of data.
Chi-Square test of fit: This is a method of testing whether the observed frequency distribution is significantly different from the expected distribution.
Mann-Whitney U test or Wilcoxon rank sum test: This is a non-parametric method used to compare whether two independent samples come from the same distribution.
Kruskal-Wallis H test: is an extension of the Mann-Whitney U test and is used for three or more independent samples.

Application scenario: When you want to know whether two or more sets of data come from the same population distribution, you can use the consistency test. For example, you might want to know whether the effects are the same across treatment groups.

In summary, both tests are critical steps in validating data properties or model assumptions in data analysis, ensuring accuracy when you draw conclusions or use specific statistical methods.

One word per text

Accumulate a little every day

Guess you like

Origin blog.csdn.net/weixin_47723732/article/details/133774312