Prerequisites for T-test | Independence | Homogeneity of variances | Random sampling

The T test is a statistical method used to compare whether there is a significant difference in the means of two groups of data. However, before conducting the T test, there are some prerequisites that need to be met to ensure the accuracy and reliability of the results. These prerequisites include:

  1. Normality: T-test requires that the data obeys a normal distribution within each group. Normality can be tested using statistical methods (such as normal distribution test) or graphical methods (such as QQ plot). If the data does not follow a normal distribution, you can consider data transformation or use non-parametric testing methods. Although the T-test is relatively insensitive to smaller sample sizes, ideally the sample should come from a population that is approximately normally distributed. However, the T-test is not very strict about the normality assumption, especially when the sample size is large (usually greater than 30), because according to the central limit theorem, the distribution of the sample mean will tend to be a normal distribution.
    For large samples, the normality assumption in the assumptions of the T-test is relatively less important because the T-test exploits the properties of the central limit theorem. In practical applications, when the sample size is greater than 30 or 40, the requirements for the normality assumption can be relatively relaxed.

    However, for small sample cases, the normality assumption may be more important because the central limit theorem may not hold as well in small samples. In this case, if the sample does not meet the normality assumption, nonparametric statistical methods or other corrective measures may need to be considered.
    / It should be noted that although in the case of large samples, the Z-test approximation can use the sample standard deviation S, the specific threshold of the sample size can vary depending on the research field and question. Generally speaking, this approximation is reasonable when the sample size is large enough. If your sample size is relatively small, it may be more appropriate to use a T-test.

  2. Independence: Observations must be independent of each other. This means that observations within one group should not be affected by observations within another group, for example, there should be no repeated measurements or correlations. One of the prerequisites of the T test is that the samples must be independent. This means that observations in one sample should not be affected by observations in another sample. It is important to ensure that the samples are independent, otherwise the results of the test may be inaccurate.

  3. Homogeneity of variances: The T test assumes that the variances of the two sets of data are equal (homogeneity of variances). You can use statistical methods, such as homogeneity of variance tests, to test whether the variances of two sets of data are equal. If the heterogeneity of variances is significant, you can consider using a modified T test (such as Welch's T test).

  4. Random sampling: Data must be randomly sampled to ensure that the results are representative and can be generalized to the population. The sample must be drawn at random to ensure that it is representative of the population. Random sampling helps avoid sample selection bias and makes the results of the T test more general.

If the data does not meet these prerequisites, the accuracy of the T-test results may be affected. In some cases, you can try to use non-parametric test methods, such as the Wilcoxon rank sum test, to deal with data that does not meet the prerequisites.

Before conducting a T-test, it is recommended to conduct data exploration and statistical tests to determine whether these prerequisites are met, and to take appropriate measures to handle situations where the conditions are not met. This ensures the reliability and validity of the T-test results.

When it comes to the preconditions of the T-test, let us explain each precondition in detail through a specific example and use Python to implement the corresponding test and processing.

Question 1: Normality

Normality is an important prerequisite for the T test. We first need to check whether the data of the two groups conform to the normal distribution. We can use the Shapiro-Wilk normality test to test this. Suppose we have two sets of performance data, Group A and Group B, and we want to compare whether there is a significant difference between them.

import scipy.stats as stats
import numpy as np

# 生成示例数据
np.random.seed(0)
group_A = np.random.normal(0, 1, 50)
group_B = np.random.normal(0.5, 1, 50)

# 正态性检验
statistic_A, p_value_A = stats.shapiro(group_A)
statistic_B, p_value_B = stats.shapiro(group_B)

print("Group A 正态性检验结果:Statistic =", statistic_A, ", p-value =", p_value_A)
print("Group B 正态性检验结果:Statistic =", statistic_B, ", p-value =", p_value_B)

If the p-value is less than the significance level (usually 0.05), then we can reject the null hypothesis, indicating that the data does not follow a normal distribution. In this case, we may want to consider using non-parametric testing methods or try transforming the data.

Issue 2: Independence

Independence is another prerequisite for T-test. It is important to ensure that there are no correlations or confounding factors between the two sets of data. For example, we want to compare the test scores of students in two different classes, ensuring that each student appears in only one group.

Question 3: Homogeneity of variances

Homogeneity of variances is one of the prerequisites for the T test. We can use Levene's homogeneity of variance test to test whether the variances of the two sets of data are equal. Suppose we have life duration data for two groups of patients treated with different medications, and we want to compare them to see if they are significantly different.

# 生成示例数据
np.random.seed(1)
group_1 = np.random.normal(5, 2, 50)
group_2 = np.random.normal(5, 4, 50)

# 方差齐性检验
statistic, p_value = stats.levene(group_1, group_2)

print("方差齐性检验结果:Statistic =", statistic, ", p-value =", p_value)

If the p-value is less than the significance level, we can reject the assumption of homogeneity of variances, indicating that the variances of the two sets of data are not equal. In this case, we can consider using T test methods such as Welch's T test that do not require equal variances.

Question 4: Random Sampling

It is a basic prerequisite to ensure that the data are randomly sampled to ensure that the results are representative. Random sampling means that each individual has an equal chance of being selected into the sample without interference from other factors.

In summary, these prerequisites are crucial to the accuracy of the T-test. In practical applications, you should test and meet these prerequisites based on the characteristics of the data to ensure that your T-test results are reliable. If these conditions are not met, appropriate alternative methods or data processing techniques may be considered.

The Shapiro-Wilk normality test is a statistical method used to test whether data comes from a normal distribution. The null hypothesis of this test is that the data samples follow a normal distribution. If the p-value is less than the significance level (usually 0.05), then we can reject the null hypothesis, indicating that the data does not follow a normal distribution.

The following is sample code for using the Shapiro-Wilk normality test in Python:

import scipy.stats as stats
import numpy as np

# 生成示例数据
np.random.seed(0)
data = np.random.normal(0, 1, 100)

# 进行Shapiro-Wilk正态性检验
statistic, p_value = stats.shapiro(data)

# 输出检验结果
print("Shapiro-Wilk正态性检验结果:Statistic =", statistic, ", p-value =", p_value)

# 根据p-value判断正态性
alpha = 0.05
if p_value > alpha:
    print("样本数据可能来自正态分布(无法拒绝正态性假设)")
else:
    print("样本数据不来自正态分布(拒绝正态性假设)")

In this example, we generate a random data sample that follows a normal distribution and then use the Shapiro-Wilk test to test whether it follows a normal distribution. Based on the p-value results, we can determine whether the data comes from a normal distribution.

Note that the Shapiro-Wilk test generally has higher power for large samples, but can also be used for small samples. If the p-value is less than the significance level, it indicates that the data does not follow a normal distribution and you may want to consider using non-parametric statistical methods or performing appropriate transformations on the data.

  • The paired samples t test is used to compare whether the difference between two sets of paired data is significant. When performing a paired-samples t-test, standard errors need to be calculated to evaluate whether the differences between sample means exceed the range of random variation.

Standard error is the standard deviation of a sample statistic, which reflects the deviation between the sample statistic and the population parameter. In a paired samples t-test, we are concerned with the differences between paired samples, so we need to calculate the standard errors of these differences.

import scipy.stats as stats

# Data
before_improvement = [25, 26, 24, 23, 22, 26, 27, 24, 25, 26]
after_improvement = [28, 27, 29, 30, 32, 27, 26, 30, 28, 29]

# Calculate differences
differences = [after - before for before, after in zip(before_improvement, after_improvement)]

# Calculate mean and standard deviation of differences
mean_difference = sum(differences) / len(differences)
std_dev_difference = (sum([(d - mean_difference)**2 for d in differences]) / (len(differences) - 1))**0.5

# Calculate standard error
SE = std_dev_difference / (len(differences)**0.5)

# Calculate t-statistic
t_statistic = mean_difference / SE

# Calculate degrees of freedom
df = len(differences) - 1

# Calculate p-value
p_value = 2 * (1 - stats.t.cdf(abs(t_statistic), df=df))

print(f"t-statistic: {t_statistic}")
print(f"p-value: {p_value}")
  • The independent samples t-test is used to compare whether the means of two independent groups (that is, samples from different groups or conditions) are significantly different. Here are the steps to conduct an independent samples t-test and an example of implementation in Python:

import scipy.stats as stats

# 数据
group1 = [25, 26, 24, 23, 22, 26, 27, 24, 25, 26]
group2 = [28, 27, 29, 30, 32, 27, 26, 30, 28, 29]

# 执行独立样本t检验
t_statistic, p_value = stats.ttest_ind(group1, group2)

print(f"t统计量: {t_statistic}")
print(f"p值: {p_value}")

The one-sample t test is used to test whether the mean of a sample is significantly different from the known population mean. Here are the steps to perform a one-sample t-test and an example of implementation in Python:

 

import scipy.stats as stats

# 数据
sample_data = [28, 27, 29, 30, 32, 27, 26, 30, 28, 29]
reference_value = 30  # 参考值

# 执行单样本t检验
t_statistic, p_value = stats.ttest_1samp(sample_data, reference_value)

print(f"t统计量: {t_statistic}")
print(f"p值: {p_value}")

The T-test is not very strict about the normality assumption , especially when the sample size is large, because according to the central limit theorem, the distribution of the sample mean will tend to be a normal distribution. The central limit theorem is an important statistical principle that states:

When a large enough sample is randomly drawn from a population, the distribution of the sample mean will approach a normal distribution, even though the population itself is not necessarily normally distributed.

This means that for large samples, the assumption of normality in the assumptions of the T-test is relatively less important because the T-test exploits the properties of the central limit theorem. In practical applications, when the sample size is greater than 30 or 40, the requirements for the normality assumption can be relatively relaxed.

However, for small sample cases, the normality assumption may be more important because the central limit theorem may not hold as well in small samples. In this case, if the sample does not meet the normality assumption, nonparametric statistical methods or other corrective measures may need to be considered.

In short, the T test has looser requirements on the normality assumption when the sample size is large, but in the case of small samples, the normality assumption should be carefully considered.

Supongo que te gusta

Origin blog.csdn.net/book_dw5189/article/details/132770042
Recomendado
Clasificación