Python implementation hypothesis testing

Combined with hypothesis testing of theoretical knowledge, as used herein, Python actual data hypothesis testing.

Importing test data

Download the file from the test data line, data link: https://pan.baidu.com/s/1t4SKF6U2yyjT365FaE692A *
Data Field Description:

gender: gender, 1 male and 2 female
Temperature: temperature
HeartRate: heart rate

After downloading, use pandasthe read_csvfunction to import data.

import numpy as np
import pandas as pd
from scipy import stats

test_df = pd.read_csv('test.csv')

question list

For this test data, we asked the following question:

  1. Whether the population mean body temperature was 98.6 degrees Fahrenheit?
  2. The temperature of the human body follow a normal distribution?
  3. Abnormal body temperature data exists is what?
  4. Male and female body temperature is significantly different?
  5. The correlation between body temperature and heart rate (strong? Weak? Moderate?)

Problem-solving steps

1. The overall mean body temperature of 98.6 degrees Fahrenheit if?

Solution 1:
This problem can be transformed into hypothesis testing problem, it can be assumed:
$ H_0: \ MU = 96.8 $
$ H_1: \ MU \ neq $ 96.8

This is a bilateral problem detection, so long as the $ \ mu> \ mu_0 $ or $ \ mu <\ mu_0 $ Among them there is a set up, you can reject the null hypothesis.

According to the NPC "Statistics" 7th edition p163:

Under the conditions of a large sample size, if the overall normal distribution, the sample statistics normal distribution; if the overall non-normal distribution, the sample statistics are asymptotically normal distribution. In these cases, we can sample statistic regarded as normal, then you can use the $ z $ statistics.

In this problem, the overall standard deviation $ \ $ Sigma unknown, may be replaced by the sample standard deviation $ s $.

## 计算Z统计量
mu = 96.8
temp = test_df['Temperature']
# 样本均值
sample_mean = np.mean(temp)

# 样本方差
sample_std = np.std(temp, ddof=1)
# 样本个数
sample_size = temp.size

z = (sample_mean-mu)/(sample_std/np.sqrt(sample_size))
print(z)
22.537033076347175

In accordance with the principles of the two-sided test at significance level $ \ alpha = 0.05 $ where
$ z _ {\ frac {\ alpha} {2}} = \ pm 1.96 $
due $ | z |> | z _ {\ frac {\ alpha} {2}} | $ , so reject the null hypothesis, the overall mean body temperature is not 98.6 degrees Fahrenheit.

Solution 2:
may be used as statsmodelsthe package statsmodels.stats.weightstats.ztestfunction directly perform calculations, http://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.ztest.html
used as follows:

statsmodels.stats.weightstats.ztest(x1, x2=None, value=0, alternative='two-sided', usevar='pooled', ddof=1.0)[source]

2. The temperature of the human body follow a normal distribution?

solution:

Reference Python verification sampling distribution of types of data , the first draw histogram distribution, and then use the scipy.stat.kstestfunction determination.

%matplotlib inline
import seaborn as sns
sns.distplot(temp, color='b', bins=10, kde=True)

From simple graphics look, greater than 99.3 after the data distribution is minimal. Preliminary view that does not follow a normal distribution.

Then use kstestverification.

In: stats.kstest(temp, 'norm')
Out: KstestResult(statistic=1.0, pvalue=0.0)

It can be found pvalue <0.05, that is considered inconsistent with normal body temperature.

Determine whether to obey t distribution:

In: np.random.seed(1)
    ks = stats.t.fit(temp)
    df = ks[0]
    loc = ks[1]
    scale = ks[2]
    t_estm = stats.t.rvs(df=df, loc=loc, scale=scale, size=sample_size)
    stats.ks_2samp(temp, t_estm)
Out: Ks_2sampResult(statistic=0.11538461538461536, pvalue=0.33281734591562734)

The idea here is to use t mean regional income distribution fitting, then use the ks_2samprandom variable function to compare the average regional income and the t-distribution. Because pvalue greater than 0.05, that the data set subject to the t distribution.

Determine whether to obey chi-square distribution:

In: np.random.seed(1)
    chi_square = stats.chi2.fit(temp)
    df = chi_square[0]
    loc = chi_square[1]
    scale = chi_square[2]
    chi_estm = stats.chi2.rvs(df=df, loc=loc, scale=scale, size=sample_size)
    stats.ks_2samp(temp, chi_estm)
Out: Ks_2sampResult(statistic=0.07692307692307687, pvalue=0.8215795712396048)

pvalue was 0.82, indicating that the temperature data more amenable chi-square distribution. The following embodiment, shown in FIG comparison fit chi-square distribution and test data.

from matplotlib import pyplot as plt
plt.figure()
temp.plot(kind = 'kde')
chi2_distribution = stats.chi2(chi_square[0], chi_square[1],chi_square[2])
x = np.linspace(chi2_distribution.ppf(0.01), chi2_distribution.ppf(0.99), 100)
plt.plot(x, chi2_distribution.pdf(x), c='orange')
plt.xlabel('Human temperature')
plt.title('temperature on chi_square', size=20)
plt.legend(['test_data', 'chi_square'])

3. abnormal body temperature data exists is what?

solution:

Case of the known data subject to temperature chi-square distribution can be directly used to calculate the Python P = 0.025 and P = distribution value of 0.925 when, on both sides of the distribution of data values ​​is a small probability that the value is abnormal.

In: chi2_distribution.ppf(0.025)
Out:97.0690523831819
In: chi2_distribution.ppf(0.925)
Out:99.332801136025
In: temp[temp<97.069]
Out:0     96.3
    1     96.7
    2     96.9
    3     97.0
    65    96.4
    66    96.7
    67    96.8
    Name: Temperature, dtype: float64
In: temp[temp>99.332]
    63      99.4
    64      99.5
    126     99.4
    127     99.9
    128    100.0
    129    100.8
    Name: Temperature, dtype: float64

4. whether there are significant differences in male and female body temperature?

solution:

This problem is a difference between the two population means hypothesis testing problem, because if there is a difference does not involve direction, it is two-sided test. The establishment of the null hypothesis and alternative hypothesis as follows:

$ H_0: \ mu_1 - mu_2 = 0 $ there was no significant difference \

$ H_1: \ mu_1 - \ mu_2 \ ne 0 $ there are significant differences

Since $ \ sigma_1 ^ 2 $, $ \ sigma_2 ^ 2 $ unknown, we can not conclude that $ \ sigma_1 ^ 2 = \ sigma_2 ^ 2 $ is established, and $ n_1 $, the number of $ n_2 $ 65.

In: test_df.groupby(['Gender']).size()
Out:Gender
    1    65
    2    65
    dtype: int64

In this case the amount of the sample, sampling distributions approximate normal distribution of the degree of freedom t of f, where f is:
$$
f = \ {FRAC (\ FRAC S_1 ^ {2} of n_1} + {\ FRAC S_2 {^ {2} and n_2 }) ^ 2} {\ frac {(\ frac {s_1 ^ 2} {n_1}) ^ 2} {n_1-1} + \ frac {(\ frac {s_2 ^ 2} {n_2}) ^ 2} {n_2 }} -1
$$
test statistic t is calculated as:
$$
t = \ {FRAC (\ bar x_1 {} - \ bar {} x_2) - (\ mu_1- \ mu_2)} {\ sqrt {\ FRAC S_1 ^ {2} {} of n_1 + \ FRAC S_2 ^ {2}}}} {N2
$$

male_df = test_df.loc[test_df['Gender'] == 1]
female_df = test_df.loc[test_df['Gender'] == 2]
  • Method One: Building a statistical calculation function
def cal_f(a, b):
     n_1 = len(a)
     n_2 = len(b)
     mean_1 = a.mean()
     mean_2 = b.mean()
     std_1 = a.std()
     std_2 = b.std()

     s_1 = std_1**2/n_1
     s_2 = std_2**2/n_2
     f = (s_1 + s_2)**2 / (s_1**2/(n_1 - 1) + s_2**2/(n_2 -1))
     print('degree of freedom=%.3f', % f)

     t = (mean_1 - mean_2)/np.sqrt(s_1 + s_2)
     
     # 计算边界值,设置显著性水平为0.05,双侧检验,取边界值为0.025
     v = stats.t.ppf(0.025, f)

     print('stat=%.3f, boudary=%.3f' % (t, v))

     if abs(t)>abs(v):
       print("拒绝原假设,男女体温存在明显差异。")
     else:
       print("不能拒绝原假设,男女体温无明显差异。")

Calling Custom Functions

In: cal_f(male_df['Temperature'],female_df['Temperature'])
Out:degree of freedom=127.510
    stat=-2.285, boudary=-1.979
    拒绝原假设,男女体温存在明显差异。
  • Method 2, using the function ttest_ind 1
stats.t.ppf(0.025, 127.51)
stat, p = stats.ttest_ind(male_df['Temperature'],female_df['Temperature'])
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('不能拒绝原假设,男女体温无明显差异。')
else:
    print('拒绝原假设,男女体温存在明显差异。')
    
Out:
stat=-2.285, p=0.024
拒绝原假设,男女体温存在明显差异。

Note: P Bilateral probability function value calculated accumulated, so the direct and significant level of 0.05 compared. You can use the following ways proved to be two-sided probability:

# 将方法1中的统计值代入t分布的概率分布
In: stats.t.cdf(stat, 127.51)
Out:0.011969134059074056

These results exactly half the bilateral probability.

5. between body temperature and heart rate correlated

Pearson correlation coefficient may be used 2 test the relationship between the two sets of data. Before treatment, you can display both at the distribution in two-dimensional space using a scatter plot.

heartrate_s = test_df['HeartRate']
temperature_s = test_df['Temperature']
from matplotlib import pyplot as plt
plt.scatter(heartrate_s, temperature_s)

Pearson correlation coefficient calculation 3 :

In: stat, p = stats.pearsonr(heartrate_s, temperature_s)
    print('stat=%.3f, p=%.3f' % (stat, p))
Out:stat=0.254, p=0.004

Pearson correlation coefficient of 0 is known, no correlation data, but greater than 0 indicates a positive correlation, as is a perfect positive correlation. Because the result was 0.004, between body temperature and heart rate can be considered essentially no correlation. From the graphic can also be found scattered distribution, lack of correlation.

Reference material

Welcome to the two-dimensional code scanning concern
fishdata

Guess you like

Origin www.cnblogs.com/shenfeng/p/hypothesis_using_python.html