Probability density function of normal distribution | Various normal distribution tests | QQ chart

The function value of the probability density function (PDF) of the normal distribution is calculated for a specific random variable value x under the given normal distribution parameters (mean μ and standard deviation σ). The probability density value f(x). This value represents the probability density of the random variable taking the value x under the normal distribution.

Specifically, the calculation formula of the probability density function of the normal distribution is as follows:

This probability density function describes the probability density distribution when the random variable x takes different values. In other words, f(x) represents the relative probability of the random variable X at x. . The curve of the normal distribution is bell-shaped, centered on the mean μ, and the standard deviation σ determines the width of the curve. The farther the data points are from the mean, the lower the probability density.

Probability density and probability: The probability density function gives the probability density at different values, but for continuous random variables, the probability density of a single point is zero. Probability is the accumulation of probability density within an interval, not the probability of a single point.

The probability density value of the probability density function of the normal distribution is not a direct probability, but describes the distribution of the relative probability density of the random variable at different values. To calculate specific probabilities, you need to use integrals to calculate probabilities within intervals.

The probability density function f(x) of the normal distribution is a mathematical function used to describe the probability density of a random variable X taking a specific value x in the normal distribution. Simply put, it expresses the relative probability that a random variable X is equal to a specific value x given the mean (μ) and standard deviation (σ) of the normal distribution.

Specifically, f(x) can be interpreted as the following two points:

  1. Relative likelihood: f(x) is not a direct probability value, but a probability density. It tells you the relative likelihood that a random variable X takes on a specific value x under a normal distribution. If f(x) is high at some x, it means that the value is more likely to occur in the normal distribution.

  2. Area under the curve: The graph of the probability density function of the normal distribution is a bell-shaped curve. By integrating the area under the curve, you can find the probability that a random variable X falls within a certain value or range of values. This means that if you want to know the probability that X falls within a certain interval, you can calculate it by integrating f(x).

To summarize, the probability density function f(x) is a function that describes the relative likelihood of each possible value in a normal distribution. It is a representation of probability density rather than a direct probability value. By integrating f(x), you can calculate the probability that a random variable X falls within a certain value or range of values ​​in a normal distribution.

 Probability Density Function (PDF) describes the relative probability density of a continuous random variable at different values. This means that the PDF reflects the relative frequency or density of occurrences of a random variable at different values, rather than the direct probability.

The following are some important concepts about different values ​​​​under PDF:

  1. Probability density value : The value f(x) of the PDF represents the relative probability density near a random variable taking a specific value x. Specifically, f(x) represents the unit probability density at x, that is, the relative probability density within an infinitesimal interval.

  2. Value Range : PDF describes the probability density distribution over all possible value ranges of a random variable. This range is usually continuous, so the probability density value at each specific value is infinitesimal.

  3. Curve shape : The graph of a PDF is usually a curve, and its shape is determined by the distribution characteristics of the random variable. For example, the PDF of a normal distribution is a bell-shaped curve with a peak at the mean, indicating that values ​​near the mean have a high relative probability density.

  4. Probability calculation : To calculate the probability that a random variable falls within a certain interval [a, b]$, you can use integrals to calculate it.

  5. Probability comparison : By comparing the relative probability density of PDF at different values, you can understand the relative frequency of different values. A higher probability density value means that the value there is more frequent, while a lower probability density value means that the value there is less common.

In short, the relative probability density at different values ​​under the probability density function describes the relative frequency or density distribution of continuous random variables. This allows us to understand the relative frequency of occurrences of the random variable at different values, but to calculate specific probabilities, we need to use integration to consider the probabilities within the interval.

The probability density function of the normal distribution is a mathematical equation used to describe the probability density distribution of data at different values. It is very important in statistics and data science because it allows us to quantify the likelihood of a data point appearing at different locations. . This is one of the reasons why the normal distribution is widely used in various applications.

-----------

The critical role of the normal distribution in statistics and data analysis cannot be underestimated. Here are some insights into the important role of the normal distribution in these areas:

  1. Parameter Estimation : The properties of the normal distribution make it very useful in parameter estimation. By performing maximum likelihood estimation on the data, the mean and standard deviation of the normal distribution can be estimated, giving a better understanding of the overall characteristics of the data.

  2. Hypothesis testing : Many hypothesis testing methods are based on the properties of normal distribution, such as t-test, F-test, etc. These tests are used to compare means or variances between different groups to determine whether they are significantly different.

  3. Statistical Inference : The normal distribution plays a key role in statistical inference. By estimating the parameters of a normal distribution and testing hypotheses, inferences can be drawn about the population, such as confidence intervals and the credibility of hypotheses.

  4. Central limit theorem : The central limit theorem states that the means of a large number of independent random variables tend to obey a normal distribution. This theorem makes the normal distribution fundamental for statistical inference over large samples because it explains why many real-world data are normally distributed around the mean.

  5. Model Fitting : The normal distribution is often used to fit data because it provides a good fit to the data distribution of many natural and social phenomena. This is important for building statistical models and predicting future data points.

  6. Visualization : The graph of the probability density function of the normal distribution is a commonly used visualization tool for understanding the distribution characteristics of data. By drawing a normal distribution curve, you can quickly understand the center position and dispersion of your data.

  7. Risk Management and Finance : In finance, the normal distribution is often used to model the volatility of asset prices, which is critical for risk management and investment decisions.

  8. Engineering and Natural Sciences : The normal distribution is widely used in engineering, physics, biology and other natural science fields to model and analyze phenomena, such as measurement errors, weather models, etc.

In summary, the mathematical properties and versatility of the normal distribution make it an indispensable tool in statistics and data analysis. It helps us understand and explain the statistical properties of various natural and social phenomena, thereby supporting scientific research, decision-making, and problem solving.

-------------------

The kurtosis and skewness of the normal distribution are two statistical characteristics that describe the shape of the distribution:

  1. Skewness : Skewness measures the skewness of the data distribution. The skewness of the normal distribution is close to 0, which means that the distribution is symmetrical, with the mean located in the center of the distribution and the data on both sides are symmetrically distributed. When skewness is positive, the data distribution is right-skewed (the tail extends to the right), and when skewness is negative, the data distribution is left-skewed (the tail extends to the left). The greater the absolute value of skewness, the more obvious the degree of skewness.

  2. Kurtosis : Kurtosis measures the sharpness or flatness of a data distribution. The kurtosis of a normal distribution is close to 3, which is the baseline kurtosis of a normal distribution. When the kurtosis is greater than 3, the distribution is said to have a spiky shape (heavier tails), which is called positively skewed kurtosis or "overly spiky". When the kurtosis is less than 3, the distribution is said to have a flat shape (lighter tails), called negatively skewed kurtosis or "over-flat".

To summarize, the skewness of a normal distribution is close to 0, indicating a symmetric distribution, while the kurtosis is close to 3, indicating a moderately peaked shape. These two statistics are used to describe the shape characteristics of the normal distribution, but their values ​​may differ for other types of distributions. In practical applications, skewness and kurtosis can help us identify the distribution characteristics of the data and compare it with the normal distribution to determine whether the data approximately conforms to the normal distribution.

-------------------

The normal distribution test is used to determine whether a given data set meets the assumption of a normal distribution. In statistics and data analysis, there are usually several ways to perform a normal distribution test, some of the common methods include:

  1. Shapiro-Wilk test : The Shapiro-Wilk test is a widely used method to test whether data conforms to a normal distribution. Its null hypothesis is that the data follows a normal distribution. If the p-value is less than the significance level (usually 0.05), the null hypothesis can be rejected, indicating that the data does not follow a normal distribution.

  2. D'Agostino and Pearson test : This is another common test method for normal distribution. It determines whether the data conforms to a normal distribution based on the skewness and kurtosis of the data. Similar to the Shapiro-Wilk test, the normal distribution assumption can be rejected if the p-value is less than the significance level.

  3. Kolmogorov-Smirnov test : This test method is used to compare the fit of the given data to the theoretical normal distribution. It determines whether the data conforms to the normal distribution based on the difference of the cumulative distribution function.

Different normality test methods have different prerequisite requirements and characteristics for use. The following are some common normality testing methods and their main premises and characteristics:

  1. Shapiro-Wilk test :

    • Prerequisite: the data is continuous, and the sample size is usually not too small (it is usually recommended that the sample size is greater than 5 or 10).
    • Features: This is a relatively powerful normality test method that is suitable for various data set sizes. It is relatively sensitive to non-normality and can be used with both small and large samples.
  2. Kolmogorov-Smirnov test :

    • Prerequisite: the data is continuous. For single-sample testing, it is usually required that the sample size should not be too small. For two-sample testing, the sizes of the two samples should be close.
    • Features: This test is suitable for comparing data to the cumulative distribution function of a theoretical normal distribution. It is more flexible and can be used for single-sample and two-sample comparisons. But it may not be sensitive enough for small sample data.
  3. Anderson-Darling test :

    • Prerequisite: The data is continuous. Usually used for large sample data.
    • Features: This test is an extension of Shapiro-Wilk, which works better on large sample data and is usually used for larger sample sizes. It provides a number of statistics with different weights that can be used for different distribution tests.
  4. QQ diagram (Quantile-Quantile diagram) :

    • Prerequisite: Suitable for continuous data. No specific sample size is required, but graphical interpretation may require experience.
    • Features: This is a visualization method that determines whether the data conforms to the normal distribution by visually comparing the quantiles of the data with the quantiles of the theoretical normal distribution. It provides a quick initial impression but does not provide a specific p-value.
  5. Lilliefors test :

    • Prerequisite: Suitable for small sample data, usually when the sample size is small.
    • Features: This is a variant of the Kolmogorov-Smirnov test, specifically used for small sample data. It is more sensitive to small sample data than the standard Kolmogorov-Smirnov test.

Each testing method has its scope and limitations, and choosing the appropriate method depends on your data characteristics and research questions. Usually, it is recommended to combine the results of multiple methods to make the final judgment. Furthermore, normality testing is usually a step in statistical analysis rather than the final conclusion.

These are some common normal distribution testing methods. You can choose the appropriate method according to your data and needs to verify whether the data conforms to the normal distribution. Note that the normal distribution test does not necessarily require the data to be completely normally distributed, but is used to determine whether the data deviate significantly from the normal distribution.

import scipy.stats as stats
import numpy as np

# 生成模拟数据,这里使用NumPy生成随机正态分布数据
np.random.seed(0)  # 设置随机种子以保持一致性
data = np.random.normal(0, 1, 1000)  # 均值为0,标准差为1的正态分布数据,生成1000个数据点

# 使用Shapiro-Wilk检验
statistic, p_value = stats.shapiro(data)
if p_value > 0.05:
    print("Shapiro-Wilk检验:数据符合正态分布")
else:
    print("Shapiro-Wilk检验:数据不符合正态分布")

# 使用D'Agostino和Pearson检验
statistic, p_value = stats.normaltest(data)
if p_value > 0.05:
    print("D'Agostino和Pearson检验:数据符合正态分布")
else:
    print("D'Agostino和Pearson检验:数据不符合正态分布")

# 使用Kolmogorov-Smirnov检验
statistic, p_value = stats.kstest(data, 'norm')
if p_value > 0.05:
    print("Kolmogorov-Smirnov检验:数据符合正态分布")
else:
    print("Kolmogorov-Smirnov检验:数据不符合正态分布")

Multiple normality tests 

import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt

data = np.random.normal(loc=12, scale=2.5, size=340)
df = pd.DataFrame({'Data': data})

# 描述性统计分析
mean = df['Data'].mean()
std_dev = df['Data'].std()
skewness = df['Data'].skew()
kurtosis = df['Data'].kurtosis()

print("均值:", mean)
print("标准差:", std_dev)
print("偏度:", skewness)
print("峰度:", kurtosis)

# 创建一个2x1的子图布局
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(6, 6))
# 可视化 - stats.probplot正态概率图(Q-Q图)
stats.probplot(data, plot=ax1, dist='norm', fit=True, rvalue=True)  #ax1作为绘图的位置
ax1.set_title("Q-Q Plot")
 
# 可视化 - 直方图
ax2.hist(data, bins=10, rwidth=0.8, density=True) # bins个柱状图,宽度是rwidth(0~1),=1没有缝隙
ax2.set_title("Histogram with Kernel Density Estimate")

# 调整子图之间的间距
plt.tight_layout()
# 显示图形
plt.show()

# 正态性检验 - Shapiro-Wilk检验
stat, p = stats.shapiro(data)
print("Shapiro-Wilk检验统计量:", stat)
print("Shapiro-Wilk检验p值:", p)

# Anderson-Darling检验
result = stats.anderson(df['Data'], dist='norm')
print("Anderson-Darling检验统计量:", result.statistic)
print("Anderson-Darling检验临界值:", result.critical_values)

# 执行单样本K-S检验,假设数据服从正态分布
statistic, p_value = stats.kstest(data, 'norm')
print("K-S检验统计量:", statistic)
print("K-S检验p值:", p_value)

# 执行正态分布检验
k2, p_value = stats.normaltest(data)
print(f"normaltest正态分布检验的统计量 (K^2): {k2}")
print(f"normaltest检验p值: {p_value}")

 

scipy.statsThe module is a submodule in the SciPy library and is used to perform various statistical analysis and probability distribution related operations. This module provides many functions for performing statistical tests, fitting probability distributions, generating random variables, and more. Here are some common scipy.statsmodule functions:

  1. Statistical tests : scipy.statsProvides many statistical test methods, such as t-test, ANOVA, chi-square test, normality test, etc. These methods are used to analyze differences between data sets, test hypotheses, and determine whether data fit certain distributions.

  2. Probability distribution : This module contains many implementations of continuous and discrete probability distributions, such as normal distribution, exponential distribution, Poisson distribution, gamma distribution, etc. These distributions can be used to model and analyze different types of random variables.

  3. Fitting distributions : You can use fitthe function to fit data to a specific probability distribution. This is useful for determining whether the data fits a known distribution and for estimating the parameters of the distribution.

  4. Generate Random Variables : scipy.statsAllows you to generate random variables that follow a specified probability distribution. This is useful for simulating experiments and generating random data points.

  5. Descriptive Statistics : You can use this module to calculate descriptive statistics for your data, such as mean, standard deviation, median, percentile, etc.

  6. Probability density function and cumulative distribution function : You can use this module to calculate the probability density function (PDF) and cumulative distribution function (CDF) and their inverse functions.

  7. Statistics calculation : This module provides the calculation of various statistics, such as correlation coefficient, covariance, skewness, kurtosis, etc.

  8. Hypothesis testing : In addition to the common t-test and chi-square test, some advanced hypothesis testing methods are also provided, such as Kolmogorov-Smirnov test, Anderson-Darling test, etc.

This is only scipy.statspart of the functionality of the module. It is a very useful tool in statistics, data analysis and scientific computing. It can be used to process and analyze various types of data and perform statistical inference and hypothesis testing. If you need detailed information on a specific feature, you can consult the official SciPy documentation or further explore the functionality of this module.

 ---------------------

QQ plot (Quantile-Quantile Plot) is a very useful visualization tool for comparing the similarity between actual data distribution and theoretical distribution (such as normal distribution). By drawing a scatter plot, the QQ plot can help you visually observe the relationship between the distribution of data and the theoretical distribution.

The steps to create a QQ chart are as follows:

  1. Collect actual data : First, you need to collect or prepare the actual data set that you want to analyze.

  2. Sort data : Arrange the actual data in ascending order for subsequent quantile calculation.

  3. Calculate quantiles : For each data point, calculate its percentile rank within the entire data set, typically using a cumulative distribution function (CDF). These quantile values ​​represent the relative position of the data points within the entire distribution.

  4. Generate theoretical quantiles : Based on a selected theoretical distribution (such as the normal distribution), calculate the theoretical quantiles corresponding to the same percentile ranking. These theoretical quantiles are derived from a theoretical distribution, and if the data fit that theoretical distribution, they should follow the same distribution.

  5. Draw QQ plot : draw the quantile of the actual data and the quantile of the theoretical distribution into a scatter plot. Typically, the x-axis represents the theoretical quantiles and the y-axis represents the quantiles of the actual data. If the data approximately fit the theoretical distribution, the scatter points should roughly follow a 45-degree diagonal.

  6. Interpret the results : Observe the distribution of points on the QQ chart. If they are closely aligned along the 45 degree diagonal, then the data are likely to fit the chosen theoretical distribution. If the points deviate from the diagonal, it may indicate that the data does not conform to the theoretical distribution.

The QQ plot is a powerful tool that can help you visually evaluate the distribution characteristics of the data and check whether the data approximately conforms to the theoretical distribution, such as the normal distribution. If the points line up closely along a straight line on a QQ plot, this is a good indication that the data conforms to the chosen theoretical distribution. Functions are used to create probability plots that visualize the fit between sample data and a theoretical distribution (usually the normal distribution). This helps you determine whether the sample data conforms to a specific theoretical distribution. 
scipy.stats.probplot

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# 生成一些模拟身高数据(正态分布)
# 假设你有一个包含100个身高观测值的数据集,你想要检查这些身高数据是否符合正态分布
np.random.seed(0)
heights = np.random.normal(loc=170, scale=10, size=100)

# 绘制Q-Q图
stats.probplot(heights, dist="norm", plot=plt)
plt.title("Q-Q Plot for Heights")
plt.xlabel("Theoretical Quantiles")
plt.ylabel("Sample Quantiles")
plt.show()

To plot a QQ plot and compare the quantiles of actual data to the quantiles of a theoretical normal distribution, you first need to calculate these quantiles. Quantile represents a specific percentage of values ​​in a data set. Quantiles are usually calculated using the cumulative distribution function (CDF). For a normal distribution, the quantiles can be calculated using:

  1. Compute the quantiles of the theoretical normal distribution:

    • For a given probability (percentage) p (for example, p=0.25 represents the 25% quantile, which is the lower quartile), the corresponding quantile can be calculated using the cumulative distribution function (CDF) of the normal distribution. This is usually done using statistical software or libraries as it involves advanced mathematical calculations.
  2. Compute quantiles of real data:

    • For your actual data set, you need to sort the data from small to large.
    • Then, calculate the quantile for each data point using the following formula: Quantile = ((i - 0.5)/n) * 100% where i is the position of the data point after sorting and n is the total data in the dataset Points.
  3. Draw QQ chart:

    • Now that you have the theoretical normal distribution and the quantiles of the actual data, you can plot them on a QQ plot.
    • The x-axis represents the quantile of the theoretical normal distribution, and the y-axis represents the quantile of the actual data.
    • If the data points are closely spaced along a diagonal, then the data is likely to be normally distributed.

We use NumPy to generate an example data set, assuming it follows a normal distribution. Then, we calculate the quantiles of the actual data and theoretical normal distributions and plot QQ plots using matplotlib and seaborn libraries . QQ plots are used to visualize the degree of fit between actual data and theoretical distributions. If the data points are closely distributed along the red dashed line, the data is close to a normal distribution.

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns

# 生成一个示例数据集,假设服从正态分布
data = np.random.normal(loc=0, scale=1, size=1000)

# 计算实际数据的分位数
percentiles = np.percentile(data, [0, 25, 50, 75, 100])

# 计算理论正态分布的分位数
theoretical_percentiles = stats.norm.ppf([0, 0.25, 0.5, 0.75, 1], loc=0, scale=1)

# 绘制Q-Q图
plt.figure(figsize=(8, 6))
sns.set(style="whitegrid")
sns.scatterplot(x=theoretical_percentiles, y=percentiles)
plt.xlabel("Theoretical Quantiles")
plt.ylabel("Sample Quantiles")
plt.title("Q-Q Plot")
plt.plot([-2, 2], [-2, 2], color='red', linestyle='--')  # 添加对角线
plt.show()

Guess you like

Origin blog.csdn.net/book_dw5189/article/details/133221424