Measurement data spread

Original link: https://blog.csdn.net/pipisorry/article/details/72820982

 

Investigation evaluation value data spread or divergence measure. These metrics include poor, divided, quartiles, percentiles and quartiles poor. Cassette number five can summarized in FIG display, it is useful for the identification of outliers. Variance and standard deviation can also indicate spread of data distribution.

Central tendency
of central tendency (central tendency) refers to the degree in statistics is a set of data to move closer to a central value, which reflects the location of the center point of a set of data. Measure of central tendency representative value or central value is to find the data level, the central tendency of low-level data, measure values for high-level measurement data can reveal the population of many observations surrounded with centralized center, on the contrary, high-level data measure of central tendency value does not apply to low-level measurement data.

In statistics, central tendency (central tendency) or central tendency, often colloquially referred to as the mean, median value of a probability distribution. The most common types of central tendency include arithmetic mean, median and the mode.

Central tendency of one-dimensional information may have the following several statistical methods.

The average number of math 

Dividing the sum by the number of observations of the observed value, i.e., x1 + x2 + x3 ... + xnn {\ displaystyle {\ tfrac {x_ {1} + x_ {2} + x_ {3} \ ldots + x_ {n}} {n}}}. Often referred to as the average, they tend to be behind the expected value of the probability distribution of unbiased estimate.

Median 

All observations will be sorted by size after centered values ​​sequentially.

The mode

Most observations appear more than once.

Geometric mean 

Observation of the product of the number of observations root value, i.e., (x1 × x2 × x3 ... × xn) 1n {\ displaystyle (x_ {1} \ times x_ {2} \ times x_ {3} \ ldots \ times x_ { n}) ^ {\ frac {1} {n}}}

Harmonic mean 

The sum of the number of observations divided by the inverse of the observation, i.e., n1x1 + 1x2 + ... + 1xn {\ displaystyle {\ frac {n} {{\ frac {1} {x_ {1}}} + {\ frac {1 } {x_ {2}}} + ... + {\ frac {1} {x_ {n}}}}}}

weighted average 

Consider arithmetic mean data is not the same degree of contribution of different groups

Censored average (English: Truncated_mean) (truncated mean) 

Ignore extreme values ​​obtained after addition of a certain value or a specific ratio average. For example, a quarter of the average number of (English: Interquartile_mean) (interquartile mean) is the arithmetic mean ignore information obtained after the first 25% and 75% after.

Full from the midpoint (English: Midrange) (midrange) 

Arithmetic mean maximum and minimum values, i.e., min (x) + max (x) 2 {\ displaystyle {\ frac {\ min (x) + \ max (x)} {2}}}.

The hub (English: Midhinge) (midhinge) 

First quartile arithmetic mean and the third quartile, i.e., Q1 + Q32 {\ displaystyle {\ frac {Q_ {1} + Q_ {3}} {2}}}.

Three mean (English: Trimean) (trimean) 

Consider three quartiles of the weighted average, i.e., Q1 + 2Q2 + Q34 {\ displaystyle {\ frac {Q_ {1} + 2Q_ {2} + Q_ {3}} {4}}}.

The average number of extreme value adjustment (English: Winsorized_mean) (winsorized mean) 

The arithmetic mean value to the nearest observation substituted extremes specific ratio achieved. For example, consider the 10 observations (from small to large are arranged x1 {\ displaystyle x_ {1}} to x10 {\ displaystyle x_ {10}}) of the case, 10% of the average of the extreme values ​​adjusted

x2+x2⏞ +x3+x4+x5+x6+x7+x8+x9+x9⏞ 10{\displaystyle {\frac {\overbrace {x_{2}+x_{2}} +x_{3}+x_{4}+x_{5}+x_{6}+x_{7}+x_{8}+\overbrace {x_{9}+x_{9}} }{10}}},

Respectively, wherein x2 {\ displaystyle x_ {2}}, and x9 {\ displaystyle x_ {9}} replaced x1 {\ displaystyle x_ {1}} and x10 {\ displaystyle x_ {10}}.

The above statistics in the multidimensional variables can still be used individually quilt in each dimension, but it does not guarantee admission remains consistent results after the shaft.

The relationship between the mean, median and the number of the congregation
in a symmetrical probability distributions, different central tendency statistics have the same result, but skewness away when 0 may be inconsistent. The probability distribution of a single peak (unimodal probability distribution), the average number ([mu]), the median (v) the relationship between the mode ([theta]) with the following: [4]

| I - M | p ≤ 3 {\ displaystyle {\ frac {| \ theta - \ mu |} {\ sigma}} \ leq {\ sqrt {3}}},

| Ν - μ | σ ≤ 0.6 {\ displaystyle {\ frac {| \ a - \ mu |} {\ sigma}} \ leq {\ sqrt {0.6}}},

| I - v | p ≤ 3 {\ displaystyle {\ frac {| \ theta - \ nu |} {\ sigma}} \ leq {\ sqrt {3}}},

Wherein σ is the standard deviation. As to any of probability distribution, [5] [6]

| N - m | p ≤ 1 {\ displaystyle {\ frac {| \ nu - \ mu |} {\ sigma}} \ leq 1}.

[Wikipedia central tendency]

Skewness Skewness
In probability theory and statistics, the skewness measure of the asymmetry of real random variables probability distribution. Skewness value may be positive, negative or even can not be defined.

In number, skewness is negative (negative skewness) the mean probability density function of the left tail is longer than the right, most of the values ​​(including the inner median) located on the right side of the average.

Skewness is positive (positive skewness) the mean length of the tail of the probability density function on the right side than the left side, most of the values ​​(median, but not necessarily including) the average on the left side.

Skewness value of zero to represent relatively evenly distributed on both sides of the mean value, but does not necessarily mean it is symmetrical.

As the book Jiajun Ping lz: right-skewed distribution, indicating the presence of data maximum value, the mean is pulled closer to the one extreme value. That is positive skewness (starboard) refers to the side for the maximum value in the positive (right).

 

Negative skew (left) and positive skewness (R)

If a symmetrical distribution, the mean = median, skewness is zero (Further, if the distribution is unimodal, then the mean = median = a mode).

The random variable X skewness γ1 standard third-order moments, may be defined as:

E⁡ γ 1 = [(X- μ σ) 3] = μ 3σ E⁡ 3 = [(X- μ) 3] (E⁡ [(X- μ) 2]) 3/2 = κ 3κ 23/2 {\ displaystyle \ gamma _ {1} = \ operatorname {E} {\ Big [{} \ big ({} \ tfrac {X \ mu} {\ sigma} {} \ big)} ^ {\! 3 } \ {\ Big] = {} \ frac {\ mu _ {3}} {\ sigma ^ {3}}} = {\ frac {\ operatorname {E} {\ big [} (X \ mu) ^ {3} {\ big]}} {\ \ \ (\ operatorname {E} {\ big [} (X \ mu) ^ {2} {\ big]}) ^ {3/2}}} = {\ frac {\ kappa} _ {3} {\ kappa} _ {2 ^ {3/2}}} \,}

Where μ3 is a third-order central moment, σ is the standard deviation. E is the expectation operator. Equation last amount of 1.5 to power ratio and the second order cumulant order cumulant represented skewness. This second order cumulants squares removed by fourth-order cumulant method to represent the kurtosis similar.

If we assume that Y is a sum of n independent variables and the variables X and have the same distribution, then the third-order cumulant Y is X is n times, n times the second order cumulant Y is X, so: Skew [Y ] = Skew [X] / n {\ displaystyle {\ mbox {Skew}} [Y] = {\ mbox {Skew}} [X] / {\ sqrt {n}}}. According to the central limit theorem, when it is close to Gaussian distribution and a skewness of variables is reduced.

Right skewed distribution, mean> median> the mode

Since the left side of the mean number of more, as many as the number of the left and right sides of the median comparison, the average will be the right side of the median (i.e., so that only an area enclosed by greater than 0.5).
Further, lz that right-skewed image area enclosed by the demarcation point 0.5 should be right at the peak point (the mode), so that the mode is greater than the median. (Median should actually be> = the mode of it, take an extreme example to know, such as [1 ... 2,2,2,2,2, ... 10,000])

Jiajun Ping books: right-skewed distribution, indicating that there is data maximum value, the mean close to pulling one of extreme value, and the mode and median represents the value of the position, not very worthy of influence.

Kurtosis Kurtosis
In statistics, kurtosis (Kurtosis) measure kurtosis real random variable probability distribution. Means high peak variance increased by an extreme difference is greater than or less than the average of low frequency caused.

Kurtosis refers Jianqiao extent or degree of projection data distribution peaks. Kurtosis to the following three types:
  ● When the curve is more ridges, sharp peak belonging degrees.
  ● When the frequency distribution of data, for the mode dispersion for comparison, the number of the frequency distribution curve is smoother than the normal distribution curve, the peak level of belonging.
  ● When the frequency distribution of data, in full compliance with the law of normal distribution, the number and frequency distribution curve of normal distribution curve are identical, it is normal kurtosis.

Kurtosis is the fourth power of the average deviation, divided by the standard deviation of the fourth power. The formula is:
  
  Where, α4: kurtosis, δ4: standard deviation of the fourth power.
  Since the kurtosis of the normal distribution is 3, so that, when the α4> 3 is the peak distribution; when α4 <flat top distribution.

Note: lz so kurtosis can be used to detect whether a normal distribution.

Another kurtosis can (wikipedia) is defined as the fourth-order cumulant divided by the square of the second-order cumulative amount, which is equal to the fourth order central moment divided by the square of the variance of the probability distribution minus 3:

c 2 = n = 4k 22 m 4c 4- 3 {\ displaystyle \ gamma _ {2} = {\ frac {\ kappa _ {4}} {\ kappa _ {2} ^ {2}}} = {\ frac {\ mu _ {4}} {\ sigma ^ {4}}} - 3}

This is also known as the peak value of (excess kurtosis). "Minus 3" is to make a normal distribution kurtosis of zero.

Y is assumed that the sum of n independent variables, and these variables and X have the same distribution, then: Kurt [Y] = Kurt [X] / n, but if the peak is defined as: μ4 / σ4, may become formula more complicated.

If the peak value of the positive peak called state (leptokurtic). If the value is negative kurtosis, referred to as low kurtosis (platykurtic).

Kurtosis comprising normal distribution (peak value = 3), tail thickness (peak value <3), thin tail (kurtosis> 3), to see both the tail. Below (>, <written backwards):

 

[Wikipedia Minedo]

[Skewness and kurtosis of the normal distribution of test]

Pippi blog

 

 

[Probability theory: the mean, variance and covariance matrix]

Standard deviation (in English: Standard Deviation, SD)
mathematical symbols σ (sigma), the most commonly used as a measure of a discrete set of values with a degree of probability statistics. Standard deviation is defined: as the square root of variance open, reflecting the degree of dispersion within groups of individuals; standard than the standard deviation from the expected value of the slip. Measuring the extent of the distribution results, the principle of having two properties:

Non-negative values;
the measurement data having the same unit.
Briefly, the standard deviation is a measure of the concept of a set of values from the average degree of spread out. A large standard deviation, most of the large differences between the representative value and a mean value thereof; a small standard deviation, representing these values closer to the average. From the geometric point of view, can be understood as a standard deviation from n {\ displaystyle n} is a function of the distance to the point of a straight-dimensional space.

Importantly, a general observation is not far away from the mean standard deviation over several times. Precisely, the use of the inequality, to prove the observation from the least mean standard deviation of no more than k. Thus, the standard deviation is a good indicator of the divergence dataset.

The overall standard deviation
 SD = 1NΣ i = 1N (xi- μ) 2 {\ displaystyle \ SD = {\ sqrt {{\ frac {1} {N}} \ sum _ {i = 1} ^ {N} (x_ {i} - \ mu) ^ {2}}}}

 

μ {\ displaystyle \ mu} is the average (x¯ {\ displaystyle {\ overline {x}}}).

 

A generally random variable
standard differential defines a random variable X {\ displaystyle X} is:

σ =E⁡ ((X− E⁡ (X))2)=E⁡ (X2)− (E⁡ (X))2{\displaystyle \sigma ={\sqrt {\operatorname {E} ((X-\operatorname {E} (X))^{2})}}={\sqrt {\operatorname {E} (X^{2})-(\operatorname {E} (X))^{2}}}}

It should be noted that not all random variables have a standard deviation, because some random variable expectations does not exist.

Differential standard discrete random variable
if X {\ displaystyle X} is a real number x1, x2, ..., xn { \ displaystyle x_ {1}, x_ {2}, ..., x_ {n}} discrete configuration random variable (English: discrete random variable), and the probability of each value are equal, X {\ displaystyle X} is defined as the standard deviation:

σ =1N∑ i=1N(xi− μ )2{\displaystyle \sigma ={\sqrt {{\frac {1}{N}}\sum _{i=1}^{N}(x_{i}-\mu )^{2}}}} ,其中 μ =1N(x1+⋯ +xN){\displaystyle \mu ={\frac {1}{N}}(x_{1}+\cdots +x_{N})}

However, if each xi {\ displaystyle x_ {i}} may have different probabilities pi {\ displaystyle p_ {i}}, then X {\ displaystyle X} is defined as the standard deviation:

σ =∑ i=1Npi(xi− μ )2{\displaystyle \sigma ={\sqrt {\sum _{i=1}^{N}p_{i}(x_{i}-\mu )^{2}}}} ,其中 μ =∑ i=1Npixi.{\displaystyle \mu =\sum _{i=1}^{N}p_{i}x_{i}.}

Standard deviation of the sample
in the real world, find an overall standard deviation is unrealistic real. In most cases, the overall standard deviation is extracted by a random amount of sample and calculating the sample standard deviation estimate.

{\ Displaystyle X_ {1}, \ cdots, X_ {N}} been taken out as the present combination of values ​​x1, ⋯, xn from a large set of values ​​X1, ⋯, XN: n <N {\ displaystyle x_ {1}, \ cdots, x_ {n}: <N}, often define the sample standard deviation n:

s=1n− 1∑ i=1n(xi− x¯ )2{\displaystyle s={\sqrt {{\frac {1}{n-1}}\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}}

[Wikipedia standard deviation]

Pippi blog

 

 

Range / full-pitch
used to represent the amount of variation of the number of statistics (English: measures of variation), as the difference between the maximum and minimum values, i.e., the minimum maximum values obtained after subtraction.

R=xmax-xmin

Full discrete levels from the simplest measure of value, vulnerable to the influence of extreme values. Equally suitable for the variable, the variable ratio, does not apply to the name of the variable or variables order. Poor take full advantage of information not data, but the calculation is very simple, only a small sample size (n <10) cases. Range comparison can not be used, in different units; variance can be used as comparison, as are a ratio.

Moving range (Moving Range)
refers to two or more consecutive sample values of the difference between the maximum value and the minimum value of this difference is calculated in such a manner: each time a get additional data points in the sample plus the new points, delete the time the "oldest" point which is then calculated from this point related to the poor, so poor is calculated for each share value calculated before a point with a very poor at least. In general, values for single moving range control chart, and usually two (consecutive points) to calculate the moving range.

Quartile (Quartile)
quartile (Quartile) is a statistical carved bits, i.e. all values are arranged in ascending into four equal parts and, in three divided value is the four point locations quantile.

The first quartile (Ql), also known as "small quartile", equal to the sample value of 25% of all the numbers arranged in ascending.
The second quartile (Q2), also known as "median", all the sample values equal to the first number after 50% of the ascending order.
Third quartile (Q3), also known as "large quartile", all the sample values equal to the first 75% of the numbers are arranged in ascending.
Select quartile values (there are different standards)

A main choice quartile percentage value (p), and total (n) samples may be expressed in the following mathematical formula:

Lp=(n)(p100){\displaystyle L_{p}=(n)\left({\cfrac {p}{100}}\right)}

Case 1: If L is an integer, the average value of L and L + 1 of
the case 2: If L is not an integer, a nearest integer is removed. (Such as L = 1.2 {\ displaystyle L = 1.2}, then take 2)
2 n-represents the number of items

Determine the location quartile

Position Q1 = (n + 1) × 0.25

The position Q2 = (n + 1) × 0.5

The position Q3 = (n + 1) × 0.75

3 Another method based on N-1 basis. which is

The position Q1 = 1 + (n-1) x 0.25

Position Q2 = 1 + (n-1) x 0.5

The position Q3 = 1 + (n-1) x 0.75

Interquartile range (InterQuartile Range, IQR)
gap between the third quartile and the first quartile, also known as interquartile range (InterQuartile Range, IQR).

Usually from a box plot, and a brief overview of the chart used to construct quartile probability distribution. A symmetric distribution of the data (number of bits which must be equal to the arithmetic average of the first and third quartile quartiles), interquartile range equal to one half of the absolute difference (MAD). The median is reflected in the trend of polyethylene.

IQR=Q3− Q1{\displaystyle IQR=Q_{3}-Q_{1}}

[Wikipedia quartile]

The coefficient of variation / dispersion coefficient Coefficient of Variation
in probability theory and statistics, the coefficient of variation, also known as "dispersion coefficient", also known as standard or slip away from the unit risk is the probability distribution of the degree of dispersion of a normalized measure, defined as the standard deviation σ {\ displaystyle \ \ sigma} with mean value μ {\ displaystyle \ \ mu} ratio [1]:

cv = p m {\ displaystyle c_ {v} = {\ sigma \ over \ mu}}

The coefficient of variation (coefficient of variation) is defined only in the average value is not zero, but generally applies to the average value is greater than zero.

When the two sets of data need to compare the size of the degree of dispersion, the measurement scales much difference if the two sets of data, or data of a different dimension, to directly compare the standard differential is inappropriate, this time should be eliminated, and measurement scale dimensions impact and coefficient of variation can do this, it is the standard deviation of the raw data than the average of the original data.

Meaningful only for the coefficient of variation calculated from the ratio of the scalar value. For example, for a distribution of temperature, degrees Kelvin or to calculate it does not change the value of standard deviation, mean values ​​of the temperature change, so the use of different temperature coefficient of variation obtained, then the subject is different. That is, the coefficient of variation obtained by using scalar range is meaningless.

 

————————————————


Guess you like

Origin www.cnblogs.com/lonelyshy/p/12531230.html