Data Science and Machine Learning Algorithms from Scratch-Knowledge Point Supplement-00

Parameter Estimation

It is divided into two categories, one is point estimation, and the other is interval estimation. Point estimation is divided into moment estimation and maximum likelihood estimation. For example, estimated rainfall. If today’s rainfall is estimated to be 550mm, it is point estimation. If it is 500-600mm, it is the interval estimation point. Conceptual understanding: when we want to know the situation of a certain index of a certain population, the workload of measuring the value of the overall index is too large or not in line with the reality, then we can use sampling The method selects a part of the sample to measure their value, and then uses the value of the sample statistic to estimate the overall situation. For example, if you want to know the height of a school student, you can randomly select a part of the students to measure their height, get an average value, and then use the average of this sample to estimate the overall height of the students, that is, point estimation. Interval estimation is based on the point estimation, giving an interval range for the overall parameter estimation. The interval is usually obtained by adding or subtracting the estimation error of the sample statistics. In another way, interval estimation starts from the point estimate and the sampling standard error, and establishes the interval containing the parameter to be estimated according to the given probability value. This given probability value is called the confidence level or confidence level. This is established The interval containing the parameter to be estimated is called the confidence interval. Confidence interval is a numerical interval that may contain overall parameters derived from sample information. The confidence level indicates the confidence level of the confidence interval; for example, the interval estimation of the average height of a school student: a 95% confidence level can be considered as the student’s The average height is between 1.4 meters and 1.5 meters, (1.4, 1.5) is the confidence interval, 95% is the confidence level, that is, there is 95% confidence that this interval contains the average height of the school students.
Insert picture description here
Insert picture description here

The core idea of ​​hypothesis testing

Insert picture description here
Insert picture description here

Biased and unbiased estimates

Insert picture description here

## 随机生成1-10的数字生成10万个
%matplotlib inline
import matplotlib.pyplot as plt
from IPython.core.pylabtools import figsize
figsize(15,5)
import pandas as pd 
import numpy as np

np.random.seed(42)
# The population N's size is 100000
N=100000
population = pd.Series(np.random.randint(1,11,N)) # 随机生成1-10的数字生成N个
print(population)

Insert picture description here

# 模拟数据抽样
samples={
    
    }
# The size of each sample 每个样本的大小
n=30
# We are going to draw 500 times of samples and each time ,we are going to take 30 of samples.我们将抽取500次样品,每次抽取30个样品。
num_of_samples= 500
for i in range(num_of_samples):
    samples[i]= population.sample(n).reset_index(drop=True)

samples=pd.DataFrame(samples) # 放入datafram中
samples

Insert picture description here

# (Delta degree of freedom) ddof=0 diveded by n ddof=1 divided by n-1
biased_samples=samples.var(ddof=0).to_frame() # ddof=0 有偏置
biased_samples

Insert picture description here
Insert picture description here

biased_samples=biased_samples.expanding().mean() # 有偏
biased_samples

Insert picture description here

biased_samples.columns=["biased var estimate (divided by n)"]
biased_samples

Insert picture description here
Insert picture description here

unbiased_sample=samples.var(ddof=1).to_frame()#无偏
unbiased_sample

Insert picture description here

unbiased_sample=unbiased_sample.expanding().mean()
unbiased_sample

Insert picture description here

unbiased_sample.columns=["unbiased var estimate(divided by n-1)"]
unbiased_sample

Insert picture description here

ax=unbiased_sample.plot()
biased_samples.plot(ax=ax)
real_population_variance=pd.Series(population.var(ddof=0),index=samples.columns)
real_population_variance.plot()

Insert picture description here

Unfinished subsequent updates

Guess you like

Origin blog.csdn.net/qq_37978800/article/details/114003899