Data analysis simple question sharing (with answers)

  1. Estimate the number of newborns born this year without using any public references
    Answer:

    1) Adopt a two-layer model (crowd portrait population transformation): number of newborns = Σ number of women of childbearing age in each age group Fertility ratio of each age group
    2) From number to number: if there is data on the number of newborns born in previous years, the establishment time Sequence model (need to take into account the mutation event of the release of the second child) for prediction
    3) Look for precursor indicators, such as the number of new active users of baby products X represents newborn family users. Xn/newborn n is the conversion rate of newborn home users in this year, for example, X2007/newborn 2007 is the conversion rate of newborn home users in 2007. The conversion rate will develop with the development of the platform. The approximate conversion rate of this year can be released based on the number of previous years, and the estimated number of newborns this year can be released based on the number of newborn family users this year.
    2. What distribution does the number of planets per unit volume in the observed universe belong to:
    A Student distribution: Estimate the mean of the normal distribution under a small sample size
    B Poisson distribution: The probability of an event occurring within a certain period of time. It can also be considered as a binomial distribution with a large n and a small p.
    C Normal distribution: multiple groups (mean of a random variable under multiple independent repeated experiments)
    D Binomial distribution: independent repeated experiments of multiple coin tosses
    Solution:
    A Student distribution: small sample size for the mean of a normal distribution Estimate
    B Poisson distribution: the probability of an event occurring within a certain period of time. It can also be considered as a binomial distribution with a large n and a small p.
    C Normal distribution: multiple groups (average of random variables under multiple independent repeated experiments)
    D Binomial distribution: independent repeated experiments of multiple coin tosses
    If the volume is regarded as time, then this question conforms to the B Poisson distribution.

  2. Common dimensionality reduction methods
    1) PCA and factor analysis
    2) LDA
    3) Manifold method: LLE (local linear embedding), Laplacian feature map, ISOMAP
    4) Automatic encoding machine extraction feature
    5) SVD
    6) Tree model extraction Feature
    7) embedding

4. When users first enter the app, they will choose attributes. How to reduce user churn while ensuring complete user information
Answer: When users first enter the app, they will select attributes. How to reduce user churn while ensuring complete user information
Using the Technology Acceptance Model (TAM) to analyze, the main factors that affect the user’s acceptance of the selected attribute are:
1) Perceived usefulness:
a. The text informs the user of the benefits that the selected attribute can bring to the user
2) Perceived ease of use:
a. Associate the user's third-party account (such as Weibo), which can match the attributes that the user is more likely to choose during the cold start phase, and recommend the user to choose
b. Do a good job of interactivity
3) User attitude: the user's attitude towards filling in information
a. here Need to allow users to skip, and remind users to fill in later
b. Inform users that the information filled in will be well protected
4) Behavioral intent: the purpose of the user using the APP, which is difficult to control
5) External variables: such as operating time, operating environment, etc. , here it is difficult to control

5. Advantages and disadvantages of SVM
1) Advantages:
a. Can be applied to non-linearly separable situations
b. The final classification is determined by the support vector, and the complexity depends on the number of support vectors rather than the dimension of the sample space, avoiding dimension Disaster
c. Robustness: because only a small number of support vectors are used, key samples are captured, and redundant samples are eliminated
d. Good performance in high-dimensional and low-sample situations, such as text classification
2) Disadvantages:
a. High complexity of model training
b. Difficult to adapt to multi-classification problems
c. There is no better methodology for kernel function selection
6. A brief introduction to random forests and some details

1)随机森林原理:通过构造多个决策树,做bagging以提高泛化能力
2)随机方法包括:subsample(有放回抽样)、subfeature、低维空间投影(特征做组合,参考林轩田的《机器学习基石》)
3)有放回抽样,可以用包外样本做检验
4)也可以用OOB做特征选择,思路:
    a. 如果一个特征有效,那么这个特征引入杂质会明显影响模型效果
    b. 引入杂质会影响分布,所以更好的方式是对特征中的取值进行洗牌,然后计算前后模型的差异
    c. 但是我们不想训练两个模型,可以利用OOB进行偷懒。把OOB中的数据该特征取值洗牌,然后扔进训练好的模型中,用输出的结果进行误差检验

(Refer to @王娟's answer: https://www.zhihu.com/question/26225801) 6.
Introduction to the principle of GBDT
1) First, introduce Adaboost Tree, which is a tree integration method for boosting. The basic idea is to train multiple trees in sequence, and weight the misclassified samples when each tree is trained. The weighting of the samples in the tree model is actually the weighting of the sampling probability of the samples. When sampling with replacement, the wrong samples are more likely to be drawn

2) GBDT is an improvement of Adaboost Tree. Each tree is a CART (classification and regression tree). The tree outputs a value at the leaf node. The classification error is the real value minus the output value of the leaf node to obtain the residual. What GBDT has to do is to use the gradient descent method to reduce the classification error value

In the iteration of GBDT, assuming that the strong learner we obtained in the previous round of iteration is ft−1(x), and the loss function is L(y,ft−1(x)), our goal of this round of iteration is to find a CART The weak learner ht(x) of the regression tree model minimizes the loss loss L(y,ft(x)=L(y,ft−1(x)+ht(x)) of this round. That is to say, this The decision tree is found by rounds of iterations, and the loss of samples should be made as small as possible.

The idea of ​​GBDT can be explained with a popular example. If a person is 30 years old, we first use 20 years old to fit and find that the loss is 10 years old. At this time, we use 6 years old to fit the remaining loss and find that there is still a gap 4 years old, in the third round we used 3 years old to fit the remaining gap, and the gap was only one year old. If our number of iteration rounds is not over, we can continue to iterate below, and the error of the fitting age will decrease in each round of iteration.

(Reference: https://www.cnblogs.com/pinard/p/6140514.html)

3) After getting multiple trees, weighted voting is performed according to the classification error of each tree

Guess you like

Origin blog.csdn.net/m0_66106755/article/details/129557137