Data Analysis Interview

Job interviews related to data analysis can be broken down into the following three parts:

1) Technical basis

2) Questions about project experience

3) Business issues

[Data Analysis and Mining (2)] Summary of Interview Questions (with Answers)_Data Analysis Interview FAQs and Answers_youthlost's Blog-CSDN Blog

I resigned from the interview python position naked

sql

py programming

interview:

02 Why when dealing with missing values ​​of features through statistical indicators, the average and median are often used to fill in continuous fields instead of the maximum, minimum, and mode?
Answer: Using the average value and median can ensure the balance of the data to a certain extent. In many cases, the distribution of the original data can be maintained. If the maximum value, minimum value, etc. are used to fill in, it is likely to lead to the distribution of the processed data. The trend changes, especially in the case of a large number of missing values, which directly leads to the phenomenon of skewed distribution, and the filled data does not conform to the objective understanding of the actual business. Of course, it is completely reasonable to use the maximum value, minimum value, etc. to fill in some specific scenarios, but in general, for continuous features, it is relatively more appropriate to use the average and median.

03 Why in the process of feature missing value and outlier value processing, outlier value is often processed first?
Answer: If the missing value is processed first, if it is filled by common statistical indicators (maximum value, minimum value, average value, etc.), the outlier data will be taken into account, which is equivalent to implanting noise data components into the missing unit , to a certain extent, the outlier components will be diffused, which will directly affect the reasonable distribution of data. If outliers are processed first, the influence of noise data can be eliminated first, and then an appropriate missing value filling method can be used to process, which can better ensure the original form of feature data distribution and have significantly less impact on subsequent model training.

04 Why is the discrete numerical feature not implemented by the boxplot method in outlier processing?
Answer: From the perspective of the principle logic of the box plot, the discrete numerical feature fully supports the identification of outliers through the box plot, and it also has certain explanatory significance, but compared with the box plot processing of the continuous feature, the discrete numeric feature The rationality of the type feature processing process is obviously lacking. For example, the value distribution of a discrete feature is 1, 2, 3, 4, 10. If the box plot is used to judge, 10 will be considered an outlier. If the label meaning of this feature is the e-commerce membership card level, 10 is very meaningful, and it is unreasonable to treat it as an outlier. Therefore, discrete numerical features often judge outliers by value ratio or human experience.

05 Why is feature exploration analysis before data modeling necessary?
Answer: The main purpose of sample data exploration is to provide information reference for subsequent data cleaning and feature engineering. Among them, in terms of data cleaning, we can learn the distribution type (continuous, discrete), value type (varchar, int, float, date), missing value, and abnormal value of the sample characteristics according to the statistical analysis of the data. etc., can further determine the specific method of data cleaning. For example, for the missing value processing of continuous and discrete features, the selected processing logic is very different. The continuous type uses the average value, while the discrete type uses the mode. For feature engineering, due to the different value types of known feature fields, there are also great differences in the selection of processing methods such as feature encoding, feature standardization, and feature correlation. Therefore, in the data analysis task, the sample exploration and analysis after importing the data is very helpful for us to be familiar with the characteristics of the sample and grasp the follow-up processing.

06 Why do not have too many field dimensions when deriving features?
Answer: Feature derivation is a feature engineering that is often used in the data modeling process, especially when the feature variable pool is small. However, in the process of deriving the original features, it is not possible to blindly pursue the number of processed features. It must be considered The business meaning and application value of features, in banks and other traditional financial institutions need to pay more attention to this point. At the same time, according to the continuous derivation of the original features, for example, through statistical differences, proportions and other dimensions, theoretically infinite fields can be processed, but the correlation between the new fields is very strong, and will be used in the subsequent feature screening process. The deletion with a high probability is obviously unnecessary for work efficiency. Even if feature correlation screening is not performed, it will directly lead to collinearity of the model during the model fitting process, and this is not the result we want. Therefore, in the process of feature derivation, it is most important to objectively analyze and grasp certain derivation dimensions and methods.

07 Why is the correlation analysis between characteristic variables necessary?
Answer: The correlation analysis of feature variables is very important in data testing, data modeling and other scenarios. For three-party data testing, we can obtain quantitative indicators related to related fields based on the correlation analysis of features, so as to select and match the fields. The introduction of features provides a very intuitive reference value; for data modeling, the correlation analysis of features has become a standard configuration. According to the correlation coefficient between fields, you can filter fields with high retention information, not only large degree weakens the collinearity of model fitting, and can improve 

Interview question: How to divert traffic when implementing A/B test?

Answer: There are three core ideas for implementing A/B test. One is to run multiple programs in parallel. The second is to control variables. Only one variable differs between each program. The effect must be greater than the control group to be considered significant. If only one link is used for A/B test, then the traffic between each scheme should be mutually exclusive and divided randomly, so as to ensure that the traffic of each scheme comes from the same sample space.

Interview question: Our company has a product that is a "co-branded credit card" launched in cooperation with a bank. This credit card can be used to withdraw cash. What do you think are the risks? How can these risks be reduced?

Answer: I am not very clear about the specific business process of the "co-branded card" you mentioned, so I assume that it is similar to a bank's credit card.

The difference is that your company acts as the fund provider and traffic entry, while the bank acts as the card issuer. I think there are 3 risks.

The first is the overdue risk, which generally exists in the financial field. The solution is to continuously iterate the risk control rules and retrain the model regularly to adapt to changes in the customer base. If possible, data can also be shared with cooperative banks to reduce the impact of data silos.

The second is the risk of fraud. The risk can be reduced by means of "face-to-face signing". When banks issue credit cards, they almost always ask for face-to-face interviews at offline outlets. Cooperating with banks can give full play to this advantage.

The third is policy risk. This risk point is that the bank terminates cooperation with your company out of compliance considerations.
 

Guess you like

Origin blog.csdn.net/zr_xs/article/details/132669762