2020 Xiaohongshu school recruitment data analysis written test questions

Star Coke's road to data analysis

learn together

Hello everyone, I'm Coke

Today I will bring you a detailed explanation of the data analysis written test questions for Xiaohongshu school recruitment in 2020

1. If a certain merchant in the Xiaohongshu mall sets a price for a product, if the price is set at the lowest price of 500 yuan in the entire network, then customers will definitely choose to buy here; for every 1 yuan increase in the price, the possibility of customer loss increases. will increase by 1%. Then the merchant quotes the best price to the customer as ()

A、520
B、535
C、550
D、565

Answer: C

Analysis:
When the price is required, the profit can be maximized. Let the price increase be x, the profit be y, and M be the unknown number of customers, but a fixed value. Find the maximum value of the binary linear equation y=M(1-x/100)x .


2. In a card collection event, there are 5 different cards that appear with the same probability. Every time you share a note, you can get a card. The expectation of the number of notes required to collect all the cards is related to which of the following results closest? ()

A、9
B、11
C、13
D、15

Answer: B

Parsing:
Investigate the sum of multiple geometric distributions .

First of all, the topic conforms to the geometric distribution, independent experiment -> the probability of getting a card is the same -> how many trials are needed to collect all the cards. For a geometric distribution, if the probability of each success is p, then the expectation is 1/p.

Back to this question, there are several situations:

  • Suppose there is only one kind of card in it, and all the cards are taken once, and the expectation is 1

  • Assuming that there are two kinds of cards, you can definitely get one for the first time, then how many times you can get the rest, it becomes a geometric distribution again, p = 1/2, the expectation is 2, so the total expectation is 1+2=3

  • Assuming there are 3 types of cards, one must be obtained the first time, the expectation is 1, and the remaining two types are to be taken the second time, p = 2/3, the expectation is 3/2, and the third time is to get one of the remaining two cards. Get the third one, p = 1/3, the expectation is 3, so the overall expectation is 1+3/2+3=11/2

  • By analogy, the expectation of getting all 5 kinds of cards should be:
    the first time you get 1 kind, the expectation is 1, and the second time you get 1 kind of the remaining 4 kinds, p=4/5, E=5/ 4. The third time to get 1 of the remaining 3 types, p=3/5, E=5/3, the fourth time to get 1 of the remaining 2 types, p=2/5, E=5/ 2. The fifth time to get the remaining one, p=1/5, E=5.

  • The total expectation is: 1+5/4+5/3+5/2+5, approximately equal to 11.42

This is the same as Ji 5 Fu .


3. How to combine the character value of column a and the character value of column b into a string c() in excel

A、c=a+b
B、c=a&b
C、c=a and b
D、c=a*b

Analysis:
Investigate the basic usage of Excel

The combination of characters in Excel uses the "&" symbol, and the function CONCATENATE can also be used. Use "+" to concatenate strings in Python.
You can use "+" or concat function to concatenate characters in SQL.


4、select count(open)  count(distinct user_id)   from   temp1
()
A、3,4
B、5,5
C、5,3
D、3,5

I don't know what this question means, so I didn't give a table.


5. Investigate the average transportation expenses of 1,000 employees in the company, and take non-replacement sampling, and select 100 of them for investigation. According to previous surveys, it can be known that the population variance s² is 100, then the variance of the sample mean is ()

A、0.1
B、1
C、100/111
D、10/111

Answer: C

Parsing:
When sampling is not reset, the variance of the sample mean is calculated using the following formula:


100/100x(1000-100)/(1000-1)=100/111

6. Knowing that the month-on-month growth rates from February to May are 5.6%, 7.1%, 8.5%, and 6.4% respectively, the growth rate in May compared with January is ()

A、5.6%7.1%8.5%6.4% 

B、(105.6%107.1%108.5%106.4%)-100%
C、(5.6%7.1%8.5%6.4%)+100% 

D、105.6%107.1%108.5%106.4%

Answer: B

Analysis:
Examining fixed-base growth rate and month-on-month growth rate

The growth rate in May compared with January is the fixed-base growth rate. There is no direct conversion relationship between the fixed-base growth rate and the chain growth rate. Multiply after 1, and then subtract 1 from the result to get the fixed-base growth rate, then the fixed-base growth rate is (107.8%×109.5%×106.2%×104.9%)-100%.


7. "You can't have both fish and bear's paw" means: ()

A. Either you get fish or bear's paw
B. If you get bear's paw, you don't get fish
C. Either you get fish or bear's paw
D. If you don't get bear's paw, you get fish

Answer: B

Analysis:
Examining mutually exclusive events

Fish and bear's paw are mutually exclusive events, only one of them will happen, only B means this


8. Which of the following is a discriminant model? () -- multiple choice

A. Hidden Markov
B. Decision tree
C. Support vector machine
D. Naive Bayesian
E. Maximum entropy model

Answer: BCE

Analysis:
Examining the basic concepts of machine learning algorithms

Decision trees, support vector machines, and maximum entropy models are discriminative models. Typical discriminative models include KNN, logistic regression, and neural networks. Naive Bayes and Hidden Markov are generative models.

Regarding the discriminative model and the generative model, Bowen machine learning discriminative model and generative model - nolonely - blog garden gives an example:

  • Example of discriminant model: To determine whether a sheep is a goat or a sheep, the method of using the discriminant model is to learn the model from historical data, and then predict the probability that the sheep is a goat by extracting the characteristics of the sheep. probability.

  • Example of a generative model: using a generative model is to first learn a goat model based on the characteristics of a goat, and then learn a model of a sheep based on the characteristics of a sheep, then extract features from the sheep, and put it into the goat model to see the probability is How much, put it in the sheep model to see what the probability is, whichever is bigger is which.


9. Among the following Excel formula input formats, the correct one is ()

A、=SUM(1,2,,,,99,100)
B、=SUM(E1:E6)
C、=SUM(E1;E6)
D、SUM(“18”,”25”,7)

Answer: B

Analysis:
Investigate the basic usage of Excel

The usage of the sum function in Excel is option B


10. Regarding the normal distribution, which of the following statements is correct ()--Multiple choices

A. The normal distribution has concentration and symmetry
B. The mean and variance of the normal distribution determine the position and shape of the normal distribution
C. The skewness of the normal distribution is 0 and the kurtosis is 1
D. The standard normal distribution The mean is 0 and the variance is 1

Answer: ABD

Analysis:
examine the basic knowledge of normal distribution

The normal distribution curve is symmetrical, with symmetry, and the mean and median are in the center, with concentration.
The mean of the normal distribution determines the central position of the curve, and the variance indicates the dispersion, that is, the larger the variance, the flatter and wider the curve, which determines its shape.
The standard normal distribution has a mean of 0 and a variance of 1.
A standard normal distribution has a skewness of 0 and a kurtosis of 0 (3).


11. X obeys the uniform distribution on the interval (1,5), find the probability that at least 2 of the 3 independent observations of X are greater than 2 ()

Answer: 27/32

Analysis:
Examining the usage of binomial distribution

Three independent observations satisfy the binomial distribution X~B(3,3/4)


Here the probability of greater than 2 p=3/4, q=1/4, n=3
requires at least 2 observations greater than the probability of 2, which is to find P(X=2)+P(X=3)
   P = 3! /2!(3-2)! (3/4)^2 (1/4)+3!/3! * (3/4)^3
  =3 (3/4) (3/4)*(1 /4)+ (3/4)^3
  =27/32

For the binomial distribution, please refer to my previous article:
Probability distribution of discrete random variables


12. There are three good standards for sampling estimation: (), and there are four factors that affect the time series: ()

Answers: unbiasedness, consistency, validity; long-term trends, seasonal changes, cyclical fluctuations, irregular fluctuations

Analysis: examine the basic concepts of sampling estimation and time series
in statistics

conceptual question


13. Please give three common clustering algorithms: ()

Answer: K-means clustering, K-centroid clustering, EM algorithm, OPTICS algorithm, DBSCAN algorithm, etc.

Analysis:
examine the basic concepts of clustering algorithms


14. The face recognition system of Xiaohongshu recognizes the identities of people currently entering Xiaohongshu. The system recognizes three different kinds of people: employees, food delivery staff and strangers. Which learning method is suitable for this application requirement ()

Answer: multi-category

Parsing:
Examining the Applications of Machine Learning


15. Xiaohongshu has launched a new module on the homepage. The purpose is to increase the user's browsing time. Please design an analysis plan to measure whether the user's stay time has been improved after the module is launched?

Analysis:
The idea is A/B Test , which will be discussed in detail in question 19 later.


16. The following table shows the sales data of an e-commerce company in different categories and in different months
(1) Please use sumif or sumifs to calculate the sales volume of facial cleanser in 201901 in cell F3
(2) Please use the function to calculate how many months the facial cleanser has The sales volume of the product exceeds 1 million
(3) Please use the function to calculate the monthly compound growth rate of the facial cleanser category

Answer:
=SUMIFS(C4:C15,B4:B15,E4,A4:A15,F3)
=COUNTIFS(B2:B13,B2,C2:C13,">100")
=pow(160/120,1/3) -1

Analysis:
Investigate the practical application of Excel

The first question examines the usage of the SUMIFS function . This function is used for conditional summation. The function has at least three parameters:

  • sum_range: Refers to the cell or cell range to be summed (summed range)

  • Criterial_range: Criterion range, when summing, this range will participate in the judgment of conditions

  • Criterl: It is usually a specific value involved in the judgment. It comes from the condition area. It
    is very simple to expand this function in detail:

The second question examines the usage of the COUNTIFS function . This function is used for conditional counting. Its parameters are:

  • criteria_range[N]: refers to the cell or cell range to be counted (conditional range)

  • criteria[N]: condition value.
    This formula is also very easy to understand after expansion:

The third question is the calculation of the compound growth rate . Its formula is:
(existing value/basic value)^(1/period) - 1
Here, the monthly compound growth rate of facial cleanser is to be calculated. Use the power function in Excel Computes powers.


17. There is an order transaction table orders:

orders

There is a favorites transaction table:
favorites

Please use one sentence of SQL to extract the behavior characteristics of all users on the product. The characteristics are divided into purchased, purchased but not collected, collected but not purchased, and collected and purchased (the output results are shown in the following table)
result

Answer:

SELECT o.user_id,o.item_id,
(CASE when o.pay_time is not null then 1 else 0 end) as '已购买',
(CASE when o.pay_time is not null and f.fav_time is null then 1 else 0 end) as '购买未收藏',
(CASE when o.pay_time is null and f.fav_time is not null then 1 else 0 end) as '收藏未购买',
(CASE when o.pay_time is not null and f.fav_time is not null then 1 else 0 end) as '收藏且购买'
FROM orders o
LEFT JOIN favorites f 
ON o.user_id = f.user_id 
AND o.item_id = f.item_id
UNION
SELECT
f.user_id,f.item_id,
(CASE when o.pay_time is not null then 1 else 0 end) as '已购买',
(CASE when o.pay_time is not null and f.fav_time is null then 1 else 0 end) as '购买未收藏',
(CASE when o.pay_time is null and f.fav_time is not null then 1 else 0 end) as '收藏未购买',
(CASE when o.pay_time is not null and f.fav_time is not null then 1 else 0 end) as '收藏且购买'
FROM orders o 
RIGHT JOIN favorites f 
ON o.user_id = f.user_id 
AND o.item_id = f.item_id
ORDER BY user_id, item_id;

Analysis:
Investigate the usage of case when, outer join and union in SQL statement


18. The positive rate is an important indicator for users to evaluate products. Now we need to count the praise rate of the "DW" brand in the "Mother and Baby" category submitted by the user 'Xiao Zhang' from March 1, 2019 to March 31, 2019 (Praise rate="Praise" evaluation amount/Total evaluation amount ), please write SQL/Python/other language query statements:
user evaluation details table: a
field: id (evaluation id, primary key), create_time (evaluation creation time, format '2019-01-01'), user_name (user name ), goods_id (commodity id, foreign key),
sub_time (evaluation submission time, format '2019-01-01 23:10:32'), sat_name (type of favorable rate, including: "good reviews", "medium reviews", " Bad review")
product details table:
field b: goods_id (product id, primary key), goods_name (product category), brand_name (brand name)

Answer:

select 
sum(case when sat_name = '好评' then 1 else 0 end)/sum(case when sat_name is not null then 1 else 0 end) as '好评率'
from a join b on a.goods_id = b.goods_id
where a.user_name = '小张'
and goods_name = '母婴'
and brand_name = 'DW'
and create_time between '2019-03-01' and '2019-03-31'

Analysis:
examine the SQL statement


19. After some research, we have developed a new recommendation algorithm for the "Related Products" module on the product page, and intend to pass the AB Test (50% of the users retain the original algorithm logic as the control group, and 50% of the users use the new The logic of the algorithm is the experimental group) to evaluate the effect of the new algorithm. Assuming you are the data analyst for this experiment, how would you evaluate the performance of the control group and the experimental group? (Assuming all required data are available) Please list the three most important indicators in order of importance and give your analysis process/thoughts.

Parse:

  • Indicators: clicks/impressions of related products; conversion rate of additional purchases/immediate purchases after entering the product details page; total sales

  • Method: Hypothesis Testing

Hypothesis testing can be done in the following ways: 1. Determine the null
hypothesis and alternate hypothesis. AB Test 3. T test for a period of time , calculate P value 4. Analysis results: If the index after using the new algorithm is much lower than the index without the new algorithm, if the new algorithm has no effect, the probability of this result is very high Low, so the null hypothesis is rejected, that is, it is valid after using the new algorithm.




Principle: small probability counter-evidence method


20. If we find that the sales of category X in a store in March this year have dropped by 50% compared to March last year, if you were the data analyst in charge of this analysis, how would you analyze it? Please write down your analytical thinking/process/idea.

Analysis:
Open question, let me put one of my thoughts:

  • Eliminate the problem of the data itself: first of all, determine whether the data is correct, whether the data source and caliber are correct, and then continue to analyze;

  • Confirm the rationality of the decline: it has dropped by 50%, and analyze whether its decline is reasonable in combination with the month-on-month, year-on-year, and cohort;

  • Analysis of external reasons: what possible external reasons are related to the decline, and to what extent, such as whether other relevant departments have carried out product iterations, adjustments to operating strategies, equipment failures, etc.;

  • Analysis of internal reasons: This can be analyzed from multiple dimensions, such as analysis from the perspectives of users, products, and markets, and index splitting;

  • Confirm the degree of impact: confirm which link has a problem that leads to the decline of the indicator, whether the decline of the indicator has any impact on the key indicators, and the extent of the impact;

  • Formulate consolidation measures: how to avoid such problems in the future.


21. The DAU of an app in July increased by 10% compared to May of the same year. As a data analyst, from what aspects would you analyze the reasons for the increase in DAU? Please list at least two ideas for splitting.

Analysis:
This question is very similar to the previous one. One is why the indicator has fallen, and the other is why the indicator has risen. But this question pays more attention to the examination and analysis of internal reasons, but the first and most important thing is to check the accuracy of the data.

Here is an answer from a netizen, the idea is very clear (source: Niuke.com):


22. Pick any community APP (excluding Xiaohongshu) that you have used, and answer the following questions:
(1) Describe the user characteristics of this APP, and compare the user characteristics of this APP with the user characteristics of Xiaohongshu Similarities and differences
(2) Estimate how many people post content on this app every day. Please write down the auxiliary data you need, and briefly describe the estimation method
(3) The APP you choose will invite one of the three groups of ABC artists to carry out a joint activity in the near future. The main purpose of the activity is to increase DAU.
On the premise that the form of activity is exactly the same, which group will you choose?
Answer requirements: 1) Briefly describe the analysis ideas, 2) List the corresponding data indicators

Analysis:
Open topic.


23. After some research, we decided to add a short video introduction page when new users activate the APP for the first time to increase users' perception of the product, and plan to pass the AB Test (50% is the control group, 50% of the users will See short video introduction) for assessment. If you were the data analyst for this experiment, how would you evaluate the performance of the control group and the experimental group? Please list the indicators you think are important, and give the analysis process and statistical methods that may be used.

Analysis:
The purpose should be to understand the behavior of users after watching the short video introduction page, so as to judge whether the short video introduction page is useful.

  • Indicators: Pay attention to the click-through rate, bounce rate, and viewing time of short videos in the experimental group, and compare the user activation volume, registration activation rate, and subsequent retention of the two groups.

  • Method: Hypothesis Testing


24. There is a convenience store downstairs in the Shanghai office of Xiaohongshu, with an area of ​​about 20 square meters, which mainly provides snacks and drinks. Please estimate the weekly turnover of this convenience store?

Analysis:
To estimate this kind of problem, the main direction is to carry out a logical disassembly , and disassemble a complex problem into specific and simple problems. Post one of the ideas, let's take a look:

Turnover can be split into passenger flow X average consumption. The area is 20 square meters, 10 square meters for placing goods, and 10 square meters for the customer area, which can accommodate 5 customers at the same time. Assuming that the average consumption time is 10 minutes per person, then the passenger flow is 30 people per hour, and the per capita consumption is 25 yuan. The business hours are 10 hours a day and weekly Turnover 30 25 10*7=52500 yuan.


25. If the APP has a function that the user's location information can be uploaded to the database every 1 minute, how to play its role?

Analysis:
The answer direction of this question should be what can be done with this user's location information. For example, according to the location information, the user's behavior trajectory can be obtained, and then the user's behavior habits can be analyzed to provide corresponding real-time recommendation services.

Summarize

  • Some topics examine statistical knowledge, such as the application of geometric distribution and binomial distribution;

  • Some topics are relatively basic mathematics problems, such as finding the maximum value of a linear equation in two variables, speed-up, etc.;

  • Investigate the basic usage of Excel, such as whether the formula is written correctly;

  • Investigate some basic knowledge points of machine learning and statistics, such as what clustering algorithms are there, just know it;

  • Investigating the application of SQL, it is more important to directly write SQL for the two major questions;

  • In the big question, the focus is on the application of A/B Test. The thinking of the three questions has it, which is very important.


you may also like:

What are the must-read books for data analysis?

What mistakes are often made in data analysis and how to solve them?

Elaborate regression analysis

click to share

Like

click to watch

Guess you like

Origin blog.csdn.net/data_cola/article/details/116026175