Uber ML

linear regression assumption violations的impact and fix

https://www.1point3acres.com/bbs/forum.php?mod=viewthread&tid=466856&extra=page%3D1%26filter%3Dsortid%26sortid%3D311%26searchoption%5B3046%5D%5Bvalue%5D%3D22%26searchoption%5B3046%5D%5Btype%5D%3Dradio%26sortid%3D311%26orderby%3Ddateline

lasso ridge区别，什么时候用tree, 什么时候用regression,
tree和regression怎么选，feature selection, 为什么我某个项目最后用了lasso。
gradient boost和random forest的区别，gradient boost的算法，tree怎么split, predict user要还是不要signup可以用什么模型解决，怎么选择模型。最后问了给non technical怎么解释模型性能是AUC=0.8
How to predict whether a restaurant will churn from Eats platform?
prediction：请问如何predict the quantity of restaurant orders on Uber eats platform? 各种讨论细节。其中有一个问题我非常地卡壳：有的餐馆的历史数据长，有的餐馆的历史数据很短（比如才2个月），请问如何对待这种历史很短的数据的情况？我说的是直接time series model里面backfill based on most recent history，面试官的反应是这算行吧，这也算是missing data的一种变体，没有继续squeeze我。后来面完之后我想的是，要不就KNN选几个类似的餐厅，根据他们的历史数据来对现在这个餐厅做prediction（我猜的）？time series对于very short term history的数据处理这块不是很懂哎，欢迎一起讨论呀~
考虑seasonality: 订单一般工作日的中午有高峰，晚上有小高峰，周末消失了
解释RF 和Boosting
解释ridge regression和 Lasso
问我熟悉linear regression嘛？他的assumption是啥？
我答：normal distribution，还有variable 要independent（然后他提醒我，如果不independent会怎么样，然后我说multicollinearity，然后我意识到说independent这个太strong了是不需要的）
问我random forest 和gradient boosting method的区别
我答：random forest based on bagging，每棵树独立，gradient boosting后一棵树是基于前面的trained的树，不独立
然后他问：bagging是怎么样的，我答是resample，然后每个sample上train模型
然后他问：你知道bias-variance trade off嘛？然后我答 blablabla overfitting to training blabla underfitting. From 1point 3acres bbs
然后他问：那你知道random forest 和gradient boosting在解决bias vairance 上的区别吗
然后我答：random forest 是减少vairance， gbm是减少bias
（好像差不多这样，可能有遗漏）
然后开始问建模
怎么给一个用户personalize他的那个餐馆的推荐的界面
我回答，考虑自己的history，其他人的history,然后餐馆的类别blablabla，问了很多，很细，比如你决堤怎么建模，具体怎么算distance，但我有点想不起来了，不好意思~
如何evaluate你的模型好不好
我说，放上去用一个月，然后收集一下data 比如CTR啥的
然后他说如果时间太长，不要放上去怎么弄
我说用历史数据，这个人的rating和我们predict的rating，算算MSE啥的然后问product，什么metrics比较重要，我说CTR，total amount， amount of each restaurant，然后一直问我还能想到别的吗，但我好像想不出来了
6. 如何比较两个模型哪个好？
我说收集以上数据，然后用t-test，
然后他问什么是ttest
我答：假设两个sample variance一样，看是否mean一样
他问，怎么看
我说p-value< 0.05，则显著不同
他问什么是pvalue，我解释一通，但感觉每次叫我解释pvalue都很乱七八糟
然后他问pvalue怎么算，我想不起来公式了
然后又扯到了怎么知道跑这个test要跑多久，我说要看power, confidence level, discernable difference确定total sample，他又问我公式，我不知道公式
然后问如果一个discernable 0.01, 一个0.05,哪个要更多sample，我一开始回答错了，应该是0.01要跟多，他问多少倍，我说5倍，他说不是，是25倍，因为是与他的平方成正比，我输了。。。
7. 然后他又问如果有同时几十个model，我怎么看哪个最好，会有什么问题。
我一开始说如果还是ttest，每次只能两个，就要很长时间
但后来我又说可以anova? 然后还没得到反馈，时间到了。。所以就结束通话了

猜你喜欢