Road machine learning - Common interview questions

Most of the content from the face of machine learning algorithms << one hundred engineers >>

1. The feature works

1.1 Why the need for numeric types are normalized?

The value of each index in the same order of magnitude, to eliminate the influence dimension between the data.

For example, analysis of the impact a person's height and weight on health.

1.2 Supplementary knowledge

Structured Data : a relational database tables, each column has a clear definition, comprising the numeric and categorical

Unstructured data : text, image, audio, can not be expressed using simple value, there is no clear definition of the category, and the size of each piece of data is not the same

1.3 Methodology

Linear normalization (Min-Max Scaling)

It linear transformation of raw data, so the result is mapped to the [0,1] range, to achieve the original data geometric scaling.

Zero-mean normalization (Z-Score Normalization)

It enables the original data is mapped to a mean of 0 and standard deviation 1 of distribution is too

 

 

 1.4 Note

By gradient descent algorithm typically require normalization, including linear regression, logistic regression, the SVM, neural network models , but need not normalized decision tree model

 

1.2 When preprocessing of the data, it should be how to deal with categorical characteristics?

No. coding (Ordinal Encoding)

The type of processing is typically used between the data having the size relationship.

Such results can be divided into three high-low speed expressed as high as 3, indicated 2, it is represented as a low

 

Hot encoded (One-hot Encoding)

Between the type of processing does not have a feature size relationship.

E.g. blood (A blood type, B-type blood, AB blood type, O-blood)

A blood type (1,0,0,0), B blood type (0,1,0,0), AB blood type (0,0,1,0), O blood type (0,0,0,1 )

Note For more under the category value case

1. Use the sparse vector class to save space

2. Select the mating feature to reduce dimension

 

Binary encoding (Binary Encoding)

Imparting a first category sequence number for each encoded category ID, and the category ID corresponding binary

As a result of coding.

 

 

1.3 What is a combination of features? How to deal with high-dimensional combination of features?

In order to improve the ability to fit the complex relationships in the project it will often feature a first-order discrete combinations of two features, constitute higher-order combinations of features.

 

 

 

 

 

note

When introduced when the ID type of features, problems arise.

Solution matrix factorization

 

1.4 how to find an effective combination of features?

One method is based on Decision Tree

 

1.5 What are text representation model? What they have their own advantages and disadvantages?

Text representation model

Bag of words model (Bag of Words)

TF-IDF

Topic Model (Topic Mode)

Word embedded model (Word Embedding)

 

Bag word model (Bag of Words) and N-gram model

Each article is to be seen as a bag of words, ignore the order of appearance of each word.

Weight calculation formula

IDF if a word in a lot of articles which have appeared then it is probably a more generic term, for smaller semantic distinction between the rules and articles of the special contribution, and therefore the right to redo certain punishment

 

 

 

N-gram

将文章用单次级别进行划分不好,所以可以通过词组来划分。一般会对单词进行词干抽取(Word Steamming)处理,即将不同词性的单词统一成为同一词干的形式

 

主题模型

从文本库中发现有代表性的主题(得到每个主题上面词的分布特效),并且能够计算

出每篇文章的主题分布

 

词嵌入与深度学习模型

词嵌入就是将词向量化的模型的统称,核心思想是将每个词都映射成低纬空间(通常

k=50~300维)上的一个稠密向量(Dense Vector),K维空间的每一维度都可以看作一个

隐含的主题,只不过不像主题模型中的主题那样直观。

 

Word2Vec

CBOW SKip-gram

 

 

 

 

1.6图像数据不足时的处理方法

迁移学习(Transfer Learning),GNN,图像处理,上采样技术,数据扩充

 

1.6在图像分类任务中,训练数据不足会带来什么问题?如何缓解数据量不足带来的问题?

 

带来过拟合问题

解决方法

一是基于模型的方法,

简化模型(如将非线性模型简化为线性模型)

添加约束项以缩小假设空间(L1/L2正则项)

集成学习

Dropout超参数等

二是基于数据的方法

数据扩充(Data Augmentation)

 

 

 

 

除了直接在图像空间进行变换,还可以先对图像进行特征提取,然后

在图像的特征空间内进行变换,利用一些通用的数据扩充或上采样技术。

例如SMOTE

 

迁移学习进行微调(fine-tune)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/ggnbnb/p/12210805.html