文本分类 Text Classification

什么是文本分类

文本分类任务是NLP十分常见的任务大类，他的输入一般是文本信息，输出则是预测得到的分类标签。主要的文本分类任务有主题分类、情感分析、作品归属、真伪检测等，很多问题其实通过转化后也能用分类的方法去做。

常规步骤

选择一个感兴趣的任务
收集合适的数据集
做好标注
特征选择
选择一个机器学习方法
利用验证集调参
可以多尝试几种算法和参数
训练final模型
Evaluate测试集

机器学习算法

这里简单介绍几个机器学习（基础）算法

1. 朴素贝叶斯 Naive Bayes

假设特征之间是相互独立的，利用贝叶斯法则，寻找最有可能的class，

\[P(c_n|f_1...f_m) = \prod_{i=1}^mp(f_i|c_n)p(c_n) \]

优点：Fast to “train” and classify; robust, low- variance; good for low data situations; optimal classifier if independence assumption is correct; extremely simple to implement.

缺点：Independence assumption rarely holds; low accuracy compared to similar methods in most situations; smoothing required for unseen class/ feature combinations

2. 逻辑回归 Logistic Regression

逻辑回归是由线性回归做了点改动得来的，利用一个link function进行转化，有点”化曲为直“的味道，能够输出一个0-1的概率。

\[P(c_n|f_1...f_m) = \frac{1}{Z} * exp(\sum_{i=0}^mw_if_i) \]

训练的方法和回归模型差不多，利用cost函数来求weight，还可以添加正则项（regularisation）作为惩罚项。

优点: Unlike Naïve Bayes not confounded by diverse, correlated features

缺点: High bias; slow to train; some feature scaling issues; often needs a lot of data to work well; choosing regularisation a nuisance but important since overfitting is a big problem

3. Support Vector Machines (SVD)

主要思想：找到一个超平面能够区分训练数据从而进行测试集的分类，这里不展开。

优点: fast and accurate linear classifier; can do non-linearity with kernel trick; works well with huge feature sets

缺点: Multiclass classification awkward; feature scaling can be tricky; deals poorly with class imbalances; uninterpretable

4. K-Nearest Neighbour (KNN)

主要思想：根据观测数据与已有数据的距离（可以是欧几里得距离、cosine距离），取最接近的标签作为观测数据的标签。

优点: Simple, effective; no training required; inherently multiclass; optimal with infinite data

缺点: Have to select k; issues with unbalanced classes; often slow (need to find those k-neighbours); features must be selected carefully

5. 决策树 Decision Tree

主要思想：利用feature信息构建树，最后的叶子节点就是class类。

优点: in theory, very interpretable; fast to build and test; feature representation/scaling irrelevant; good for small feature sets, handles non-linearly-separable problems

缺点: In practice, often not that interpretable; highly redundant sub-trees; not competitive for large feature sets

6. 随机森林 Random Forest

主要思想：有多个决策树构成，通过最后投票选定标签。

优点: Usually more accurate and more robust than decision trees, a great classifier for small- to moderate-sized feature sets; training easily parallelised

缺点: Same negatives as decision trees: too slow with large feature sets

7. 神经网络 Neural Network

主要思想：将多个神经层节点之间相互联系，每个节点把前一层的weight传递到下一层，这里不展开，其实本质还是linear regression。

优点: Extremely powerful, state-of-the-art accuracy on many tasks in natural language processing and vision

缺点: Not an off-the-shelf classifier, very difficult to choose good parameters; slow to train; prone to overfitting

调参

我们在使用训练集训练完数据后，可以用验证集进行调参，常用的调参方法有k-fold cross-validation，grid search

评估

常用的评估标准：

Accuracy = 正确数/总数
Precision = tp/tp+fp
Recall = tp/tp+fn
F1-score = 2 * precision * recall / (precision + recall)

另外还有macro f-score 和 micro f-score，想进一步了解的可以点这里。