任务：文本检测(可以检测倾斜文本)

contributions
- 提出了End-to-End 的全卷积网络来解决文本检测问题
- 可以根据特定的应用生成quardrangles或者是rotated boxes两种格式的几何标注
- 改进了state-of-the-art方法
算法的核心思想：主要思想来自于U-Net, 采用U型结构来得到1、pixel-level的分割预测结果。2、pixel-level的几何预测结果。根据1和2的结果可以计算得到每个bounding box的四个顶点的坐标值。然后再通过NMS将多余，重复的bounding box删除。
算法流程：
- 训练阶段
- 测试阶段
- 其中基于ResNet的U-Net网络结构如下图所示
算法详情：
- 如何计算ground truth?
  - score对应的ground truth: 是将原始的bounding box按照短边长度r向内收缩了0.3r的距离。其实不太懂为什么要做这一步操作，是为了去除噪声吗？
  - geometry 对应的ground truth：我们这里以RBOX类型的数据为例，如下图所示。针对bounding box内部的每个点，我们计算他们到上下左右四个边的距离，并且计算角度。针对bounding box外部的点，我们将其ground truth置为0。
- Loss函数？
  - 针对score: 我们使用的是balanced cross-entropy。这样可以平衡正负样本不平衡的影响。其定义如下所示。实现代码如下所示：
    
    $L_{s} = b a l a n c e d - x e n t (\hat{Y}, Y^{*}) = - β Y^{*} l o g (\hat{Y}) - (1 - β) (1 - Y^{*}) l o g (1 - \hat{Y}) β = 1 - \frac{\sum_{y^{*} \in Y^{*}} y^{*}}{| Y^{*} |}$
```
def cross_entropy(y_true_cls, y_pred_cls, training_mask):
    '''
    :param y_true_cls: numpy array
    :param y_pred_cls: numpy array
    :param training_mask: numpy array
    :return:
    '''
    # eps = 1e-10
    # y_pred_cls = y_pred_cls * training_mask + eps
    # y_true_cls = y_true_cls * training_mask + eps
    # shape = list(np.shape(y_true_cls))
    # beta = 1 - (np.sum(np.reshape(y_true_cls, [shape[0], -1]), axis=1) / (1.0 * shape[1] * shape[2]))
    # cross_entropy_loss = -beta * y_true_cls * np.log(y_pred_cls) - (1 - beta) * (1 - y_true_cls) * np.log(
    #     1 - y_pred_cls)
    # return np.mean(cross_entropy_loss)
    eps = 1e-10
    y_pred_cls = y_pred_cls * training_mask + eps
    y_true_cls = y_true_cls * training_mask + eps
    each_y_true_sample = tf.split(y_true_cls, num_or_size_splits=FLAGS.batch_size_per_gpu, axis=0)
    each_y_pred_sample = tf.split(y_pred_cls, num_or_size_splits=FLAGS.batch_size_per_gpu, axis=0)
    loss = None
    for i in range(FLAGS.batch_size_per_gpu):
        cur_true = each_y_true_sample[i]
        cur_pred = each_y_pred_sample[i]
        beta = 1 - (tf.reduce_sum(cur_true) / (FLAGS.input_size * FLAGS.input_size))
        cur_loss = -beta * cur_true * tf.log(cur_pred) - (1-beta) * (1-cur_true) * tf.log((1-cur_pred))
        if loss is None:
            loss = cur_loss
        else:
            loss = loss + cur_loss
    return tf.reduce_mean(loss)
```
  - 针对geometry：我们计算其IoU loss，其定义如下所示：
    
    $L_{A A B B} = - l o g I o U (\hat{R}, R^{*}) = - l o g \frac{| \hat{R} \cap R^{*} |}{| \hat{R} \cup R^{*} |} w_{i} = m i n (\hat{d_{2}}, d_{2}^{*}) + m i n (\hat{d_{4}}, d_{4}^{*}) h_{i} = m i n (\hat{d_{3}}, d_{3}^{*}) + m i n (\hat{d_{1}}, d_{1}^{*}) \hat{R} = (\hat{d_{1}} + \hat{d_{3}}) * (\hat{d_{2}} + \hat{d_{4}}) R^{*} = (d_{1}^{*} + d_{3}^{*}) * (d_{2}^{*} + d_{4}^{*}) | \hat{R} \cap R^{*} | = w_{i} * h_{i} | \hat{R} \cup R^{*} | = \hat{R} + R^{*} - | \hat{R} \cap R^{*} |$ $L_{AABB} = -logIoU(\hat{R}, R^*) = -log\frac{|\hat{R} \cap R^*|}{|\hat{R} \cup R^*|}\\ w_i = min(\hat{d_2}, d^*_2) + min(\hat{d_4}, d^*_4)\\ h_i = min(\hat{d_3}, d^*_3) + min(\hat{d_1}, d^*_1)\\ \hat{R} = (\hat{d_1} + \hat{d_3}) * (\hat{d_2} + \hat{d_4})\\ R^* = (d^*_1 + d^*_3) * (d^*_2 + d^*_4)\\ |\hat{R} \cap R^*| = w_i * h_i\\ |\hat{R} \cup R^*| = \hat{R} + R^* - |\hat{R} \cap R^*|$
疑惑？
- 具体的实现中，并没有将mask upscale到原图尺寸，而是只upscale到原图的1/4大小，据说这样可以更好的检测小的text信息，不知道原理是什么？
- 具体的实现中，关于score的loss，并没有使用balanced cross-entropy loss而是使用的dice loss。

【论文阅读】EAST: An Efficient and Accurate Scene Text Detector

任务：文本检测(可以检测倾斜文本)

猜你喜欢