任意角度的场景文本检测

论文思路总结
特点：重新添加分支的创新更突出
场景文本检测

基于分割的检测方法

spcnet(mask_rcnn+tcm+rescore)
psenet(渐进扩展)
mask text spottor(新加分割分支)
craft
incepText

基于回归的检测方法：

r2cnn(类别分支，水平分支，倾斜分支)
rrpn(旋转rpn)
textbox(ssd)
textbox++
sstd(tcm改进前身)
rtn
ctpn(微分)

基于分割和回归的混合方法：

spcnet
利用mask_rcnn来进行实例分割，通过新模块tcm（获取全局语义分割图）以及rescore来提升准确率，实例分割映射在全局语义分割打分
pixel-anchor(deeplabv3+ssd):
分割的部分检测中大目标，ssd检测小目标
east(deeplabv3)
af-rpn
位于文本核心区域中的每个滑动点，直接预测从它到文本边框顶点的偏移量
(采用ohem)

FPN官方给的训练时候是前面共享参数的，对结果影响不大，说是特征金字塔使得不同层学到了相同层次的语义特征
FPN在得到多层金字塔模块的proposals结果之后，放到一块做nms处理
FPN每层金字塔模块的scale都是一样的，因为对应到不同的feature map上面刚好检测不同大小的物体

***********************论文名字后边括号内容为亮点部分********************

hybrid:---------------------------------------------------------------
1.af-rpn(af)
anchor-free
直接预测中心点到box的四个顶点偏移量，
避免了这种情况（to achieve high recall, anchors use various scales and shapes should be designed to cover the scale and shape variabilities of objects ）
scale-friendly
FPN对大中小三种尺度的目标分开检测（实现细节与fpn有不同）

2.inceptext(inceptext)
整体就是 fpn+inception_module+deformable_conv+deformable PSROI pooling
inception-text
设计类似inception中(1*1，3*3，5*5)三种卷积核对大中小三种不同尺度的目标进行检测，
也加入deformable卷积来调整感受野,把检测聚集到文字上面，不容易受方向限制；还有 two fused feature maps 增加多尺度信息。
deformable psroi pooling
(把检测聚集到文字上面，不容易受方向限制)
加入offset集中检测文字部分的信息，tend to learn the context surrounding the text
Each image is randomly cropped and scaled to have short edge of{640,800,960,1120}.
The anchor scales are {2,4,8,16}, and ratios are {0.2,0.5,2,5}.

3.rtn(无亮点)
一个多尺度的特征，加上ctpn竖直框，加上只有回归的预测
hierarchical convolutional
获得更强的语义特征，融合了resnet的模块4和模块5
vertical proposal mechanism
用ctpn获取竖直框，目的是去掉proposal的分类

扫描二维码关注公众号，回复： 9282785 查看本文章

4.fots(east改进)
simultaneous detection and recognition,sharing compution and visual information
contributions:
(1)end-to-end trainable by Sharing convolutional features,detect and recognize simultaneously
(2)ROIRotate,extract the oriented text regions from convolutional feature maps
loss = pixel-wise classification loss + IOU loss + angle loss

5.pixel-anchor
combine FPN and ASPP operation as encoder-decoder structure at segmentation
adaptive SSD (add adaptive predictor layer ADL)in anchor-level(share features with segmentation)
for better detect large variances in size and aspect ratio(orioise long anchors and anchor density)
the segmentation heat map in pixel-module is fed to anchor-module ,make the attention mechanism
gather all the boxes from pixel-level and anchor-level and conduct a cascaded NMS

regression:---------------------------------------------------------------
1.ctpn
detecting text in ﬁne-scale proposals
generate vertical proposals
recurrent connectionist text proposals
连接vertical proposals
side-reﬁnement
针对左右边界的anchors预测文本行的边界进行调整
2.textboxs
采用ssd来做std(multi-scale)
3.textboxs++
可以借鉴数据增强的方式 random crop
4.r2cnn(inclined box)
three ROIPoolings use different pooled sizes
anchor scales(4,8,16,32)
axis-aligned 和 inclined box一起预测且是包含关系
incline NMS
compute convolutional feature maps on an image pyramid(非主要)
augment ICDAR 2015
We rotate our image at the following angles (-90, -75, -60, -45, -30, -15, 0, 15, 30, 45, 60, 75, 90).
借鉴r2cnn的 ablation experiment
5.rrpn
rrpn
r-anchors(54,3*3*6),generate inclined proposals(representation,x,y,h,w,θ)
RROI pooling
skew NMS
image rotation strategy during data augmentation

segmentation ------------------------------------------------------
1.text-attention
training a CNN include more informative supervised information,
text region mask, character label and binary text/non-text information

text region regression is trained by using an additional sub network
includes two deconvolutional layers
2.sstd(text attention)
text attention module
the attention map indicates rough text regions and is further
encoded into the AIFs.
hierarchical inception module
capture richer context information by using multi-scale receptive fields
3.mask text spotter
precise text detection and recognition are acquired via semantic segmentation
(1)end-to-end trainable model for text spotting
(2)various shapes
(3)via semantic segmentation
(4)sota performances in both detection and text spotting
4.east
directly predicts words or text lines of arbitrary orientations and quad in full images
(1)only two stages FCN(pvanet和u-net)+NMS
(2)flexible geometric shapes
(2)both accuracy and speed
5.craft
（不考虑借鉴）