FCENet paper summary and code explanation based on PaddleOCR (continuously updated)

Fourier Contour Embedding for Arbitrary-Shaped Text Detection

At the end of the article, some CVPR's latest papers and code addresses in the field of OCR are summarized

Contribution

  1. A method of arbitrary shape text detection is proposed, and it reaches SOTA on the CTW1500 and Total-Text datasets.
  2. A Fourier Contour Embedding (FCE) method is proposed to fit arbitrary shape text contours
  3. Build an FCENet network model with backbone, Feature Pyramid (FPN) network and Simple Fourier Transform (IFT) and Non-Maximum Suppression (NMS) post-processing

Fourier Contour Embedding

The author defines a complex-valued function f : R ↦ C f:\mathbb{R}\mapsto\mathbb{C}f:RC to represent a closed text outline, where:
f ( t ) = x ( t ) + iy ( t ) (1) f(t)=x(t)+iy(t) \tag 1f(t)=x(t)+iy(t)( 1 )
Among them,iii represents the imaginary unit,( x ( t ) , y ( t ) ) (x(t),y(t))(x(t),y ( t )) is defined asttSpace coordinates at time t . Since the text outline is closed, the functionfff满足 f ( t ) = f ( t + 1 ) f(t)=f(t+1) f(t)=f(t+1). f ( t ) f(t) f ( t ) can be re-expressed by inverse Fourier transform (IFT) as:
f ( t ) = f ( t , c ) = ∑ k = − ∞ + ∞ cke 2 π ikt (2) f(t)=f (t,\textbf{c})=\sum_{k=-\infty}^{+\infty}{c_ke^{2\pi ikt}} \tag 2f(t)=f(t,c)=k=+cke2 πik t( 2 )
Among them,k ∈ Z k\in \mathbb{Z}kZ represents the frequency,ck c_kckare the complex-valued Fourier coefficients used to characterize the initial state at frequency k. Each component in the formula cke 2 π ikt c_ke^{2\pi ikt}cke2 πik t means having a given initial hand direction vectorck c_kckCircular motion of fixed frequency k.

So the profile can be seen as a combination of circular motions of different frequencies. As shown below:

insert image description here

The author pointed out that the low frequency component is responsible for the text outline, and the high frequency component is responsible for the detailed information of the outline. In the experiment, K=5 can obtain a satisfactory approximate text outline.

Text contour functions usually do not yield analytical solutions. In the actual coding, the author performs discrete sampling on the text outline, and discretizes the continuous function f into N points, writing f ( n N ) f(\frac{n}{N})f(Nn) , wheren ∈ [ 1 , … , N ] n\in [1,\ldots ,N]n[1,,N ] . At this time,ck c_kckAccording to the discrete Fourier transform, it can be expressed as:
ck = 1 N ∑ n = 1 N e − 2 π ikn N (3) c_k=\frac{1}{N}\sum^N_{n=1}{e ^{-2\pi ik\frac{n}{N}}} \tag 3ck=N1n=1Ne2πikNn( 3 )
where,ck = uk + ivk c_k=u_k+iv_kck=uk+ivk, u k u_k ukis the real part of the complex number, vk v_kvkis the imaginary part of the complex number. when k = 0 k=0k=0时, c 0 = u 0 + i v 0 = 1 N ∑ n f ( n N ) c_0=u_0+iv_0=\frac{1}{N}\sum_nf(\frac{n}{N}) c0=u0+iv0=N1nf(Nn) is the center position of the contour. For any text contour f, the Fourier Contour Embedding (FCE) method proposed in the article can represent the contour in the Fourier domain as a compact2 ( 2 K + 1 ) 2(2K+1)2 ( 2K )+1)维向量 [ u − K , v − K , … , u 0 , v 0 , … , u K , v K ] [u_{-K},v_{-K},\ldots,u_0,v_0,\ldots,u_K,v_K] [uK,vK,,u0,v0,,uK,vK] is defined as a Fourier signature vector.

The FCE method consists of two stages.

  1. resampling stage

    • Sampling a fixed number of N points equidistantly on the text contour (N=400 in the experiment) to obtain a resampling point sequence { f ( 1 N ) , … , f ( 1 ) } \{f(\frac{1} {N}),\ldots,f(1)\}{ f(N1),,f(1)}.

    • This resampling is necessary since different datasets have different numbers of ground truth points for text instances and they are relatively small.

    • The resampling strategy enables FCE to be compatible with all datasets with the same settings.

    • Different resampling sequences can result in different Fourier signature vectors. So in the resampling stage for f ( t ) f(t)The f ( t ) function starting point, sampling direction and moving speed are constrained:

      • Starting point: set the starting point f ( 0 ) or f ( 1 ) f(0) or f(1)f ( 0 ) or f ( 1 ) is set as the center point( u 0 , v 0 ) (u_0,v_0)(u0,v0) and the text outline.
      • Sampling direction: Resample clockwise along the contour
      • Uniform speed: Uniform sampling, the distance between adjacent points remains unchanged, ensuring uniform sampling speed
  2. Fourier transform stage

    • In the Fourier transform stage, the sequence of resampled points is transformed into its corresponding Fourier feature vector.

FCENet

FCENet uses ResNet50 with DCN as the backbone, uses FPN as the neck network to extract multi-scale features, and proposes a Fourier prediction head. The article mentions that the model uses the P3, P4 and P5 feature maps of FPN for prediction.

The Fourier prediction head contains two branches, responsible for classification and regression respectively. Each branch consists of 3 3 × 3 3\times33×3 convolutional layers and a1 × 1 1\times11×1 convolutional layer, each convolutional layer is followed by a ReLU non-linear activation layer.

  1. classification branch

    In the classification branch, the model predicts a per-pixel mask for the text region (TR) and a per-pixel mask matrix for the text center region (TCR). Experiments find that TCR can effectively improve the prediction performance of the model, because TCR can effectively filter out low-quality predictions of text boundaries.

  2. regression branch

    A text Fourier feature vector is regressed for each pixel in the text. To target text instances of different scales, P3, P4, and P5 feature maps are responsible for small, medium, and large text instances, respectively. The detection results are reconstructed from the Fourier domain to the spatial domain by IFT and NMS.

insert image description here

The detection head consists of two branches, classification and regression. In the classification branch, the number of output result channels is 4. The first two channels represent the probability of whether each pixel is a text region (Text Region, TR), and the last two channels Indicates the probability of whether each pixel is a text center region (Text Center Region). The classification branch is equivalent to segmenting the text region, and then finding the outline center of the text region.

Number of channels for regression branch 22 2222 , which means that the degree of freedom k = 5 k=5is taken from the Fourier expansionk=5 , take the first 5 high and low frequencies andk = 0 k=0k=0 out of11 1111 complex Fourier coefficients, complex numbers are represented by $( u_k , v_k ) $, so a total of22 2222 variables, the final detection result is obtained by using the inverse Fourier transform.
————————————————
Copyright statement: This article is the original article of CSDN blogger "Heng Youcheng", following the CC 4.0 BY-SA copyright agreement, please attach the original source link and this article for reprinting statement.
Original link:https://blog.csdn.net/lx_ros/article/details/129112928

FCENet code analysis based on PaddleOCR

  • Continuously updated, it is recommended to collect

paddleOCR code structure

We mainly focus on two aspects of scene text detection and recognition, and the main files of concern are as follows:

  • configs - mainly configuration files

  • ppocr - mainly stores model-related files, which is the core of the PaddleOCR framework. Including:

    • data augmentation
    • loss function
    • Evaluation index
    • model definition
    • optimizer
    • Post-processing
    • Tools

    Wait for the implementation of the algorithm.

  • tools - store model training, reasoning and other scripts

These are the files we need to pay attention to. Next, we will start with the config file to understand the configuration file of PaddleOCR.

configs file

The following 7 files are saved in the configs folder, corresponding to some popular tasks in OCR:

Please add a picture description

We also focus on the two folders det and rec. cls is a text classifier for text direction in ppocr. If necessary, it can classify the direction of the detected text and then do corresponding processing.

Let's take dbnet in the det task as an example, that is, the det_res18_db_v2.0.yml file, to illustrate the structure of the ppocr configuration file. Generally, the configuration file is named in the following way task_backbone_algorithm_dataset

1. Global field
Global:
  use_gpu: true
  epoch_num: 1200
  log_smooth_window: 20
  print_batch_step: 2
  save_model_dir: ./output/ch_db_res18_icdar2015+project/
  save_epoch_step: 1200
  eval_batch_step: [3000, 2000]
  cal_metric_during_train: False
  pretrained_model: ./pretrain_models/ch_ppocr_server_v2.0_det_train/best_accuracy
  checkpoints: #./output/ch_db_res18_icdar2015+project/best_accuracy
  save_inference_dir:
  use_visualdl: False
  infer_img: ./data/icdar2015/text_localization/test/
  save_res_path: ./output/det_db/predicts_db.txt

Global is a field for configuring the entire training process, and the configuration items under Global configure the entire training.

use_gpu: 是否使用gpu进行训练,值为true或false
epoch_num: epoch次数
log_smooth_window: log平滑窗口,每次打印输出队列里的中间值
print_batch_step: 多少次迭代打印一次log
save_model_dir: 训练模型保存地址
save_batch_step: 多少个epoch保存一次模型(这里文件中save_batch_step=epoch_num,在实际中只在第1200个epoch保存模型,但是在训练过程中会保存best_accuracy模型,这样会大大减小磁盘的占用)
eval_batch_step: 设置模型评估间隔,2000 或 [1000, 2000],2000 表示每2000次迭代评估一次,[1000, 2000]表示从1000次迭代开始,每2000次评估一次
cal_metric_during_train: 设置是否在训练过程中评估指标,此时评估的是模型在当前batch下的指标,也就是训练集的表现
pretrained_model: 预训练模型地址
checkpoints: 加载模型参数的路径,用于训练中断后恢复训练使用
save_inference_dir: 该字段是训练完成后将模型转换为推理模型的保存地址
use_visualdl: 是否使用visualdl对训练过程的log进行可视化
infer_img: 测试图像的地址,可以是文件也可以是文件夹
save_res_path: 对检测结果的保存地址

In the recognition task, there will be some fields as follows:

character_dict_path: 字典路径
max_text_length: 文本最大长度	
use_space_char: 是否识别空格
2. Architecture field
Architecture:
  model_type: det
  algorithm: DB
  Transform:
  Backbone:
    name: ResNet_vd
    layers: 18
  Neck:
    name: DBFPN
    out_channels: 256
  Head:
    name: DBHead
    k: 50

The Architecture field is a hyperparameter that defines the network model. In PaddleOCR, the network is divided into four stages: Transform, Backbone, Neck and Head

model_type: 网络类型,一般是检测任务和识别任务
algorithm: 算法,表明模型使用的算法
Transform: 变换方式,ppocr文档指出目前仅针对rec任务支持
Backbone: 表明接下来是设置backbone网络的相关参数
	name: backbone类名
	layers:resnet层数
Neck: 颈部网络设置
	name: neck类名
	out_channels: 输出通道数
Head: 头部网络设置
	name: head类名
	k: DBHead中二值化的系数
3. Loss field
Loss:
  name: DBLoss
  balance_loss: true
  main_loss_type: DiceLoss
  alpha: 5
  beta: 10
  ohem_ratio: 3

The Loss field is some hyperparameters that define the loss function

name: 网络loss类名
balance_loss: DBLossloss中是否对正负样本数量进行均衡(使用OHEM)
main_loss_type: DBLossloss中shrink_map所采用的loss
alpha: DBLossloss中shrink_map_loss的系数
beta: DBLossloss中threshold_map_loss的系数
ohem_ratio:	DBLossloss中的OHEM的负正样本比例
4. Optimizer field
Optimizer:
  name: Adam
  beta1: 0.9
  beta2: 0.999
  lr:
    name: Cosine
    learning_rate: 0.001
    warmup_epoch: 2
  regularizer:
    name: 'L2'
    factor: 0

Some definitions of optimizers

name: 优化器类名
beta1: 设置一阶矩估计的指数衰减率
beta2: 设置二阶矩估计的指数衰减率
lr: 设置学习率decay方式
name: 学习率decay类名
learning_rate: 基础学习率
regularizer: 设置网络正则化方式
name: 正则化类名
factor: 正则化系数
5. PostProcess field
PostProcess:
  name: DBPostProcess
  thresh: 0.3
  box_thresh: 0.6
  max_candidates: 1000
  unclip_ratio: 1.5

Some definitions for postprocessing

name: 后处理类名
thresh: DBPostProcess中分割图进行二值化的阈值
box_thresh: DBPostProcess中对输出框进行过滤的阈值,低于此阈值的框不会输出
max_candidates: DBPostProcess中输出的最大文本框数量
unclip_ratio: DBPostProcess中对文本框进行放大的比例
6. Metric field
Metric:
  name: DetMetric
  main_indicator: hmean

Define the evaluation method

name: 指标评估方法名称
main_indicator: 主要指标,用于选取最优模型
7. data field
Train:
  dataset:
    name: SimpleDataSet
    data_dir: ./data/icdar2015/text_localization/
    label_file_list:
      - ./data/icdar2015/text_localization/train_icdar2015_label_merge.txt
    ratio_list: [1.0]
    transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      - DetLabelEncode: # Class handling label
      - IaaAugment:
          augmenter_args:
            - {
    
     'type': Fliplr, 'args': {
    
     'p': 0.5 } }
            - {
    
     'type': Affine, 'args': {
    
     'rotate': [-10, 10] } }
            - {
    
     'type': Resize, 'args': {
    
     'size': [0.5, 3] } }
      - EastRandomCropData:
          size: [960, 960]
          max_tries: 50
          keep_ratio: true
      - MakeBorderMap:
          shrink_ratio: 0.4
          thresh_min: 0.3
          thresh_max: 0.7
      - MakeShrinkMap:
          shrink_ratio: 0.4
          min_text_size: 8
      - NormalizeImage:
          scale: 1./255.
          mean: [0.485, 0.456, 0.406]
          std: [0.229, 0.224, 0.225]
          order: 'hwc'
      - ToCHWImage:
      - KeepKeys:
          keep_keys: ['image', 'threshold_map', 'threshold_mask', 'shrink_map', 'shrink_mask'] # the order of the dataloader list
  loader:
    shuffle: True
    drop_last: False
    batch_size_per_card: 8
    num_workers: 4


Eval:
  dataset:
    name: SimpleDataSet
    data_dir: 
      - ./data/icdar2015/text_localization/
    label_file_list:
      - ./data/icdar2015/text_localization/test_icdar2015_label_merge.txt
    transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      - DetLabelEncode: # Class handling label
      - DetResizeForTest:
#           image_shape: [736, 1280]
      - NormalizeImage:
          scale: 1./255.
          mean: [0.485, 0.456, 0.406]
          std: [0.229, 0.224, 0.225]
          order: 'hwc'
      - ToCHWImage:
      - KeepKeys:
          keep_keys: ['image', 'shape', 'polys', 'ignore_tags']
  loader:
    shuffle: False
    drop_last: False
    batch_size_per_card: 1 # must be 1
    num_workers: 2

Define the data set and some hyperparameters of the dataloader during training

dataset: 每次迭代返回一个样本
name: dataset类名
data_dir: 数据集图片存放路径
label_file_list: 数据标签路径
ratio_list: 数据集的比例,若label_file_list中有两个train_list,且ratio_list为[0.4,0.6],则从train_list1中采样40%,从train_list2中采样60%组合整个dataset
transforms: 对图片和标签进行变换的方法列表
loader: dataloader相关
shuffle: 每个epoch是否将数据集顺序打乱
batch_size_per_card: 训练时单卡batch size
drop_last: 是否丢弃因数据集样本数不能被 batch_size 整除而产生的最后一个不完整的mini-batch
num_workers: 用于加载数据的子进程个数,若为0即为不开启子进程,在主进程中进行数据加载

For the explanation in the ppocr document of the specific config reference, the address is as follows:

configs configuration file

FCE Debug

First, the training script of PaddleOCR is in the project/tools/train.py, and in train.py, we focus on the main method:

if __name__ == '__main__':
    config, device, logger, vdl_writer = program.preprocess(is_train=True)
    seed = config['Global']['seed'] if 'seed' in config['Global'] else 1024
    set_seed(seed)
    main(config, device, logger, vdl_writer)

In the main method, first initialize the training environment, and pass the parsed and read config file into the construction method of the corresponding component. Note here that the training components for dataloader, model, postprocess, preprocess, etc. are all in ppocr/ Component name/init.py is implemented by the build_component name method, taking the loss construction method as an example:
in the ppocr/losses/init.py file

# train.py中调用build_loss方法
loss_class = build_loss(config['Loss'])

# losses/init.py文件中对build_loss的实现
def build_loss(config):
    support_dict = [
        'DBLoss', 'PSELoss', 'EASTLoss', 'SASTLoss', 'FCELoss', 'CTCLoss',
        'ClsLoss', 'AttentionLoss', 'SRNLoss', 'PGLoss', 'CombinedLoss',
        'CELoss', 'TableAttentionLoss', 'SARLoss', 'AsterLoss', 'SDMGRLoss',
        'VQASerTokenLayoutLMLoss', 'LossFromOutput', 'PRENLoss', 'MultiLoss',
        'TableMasterLoss', 'SPINAttentionLoss', 'VLLoss', 'StrokeFocusLoss',
        'SLALoss', 'CTLoss', 'RFLLoss', 'DRRGLoss', 'CANLoss', 'TelescopeLoss'
    ]
    config = copy.deepcopy(config)
    module_name = config.pop('name')
    assert module_name in support_dict, Exception('loss only support {}'.format(
        support_dict))
    module_class = eval(module_name)(**config)
    return module_class

The above code first parses the input config, obtains the class name of loss in the configuration file, and verifies the class name to ensure that it is a supported algorithm.
Then use the eval method to instantiate the loss object, where the instantiated class name corresponds to the class name in the configuration file.
In this way, a loss class is constructed, and then it is returned.
paddleocr constructs all other components in the above way, which can be found in the init.py file of each package under the ppocr file.

Here, after instantiating the classes of all components, the following code will be called to start training:

# start train
program.train(config, train_dataloader, valid_dataloader, device, model,
               loss_class, optimizer, lr_scheduler, post_process_class,
               eval_class, pre_best_model_dict, logger, vdl_writer, scaler,
               amp_level, amp_custom_black_list)

The program class is implemented in the tools/program.py file.

In the program.train method, the following for loop implements the entire training process:

    for epoch in range(start_epoch, epoch_num + 1):
        # 省略
        for idx, batch in enumerate(train_dataloader):
        	# 省略
                preds = model(images)
                loss = loss_class(preds, batch)
                avg_loss = loss['loss']
                avg_loss.backward()
                optimizer.step()
            optimizer.clear_grad()

In the double for loop of training, the batch size is [4,3,800,800]

Among them, 4 represents batch-size, 3 represents channel, and [800, 800] represents the height and width of preprocessing after image enhancement.

There are 4 lists in the batch, batch[0] is the picture to be trained, and batch[1-3] is the mask picture of different scales.

After that, the images will be sent to the built model for prediction

Please add a picture description

  • Base_model
    • backbone
    • neck
    • head

The model returns the data of the sorted 'levels' field in the head network. It is a list with a length of 3, and each list is a list with a length of 2, which are the corresponding feature maps of the cls branch and the feature map of the reg branch. .

Base_model

Enter the forward function in the base_model class:

Please add a picture description

Here, the input image will be forward propagated according to the model structure configured in the Architecture field in the configuration file. Generally, transform is not used in detection tasks, so x will be input into the near backbone. This parameter is also constructed according to the backbone class specified in the config file. ResNet_vd is used in FCENet. We just need to go into det_resnet_vd.py to see the structure.

backbone

The Backbone used by FCENet is ResNet50_vd. The data enters the forward function in the ResNet_vd class, and the shape is [4,3,800,800].

In the forward function, the data first passes through three layers of conv1_x, which is an attribute of the ResNet_vd class, and its type is the ConvBNLayer class. Its structure after initialization is as follows:

Please add a picture description

In the ConvBNLayer class, there are mainly three layer structures, which are

  1. AvgPool2D;
  2. Conv2D or DeformableConvV2 selected according to hyperparameters
  3. 3.BatchNorm.

The data will choose whether to use average pooling according to the is_vd_mode in the configuration, and when the ConvBNLayer is initialized, the conv layer will be selected.

In the ResNet_vd class, it first entered conv1_1. The example here is that the layer is Conv2D.

In conv1_1, the average pooling will be skipped, and then the conv layer will be used. The parameters are out_channel=32, kernel_size=[3,3], stride=2, padding=1. Therefore, the data shape becomes [4,32,400,400] after execution.

In conv1_2, the parameters of the conv layer are out_channel=32, kernel_size=[3,3], stride=1, padding=1. Therefore, after conv1_2, the data shape is still [4,32,400,400].

In conv1_3, the conv layer parameters are out_channel=64, kernel_size=[3,3], stride=1, padding=1. Only the number of channels is changed, and the shape of the feature map remains unchanged.

After three layers of conv and bn layers, a layer of max_pool will be passed. The parameters of max_pool are kernel_size=3, stride=2, padding=1. The shape of the data input to the backbone will become [4,64,200,200].

These three layers of conv1_x and maxpool implement some layers before the block in ResNet. You can see the picture below:

Please add a picture description

Next, a for loop traverses the block to complete the overall forward propagation of ResNet. The feature maps of different scales will be output according to the out_indices attribute.

Please add a picture description

For the specific structure in the block, refer to the init method of the ResNet_vd class.

Stage structure:

Please add a picture description

The 4 blocks represent the conv2-conv5 stage in the above figure.

The pictures after the backbone will be down-sampled to 8 times, 16 times, and 32 times respectively. So the output of self.backbone(x) is a list with a len of 3, where:

  • x[0]=[batch_size, 512, 100, 100]
  • x[1]=[batch_size, 1024, 50, 50]
  • x[2]=[batch_size, 2048, 25, 25]

The number of channels are 512, 1024, 2048 respectively.

After the backbone, the type of x will be judged.

If x is a dictionary type, update it to y, if not, update the corresponding field of y to x.

neck net

The same method will send the output of the backbone to the neck.

In fce_fpn, first count the data input into body_feats. Determine the number of levels.

Then traverse each level, and send the corresponding level feature map of body_feats into the corresponding lateral_convs. See below:

Please add a picture description

Among them, lateral_convs is a 1*1 convolution for different level feature maps of body_feats, the purpose is to unify the number of channels of the feature map.

Please add a picture description

After the above for loop, the elements in the laterals list are as follows:

It unifies the feature map channels of different scales to 256.

Next, the feature map is upsampled, the same as the interpolate function in paddle.nn.functional.

insert image description here

Firstly, the feature map is traversed in reverse order, then the feature map is up-sampled, and finally the up-sampled feature map is combined with the shallow feature map.

Therefore, in laterals, only laterals[0], laterals[1] are fused with the upsampling results of the deep feature map.

At the same time, pass the laterals through the convolution of kernel_size=3*3, padding=1, stride=1, the purpose is to smooth the feature map after feature fusion.

insert image description here

Through analysis, we can see that in fce_fpn, the core is two lists, namely lateral_convs and fpn_convs

head net

In det_fce_head, the input data of the forward function of the FCEHead class is feats. It is very metaphysical here. A multi_apply function is used to complete the forward propagation of the two branches in the paper, focusing on the forward_single function and the different branches for training and reasoning behind the forward function.

insert image description here

The function of this function is to send data into the classification branch and the regression branch.

Classsification Branch

The classification branch corresponds to the out_conv_cls attribute in the FCEHead class, which is a convolutional layer with parameters inchannels=256, out_channels=4, kernel_size=3, stride=1, padding=1. Its output is this section from the paper:

insert image description here

Regression Branch

The regression branch corresponds to the out_conv_reg attribute in the FCEHead class, which is a parameter inchannels=256, out_channels=22 (related to the fourier_degree field in the configuration file, the paper points out that the length of the Fourier signature vector is v = 2 ∗ ( 2 ∗ K + 1 ) v=2*(2*K+1)v=2(2K+1 ) ), kernel_size=3, stride=1, padding=1. Its output corresponds to this part of the paper:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-ULiHrVSt-1683620700184) (D:\Graduate\Thesis\OCR\Notes\Notes and Pictures Temporary Storage\regression_branch.png )]

Here is the output of the two branches:

In the forward function of the FCEHead class, after the classification branch and the regression branch, the data structure returned to the model is as follows:

out is a dictionary, in which levels saves a list with a length of 3, and each element of the list is a list with a length of 2, which stores the classification feature map and regression feature map of the corresponding scale.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-J7hWLxq8-1683620700185) (D:\Graduate\Thesis\OCR\Notes\Notes and Pictures Temporary Storage\FCEHead return data structure.png)]

After backbone, neck and head, the obtained data structure is as follows:

insert image description here
Each field saves the above output results.

FCE pre-processing stage

  • DecodeImage:
  • DetLabelEncode:
  • Color Jitter:
  • RandomScaling:
  • RandomCropFlip:
  • RandomCropPolyInstances:
  • RandomRotatePolyInstances:
  • SquareResizePad:
  • IaaAugment:
  • CFPNetTargets:
  • NormalizeImage:
  • ToCHWImage:
  • KeepKeys:

The above is the pre-processing class of FCENet configured in the config file, the most important of which is CFPNetTargers (the name in the original project is FCENetTargets), which is mainly responsible for the generation of the ground truth target of FCENet. Specific steps are as follows:

  1. First, in the generate_level_targets method, the circumscribing rectangle of each text instance is calculated, and the text instances are divided into different scales according to the aspect ratio of the circumscribing rectangle.
  2. Since different feature maps will be selected in the backbone, corresponding to downsampling of 8 times, 16 times, and 32 times, the generate_level_targets method will calculate the generate_text_region_mask text region mask, generate_center_region_mask text center region mask, and generate_effective_mask effective region mask according to the scale of the text instance. , generate_fourier_maps Fourier map.
    insert image description here
    insert image description here
    insert image description here
    insert image description here
    (The Fourier channel map has 22 channels, only one of which is intercepted here)

In the resampling stage, the original FCENet will resample according to the text instance frame. The specific step is to calculate the distance between the border points and sample according to the distance. By default, 400 points are sampled.

FCE post-processing stage

In the process of performing inference, the input image is first converted into a tensor, and then sent to the model class to obtain the feature map after model inference.
insert image description here

Then send it to the post-processing class-fce_postprocess class to reconstruct the Fourier frame of the obtained feature map.
After model inference, the predicted feature map containing 3 scales is obtained. The feature map of each scale contains a total of 26 channels, corresponding to the results of the 4-channel classification branch and the 22-channel regression branch results in the network structure (here the Fourier degree is 5. , (2*5+1)*2=22).
insert image description here

The size of the feature map corresponds to the downsampling multiple of the input image.
After entering the post-processing class, first separate the classification branch and the regression branch for each scale feature map, and then call the class method get_boundary to generate a text box.
In the get_boundary method, call the _get_boundary_single method, the most important is the fcenet_decode method, which decodes the result of the regression branch into the coordinates of the text box.
In the fcenet_decode method, all feature maps are taken out first .
insert image description here

Then use the text area map and the text center map to multiply pixel by pixel to get the score_map, but only one channel is used in the code, and the other two channels of the two feature maps are invalidated.
insert image description here

After getting score_pred, use the score_thr threshold to predict the text area mask, tr_mask, and then use findContours in opencv to find the border of the text area.
Next, traverse the border, draw it on the deal_map, and then multiply the deal_map by score_pred. I feel that this step is to get the score of the text edge. Get the score_map, and then filter the score_map to get the score_mask.
Use numpy's argwhere function to find the indices of all True values, then use the indexed y value as the real part of dxy, and the indexed x value as the imaginary part of dxy.[I don't quite understand the reason for reversing the order of xy here]

insert image description here
由于用复数表示傅里叶系数时,实部表示余弦项系数,虚部表示正弦项系数,因此需要将 y 坐标放在实部,x 坐标放在虚部。

Next, take the predicted feature map of xy from the regression branch, make it imaginary, and form c (that is, the Fourier coefficient).
insert image description here

Here, x and y are taken out according to score_mask, that is, the mask matrix of the score of the text edge calculated by the text region tr, and activate the vector of the Fourier feature in the regression branch.
Eg1, if the mask Unicom area of ​​30 points is obtained in the text area, then the feature vector on the regression branch feature map corresponding to the Unicom area of ​​these thirty points will be taken out, and a [30*11] matrix will be obtained, where k The Fourier degree is 5, each row corresponds to a Fourier feature, and each row is equivalent to a predicted text box. Each column represents Fourier coefficients at different frequencies.[I don't know much about adding dxy here, and I don't know the meaning of c *=scale (maybe to restore the scale of the feature map)]

由于傅里叶级数拟合的过程中,只考虑了自然频率为非零的正弦和余弦波,因此无法表示多边形的平移。为了将多边形移动到正确的位置,需要在傅里叶系数中添加一个偏移量。因此,将 dxy 加到第 fourier_degree 个元素上,相当于将多边形的中心位置从原点移动到文本实例的中心位置。这里的 dxy 是一个包含像素点坐标的复数数组,每个元素的实部和虚部分别表示像素点在 x 轴和 y 轴上的偏移量。

Then pass c and the number of reconstruction text box points num_reconstr_points (50 points by default) into the fourier2poly method to transform the Fourier coefficients.
First, a [len (number of Fourier eigenvectors), num_reconstr_points] dimension a matrix will be created. This matrix is ​​in the form of imaginary numbers.
It is worth noting the assignment to a here:
insert image description here

Assign the 5-11 columns of the Fourier coefficients to the 0-5 columns of a, and assign the 0-4 columns of the Fourier coefficients to the 45-49 columns of a. The assignment results under fixed parameters are discussed here.

这样做是为了调整傅里叶系数的顺序,以便进行反傅里叶变换。 傅里叶级数展开的系数通常以复数形式表示,包括正频率和负频率的项。正频率对应于频谱中的正频率分量,负频率对应于频谱中的负频率分量。在实际计算中,为了方便处理,通常将傅里叶系数的正负频率部分分开存储。 在这个函数中,fourier_coeff 中的前半部分存储了负频率部分的系数,而后半部分存储了正频率部分的系数。通过将后半部分复制到数组 a 的前半部分,前半部分复制到数组 a 的后半部分,实际上是将正频率部分移动到了数组 a 的前半部分,将负频率部分移动到了数组 a 的后半部分。

Next, use the inverse Fourier transform to get the prediction result of the text instance.
insert image description here

乘以 num_reconstr_points 是为了校准坐标的范围。 在进行傅里叶变换时,频域上的信号通常会被归一化,即将信号的振幅范围缩放到一定范围内。反傅里叶变换则是将归一化的频域信号转换回时域信号。 通过乘以 num_reconstr_points,可以将反傅里叶变换得到的复数形式的多边形坐标进行放缩,使其适应实际的坐标范围。 保持反傅里叶变换后多边形坐标的比例和大小与原始数据一致。使得重建的多边形与原始多边形具有相似的形状和比例关系。

Taking Eg1 as an example, Poly_complex is a 30*50 imaginary matrix, and each row represents a text box.

insert image description here

Creates an all-zero array polygon of shape (n, num_reconstr_points, 2) to store the polygon's coordinates. where the first dimension represents the number of candidate text instances, the second dimension represents the number of polygon points, and the third dimension represents the dimension of coordinates (x and y).
Finally, adjust the shape of the text instance to (n, -1), and -1 is to flatten the remaining dimensions. The result is a 30*100 vector, each row representing a text instance.
The most important thing is over, and the rest is to use nms to filter the obtained text boxes and select the one with the highest confidence or score. The confidence here is the score of the activation position in the score_map on the score_mask.
It should be noted that in the post-processing, nms filtering is first performed on each instance, and then the IOU overlap is filtered on different scales.

CVPR papers on OCR detection and recognition in recent years

2021

scene text detection

Fourier Contour Embedding for Arbitrary-Shaped Text Detection

​ FCENet

  • Paper: https://arxiv.org/abs/2104.10442

scene text recognition

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

ABINet

  • Paper: https://arxiv.org/abs/2103.06495

  • Code: https://github.com/FangShancheng/ABINet

2022

end-to-end text recognition

SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition

  • Paper: https://arxiv.org/abs/2203.10209
  • Code: https://github.com/mxin262/SwinTextSpotter

Guess you like

Origin blog.csdn.net/qq_32577169/article/details/129581716