One tool handles annotation data format conversion

Author: Yang Yicheng

foreword

Usually, after a new model is released, its corresponding github warehouse will have task scripts that provide custom dataset training. Developers can quickly implement the training and verification of the model structure based on their own datasets. But often these training scripts only support part of the data set format, such as the official warehouse of YOLOv8, which requires the developer's data set format to meet the Ultralytics standard, or manually write the script to convert the original data set format into the Ultralytics format, but This part of the work does not actually affect the deployment effect after model training. Therefore, if there is a tool that helps us manage training data and format conversion very conveniently, it will greatly improve the production efficiency of model training.

Introducing the Datumaro Kit

Project address: https://github.com/openvinotoolkit/datumaro

Datumaro is an annotation data management tool that supports both Python and command-line calls. It can support the following functions:

  • Two-way conversion of annotation data format

It can be applied to data processing of classification, segmentation, detection, key point detection, text positioning, text recognition, re-recognition and point cloud tasks, and supports the mutual conversion of the following annotation data formats:

  • Build and modify datasets
  • Combine multiple datasets
  • Dataset label filtering, such as removing images with specific labels
  • Modify dataset labels
  • Data set splitting, such as training set, validation set, and test set
  • Data set sampling, such as screening suitable training set samples based on Entropy method

Datumaro Kit Practice

Datumaro is a mode that supports command line calls, so we can easily use one line of commands to complete the conversion of data in two different formats . Next, based on the command line mode, I will demonstrate the basic method of Datumaro for processing label data format conversion.

  • Datumaro installation and basic usage

Datumaro supports the installation method based on PyPI. If you want to experience the latest features at the first time, you can also install it directly based on the github warehouse

# From PyPI:
$ pip install datumaro[default]

$ pip install 'git+https://github.com/openvinotoolkit/datumaro[default]'
Datumaro's command line invocation method is very simple. If you already have a set of datasets in a standard format, you only need to specify the original data format and path, and the converted data format and path on the command line:
$ datum convert -if voc -i <path/to/voc> -f coco -o <output/dir>

  • YOLOv8 target detection data set actual combat

Just recently, I am preparing for the meter recognition project. Here I borrow the meter detection data set provided by the flying paddle. The goal is to use the YOLOv8 model to build a dial target detection task.

Dataset download address: https://bj.bcebos.com/paddlex/examples/meter_reader/datasets/meter_det.tar.gz

After downloading the dataset, you can see the file directory structure as follows:

├── meter_it

│   ├── annotations

│          ├── instance_train.json

│          └── instance_test.json

│   ├── test

│   └── train

In fact, the training data and verification data in image format are saved in the train and test directories:

├── test

│   ├── 20190822_105.jpg

│   ├── 20190822_110.jpg

│   ├── 20190822_123.jpg

│   ├── 20190822_124.jpg

│   ├── 20190822_127.jpg

│   ├── …

We can randomly open a picture for verification:

Figure: Image data example

In the first step, we can use the datum command to automatically identify the format of the data set:

$ datum detect './meter_det'
Output: Detected format: image_dir

It can be seen that since the data set does not conform to a certain standard format specification, Datumaro judges it as an ordinary picture folder. Here I can manually query the standard formats of several datasets, find the one that is most similar to the original dataset, and then manually modify it. By querying https://openvinotoolkit.github.io/datumaro/latest/docs/data-formats/supported_formats.html for several data format standards supported by Datumaro , we found that the format of the original data is most similar to COCO:

└─ Dataset/

    ├── dataset_meta.json # a list of custom labels (optional)

    ├── images/

    │   ├── train/

    │   │   ├── <image_name1.ext>

    │   │   ├── <image_name2.ext>

    │   │   └── ...

    │   └── val/

    │       ├── <image_name1.ext>

    │       ├── <image_name2.ext>

    │       └── ...

    └── annotations/

        ├── <task>_<subset_name>.json

        └── ...

Figure: COCO data format example

So we manually slightly modified the original directory and added an images directory to store image data separately. The results of the transformation are as follows:

├── meter_it_coco

│   ├── annotations

│          ├── instances_train.json

│          └── instances_val.json

│   └── images

    ├── train

└── val

There is a point that is easily overlooked here, because Datumaro will judge the purpose of the dataset based on the file name of the .json tag file, such as detection or segmentation, so we must change the name of the <subset> part in the example to "instances ", and then we use the detect method to detect the modified data set type:

$ datum detect './meter_det_coco'
Output: Detected format: coco

You can see that Datumaro has recognized it as a standard COCO format type. Finally, we can call the command mentioned above to complete the conversion of the dataset from COCO to the Ultralytics standard with one click:

$ datum convert -if coco -i '/home/ethan/intel/data/meter_det_coco' -f yolo_ultralytics -o '/home/ethan/intel/data/meter_det_yolo' -- --save-media

PS: --save-media automatically copies the image files to the new dataset directory

The converted dataset directory is as follows:

├── meter_det_yolo

│   ├── data.yaml

│   ├── images

│   │   ├── train

│   │   └── val

│   ├── labels

│   │   ├── train

│   │   └── val

│   ├── train.txt

│   └── val.txt

└── table.jpg

With such an Ultralytics standard dataset, we can directly put its corresponding path into the configuration file in the YOLOv8 warehouse, and then open the training script for model creation.

model.train(data=' data.yaml ', epochs=100, imgsz=640)

Summarize

Datumaro is a very powerful annotation data management tool that helps developers easily convert formats between various data standards, and realize effective management and transformation of these data sets, greatly improving the reusability of existing data sets , to complete the rapid verification of the effect of the new model.


Guess you like

Origin blog.csdn.net/gc5r8w07u/article/details/130767833