VOC data set to COCO data set tool

When we deal with object detection tasks, many times we encounter situations where we must convert the data set from VOC format to COCO format. VOC format and COCO format are two widely used target detection data set formats. The VOC format uses XML files to store annotation information for each image, while the COCO format uses JSON files. This format conversion is usually done to accommodate different deep learning frameworks or tools.

In order to simplify this process, I will share with you a Python script that can convert VOC format data sets to COCO format, and also supports automatic copying of images to a specified directory.

First, we need to prepare the following:

  • Annotation folder in VOC format (contains XML files)
  • The folder where the target converted COCO format JSON file is located
  • list of categories (represented as a string, e.g. ['cat', 'dog', 'person'])
  • Image folder, which stores corresponding image files

Next, we create an VOC2COCOConverterobject and specify the various parameters mentioned above. We can also choose to set proportionsparameters to control the proportion of the final generated COCO format data set between the training set, validation set and test set. By default, this parameter is set [80, 10, 10]to divide the data set into 80% training set, 10% validation set and 10% test set. [100]Of course, you can also set it to or if you need [80, 20].

We can also set copy_imagesparameters to decide whether to copy image files. If it is set to True, the script automatically copies the image to a folder with the same name as the generated COCO format JSON file. This feature is useful for dataset management and can be used conveniently when using other frameworks to manipulate datasets, explore data, or debug models.

Here is part of the sample code:

import os
import glob
import json
import shutil
import xml.etree.ElementTree as ET
from collections import defaultdict, Counter
from tqdm import tqdm

START_BOUNDING_BOX_ID = 1

class VOC2COCOConverter:
    def __init__(self, xml_dir, json_dir, classes, img_dir, proportions=[8, 1, 1], copy_images=False, min_samples_per_class=20):
        self.xml_dir = xml_dir
        self.json_dir = json_dir
        self.img_dir = img_dir
        self.classes = classes
        self.proportions = proportions
        self.copy_images = copy_images
        self.min_samples_per_class = min_samples_per_class

        self.pre_define_categories = {
    
    }
        for i, cls in enumerate(self.classes):
            self.pre_define_categories[cls] = i + 1
            
    def convert(self):
        xml_files_by_class = self._get_sorted_xml_files_by_class()
        dataset_size = len(self.proportions)
        xml_files_by_dataset = [defaultdict(list) for _ in range(dataset_size)]
        xml_files_count_by_dataset = [0] * dataset_size
        
        for cls, xml_files in xml_files_by_class.items():
            total_files = len(xml_files)
            datasets_limits = [int(total_files * p / sum(self.proportions)) for p in self.proportions]
            datasets_limits[-1] = total_files - sum(datasets_limits[:-1]) # adjust to make sure the sums are correct due to integer division

            start = 0
            for i, limit in enumerate(datasets_limits):
                xml_files_by_dataset[i][cls] = xml_files[start:start + limit]
                xml_files_count_by_dataset[i] += limit
                start += limit

        for idx, xml_files_dict in enumerate(xml_files_by_dataset):
            dataset_dir = ''
            if self.copy_images:
                dataset_dir = os.path.join(self.json_dir, f'dataset_{
      
      idx + 1}')
                os.makedirs(dataset_dir, exist_ok=True)
                
            json_file_name = f'dataset_{
      
      idx + 1}.json'
            xml_files = sum(xml_files_dict.values(), [])
            self._convert_annotation(tqdm(xml_files), os.path.join(self.json_dir, json_file_name))
            if dataset_dir:
                self._copy_images(tqdm(xml_files), dataset_dir)

            print(f"\n在数据集{
      
      idx+1}中,各个类型的样本数量分别为:")
            for cls, files in xml_files_dict.items():
                print(f"类型 {
      
      cls} 的样本数量是: {
      
      len(files)}")

        print("\n各个数据集中相同类型样本的数量比值是:")
        for cls in self.classes:
            print("\n类型 {}:".format(cls))
            for i in range(len(self.proportions) - 1):
                if len(xml_files_by_dataset[i + 1].get(cls, [])) != 0 :
                    print("数据集 {} 和 数据集 {} 的样本数量比是: {}".format(
                        i + 1, 
                        i + 2, 
                        len(xml_files_by_dataset[i].get(cls, [])) / len(xml_files_by_dataset[i + 1].get(cls, []))
                    ))

    def _get_sorted_xml_files_by_class(self):
        xml_files_by_class = defaultdict(list)
        for xml_file in glob.glob(os.path.join(self.xml_dir, "*.xml")):
            tree = ET.parse(xml_file)
            root = tree.getroot()
            for obj in root.findall('object'):
                class_name = obj.find('name').text
                if class_name in self.classes:
                    xml_files_by_class[class_name].append(xml_file)


        # Filter classes
        if self.min_samples_per_class is not None:
            xml_files_by_class = {
    
    
                cls: files 
                for cls, files in xml_files_by_class.items()
                if len(files) > self.min_samples_per_class
            }

        xml_files_by_class = dict(
            sorted(xml_files_by_class.items(), key=lambda item: len(item[1]), reverse=True))

        return xml_files_by_class

    def _copy_images(self, xml_files, dataset_dir):
        for xml_file in xml_files:
            img_file = os.path.join(self.img_dir, os.path.basename(xml_file).replace('.xml', '.jpg'))
            if os.path.exists(img_file):
                shutil.copy(img_file, dataset_dir)

    def _get_files_by_majority_class(self):
        xml_files_by_class = defaultdict(list)
        for xml_file in glob.glob(os.path.join(self.xml_dir, "*.xml")):
            tree = ET.parse(xml_file)
            root = tree.getroot()
            class_counts = defaultdict(int)
            for obj in root.findall('object'):
                class_name = obj.find('name').text
                if class_name in self.classes:
                    class_counts[class_name] += 1
            majority_class = max(class_counts, key=class_counts.get)
            xml_files_by_class[majority_class].append(xml_file)

        return dict(sorted(xml_files_by_class.items(), key=lambda item: len(item[1]), reverse=True))

    def _convert_annotation(self, xml_list, json_file):
        json_dict = {
    
    "info":['none'], "license":['none'], "images": [], "annotations": [], "categories": []}
        categories = self.pre_define_categories.copy()
        bnd_id = START_BOUNDING_BOX_ID
        all_categories = {
    
    }

        for index, line in enumerate(xml_list):
            xml_f = line
            tree = ET.parse(xml_f)
            root = tree.getroot()

            filename = os.path.basename(xml_f)[:-4] + ".jpg"

            image_id = int(filename.split('.')[0][-9:])

            size = self._get_and_check(root, 'size', 1)
            width = int(self._get_and_check(size, 'width', 1).text)
            height = int(self._get_and_check(size, 'height', 1).text)
            image = {
    
    'file_name': filename, 'height': height, 'width': width, 'id':image_id}
            json_dict['images'].append(image)

            for obj in self._get(root, 'object'):
                category = self._get_and_check(obj, 'name', 1).text
                if category in all_categories:
                    all_categories[category] += 1
                else:
                    all_categories[category] = 1

                if category not in categories:
                    new_id = len(categories) + 1
                    print(filename)
                    print("[warning] 类别 '{}' 不在 'pre_define_categories'({})中,将自动创建新的id: {}".format(category, self.pre_define_categories, new_id))
                    categories[category] = new_id

                category_id = categories[category]
                bndbox = self._get_and_check(obj, 'bndbox', 1)
                xmin = int(float(self._get_and_check(bndbox, 'xmin', 1).text))
                ymin = int(float(self._get_and_check(bndbox, 'ymin', 1).text))
                xmax = int(float(self._get_and_check(bndbox, 'xmax', 1).text))
                ymax = int(float(self._get_and_check(bndbox, 'ymax', 1).text))
                o_width = abs(xmax - xmin)
                o_height = abs(ymax - ymin)

                ann = {
    
    'area': o_width*o_height, 'iscrowd': 0, 'image_id': image_id, 'bbox':[xmin, ymin, o_width, o_height],
                       'category_id': category_id, 'id': bnd_id, 'ignore': 0, 'segmentation': []}
                json_dict['annotations'].append(ann)
                bnd_id = bnd_id + 1

        for cate, cid in categories.items():
            cat = {
    
    'supercategory': 'none', 'id': cid, 'name': cate}
            json_dict['categories'].append(cat)
        json_fp = open(json_file, 'w')
        json_str = json.dumps(json_dict)
        json_fp.write(json_str)
        json_fp.close()
        print("------------已完成创建 {}--------------".format(json_file))
        print("找到 {} 类别: {} -->>> 你的预定类别 {}: {}".format(len(all_categories), all_categories.keys(), len(self.pre_define_categories), self.pre_define_categories.keys()))
        print("类别: id --> {}".format(categories))
        
    def _get(self, root, name):
        return root.findall(name)

    def _get_and_check(self, root, name, length):
        vars = root.findall(name)
        if len(vars) == 0:
            raise NotImplementedError('Can not find %s in %s.'%(name, root.tag))
        if length > 0 and len(vars) != length:
            raise NotImplementedError('The size of %s is supposed to be %d, but is %d.'%(name, length, len(vars)))
        if length == 1:
            vars = vars[0]
        return vars

if __name__ == '__main__':
    # xml标注文件夹   
    xml_dir = 'path/to/xml/directory'
    # JSON文件所在文件夹
    json_dir = 'path/to/json/directory'  
    # 类别列表,以字符串形式表示
    classes = ['cat', 'dog', 'person']
    # 图片所在文件夹
    img_dir = 'path/to/image/directory'
    # 类别在数据集中的比例
    proportions = [80, 10, 10]
    # 创建VOC2COCOConverter对象并进行转换
    converter = VOC2COCOConverter(xml_dir, json_dir, classes, img_dir, proportions, copy_images=True)
    converter.convert()

The above code only converts voc format into coco format, and does not specify which data set is used. Therefore, after conversion, you need to manually name each folder and annotation file. The test data set is not necessary and can be based on Adjust to your needs.

This is the basic directory structure of the COCO dataset:

|-- annotations
|   |-- instances_train.json
|   |-- instances_val.json
|   |-- instances_test.json
|-- train
|   |-- image1.jpg
|   |-- image2.jpg
|   |-- ...
|-- val
|   |-- image1.jpg
|   |-- image2.jpg
|   |-- ...
|-- test
|   |-- image1.jpg
|   |-- image2.jpg
|   |-- ...

annotations folder: stores annotation files, such as instances_train.json, instances_val.json, etc.

train folder: stores the image files of the training set.

val folder: stores the image files of the verification set.

test folder: stores the image files of the test set.

The above is a brief introduction to this VOC to COCO format conversion script. You can make appropriate modifications and optimizations according to your actual needs. This script can help you convert data set formats efficiently, and supports automatic copying of image files to facilitate the management and use of data sets.

Note: This article and code are completely automatically generated by GPT 4, and the tested code can be used normally.

Guess you like

Origin blog.csdn.net/Lc_001/article/details/132698406