Script tools for data set processing [with code]

When doing classification projects (including target detection), the preprocessing of data sets is often involved. Here I will open source some tool script codes I wrote for everyone to use, and will update them from time to time in the future.

Related functions:

1. Classification task one-hot label to single label

2. Statistics of each category in the data set

3. The width and height distribution of pictures in the data set, and the aspect ratio distribution

4. Visualize images with extreme aspect ratios in the dataset


1.one-hot label to order label

For example, the format of our data set is as follows, and the label is in one-hot format.

image1_path.png 1 0 0 # 1 0 0 is a cat

image2_path.png 0 1 0 # 0 1 0 is a dog

image3_path.png 0 0 1 # 0 0 1 is a bird

Now we need to convert one-hot to single category. For example: 0-cat, 1-dog, 2-bird, which is the following form:

image1_path.png 0

image2_path.png 1

image3_path.png 2

......

code show as below:

Among them, train and test are the training set and test set. You can modify the relevant paths according to your own needs.

# 1 0 0-> 0 猫
# 0 1 0-> 1 狗
# 0 0 1-> 2 鸟

train = False
test = False

if train:
    label_list_path = 'or_train.txt' # one-hot数据集的txt路径
    txt_path = '/train.txt' # 处理后的保存路径
elif test:
    label_list_path = 'or_test.txt' # one-hot.txt 路径
    txt_path = '/test.txt' # 处理后的保存路径
with open(label_list_path, 'r') as f:
    lines = f.readlines()
f.close()

label_list = []
for line in lines:
    image_path = line[:-7]
    one_hot_label = line[len(image_path)+1:].strip()
    label = ''

    if one_hot_label == '1 0 0': # 猫
        label = '0'
    elif one_hot_label == '0 1 0': # 狗
        label = '1'
    elif one_hot_label == '0 0 1': # 鸟
        label = '2'
    label_list.append(image_path + ' ' + label)

file = open(txt_path,'w')
for label in label_list:
    file.write(label + '\n')
file.close()

 


 2. Statistics of each category in the data set

The number of each category can be counted and printed, where txt_path is the txt file of the data set. Similarly, it can be modified according to the actual situation.

def label_count(txt_path):
    '''
    函数功能:统计数据集中各个类别的数量以及在整个数据集中的占比
    txt_path:label_list.txt路径,要求label是单标签,不能是one-hot形式
    '''
    all_targets = 0
    class1 = 0  # 记录类别1的数量
    class2 = 0  # 记录类别2的数量
    class3 = 0  # 记录类别3的数量
    with open(txt_path, 'r') as f:
        lines = f.readlines()
    f.close()
    for line in lines:
        label = line.split()[1]
        if label != '':
            all_targets += 1
        if label == '0':  # 猫
            class1 += 1
        elif label == '1':  # 狗
            class2 += 1
        elif label == '2':   # 鸟
            class3 += 1
    print("总目标数量:{}".format(all_targets))
    print("0-猫:{},占比{:.2f}%".format(class1, (class1 / all_targets)*100))
    print("1-狗:{},占比{:.2f}%".format(class2, (class2 / all_targets)*100))
    print("2-鸟:{},占比{:.2f}%".format(class3, (class3 / all_targets)*100))
    return (class1, class2, class3)

The printing effect is as follows:

Total number of targets: 100928
0-cat:20570, accounting for 20.38%
1-Dog: 15288, accounting for 15.15%
2-Bird: 65070, accounting for 64.47%

If you also want to display the number of each class in the form of a histogram, the code is as follows:

def plot_bar(data):
    '''
    函数功能:将每个类的数量在柱状图中显示出来
    '''
    class_names = ['猫', '狗', '鸟']
    # 类别数量
    counts = [x for x in data]
    # 绘制柱状图
    plt.bar(class_names, counts)

    # 添加标签
    for i in range(len(class_names)):
        plt.text(i, counts[i], str(counts[i]), ha='center', va='bottom')
    # 设置标题和坐标轴标签
    plt.title('目标类别数量')
    plt.xlabel('类别')
    plt.ylabel('数量')

    # 显示图形
    plt.show()

 


3. The width and height distribution of pictures in the data set, and the aspect ratio distribution

For example, you want to count the width and height distribution and aspect ratio distribution of all data in the data set. code show as below:

Where root_path is the root directory path of the data set, and txt_path is the txt path of the data set. This requires modifying the code according to your actual situation, as long as the image can be read completely from the txt.

def Dataset_shape_distribution(root_path, txt_path):
    with open(txt_path, 'r') as f:
        lines = f.readlines()
    f.close()
    widths = []  # 存储所有图像的w
    heights = []  # 存储所有图像的h

    for line in lines:
        image_path = root_path + '/' + line.split()[0]
        img = Image.open(image_path)
        w, h = img.size
        widths.append(w)
        heights.append(h)

    # 计算宽高比
    aspect_ratios = [widths[i] / heights[i] for i in range(len(widths))]

    # --------------获取宽高比的频数和bins--------------------------------
    hist, bins = np.histogram(aspect_ratios, bins=50)
    # 找到频数最多的范围
    max_freq_index = np.argmax(hist)  # 获取频数最大值的索引
    most_common_range = (bins[max_freq_index], bins[max_freq_index + 1])  # 根据索引获取对应范围
    print("宽高比分布主要的范围为:",np.around(most_common_range,decimals=2))

    hist_w, bins_w = np.histogram(widths, bins=50)
    max_freq_w_index = np.argmax(hist_w)  # 获取频数最大值的索引
    most_common_w_range = (bins_w[max_freq_w_index], bins_w[max_freq_w_index + 1])  # 根据索引获取对应范围
    print("宽分布主要的范围为:", np.around(most_common_w_range, decimals=2))

    hist_h, bins_h = np.histogram(heights, bins=50)
    max_freq_h_index = np.argmax(hist_h)  # 获取频数最大值的索引
    most_common_h_range = (bins_h[max_freq_h_index], bins_w[max_freq_h_index + 1])  # 根据索引获取对应范围
    print("高分布主要的范围为:", np.around(most_common_h_range, decimals=2))

    # 如果要归一化显示
    # min_width = min(widths)
    # max_width = max(widths)
    # min_height = min(heights)
    # max_height = max(heights)
    # normalized_widths = [(w - min_width) / (max_width - min_width) for w in widths]
    # normalized_heights = [(h - min_height) / (max_height - min_height) for h in heights]

    # -------------------------plot 部分-----------------------------------------------
    # 以直方图形式展现w和h的分布
    # bins:指定直方图显示的条数
    plt.hist(widths, bins=50, alpha=0.5, color='b', edgecolor='black')
    plt.title('Datasets Width Distribution')
    plt.xlabel('Width')
    plt.ylabel('Count')
    plt.show()
    # 绘制高的分布
    plt.hist(heights, bins=50, alpha=0.5, color='b', edgecolor='black')
    plt.title('Datasets Height Distribution')
    plt.xlabel('Height')
    plt.ylabel('Count')
    plt.show()

    # 绘制散点图
    #plt.scatter(normalized_widths, normalized_heights, s=0.9)
    plt.scatter(widths, heights, s=0.9)
    plt.xlabel('Width')
    plt.ylabel('Height')
    plt.title('Width vs Height Distribution')
    plt.show()

    # 绘制宽高比直方图
    plt.hist(aspect_ratios, bins=50, edgecolor='black')
    plt.xlabel('Aspect Ratio')
    plt.ylabel('Count')
    plt.title('宽高比分布')
    plt.show()

    # 绘制宽高比分布范围最多直方图
    plt.hist(aspect_ratios, bins=50, edgecolor='black')
    plt.xlabel('Aspect Ratio')
    plt.ylabel('Count')
    plt.title('宽高比分布')
    # 绘制最常见范围
    plt.axvspan(most_common_range[0], most_common_range[1], color='r', alpha=0.5)
    # 显示图形
    plt.show()

The output format is as follows:

The largest range of aspect ratio distribution is: [0.33 0.43]
The maximum range of the wide distribution is: [80.72 156.44]
The maximum range of high distribution is: [411.84 535.04]

At the same time, the width, height, and aspect ratio in the data set will be plotted as follows: 

 

Title dataset wide distribution
Data set is highly distributed

 

Overall distribution of data set width and height

 

Aspect ratio distribution (red is the most frequent)

 


4. Visualize images with extreme aspect ratios in the data set 

From the aspect ratio distribution in 3, we can see that some data have extreme aspect ratios. Then we can display these extreme data to see what kind of data they are. code show as below:

Among them, root_path is the root directory, txt_path is the data set txt path, save_path is the save path, whr_thre is the aspect ratio threshold, and pictures smaller than the threshold will be saved in save_path. Also modify these path parameters according to your own project.

def Extreme_data_display(root_path, txt_path, save_path, whr_thre=1.5):
    '''
    通过对数据集宽高比进行分析,设置宽高比阈值显示对应的图片(可以将一些宽高比比较极端的数据显示出来)
    '''
    with open(txt_path) as f:
        lines = f.readlines()
    f.close()
    for line in lines:
        image_path = root_path + '/' + line.split()[0]
        img = Image.open(image_path)
        w, h = img.size
        ratio = w / h
        #if whr_thre <= ratio:
        if whr_thre >= ratio:
            img.save(save_path + line.split()[0].split('/')[-1])

As for why we need to visualize extreme data, for example, in pedestrian detection, some images appear "thin strips". If there are a large number of such samples, it will affect the training of the network. For example:

extreme sample

Guess you like

Origin blog.csdn.net/z240626191s/article/details/133796474