Python counts the number of images in the subfolder, and draws a graph to analyze the long-tail distribution of categories

Many classification datasets are named according to folder names, such as the VOC dataset:

Taking my own data set as an example, analyze the data balance between various categories, and implement it with the following script:

import os
import matplotlib.pyplot as plt

path = 'soybeanleaf'
dirs = os.listdir(path)
num_dir = len(dirs)
num = []

for i in range(num_dir):
    file_i = os.listdir(path + '/'+ dirs[i])
    num.append(len(file_i))

print(dirs)
print(num)

d = dict(zip(dirs,num))
sort_d = sorted(d.items(),key = lambda item:item[1],reverse = True)
x = []
y = []
for it in sort_d:
    x.append(it[0])
    y.append(it[1])

plt.barh(x[0:num_dir],y[0:num_dir])
plt.yticks(fontproperties = 'Times New Roman', size = 2)
plt.savefig('leafdir.png',dpi=300)

The script outputs the name of the subfolder and the number of files in the corresponding folder

 At the same time, perform data distribution statistics on the data set and draw a bar graph:

From the above results, the data distribution of each category of the dataset can be analyzed, which categories have more data and which categories have less data, and the algorithm classification results can be analyzed for this situation. 

Guess you like

Origin blog.csdn.net/u013685264/article/details/126362797