Many classification datasets are named according to folder names, such as the VOC dataset:
Taking my own data set as an example, analyze the data balance between various categories, and implement it with the following script:
import os
import matplotlib.pyplot as plt
path = 'soybeanleaf'
dirs = os.listdir(path)
num_dir = len(dirs)
num = []
for i in range(num_dir):
file_i = os.listdir(path + '/'+ dirs[i])
num.append(len(file_i))
print(dirs)
print(num)
d = dict(zip(dirs,num))
sort_d = sorted(d.items(),key = lambda item:item[1],reverse = True)
x = []
y = []
for it in sort_d:
x.append(it[0])
y.append(it[1])
plt.barh(x[0:num_dir],y[0:num_dir])
plt.yticks(fontproperties = 'Times New Roman', size = 2)
plt.savefig('leafdir.png',dpi=300)
The script outputs the name of the subfolder and the number of files in the corresponding folder
At the same time, perform data distribution statistics on the data set and draw a bar graph:
From the above results, the data distribution of each category of the dataset can be analyzed, which categories have more data and which categories have less data, and the algorithm classification results can be analyzed for this situation.