先查sample函数的使用方法
DataFrame.
sample
(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)[source]
DataFrame可以是Series、DataFrame
其中的n和frac是相同的作用,n的含义是抽样的个数,是整数;frac是浮点数,是抽样的比例
replace的含义是是否使用抽样以后的结果替代抽样使用的DataFrame
weights的含义是给抽样所在axis的每个元素赋值抽样权重,所以weights的长度必须和所在axis的长度相同,不然会报错,缺失值的weights会被设置为0,如果weights加和不等于1,会被normalized到加和为1,inf和-inf值不被允许
axis的含义是抽样的方向,axis=0,对行进行抽样,axis=1,对列进行抽样
random_state是用来复现结果的
df = pd.DataFrame(np.random.randn(50, 4), columns=list('ABCD')) df.head() Out: A B C D 0 1.140379 -1.935629 1.308425 -0.849782 1 1.485294 0.549036 0.598258 -0.425654 2 1.196884 0.978419 0.863556 -0.127361 3 -0.125885 0.646688 -0.893611 0.139235 4 -0.706401 0.279978 0.430465 1.238776 df.sample(frac=0.1, replace=True) Out: A B C D 22 -0.366721 0.242974 1.258249 0.865142 28 0.480038 -0.892942 -0.919231 0.368843 24 0.851685 0.316462 -0.511101 0.605807 35 0.985688 -0.885189 -0.384027 1.192959 31 -1.965679 -1.334534 -0.860851 -0.174803
整体是先读取voc格式文件中JPEGImages文件夹里面图片的名称,然后放入DataFrame中,使用DataFrame.sample函数来进行采样,然后保存到Main文件夹中
# encoding=utf-8
import os
import pandas as pd
path='~/VOCdevkit/VOC2007/JPEGImages'
dir=os.listdir(path)
lis=[i.split('.')[0] for i in os.listdir(path)] #读取JPEGImages文件夹中的文件名
df=pd.DataFrame(lis,columns=['name']) #构建pandas表格
temp1=df.sample(100) #随机抽样100个作为训练集
train=temp1['name'].values.tolist()
print(len(train))
with open('~/VOCdevkit/VOC2007/ImageSets/Main/train.txt','w') as f: #保存为train.txt
for i in train[:-1]:
f.write(i+' ')
f.write(train[-1])
#将train.txt中的去掉,剩下的再次抽样作为test集
dft=pd.DataFrame(list(set(df.name.values)-set(temp1.name.values)),columns=['name'])
print(len(dft))
temp2=dft.sample(10)
test=temp2['name'].values.tolist()
print(len(test))
pathi='~/VOCdevkit/VOC2007/JPEGImages'
with open('~/VOCdevkit/VOC2007/ImageSets/Main/test.txt','w') as f: #保存为test.txt
for i in test[:-1]:
f.write(i+' ')
os.system('cp %s %s' % (os.path.join(pathi, i + '.jpg'), '/home/xin/dogfacedetect/models/data/hashiqi/test'))
f.write(test[-1])
#将test.txt中的去掉以后全部作为validation集
val=list(set(dft.name.values)-set(temp2.name.values))
print(len(val))
assert len(val)==10
with open('~/VOCdevkit/VOC2007/ImageSets/Main/val.txt','w') as f: #保存为val.txt
for i in val[:-1]:
f.write(i+' ')
f.write(val[-1])
print(len(set(train)&set(val)))