CRNN 塘坑记录

train的时候遇到的坑:

1、trainroot、valroot找不到

对应改成和opt相同的就好了。改成trainRoot、valRoot

2、训练的时候(112train。112val)

[0/10000][495/10000] Loss: 25.622013
[0/10000][496/10000] Loss: 17.766382
[0/10000][497/10000] Loss: 19.240152
[0/10000][498/10000] Loss: 17.980661
[0/10000][499/10000] Loss: 14.364298
[0/10000][500/10000] Loss: 15.110470

Start val
Traceback (most recent call last):
  File "train.py", line 208, in <module>
    val(crnn, test_dataset, criterion)
  File "train.py", line 158, in val
    preds = preds.transpose(1, 0).contiguous().view(-1)
RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)

 python train.py --trainRoot /export/gpudata/fujingling/projects/crnn.pytorch-master/dataset/lmdb_capcha_ten_thousand/ --valRoot /export/gpudata/fujingling/projects/crnn.pytorch-master/dataset/lmdb_xjxl_112/ --ngpu 1 --random_sample --cuda --displayInterval 1 --saveInterval 1000 --batchSize 1 --nepoch 10000 --lr 0.0001 --workers 1 --expr_dir expr_captcha --n_test_disp 50

Namespace(adadelta=False, adam=False, alphabet='12345678abcdefghijkmnopqrstuvwxyzABDEFHKLMNRTY', batchSize=1, beta1=0.5, cuda=True, displayInterval=1, expr_dir='expr_captcha', imgH=32, imgW=100, keep_ratio=False, lr=0.0001, manualSeed=1234, n_test_disp=50, nepoch=10000, ngpu=1, nh=256, pretrained='', random_sample=True, saveInterval=1000, trainRoot='/export/gpudata/fujingling/projects/crnn.pytorch-master/dataset/lmdb_capcha_ten_thousand/', valInterval=500, valRoot='/export/gpudata/fujingling/projects/crnn.pytorch-master/dataset/lmdb_xjxl_112/', workers=1)

具体是什么问题导致的,我现在还不知道,我估计是n_test_disp太大了

因为改成下面这样就不报错了(我必须好好读读源码了):

 python train.py --trainRoot /export/gpudata/fujingling/projects/crnn.pytorch-master/dataset/lmdb_xjxl_112/ --valRoot /export/gpudata/fujingling/projects/crnn.pytorch-master/dataset/lmdb_xjxl_112/ --ngpu 1 --random_sample --cuda --displayInterval 10 --saveInterval 100 --batchSize 10 --nepoch 10000 --lr 0.0001 --workers 1 --expr_dir expr_captcha --n_test_disp 10 --valInterval 10 

test.py是我从demo.py改了改,搞的

在测试的时候摊的坑如下

1、我在load模型的时候,每一层的参数名称对不上,我自己的模型参数统一多了一个module

解决方案:https://stackoverflow.com/questions/44230907/keyerror-unexpected-key-module-encoder-embedding-weight-in-state-dict

以下是我的测试代码

import torch
from torch.autograd import Variable
import utils
import dataset
from PIL import Image
from glob import glob
import models.crnn as crnn


model_path = './expr/netCRNN_0_1000.pth'
img_path = './dataset/test/output_catptcha_20181106_cut/'
alphabet = '12345678abcdefghijkmnopqrstuvwxyzABDEFHKLMNRTY'

nclass = len(alphabet) + 1
nc = 1
nh=256
imgH = 32
model = crnn.CRNN(imgH, nc, nclass, nh)
if torch.cuda.is_available():
    model = model.cuda()
#print('loading pretrained model from %s' % model_path)
#model.load_state_dict(torch.load(model_path))


# original saved file with DataParallel
state_dict = torch.load(model_path)
# create new OrderedDict that does not contain `module.`
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
    name = k[7:] # remove `module.`
    new_state_dict[name] = v
# load params
model.load_state_dict(new_state_dict)


converter = utils.strLabelConverter(alphabet)

transformer = dataset.resizeNormalize((100, 32))
image_dirs = glob(img_path + '*.jpg')
for img_path in image_dirs:
	image = Image.open(img_path).convert('L')
	image = transformer(image)
	if torch.cuda.is_available():
	    image = image.cuda()
	image = image.view(1, *image.size())
	image = Variable(image)

	model.eval()
	preds = model(image)

	_, preds = preds.max(2)
	preds = preds.transpose(1, 0).contiguous().view(-1)

	preds_size = Variable(torch.IntTensor([preds.size(0)]))
	raw_pred = converter.decode(preds.data, preds_size.data, raw=True)
	sim_pred = converter.decode(preds.data, preds_size.data, raw=False)
	print(img_path)
	print('%-20s => %-20s' % (raw_pred, sim_pred))

100万训练样本,10万验证集,验证效果几乎接近100%了,但是用业务方给的数据测试效果很不好

我又用业务方给的100张测试数据训练,并用它做验证,效果还可以。

这说明我造的数据和业务方的数据特征分布不一样

现在尝试把所有的图都转成灰度图进行训练和测试,看看效果

//TODO

一直不打印验证日志;

batchsize设置的太大了,--valInterval 设太大了,假设训练数据100张,batch size设置为64,那么只有两个bach。必须保证valInterval < 训练样本总数/batchsize

训练数据有问题,具体什么问题不确定

Traceback (most recent call last):
  File "train.py", line 198, in <module>
    cost = trainBatch(crnn, criterion, optimizer)
  File "train.py", line 177, in trainBatch
    t, l = converter.encode(cpu_texts)
  File "/export/gpudata/fujingling/projects/crnn.pytorch-master/utils.py", line 51, in encode
    text, _ = self.encode(text)
  File "/export/gpudata/fujingling/projects/crnn.pytorch-master/utils.py", line 45, in encode
    for char in text
KeyError: ' '
Exception NameError: "global name 'FileNotFoundError' is not defined" in <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f569f92c150>> ignored

跑着跑着就报这个错,每次都报

Traceback (most recent call last):
  File "train.py", line 198, in <module>
    cost = trainBatch(crnn, criterion, optimizer)
  File "train.py", line 173, in trainBatch
    data = train_iter.next()
  File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 280, in __next__
    idx, batch = self._get_batch()
  File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 259, in _get_batch
    return self.data_queue.get()
  File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/multiprocessing/queues.py", line 378, in get
    return recv()
  File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 22, in recv
    return pickle.loads(buf)
  File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/pickle.py", line 1388, in loads
    return Unpickler(file).load()
  File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/pickle.py", line 864, in load
    dispatch[key](self)
  File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/pickle.py", line 1139, in load_reduce
    value = func(*args)
  File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/multiprocessing/reductions.py", line 68, in rebuild_storage_fd
    fd = multiprocessing.reduction.rebuild_handle(df)
  File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/multiprocessing/reduction.py", line 157, in rebuild_handle
    new_handle = recv_handle(conn)
  File "/export/docker/JXQ-23-46-110.h.chinabank.com.cn/fujingling/conda/envs/crnn/lib/python2.7/multiprocessing/reduction.py", line 83, in recv_handle
    return _multiprocessing.recvfd(conn.fileno())
OSError: [Errno 4] Interrupted system call
Exception NameError: "global name 'FileNotFoundError' is not defined" in <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f1ebbb37f50>> ignored

解决方案:https://github.com/pytorch/pytorch/issues/4220

我没有找到是哪里导致报错的,但是我按链接里的方法改了以后就不报错了

GPU 上训练的模型,加载是时用CPU

Traceback (most recent call last):
  File "test_chinese.py", line 57, in <module>
    preds = model(image)
  File "/export/gpudata/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/gpudata/fujingling/projects/crnn.pytorch-master/models/crnn.py", line 70, in forward
    conv = self.cnn(input)
  File "/export/gpudata/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/gpudata/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/export/gpudata/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/export/gpudata/fujingling/conda/envs/crnn/lib/python2.7/site-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

下面是方案,具体我没shi

猜你喜欢

转载自blog.csdn.net/u012135425/article/details/83933888