Large data sets contain corrupted images OSError: image file is truncated and PIL.UnidentifiedImageError: cannot identify image file solutions

Solutions for large data sets containing corrupted images and error messages OSError: image file is truncated and PIL.UnidentifiedImageError: cannot identify image file

Project scenario:

When training the deep learning network model, the large-scale image data sets used (such as 10,000-level, 100,000-level, million-level, etc.) contain damaged or damaged images. After several hours of the first round of iteration, suddenly due to Data set loading problems and sudden interruptions are very time-consuming. The following methods can be used to troubleshoot and solve the problems encountered. Hope it is of reference~


Problem 1 OSError: image file is truncated Cause analysis and solution

报错问题:OSError: image file is truncated

Cause analysis: When python uses PIL to process images, sometimes some inexplicable data operations cause some minor problems with the image data. When the image data is used later, this error OSError: image file is truncated will appear. Image data has its own basic format, which is a string of binary data starting with ff d8 and ending with ff d9. When the error OSError: image file is truncated is reported, in most cases the tail is truncated, that is, the ff d9 at the tail is missing, the identification point is missing, and an error is reported.

There are two solutions.

Method 1: Throw away the damaged data. Just add some content at the front of the code. In this way, when encountering a picture with truncated data, PIL will break directly, jump out of the function, and proceed to the next step without reporting an error.

from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

Method 2: Complete the data and use it again. Sometimes there is a lot of damaged data, the cost of data loss is too high, and the truncation part has little impact, you can consider completing the data and adding ff d9 after the data. The code is as follows, just traverse it for your own data set.

from PIL import Image
from io import BytesIO
ori_img='001.jpg'
with open(ori_img, 'rb') as f:
    # 二进制读取
    f = f.read()

# 补全数据的
f = f + B'\xff' + B'\xd9'

img = Image.open(BytesIO(f))
if img.mode != "RGB":
    img = im.convert('RGB')

Problem 2 PIL.UnidentifiedImageError: cannot identify image file Cause analysis and solution

报错问题:‘PIL.UnidentifiedImageError: cannot identify image file ‘001.jpg’

Cause analysis: It is caused when the operating system cannot perform the specified task (such as opening a file). Double-clicking the error image here does not open it. These pictures can be deleted.

In my project, the data set used for model training is 100,000-level. The data set provider informed me that there are dozens of pictures that cannot be opened. The data set txt directory has been generated in the early stage, so a simple traversal can be used. Output the number of rows of images and directories that cannot be opened and delete them manually. code show as below:

from PIL import Image

txt_path = '/home/yy/dataset/train.txt'
fh = open(txt_path, 'r')
i = 0 
for line in fh:
	i += 1
	line = line.rstrip()
	words = line.split()
	img = words[0]
	try:
		flag = Image.open(img)
	except Exception as e:
		print(img, i)

This error may also be a problem with the PIL library, and you need to confirm it yourself. If there are many image files that cannot be opened, you can delete the images directly in the code. Refer to the following blog post:

When python reads images through PIL, an error is reported: OSError: cannot identify image file


Other references:

There are two blog posts for the first error OSError: image file is truncated. They have more detailed analysis and solutions for your reference~

Analysis and solution of PIL Image "image file is truncated" problem in Python program
OSError: image file is truncated solution ideas and solutions.

Guess you like

Origin blog.csdn.net/qq_39691492/article/details/124468857