Article directory
This article then previews the overview of the dataset to see how image files and annotation files are processed.
The code used in this article: My Github
dataset preview
The dataset is divided into three parts:
Images are the image data we need to download.
Annotation Files are annotation data for target detection (that is, XML annotation files that mark tumor location and classification).
Clinical Data is patient-related clinical data.
We mainly focus on the first two parts.
The Image section contains 355 subfolders named "Lung_Dx-Xxxxx". For example, A0002 means case No. 0002 of category A (adenocarcinoma).
Under Annotation Files are some annotation files in xml format, that is, the annotation data of target detection, marking the tumor location and classification.
We see that an object of type A (ie Bounding Box) and its four coordinates are marked in the xml file in the figure below.
Later, it is also possible to encounter a situation where multiple Bounding Boxes are marked in an xml file, that is, two nodular lesions are marked in one CT.
Here is a question, how do our CT images and XML annotation files correspond? That is, how to know which CT picture corresponds to the bounding box described in the xml file?
Correspondence between CT pictures and xml files
We know that the file name of the xml file is composed of a long series of numbers, such as " 1.3.6.1.4.1.14519.5.2.1.6655.2359.122259036515695905512549026864.xml ". This is the globally unique identifier
of a CT image in a CT sequence , called SOP Instance UID.
We can use dcm software (such as RadiAnt) to open a dcm image file, and open the tag of the dcm file to see the SOP Instance UID. This is the one-to-one correspondence between this thing and the xml file.
View dcm file information
Since the RadiAnt software is charged, we can use python's pydicom library to view and process dcm files.
First pip install pydicom
install the library through the command.
Then, we can load the required dcm file.
SOPInstanceUID attribute to view the SOP Instance UID.
The pixel matrix can also be obtained to draw a CT image.
code show as below:
import pydicom
import matplotlib.pyplot as plt
im = pydicom.read_file('manifest-1608669183333/Lung-PET-CT-Dx/Lung_Dx-A0001/04-04-2007-NA-Chest-07990/2.000000-5mm-40805/1-01.dcm')
# 获取 UID
uid = im.SOPInstanceUID
# 获取像素矩阵
img_arr = im.pixel_array
# 打印矩阵大小
print(img_arr.shape)
# 绘制图像
plt.imshow(img_arr,cmap=plt.cm.gray)
plt.title("UID:{}".format(uid))
Effect:
Pair dcm images with xml callout files
We need to process the data, pair the dcm image with the xml annotation file, and see the final anchor box effect.
For example, we want to see the data of A0001.
# 查看 A0001 号数据的图像文件和xml标注文件
import os
# 查看指定目录下所有的dcm文件名
dcm_file=[]
for root, dirs, files in os.walk('manifest-1608669183333/Lung-PET-CT-Dx/Lung_Dx-A0001'):
for file in files:
file_path = os.path.join(root, file)
if 'dcm' in file_path:
dcm_file.append(file_path)
print(dcm_file[0])
# 查看指定目录下所有的dcm文件的 SOP Instance UID
dcm_file_uid=[]
for dcm in dcm_file:
im=pydicom.read_file(dcm)
dcm_uid=im.SOPInstanceUID
dcm_file_uid.append(dcm_uid)
print(dcm_file_uid[0])
# 查看指定Annotation的指定目录下所有的XML文件
xml_file=[]
for root, dirs, files in os.walk('Lung-PET-CT-Dx-Annotations-XML-Files-rev12222020/Annotation/A0001'):
for file in files:
file_path = os.path.join(root, file)
if 'xml' in file_path:
xml_file.append(file_path)
print(xml_file[0])
# 从dcm_file找出“SOP Instance UID”与xml_file一致的文件
to_select_dcm = []
for xml in xml_file:
xml_file_name = xml[66:-4]
if xml_file_name in dcm_file_uid:
idx = dcm_file_uid.index(xml_file_name)
to_select_dcm.append(dcm_file[idx])
to_select_dcm[:5]
Let's take the first paired dcm image and XML annotation file to preview the effect of the anchor box.
# 我们取第一张配对好的 dcm图片 和 XML标注文件
im, xml = to_select_dcm[0], xml_file[0]
# 查看dcm图片文件
im = pydicom.read_file(im)
# 获取 UID
uid = im.SOPInstanceUID
# 获取像素矩阵
img_arr = im.pixel_array
# 打印矩阵大小
print(img_arr.shape)
# 绘制图像
plt.imshow(img_arr,cmap=plt.cm.bone)
plt.title("UID:{}".format(uid))
# 取第1组标注框和类别
bbox, label = get_labelFromXml(xml)
bbox, label = bbox[0], label[0] # 一张dcm图片可能对应多个标注框,这里取第1个
def bbox_to_rect(bbox, color):
# 将边界框(左上x,左上y,右下x,右下y)格式转换成matplotlib格式:
# ((左上x,左上y),宽,高)
return plt.Rectangle(
xy=(bbox[0], bbox[1]), width=bbox[2]-bbox[0], height=bbox[3]-bbox[1],
fill=False, edgecolor=color, linewidth=2)
fig = plt.imshow(img_arr,cmap=plt.cm.bone)
plt.title("UID:{}".format(uid))
fig.axes.add_patch(bbox_to_rect(bbox, 'red'))
fig.axes.text(bbox[0]+12, bbox[1]+12, label,
va='center', ha='center', fontsize=12, color='red')
In this way, we can draw the anchor frame to preview the data.