Play with the lung cancer target detection dataset Lung-PET-CT-Dx ——② Preview the dataset and draw the anchor box

This article then previews the overview of the dataset to see how image files and annotation files are processed.
The code used in this article: My Github

dataset preview

The dataset is divided into three parts:
Images are the image data we need to download.
Annotation Files are annotation data for target detection (that is, XML annotation files that mark tumor location and classification).
Clinical Data is patient-related clinical data.

We mainly focus on the first two parts.


The Image section contains 355 subfolders named "Lung_Dx-Xxxxx". For example, A0002 means case No. 0002 of category A (adenocarcinoma).

insert image description here
insert image description here


Under Annotation Files are some annotation files in xml format, that is, the annotation data of target detection, marking the tumor location and classification.
We see that an object of type A (ie Bounding Box) and its four coordinates are marked in the xml file in the figure below.
Later, it is also possible to encounter a situation where multiple Bounding Boxes are marked in an xml file, that is, two nodular lesions are marked in one CT.

insert image description here


Here is a question, how do our CT images and XML annotation files correspond? That is, how to know which CT picture corresponds to the bounding box described in the xml file?

Correspondence between CT pictures and xml files

We know that the file name of the xml file is composed of a long series of numbers, such as " 1.3.6.1.4.1.14519.5.2.1.6655.2359.122259036515695905512549026864.xml ". This is the globally unique identifier
of a CT image in a CT sequence , called SOP Instance UID.

We can use dcm software (such as RadiAnt) to open a dcm image file, and open the tag of the dcm file to see the SOP Instance UID. This is the one-to-one correspondence between this thing and the xml file.
insert image description here


View dcm file information

Since the RadiAnt software is charged, we can use python's pydicom library to view and process dcm files.
First pip install pydicominstall the library through the command.

Then, we can load the required dcm file.
SOPInstanceUID attribute to view the SOP Instance UID.
The pixel matrix can also be obtained to draw a CT image.
code show as below:

import pydicom
import matplotlib.pyplot as plt

im = pydicom.read_file('manifest-1608669183333/Lung-PET-CT-Dx/Lung_Dx-A0001/04-04-2007-NA-Chest-07990/2.000000-5mm-40805/1-01.dcm')

# 获取 UID
uid = im.SOPInstanceUID

# 获取像素矩阵
img_arr = im.pixel_array
# 打印矩阵大小
print(img_arr.shape)

# 绘制图像
plt.imshow(img_arr,cmap=plt.cm.gray)
plt.title("UID:{}".format(uid))

Effect:
insert image description here


Pair dcm images with xml callout files

We need to process the data, pair the dcm image with the xml annotation file, and see the final anchor box effect.

For example, we want to see the data of A0001.

# 查看 A0001 号数据的图像文件和xml标注文件
import os

# 查看指定目录下所有的dcm文件名
dcm_file=[]
for root, dirs, files in os.walk('manifest-1608669183333/Lung-PET-CT-Dx/Lung_Dx-A0001'):
    for file in files:
        file_path = os.path.join(root, file)
        if 'dcm' in file_path:
            dcm_file.append(file_path)
print(dcm_file[0])

# 查看指定目录下所有的dcm文件的 SOP Instance UID
dcm_file_uid=[]
for dcm in dcm_file:
    im=pydicom.read_file(dcm)
    dcm_uid=im.SOPInstanceUID
    dcm_file_uid.append(dcm_uid)
print(dcm_file_uid[0])

# 查看指定Annotation的指定目录下所有的XML文件
xml_file=[]
for root, dirs, files in os.walk('Lung-PET-CT-Dx-Annotations-XML-Files-rev12222020/Annotation/A0001'):
    for file in files:
        file_path = os.path.join(root, file)      
        if 'xml' in file_path:
            xml_file.append(file_path)
print(xml_file[0])

insert image description here


# 从dcm_file找出“SOP Instance UID”与xml_file一致的文件
to_select_dcm = []
for xml in xml_file:
    xml_file_name = xml[66:-4]
    if xml_file_name in dcm_file_uid:
        idx = dcm_file_uid.index(xml_file_name)
        to_select_dcm.append(dcm_file[idx])
to_select_dcm[:5]

insert image description here


Let's take the first paired dcm image and XML annotation file to preview the effect of the anchor box.

# 我们取第一张配对好的 dcm图片 和 XML标注文件
im, xml = to_select_dcm[0], xml_file[0]

# 查看dcm图片文件
im = pydicom.read_file(im)

# 获取 UID
uid = im.SOPInstanceUID
# 获取像素矩阵
img_arr = im.pixel_array
# 打印矩阵大小
print(img_arr.shape)
# 绘制图像
plt.imshow(img_arr,cmap=plt.cm.bone)
plt.title("UID:{}".format(uid))

insert image description here


# 取第1组标注框和类别
bbox, label = get_labelFromXml(xml)
bbox, label = bbox[0], label[0]  # 一张dcm图片可能对应多个标注框,这里取第1个

def bbox_to_rect(bbox, color):
    # 将边界框(左上x,左上y,右下x,右下y)格式转换成matplotlib格式:
    # ((左上x,左上y),宽,高)
    return plt.Rectangle(
        xy=(bbox[0], bbox[1]), width=bbox[2]-bbox[0], height=bbox[3]-bbox[1],
        fill=False, edgecolor=color, linewidth=2)

fig = plt.imshow(img_arr,cmap=plt.cm.bone)
plt.title("UID:{}".format(uid))

fig.axes.add_patch(bbox_to_rect(bbox, 'red'))
fig.axes.text(bbox[0]+12, bbox[1]+12, label,
                      va='center', ha='center', fontsize=12, color='red')

insert image description here

In this way, we can draw the anchor frame to preview the data.

Guess you like

Origin blog.csdn.net/takedachia/article/details/129867427