TianCHi/LUNA16/Kaggle lung cancer肺结节数据集特征比较

现在网上能够找到的关于肺结节的比赛有TianCHi/LUNA16/Kaggle这三家。

三个网站论坛上都有很多关于肺结节识别与检测的源码。作为一名新手，我准备先大致了解这三个网站提供的数据的异同。

天池医疗AI大赛：

大赛数据集提供数千份高危患者的低剂量肺部CT影像（mhd格式）数据，每个影像包含一系列胸腔的多个轴向切片。每个影像包含的切片数量会随着扫描机器、扫描层厚和患者的不同而有差异。原始图像为三维图像。每个图像包含一系列胸腔的多个轴向切片。这个三维图像由不同数量的二维图像组成。其二维图像数量可以基于不同因素变化，比如扫描机器、患者。Mhd文件具有包含关于患者ID的必要信息的头部，以及诸如切片厚度的扫描参数。

初赛数据，允许用户下载，具体如下：

1. 数据量：1000例病人，全部都有结节。数据质量：
a）结节大小（mm）大致分布如下：

5-10mm	10-30mm
50%	50%

b）除了进行病理分析的结节外，其它结节都由三位医生进行标记确认。

2. 数据格式：
1）CT影像：mhd格式
2）结节标注信息：csv文件，标注了结节的位置和大小（mm）

seriesuid	coordX	coordY	coordZ	diameter_mm
LKDS_00001	-100.56	67.26	-231.81	6.44

3. 层厚（mm）
所有CT影像的层厚小于2mm

不过现在天池这个比赛的数据集还没有公开，网上应该是下载不到的。

LUNA16：

For this challenge, we use the publicly available LIDC/IDRI database. This data uses the Creative Commons Attribution 3.0 Unported License. We excluded scans with a slice thickness greater than 2.5 mm. In total, 888 CT scans are included. The LIDC/IDRI database also contains annotations which were collected during a two-phase annotation process using 4 experienced radiologists. Each radiologist marked lesions they identified as non-nodule, nodule < 3 mm, and nodules >= 3 mm. See this publication for the details of the annotation process. The reference standard of our challenge consists of all nodules >= 3 mm accepted by at least 3 out of 4 radiologists. Annotations that are not included in the reference standard (non-nodules, nodules < 3 mm, and nodules annotated by only 1 or 2 radiologists) are referred as irrelevant findings. The list of irrelevant findings is provided inside the evaluation script package (annotations_excluded.csv).

本次比赛，我们使用公开可获得的LIDC/IDRI database（数据库）。

数据集里面切片厚度均小于2.5mm。总共有888张CT扫描件。

我们比赛的参考标准是四个放射科医生中有三个都接受的大于3mm的结节。

不包括在参考标准中的结节（非结节，小于3mm的结节，只被1-2名医生接受的结节）被视为无关发现。

无关发现的列表在 evaluation script package (annotations_excluded.csv)中。

Data is available on the download page. The data is structured as follows:

subset0.zip to subset9.zip: 10 zip files which contain all CT images
annotations.csv: csv file that contains the annotations used as reference standard for the 'nodule detection' track
sampleSubmission.csv: an example of a submission file in the correct format
candidates_V2.csv: csv file that contains the candidate locations for the ‘false positive reduction’ track

Additional data includes:

evaluation script: the evaluation script that is used in the LUNA16 framework
lung segmentation: a directory that contains the lung segmentation for CT images computed using automatic algorithms
additional_annotations.csv: csv file that contain additional nodule annotations from our observer study. The file will be available soon

Note: The dataset is used for both training and testing dataset. To allow easier reproducibility, please use the given subsets for training the algorithm for 10-folds cross-validation.

Kaggle:

In this dataset, you are given over a thousand low-dose CT images from high-risk patients in DICOM format. Each image contains a series with multiple axial slices of the chest cavity. Each image has a variable number of 2D slices, which can vary based on the machine taking the scan and patient.

在本数据集中，你可以获得1000张来自高位患者的以DICOM为格式的低剂量CT图片。

每个图像包含一系列多个胸部的轴向切片。

CT图像之间2D切片图像的数量是不等的，这个数量随着扫描患者的机器的不同而变化。

The DICOM files have a header that contains the necessary information about the patient id, as well as scan parameters such as the slice thickness.

DICOM文件有一个头部，包含患者ID，以及扫描参数比如切片厚度。

The competition task is to create an automated method capable of determining whether or not the patient will be diagnosed with lung cancer within one year of the date the scan was taken. The ground truth labels were confirmed by pathology diagnosis.

这个比赛的任务是建立一种自动化方法，以确定患者是否会在CT扫描的一年内被诊断出癌症。所有的标记都被病理诊断确认了。

The images in this dataset come from many sources and will vary in quality. For example, older scans were imaged with less sophisticated equipment. You should expect the stage 2 data to be, on the whole, more recent and higher quality than the stage 1 data (generally having thinner slice thickness). Ideally, your algorithm should perform well across a range of image quality.

数据集里的图片来源不同品质不同。例如，旧一点扫描件是由不成熟的设备成像的。stage2中的扫描件质量会比stage1中的高（通常体现在有更薄的切片）（注：stage1与stage2在kaggle比赛页面能够找到）。理想情况下，你的算法应该在这一系列（质量参差不齐）的图片中表现良好。

Notes

Use of external data is permitted in this competition, provided the data is freely available. If you are using a source of external data, you must post the source to the official external data forum thread no later than one week prior to the deadline of the first stage.
This is a two-stage competition. In order to appear on the final competition leaderboard and receive ranking points, your team must make a submission during both stages of the competition.
Due to the large file size, Kaggle is beta testing use of BitTorrent as an alternate means of download. The image archives are encrypted in order to prevent outside access. Please do not share the decryption password. The large stage1.7z archive hosted on BitTorrent is the same as the version available for direct download.

注意：

比赛事项的注意，简单看看就好。

File Descriptions

Each patient id has an associated directory of DICOM files. The patient id is found in the DICOM header and is identical to the patient name. The exact number of images will differ from case to case, varying according in the number of slices. Images were compressed as .7z files due to the large size of the dataset.

stage1.7z - contains all images for the first stage of the competition, including both the training and test set. This is file is also hosted on BitTorrent.
stage1_labels.csv - contains the cancer ground truth for the stage 1 training set images
stage1_sample_submission.csv - shows the submission format for stage 1. You should also use this file to determine which patients belong to the leaderboard set of stage 1.
sample_images.7z - a smaller subset set of the full dataset, provided for people who wish to preview the images before downloading the large file.
data_password.txt - contains the decryption key for the image files

The DICOM standard is complex and there are a number of different tools to work with DICOM files. You may find the following resources helpful for managing the competition data:

The lite version of OsiriX is useful for viewing images on OSX.
pydicom: A package for working with images in python.
oro.dicom: A package for working with images in R.
Mango: A useful DICOM viewer for Windows users.

目前最大的收获，就是知道了TianChi和LUNA16都使用mhd数据，而kaggle使用的是DICOM数据，也就是说CT图像的格式不一样。