Camelyon16 data set slicing preprocessing

Preface

This article records the tile preprocessing process of the Camelyon16 data set. This data set can be used for two tasks: 1. Training a multi-instance learning model to predict slide level labels; 2. Training a classification network to detect the tumor area of ​​WSI. If it is a cancer area detection task, the WSI needs to be cut into small pieces
. , this article records the cutting method.


1. Introduction to Camelyon16 data set

The data is a full-field digital slice image (WSI) in *.tif format. The resolution of a WSI is about 100k×100k pixels and the size is about 1GB. The data set contains a total of 700 WSIs, approximately 700+GB. The content of the data set is as follows:
1. Trianing (training set): (1) normal contains 159 WSIs without cancer cells; (2) tumor contains 111 WSIs with cancer cells; (3) lesion_annotations contains tumor with cancer Annotation file of cell area in .xml format
Training set

2. Testing (test set): (1) images contain a total of 129 WSIs, some of which have cancer cells and some of which do not. (2) Lesion_annotations contains 129 cancer area annotation files with cancer cells in WSI.
Insert image description here

The classification project does not use WSIs that do not contain cancer cells (it is necessary for multi-instance learning), so the training set is the 111 tumorWSIs, and the test set is the dozens of WSIs that contain cancer cells among the 129 images.


2. Camelyon16 data set slicing

Although the original 400 WSI files contain all necessary information, they are not suitable for directly training a deep convolutional neural network (CNN) due to insufficient machine memory and multi-resolution issues caused by the high level resolution. Therefore, we must sample smaller patch images, such as 256×256, that typical CNNs can handle. Effective sampling information and representative patch images are one of the key parts to achieve good tumor detection performance. The acquisition of the patch image should include three parts, the generation of the mask, the filtering of the center coordinates of the patch image and the acquisition of the patch.

All the following programs process a single file. If batch processing is required , just get the file name list from os.lisdir and use a for loop to traverse it.

Please note that the indentation of the code in this article is wrong, and some are tab keys. Please change it to 4 spaces after copying, otherwise an error will not be reported! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

1. Convert xml annotation to json format

Each annotation is a list of polygons, where each polygon is represented by its vertices. Positive polygons represent tumor areas, and negative polygons represent normal areas. At this stage, the annotation format is converted into a simpler .json format.

import os
import json
import numpy as np
import xml.etree.ElementTree as ET

def camelyon16xml2json(inxml, outjson):
        """
        Convert an annotation of camelyon16 xml format into a json format.
        Arguments:
            inxml: string, path to the input camelyon16 xml format
            outjson: string, path to the output json format
        """
        root = ET.parse(inxml).getroot()
        annotations_tumor = \
            root.findall('./Annotations/Annotation[@PartOfGroup="Tumor"]')
        annotations_0 = \
            root.findall('./Annotations/Annotation[@PartOfGroup="_0"]')
        annotations_1 = \
            root.findall('./Annotations/Annotation[@PartOfGroup="_1"]')
        annotations_2 = \
            root.findall('./Annotations/Annotation[@PartOfGroup="_2"]')
        annotations_positive = \
            annotations_tumor + annotations_0 + annotations_1
        annotations_negative = annotations_2

        json_dict = {
    
    }
        json_dict['positive'] = []
        json_dict['negative'] = []

        for annotation in annotations_positive:
            X = list(map(lambda x: float(x.get('X')),
                     annotation.findall('./Coordinates/Coordinate')))
            Y = list(map(lambda x: float(x.get('Y')),
                     annotation.findall('./Coordinates/Coordinate')))
            vertices = np.round([X, Y]).astype(int).transpose().tolist()
            name = annotation.attrib['Name']
            json_dict['positive'].append({
    
    'name': name, 'vertices': vertices})

        for annotation in annotations_negative:
            X = list(map(lambda x: float(x.get('X')),
                     annotation.findall('./Coordinates/Coordinate')))
            Y = list(map(lambda x: float(x.get('Y')),
                     annotation.findall('./Coordinates/Coordinate')))
            vertices = np.round([X, Y]).astype(int).transpose().tolist()
            name = annotation.attrib['Name']
            json_dict['negative'].append({
    
    'name': name, 'vertices': vertices})

        with open(outjson, 'w') as f:
            json.dump(json_dict, f, indent=1)
if __name__=='__main__':
        xml_path = 'Camelyon16/lesion_annotations/tumor_001.xml' # xml文件路径
        json_path = 'Camelyon16/json_annotations/tumor_001.json' # json文件路径
        camelyon16xml2json(xml_path, json_path)

2. Obtain the mask of the tumor area

At this stage, json annotation is used to obtain the mask file of the tumor area, in the format of .npy .

import os
import numpy as np
import openslide
import cv2
import json

wsi_path = 'CAMELYON16/training/tumor/Tumor_001.tif' # wsi文件路径
json_path = 'PatchCamelyon/train/annotation_json/Tumor_001.json' # josn标注文件路径
npy_path = 'PatchCamelyon/train/tumor_npy/Tumor_001.npy' # mask文件输出路径
level = 6 # at which WSI level to obtain the mask

if __name__=='__main__':
	slide = openslide.OpenSlide(wsi_path)
    w, h = slide.level_dimensions[level]
    mask_tumor = np.zeros((h, w)) # the init mask, and all the value is 0
    
    factor = slide.level_downsamples[level]# get the factor of level * e.g. level 6 is 2^6

    with open(json_path) as f:
        dicts = json.load(f)
    tumor_polygons = dicts['positive']
    for tumor_polygon in tumor_polygons:
        # plot a polygon
        vertices = np.array(tumor_polygon["vertices"]) / factor
        vertices = vertices.astype(np.int32)

        cv2.fillPoly(mask_tumor, [vertices], (255))
    
    mask_tumor = mask_tumor[:] > 127
    mask_tumor = np.transpose(mask_tumor)
    np.save(npy_path, mask_tumor)# 获得Tumor_001.tif在level_6下的tumor区域掩码

3. Obtain the mask of the tissue area

Tissue regions can be obtained by image segmentation using the Otsu algorithm. RGB_min can be adjusted manually to determine the minimum threshold. You can convert tissue_mask into a binary image and save it to see the effect.

import os
import openslide
import numpy as np
from PIL import Image
from skimage.color import rgb2hsv
from skimage.filters import threshold_otsu

wsi_path = 'CAMELYON16/training/tumor/Tumor_001.tif' # wsi文件路径
npy_path = 'PatchCamelyon/train/tissue_npy/Tumor_001.npy' # mask文件路径
level = 6 # at which WSI level to obtain the mask
RGB_min = 50 # min value for RGB channel

if __name__=='__main__':
	slide = openslide.OpenSlide(wsi_path)
    img_RGB = np.transpose(np.array(slide.read_region((0, 0),
                        level,
                        slide.level_dimensions[level]).convert('RGB')),
                        axes=[1, 0, 2])
    img_HSV = rgb2hsv(img_RGB)

    background_R = img_RGB[:, :, 0] > threshold_otsu(img_RGB[:, :, 0])
    background_G = img_RGB[:, :, 1] > threshold_otsu(img_RGB[:, :, 1])
    background_B = img_RGB[:, :, 2] > threshold_otsu(img_RGB[:, :, 2])
    tissue_RGB = np.logical_not(background_R & background_G & background_B)
    tissue_S = img_HSV[:, :, 1] > threshold_otsu(img_HSV[:, :, 1])
    min_R = img_RGB[:, :, 0] > RGB_min
    min_G = img_RGB[:, :, 1] > RGB_min
    min_B = img_RGB[:, :, 2] > RGB_min

    tissue_mask = tissue_S & tissue_RGB & min_R & min_G & min_B

    np.save(npy_path, tissue_mask)# 获得Tumor_001.tif在level_6下的组织掩码
    # img = Image.fromarray(tissue_mask)
    # img.save('tumor_001_tissue.png') # 可以保存二值图像看看效果如何

4. Obtain the mask of the no_tumor area

The tissue area contains tumor and no_tumor, so you only need to perform logical operations on tissue_mask and tumor_mask to get the mask of the no_tumor area.

import os
import numpy as np

tumor_npy_path = 'PatchCamelyon/train/tumor_npy/Tumor_001.npy'
tissue_npy_path = 'PatchCamelyon/train/tissue_npy/Tumor_001.npy'
no_tumor_npy_path = 'PatchCamelyon/train/no_tumor_npy/Tumor_001.npy'

if __name__=='__main__':
	tumor_mask = np.load(tumor_npy_path)
	tissue_mask = np.load(tissue_npy_path)
	no_tumor_mask = tissue_mask & (~ tumor_mask)
	
 	np.save(no_tumor_npy_path, no_tumor_mask)

5. Randomly sample each tissue (tumor, no_tumor) area.

Thousands of patches can be cut out from one WSI, but not all of them are needed. You only need to sample a certain number in each WSI.
The sampling principle is relatively simple. Since the masks we obtained earlier are all WSI masks at level 6, with a resolution of about 1k * 2k, we directly sample some points in the low-resolution mask to obtain the coordinates of the sampling points at level 6. Then multiply by the zoom factor to calculate their coordinates at level 0 (the coordinates of the center point of the patch). Get the sampling coordinates txt file.

import os
import numpy as np

mask_path = 'PatchCamelyon/train/tumor_npy/Tumor_001.npy'
# mask_path = 'PatchCamelyon/train/no_tumor_npy/Tumor_001.npy' # no_tumor区域一样的流程
txt_path = 'PatchCamelyon/train/tumor_txt/Tumor_001.txt'
patch_number = 1000 # 采样点数
level = 6 # npy 文件的level

if __name__=='__main__':
	mask_tissue = np.load(mask_path)
	X_idcs, Y_idcs = np.where(mask_tissue)
	centre_points = np.stack(np.vstack((X_idcs.T, Y_idcs.T)), axis=1)
    if centre_points.shape[0] > patch_number:
        sampled_points = centre_points[np.random.randint(centre_points.shape[0],
                                                         size=patch_number), :]
    else:
        sampled_points = centre_points # 点数不够就全要

    sampled_points = (sampled_points * 2 ** level).astype(np.int32) # make sure the factor
    mask_only_name = os.path.split(mask_path)[-1].split(".")[0]
    name = np.full((sampled_points.shape[0], 1), mask_only_name)
    center_points = np.hstack((name, sampled_points))

    with open(txt_path, "a") as f:
        np.savetxt(f, center_points, fmt="%s", delimiter=",")

6. Get the patch data set

According to the coordinates of the sampling point, the patch can be obtained by cutting the WSI at level 0. Tumor and no_tumor need to be operated separately to obtain two types of patches. You also need to cut the test set into chunks, which is the same process. Just take tumor cutting of the training set as an example.

import os
import openslide
from multiprocessing import Pool

wsi_path = 'CAMELYON16/training/lesion_annotations/tumor_001.tif' # WSI文件路径
txt_path = 'PatchCamelyon/train/sample_spot_txt/tumor_001.txt' # 采样点txt文件路径
patch_file = 'PatchCamelyon/train/train_patch/tumor' # patch输出文件夹路径
patch_size = 256 # patch 的尺寸 默认256*256大小
level = 0 # 默认在level 0 切割WSI
num_process = 16 # 进程数,使用多进程切块要快得多

def process(opts):
    j, pid, x_center, y_center, wsi_path = opts
    x = int(int(x_center) - patch_size / 2)
    y = int(int(y_center) - patch_size / 2)
    slide = openslide.OpenSlide(wsi_path)
    img = slide.read_region((x,y),level,(patch_size,patch_size)).convert('RGB')
    img.save(os.path.join(patch_file,pid+'_'+str(100000+j)+'.png'))
if __name__=='__main__':
	opt_list = []
	with open(txt_path) as f:
      for j,line in enumerate(f):
            pid, x_center, y_center = line.strip('\n').split(',') 
            # pid为不带后缀的文件名字,如tumor_001
            opt_list.append((j,pid,x_center,y_center,wsi_path))
    pool = Pool(processes=num_process)
    pool.map(process, opt_list) # 多进程切块

Summarize

After the slicing is completed, the training set (train) of PatchCamelyon (patch version of Camelyon16) is obtained. The data set contains two types of patches: tumor and no_tumor. There are approximately 110,000 blocks per category (110 WSIs, 1,000 samples per category). The test set cannot be sampled, but is cut into all blocks according to the tissue_mask for testing.
Insert image description here
Insert image description here

We need to use the verification set when training the network. Just take 20% of the data from the training set (tip: after getting the patch list, random shuffle, take the first 80% to get their corresponding labels and paths, and the last 20% of the labels and Path, save it as a txt file, and read the image directly according to the path during training).

Project reference

Guess you like

Origin blog.csdn.net/ittongyuan/article/details/129481328