Practical: Repairing Medical Image Datasets with Deep Learning Methods

In medical imaging , data storage archives are based on clinical assumptions. Unfortunately, this means that when you want to extract an image, such as a frontal chest x-ray, you usually end up with a folder with many other images stored, and there's no easy way to tell them apart.

42f695f408cfb14ad86612e8c7034cf348ecde8c

Figure 1: It makes sense that these pictures are from the same folder, because in radiology we record cases and not images. This is all body parts scanned at the same time after the patient is injured.

Depending on the institution, you may get images flipped horizontally or vertically. They may contain inverted pixel values. They may spin. The question is, when dealing with a huge dataset, say 50,000 to 100,000 images, how can you spot these distortions without a doctor's guidance?

You can try to write some elegant solutions like: Since most chest X-rays are taller than they are wide, there are black borders on either side of the X-ray, so if there are more than 50 black pixel rows at the bottom, then it Probably rotated 90 degrees.

But as always, our experience failed.

4d0f46c34876759f54375b622fa5db747b46ca96

Figure 2: Here only the middle image has the classic "black border"

These fragile rules do not solve the above problems.

Entering software 2.0 , we used machine learning to build solutions that we couldn't code ourselves . Problems like rotated images are embarrassingly learnable . This means that machines can perform these tasks as perfectly as humans.

So the obvious solution is to use deep learning to fix the dataset for us. In this post, I'll show you where these techniques can be applied, how to do it with minimal effort, and show some examples of how to use them. As an example, I'll use the CXR14 dataset developed by Wang et al ., which appears to be well-curated but still contains some bad pictures at times. If you use the CXR14 dataset, we can even give you a dataset with 430 new labels so you don't have to worry about those bad pictures!

such an embarrassing question

The first question we really need to ask is is the question now embarrassingly learnable ?

Considering that most studies are normal, you need a very high level of precision to prevent excluding those "good" studies. We should aim for 99.9% .

The cool thing is that for visually identifiable problems, it's simple and we can solve them just fine. A good question is "Can you imagine a single visual rule to solve this problem?" " The main purpose of the ImageNet dataset is to distinguish between dogs and cats, and the solution is certainly not the case.

There are so many variations, so many similarities. I use this example a lot in speeches: I can't even imagine how to write rules to visually distinguish between the two types of animals . This is not embarrassingly learnable.

But in medical data, many problems are actually quite simple. Because the changes in medical images are very small. Anatomy, angles, light, distance and background are all stable. To illustrate this, let's look at a simple example from CXR14. Of the normal chest x-rays in the dataset, some are rotated (this is not identified in the labels, so we don't know which ones). They can be rotated around 90 degrees , or upside down 180 degrees .

Is this embarrassingly learnable?

c058087825a6c39614814097ff449ce3c96979d0

Figure 3: The difference between rotational and vertical chest x-rays is really simple

The answer is yes. Visually, an abnormal study is completely different from a normal study. You can use a simple visual rule like "shoulders should be higher than the heart" and you will get validation on all examples. Given that the anatomy is very stable, and all have shoulders and hearts, this should be a learnable convolutional neural network rule.

The data of "waiting to be fed"

The second question we want to ask is: do we have enough training data ?

In the case of rotated images, we certainly have enough data that we can do data generation. All we need is a few thousand normal chest x-rays, rotated randomly. For example, if you were using numpy arrays, you might use a function like this:

def rotate(image):

rotated_image = np.rot90(image, k = np.random.choice(range(1,4)), axes = (1,2))

return rotated_image

This just rotates 90, 180 or 270 degrees clockwise. In this case the rotation is usually around the second and third axis, since the first axis is the number of channels (according to the conventions of theano matrix dimensions).

Note: In this case, there are very few rotated images in the CXR14 dataset, so the chance of accidentally "correcting" an already rotated image is very small. We can assume that there are no rotated images in the data, which is beneficial to the learning of the model. If there are a lot of abnormal images, then you'd better select both normal and abnormal images. Since issues like rotation are easily identifiable, I find that I can give thousands of labels in an hour, so it doesn't take much effort. Since these problems are simple, I often find that I only need a few hundred examples to "solve" the challenge.

So we build a dataset of normal images, rotate half of them, and label them accordingly. In my case, I selected 4000 training cases, 2000 of which were rotated, and 1000 of the 2000 validation set cases were rotated. This seems like a decent amount of data (remember, as a rule of thumb, 1000 examples might be good after taking the error into account, and it fits in RAM so it's easy to train on my home computer .

In order to have an interesting variation in machine learning, I don't need a separate test set. The proof is visible in Pudding: I will run this model on the entire dataset and get test results by examining the data.

In general, doing this type of work can make our lives easier . I downscaled the image to 256 x 256 pixels, since rotation detection doesn't seem to require high resolution, I use a pre-trained resnet50 with keras as the base network. There isn't a clear reason for using a pre-trained network, as almost all networks you use will converge on a simple solution, but it's simple and won't cause any slowdown, because anyway Training times are fast. I used a default set of parameters because I didn't need to do any tuning for this simple task.

You can use whatever network and coding you have at hand . A VGG-net will do. A densenet would also work. In fact, any network can achieve this task.

After dozens of iterations , I got the result I expected on the validation set:

586f935abf8dd3b1877a07214ff8ac0759d1744e

图4:AUC = 0.999, ACC = 0.996, PREC = 0.998, REC = 0.994 �

Nice, I found exactly what I was looking for if this were an embarrassingly learnable task.

result check

Like I said before, in medical image analysis we always need to check our results. Make sure the model or process achieves your goals by comparing the pictures.

So the final step is to run the model on the entire dataset , make predictions , and then exclude the study of rotation. Since there are few studies of rotations in the data, although I know that doing so, the recall will be very high, I can simply look at all the images that are predicted to be rotated.

If this is a problem with a large number of anomalous images, say including more than 5% of the anomalous data, it would be more efficient to collect a few hundred random cases and manually label a test set, and then you can track your model Accuracy on appropriate metrics.

I'm particularly concerned with any normal studies that are considered rotated (false positives) , because I don't want to lose valuable training cases . This is actually more of a concern than you might think, as the model is likely to overcall certain types of cases (probably those recorded when the patient is slouching and oblique), and if we exclude these rules, we will introduce biased data, no longer There are " real world " representative datasets. This is obviously important for medical data because our goal is to build systems that work in real clinics.

In total, the model identified 171 cases as "rotated" images. Interestingly, it's actually an "anomaly" detector, identifying many anomalies that aren't actually spinning. This makes sense as it could be a milestone in learning anatomy. Anything out of the ordinary, like a spinning film or x-rays of other body parts, was tagged differently than this model. So we get a lot more results than finding images that are not properly rotated.

Of the 171 selected predictions, 51 were rotated anterior chest X-rays. Given the ridiculously low prevalence (51 out of 120,000), this is already a very low false positive rate.

07db5400874b64c316c633afa33f45433f300c88

Figure 5: Example of a rotating chest radiograph

Of the remaining 120 cases, 56 were not frontal chest radiographs. Mainly side shots and abdominal x-rays. Anyway, I still want to get rid of these.

What about the rest? There's a mix of studies , where the image contains lots of black or white borders, knock-out studies, which means the whole image is gray, reverse pixel-level studies, and so on.

In total, there are about 10 studies, which I call " false positives " (meaning the images picked out are frontal x-rays that I might want to save). Thankfully, even if you want to add them back up, there are only 171 predictions , so it's easy to manage manually .

So the rotation detector looks like it partially solves some other problem (like inversion of pixel values). To know how good it is, we need to check if it misses other bad cases. We can test this because the inversion of pixel values ​​easily generates data (x=max-x for x in the image).

The embarrassingly learnable problem needs to be mentioned again here. In this case, there might be some way to do this without machine learning (the histogram should look quite different), but it's also pretty simple.

So, did this particular detector find more inversions than the rotating detector? Yes. The rotation detector found 4 in the entire dataset, while the inversion detector found 38 inverse studies. So the rotating detector only found partially poor studies.

Back to the book: training a single model to solve each problem is the right approach.

Therefore, we need specific models to perform additional data cleaning tasks.

dripping water

To demonstrate the usefulness of a small amount of labeled data, I used a rotation detector (n=56) to take videos of lateral and bad regions and trained a new model on them. Since I don't have a lot of data, I decided to go wild with HOG and not use the validation set. Since these tasks are embarrassingly learnable, once it gets close to 100%, it generalizes well. Obviously there is a risk of overtraining here, but I still choose to take the risk.

Experiments show that the results are very good! I also found hundreds of other side images, abdominal images and some images of the pelvis.

Obviously, if I build this dataset from scratch, it will be easier to solve this problem because I can get a lot of relevant non-frontal chest images. For me to do better than I do now, I need to extract a series of images from my local hospital archives, which is beyond the scope of this article. So I'm not sure I got most of it, but from such a small dataset, it's a pretty decent effort.

Aside from the data on CXR14, one thing I noticed is that my model performed poorly on images of children. These paediatric images are different in appearance from adult images and are identified as "abnormal" by rotating detectors, inverted detectors, and bad-section detectors. I suggest they should be ignored, but are included in the labels as the patient ages so that it can be done without deep learning. Considering there are only 286 patients in the under-5 dataset, I would personally exclude all patients unless I specifically want to study patients in that age group and really know what I'm doing, from a medical imaging standpoint Look. In fact, I'd probably exclude anyone under 10, as that's a reasonable age to be more "mature" considering body size and pathology. Among people under the age of 10, there are about 1,400 cases, or about 1%.

Back to the book:

Pediatric chest X-rays are very different from adults. Given that data under 10 is only about 1% of the data, they should be excluded unless there is a good reason.

Poor positioning and magnified images can be a problem, but it's task-dependent, and arbitrarily defining a "bad image" is impossible for all tasks, which is not what I want to do. One more thing is task specific.

In general, using deep learning to solve simple data cleaning problems works well. After about an hour, I've cleaned up most of the rotated and inverted images in the dataset. I've probably identified a good portion of the lateral images and images of other body parts, although I'm sure I'll need to build specific detectors for them. This is too long for this simple blog without the raw data.

Looking at the CXR14 data from a broader perspective , it doesn't have too many image errors . The team at the National Institutes of Health organized their data nicely. In medical datasets, however, this is not always the case, and if you want to build high-performance medical AI systems , you must leverage the clinical infrastructure to handle the noise of research tasks.

further consideration

So far, we have solved some very simple challenges, but not all the problems we encounter in medical imaging are so simple.

Our team applied these techniques when constructing a large hip fracture dataset. We excluded images from other body regions, we excluded cases where metal was implanted, such as hip replacements, we also zoomed in on the hip region, and removed image areas that were not relevant to our problem, such as hip fractures that did not occur in the hip Case.

We approach the problem of metal exclusion through an automated text mining process, and these prosthetics are almost always found at the same time they appear, so I found keywords related to implants. The tags were created in about 10 minutes.

In the case of false detections of body parts and false bounding box predictions , we have no way to automatically generate labels. So proceed directly to manual processing. Even for something as complex as bounding box prediction (which is indeed an anatomically landmark identification task), we only needed about 750 cases , which took only about an hour per dataset.

In this case, we use a human-labeled test set to quantify the results. From one of our papers:

600cae2fdb34a3ff6cb96cd80fb92959731f8581

Considering that labeling the fracture problem takes months, spending an extra hour or two to get a clean dataset is a small price to pay. And the system can now accept any clinical image, and by leveraging our knowledge can automatically exclude irrelevant or low-quality images. This is exactly one example of how medical AI systems can be used in real life, unless you want to hire someone to process all the images your model analyzes for you.

Summarize

We all agree that deep neural networks are as good as humans at solving vision problems, given enough data. However, "sufficient data " largely depends on the difficulty of the task.

For a branch of the medical image analysis problem, which we often need to solve when building medical datasets, the task is very simple, which makes the problem easy to solve with a small amount of data. Typically, models take less than an hour to recognize images, while doctors spend hours manually processing each dataset .

As a proof of the method and to thank the readers of my blog, I have provided a set of ~430 bad images with labels to exclude from the CXR14 dataset and suggest to exclude ~1400 children under the age of 10 unless you really know Why do you keep this data. This doesn't change the results of any of the papers, but the sharper the images on these datasets the better.

The conclusions I present here are not in any way technically innovative, which is why I don't write a related formal paper. But for those of us who are building new datasets, especially doctors with no deep learning experience, I hope this might spark some ideas about how software 2.0 can solve your data problems in an order of magnitude better than manual The method is more labor-saving. The main obstacle to building amazing medical AI systems right now is the huge cost of collecting and cleaning data, in which case deep neural networks really aren't much use.

I checked all my images in Windows File Explorer!

I've attached my address at the end of this blog post where I make predictions from the rotation detector in my space.

bdaffceb42ef89567e6958313afa45d75d30a187

I just move the cases I want to see to a new folder, then open the folder (using the "extra large icon" view mode). An image of this size is about a quarter of the screen height, and on most screens is large enough to detect large anomalies such as rotation . When I tag images with big exceptions , I just ctrl-click all the examples in the folder and cut/paste them into a new folder. This is the secret of how I do 1000 data per hour.

Like janky for this system, it's much better than most things I've written from online repos or myself.

The python code for moving the files is pretty simple, but it's the code I use the most when building the data, so I thought I should include it:

pos = rotation_labs[rotation_labs[‘Preds’] > 0.5]

In this case rotationlabs is a pandas dataframe that stores the image index/filename and the model predictions for that case. I subset it into a dataframe with only positive examples.

for i in pos[‘Index’]:

fname = “F:/cxr8/chest xrays/images/” + i

shutil.copy(fname, “F:/cxr8/data building/rotation/”)

All of this is copying the relevant images into a folder I made called "rotation".

Then I can go to that folder and take a look. If I do some manual curation and want to read the images back out, it's as simple as:

new_list = os.listdir(“F:/cxr8/data building/rotation/”)

William Gale is my outstanding co-author on this. He successfully landed an ML research position at Microsoft after his undergraduate degree and now focuses on language issues. He is worth watching.


The original release time is: 2018-05-4

Author of this article: Xiao Pan

This article is from Xinzhiyuan, a partner of Yunqi community. For relevant information, you can follow "AI_era".

Original link: Practical: Repairing Medical Image Datasets with Deep Learning Methods

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325377800&siteId=291194637