Three papers in the eye of the "AI breast cancer detection" storm

Three promising breast cancer diagnosis papers from Google, NYU and DeepHealth have sparked lively discussions in the deep learning and medical research community.

A few years ago, a group of researchers at New York University began publishing papers on applying deep learning to cancer screening. The team's latest paper, "Deep Neural Networks Improve Radiologist Performance in Cancer Screening," was published in October 2019.

In December, Boston-based DeepHealth—a startup that uses machine learning to help radiologists—published "Robust Breast Cancer Detection in Mammography" and "Efficient Deep Learning Using Annotations" on arXiv. Methods for Digital Breast Tomosynthesis. According to the paper, the proposed method achieves SOTA performance in mammogram classification. Co-authors include researchers from Rhode Island Hospital and Brown University, Henan Provincial People's Hospital in China, the Medford Radiology Group and the University of Massachusetts Medical School.

William Lotter, DeepHealth CTO and co-founder, is lead author. The paper is undergoing journal review, and DeepHealth plans to present three related abstracts at the Society for Breast Imaging meeting in mid-April, Lotter said in an email to Synced.

On New Year's Eve, a simple Reddit post praising the DeepHealth paper gave the first hint of the controversy surrounding the studies. The r/MachineLearning subreddit titled "Deep learning model for breast cancer detection beats five full-time radiologists and previous SOTA models at NYU and MIT" garnered more than 600 likes in less than two days Upvotes and 106 comments. However, DeepHealth believed the headline was "exaggerated and not necessarily constructive" and requested that the post be removed. Indeed.

Then on New Year's Day, Google's global breast cancer study grabbed headlines around the world. The International Assessment of Artificial Intelligence Systems for Breast Cancer Screening presents a new artificial intelligence system that reads mammograms more accurately than human radiologists, with fewer false positives and fewer false negatives. Published in the journal Nature, the paper was written by<>a number of researchers from Google Health, Google DeepMind, Imperial Cancer Research UK, North West University and the Royal Surrey County Hospital.

However, as Google DeepMind founder and CEO Demis Hassabis (Demis Hassabis) and others celebrated the release of the paper, Turing Award winner and Facebook chief artificial intelligence scientist Ian LeCun (AI) Yann LeCun went to ruin the party, tweeting that the authors of the Google paper owed the NYU researchers something and should "cite previous research on the same topic." Unlike the Google system, NYU's method has been open-sourced, he added.

2023-08-04T04:01:41.png

Hassabis shot back that Google did cite the NYU paper, and took a stab at LeCun in the process: "Maybe people should read the newspaper before tweeting outrage with misinformation.

LeCun flinched a little at the time: "I'm not mad ;-)" and "I did read the paper, but missed the citation the first time.

Important challenges and controversies regarding novelty

Globally, breast cancer is the most common cancer among women, according to the World Health Organization. Although breast cancer deaths in the United States have declined in recent years, the disease remains the second leading cause of cancer death among women in the country, according to the Centers for Disease Control and Prevention.

Breast cancer is a condition in which cells in the breast grow out of control and can spread throughout the body through the blood and lymph vessels. Mammograms allow doctors to identify breast cancer tumors before they are large enough to cause symptoms or be detected by patients. But despite the increased use of digital mammograms, reading mammograms remains a daunting task, even for professional radiologists.

Shravya Shetty, technical lead at Google Health, said in a blog post that distilled the Nature paper that Google has been working with leading clinical research partners in the UK and US for several years to see if artificial intelligence can improve breast cancer detection. The results of this research are included in this new paper.

However, Google's paper didn't generate as much excitement in the AI ​​community as it did in the mainstream media.

Back in the twitter circle again bashing the novelty of the paper, LeCun retweeted a comment from Hugh Harvey of the Royal College of Radiologists: "Congratulations Google, but let's not forget the NYU team who published even better results last year, More cases validated, tested on more readers, and their code and data available. They just don't have the PR machine to drive awareness.

Krzysztof J. Geras, co-author of the paper and a professor of radiology at NYU, told Synced: "The paper on applying deep learning to breast cancer screening is very long. My paper may be the first to combine large-scale experiments with different possible models." Papers that combine careful evaluation, very good results, extensive reader research, and trained models publicly available online. However, there is still room for improvement, and I believe there will be many papers that go further in different aspects in the coming years.

NYU researchers have introduced a deep convolutional neural network for cancer screening classification that was trained and evaluated on more than 1 million images from 200,000 breast exams. When tested on images from the NYU School of Medicine-affiliated website, their system scored 0.895 on the AUC performance metric (AUC or "area under receiver operating characteristic" ranges from 0-1, higher is better ),impressive. The Google system had an AUC of 0.889 in the UK screening data and 0.8107 in the US screening data.

Geras acknowledged the power of the Google paper's careful analysis of results, but warned in a tweet that "novelty is hard to quantify" and that "multiple papers have shown similar results."

In fact, an earlier NYU study last August came up with an AUC of 0.919. The new paper from DeepHealth gives even more striking AUC scores: 0.971 for data from China, 0.95 for data from the UK, and 0.957 for data from the US.

However, Geras cautions that because the models were trained and evaluated on different datasets, it is difficult to fairly compare results or say whether any model is actually state-of-the-art. He believes that it would be good for multiple groups to use similar methods to achieve similar results, "to validate our method together and show that the toolbox we use - in this case, deep neural networks - is robust, applicable in different scenarios.

There are certainly similarities between the Google and DeepHealth studies in terms of size, methodology, and results, but perhaps the biggest difference is that the Google paper was published in the prestigious journal Nature, while the DeepHealth paper is still awaiting review on arXiv. Google being the first to make their research public will put more pressure on DeepHealth to prove the novelty of its model.

"One of the core innovations of our paper is that we propose a model that is applicable to digital breast tomosynthesis (DBT, or 3D mammography) in addition to 2D mammography," Lotter wrote in an email. ).” He explained that the method achieves good performance without requiring strongly labeled DBT data. Developing an AI model for DBT is more challenging because 3D mammograms typically contain 50 to 100 times more data than 2D mammograms. But the DeepHealth team believes that DBT is critical given the widespread use of the technique and its remarkable clinical accuracy.

AI and radiologists in breast cancer detection

Google's system was trained and tuned on mammograms of more than 76,000 women in the UK and 15,000 in the US, and evaluated on separate datasets of more than 25,000 women in the UK and 3,000 in the UK, making the US and In the UK, the false positive rates were reduced by 5.7% and 1.2%, respectively; in the US and UK, the false negative rates were reduced by 9.4% and 2.7%, respectively.

"Reading mammograms is a perfect problem for machine learning and artificial intelligence, but I honestly didn't expect it to do better," Mozziyar Etemadi, a research assistant professor at Northwestern University and one of the co-authors of the Google paper, told TIME.

In an independent study, Google's AI system outperformed all six human experts. Meanwhile, DeepHealth's system outperformed all five full-time breast imaging specialists, with an average 14% improvement in absolute sensitivity. The NYU model outperformed a group of 12 radiologists, residents and medical students with 2 to 25 years of experience.

Faced with a flurry of headlines about "AI defeating radiologists," the Google, DeepHealth and NYU team all emphasized that the goal of their system is not to replace radiologists, but to support them in interpreting breast cancer screening exams.

It's worth noting that radiologists often use other sources of data in their diagnosis, such as family cancer history and previous images. But in DeepHealth's comparison, radiologists received mammograms and nothing else, which could have affected their performance. The Google team did feed their human experts (but not the AI) the patient history and previous mammograms, and their system still achieved a higher AUC than the radiologists.

It's also worth noting that AUC reflects only some aspects of model performance. Several studies have shown that neural networks are more likely to achieve higher AUC than radiologists in breast cancer screening exam classification under specific experimental conditions. However, this performance may not transfer to other measures: "For example, in our study, radiologists remained relatively strong when assessing PRAUC," Geras explained.

Model Generalization and Future Applications

From a scientific perspective, ensuring generalization across populations is critical for real-world deployment. The DeepHealth deep learning model is mainly trained on Western populations, but can be well generalized to Chinese populations. Meanwhile, the Google team trained their model on UK data and evaluated it on US data to see how it would generalize to other healthcare systems.

However, it seems more meaningful to demonstrate that models trained in Western populations can be generalized to Asian populations, taking into account differences in geography, culture, and ethnicity. For example, Chinese women tend to have higher breast density than Western women, which can be a technical challenge for mammography. Demonstrating that models trained on Western populations can also achieve high performance on Chinese populations, suggesting potential for generalization capabilities in other populations.

The NYU model, which was trained on private data, has some idiosyncratic biases in the distribution of inputs, outputs, and the relationship between inputs and outputs, Geras said. And it doesn't necessarily generalize well to other datasets.

While it's difficult to directly compare the generalization abilities of the three models -- or their overall performance in medical diagnostics -- the ultimate test will be in a real clinical setting. However, without an open-source model, it will be difficult for others—especially small groups that can also contribute to the advancement of the field—to build on Google's work. This prompted many in the research community to criticize Google's decision not to release the code for its model.

But why not post the code? Danilo Bzdok, a professor of biomedical engineering at McGill University, said Google's model training code may not be of much use because it includes a lot of dependencies based on internal tools, infrastructure and hardware.

Open-source models for medical use can be complex, and the DeepHealth team is currently deciding how to "find a solution that best benefits the research community while mitigating the potential for misuse on real patients," Lotter told Synced. "From a research perspective, we're big fans of open code. However, from a clinical perspective, blindly publishing code that can be used directly to interpret mammograms is not a small risk.

Lotter emphasized that DeepHealth's goal is to build a usable product, not a pure research project. This requires quality management systems, FDA studies and approvals, etc. "There is a substantial risk of harm if someone circumvents these components and uses our model code directly for clinical decision-making, especially without ensuring proper preprocessing, input validation, and monitoring.

Screening is only the first step in a breast cancer diagnosis, which often requires more than a mammogram. There are broader questions about when to start screening, the ideal interval between mammograms, and the extent of the benefits versus harmful effects of mammograms.

These three powerful papers focus the attention of the deep learning and medical community on the potential of deep learning to greatly improve breast cancer screening and diagnosis. Hopefully we will see progress in this promising research area of ​​supplementation.

The NYU paper Deep Neural Networks Improve Radiologist Performance in Breast Cancer Screening is available here, and the DeepHealth paper Robust Breast in Mammography and Digital Breast Tomosynthesis Using Annotated Efficient Deep Learning Methods Cancer Detection" is available here and the Google paper "International Evaluation of AI Systems for Breast Cancer Screening" is available here.

Guess you like

Origin blog.csdn.net/virone/article/details/132101797