Artificial Intelligence in Cancer Diagnosis

Shrey S Sukhadia, Assistant Director, Bioinformatics, Department of Pathology and Laboratory Medicine, Dartmouth-Hitchcock Medical Center

Over the past several decades, pathologists and clinicians have evaluated tumors using several data modalities such as Radiology/Imaging, Pathology, Genomics and Drug Discovery. These have been correlated with each other either manually or using sophisticated statistical approaches. Artificially Intelligent (AI) models trained on radiographic or histopathology-based images have been known to predict status of tumors (i.e., whether it’s benign or cancerous). AI models have also been trained on radiographic image features to predict gene expressions in cancer. Further, all the aforementioned modalities (or features from them) could be stitched together into an AI model to predict patient outcomes in cancer.

The screening of cancer patients in the clinic encompasses various data modalities, including imaging using radiological techniques that include computed tomography (CT), magnetic resonance imaging (MRI) and positron emission tomography followed by a surgical biopsy of the tumor region of interest (ROI) as segmented by radiologists on these images using software such as 3dSlicer, ITK-SNAP and Myrian Studio. The biopsied tumor material, also termed as “specimen” is examined by a pathologist for its shape, size and other physical features. The specimen is then cut into thin slices, i.e., “histological sections” that are fixed on a glass slide using paraffin-formalin, followed by their staining using hematoxylin that aids visualization of several components of the underlying cells, such as cell nuclei and ribosomes, under a microscope. This setup allows pathologists to analyze and mark several cellular or protein markers present in the specimen. Further, these slides are sent to a genomics laboratory to scrape off the tumor material that is processed further to extract and purify Deoxyribonucleic acid (DNA) and Ribonucleic acid (RNA). The DNA and RNA undergo several procedures in the laboratory to ultimately prepare high quality libraries that undergo sequencing using either short or long read sequencers such as Illumina’s Novaseq or Pacbio’s Sequel II, respectively that allude DNA-mutations and RNA-expression in the specimen using sophisticated bioinformatics techniques. These DNA-mutations and RNA-expression are interpreted further by genomic experts and reported to the clinician who ordered genomic test for the tumor specimen of the respective patient.

Based on the results from Radiology, Histopathology and genomics, a treatment regime is determined for the patient at a tumor board, where all clinicians and pathologists discuss patient cases and arrive at a consensus for their treatment plan. This may include one or several therapies such as Radiation, Chemotherapy, Drugs targeted to specific genes/proteins or an Immunotherapy. Post the completion of these therapies, a patient’s tumor is re-examined at either radiology/imaging or histopathology or both, to deem whether the tumor had undergone a size-reduction or remained unaffected by the therapy.

AI models could encompass results from each of the aforementioned techniques (or tests) and help predict one result from a set of others, or predict the therapy outcomes or chances of disease recurrence in patients from all of their test-results stitched together into a well-validated model. AI could be categorized mainly into two techniques: a) Machine Learning (ML), and b) Deep Learning (DL). Machine learning works well with the quantitative features/results extracted from the aforementioned techniques, while deep learning could intake high dimensional images from Radiology or Microscopic-scanning of specimen slides from histopathology to identify several image patterns in an automated way and inform diagnosis. ML enables analysis of feature-associations across multiple data-modalities more granularly as compared to DL.

ML models encompass the regression- or classification-based supervised models that could be trained and validated to predict either gene or protein expression in tumors from their radiological image-features, or predict therapy-response or disease-recurrence in patients using a combination of features (or results) from radiological, pathological and genomic examination of their tumors and the respective treatments administered. Whereas DL models include the unsupervised neural networks that self-learn the patterns from radiological or pathological images or genomic result-matrix to predict the status of tumors (i.e., whether its cancerous or benign) or a response to their treatment for the respective patients.

The application of AI models requires adoption of robust software, one of which recently released and named “ImaGene” (Figure 1) ( Such software allows researchers to train models on tumor specific features/results from aforementioned techniques/tests using a variety of ML model-types such as MultiTask Linear Regression/LASSO, Decision Trees, Random Forest, Support Vector Classifier and Multi-Layer Perceptron Classifier (aka supervised neural networks) to predict gene expressions (or any omics outcomes) from radiology image features.

Deep learning algorithms primarily include four types: a) Feed-Forward, where every neuron of layer “j” connects with the neuron of layer “j+1” with the information-flow direction set to be “forward”, b) Convolutional Neural Network (CNN) where weighted sums are calculated at each neuron for the data positions, c) Recurrent Neural Network (RNN), used for processing of sequential or time-series data and d) Autoencoder, which conducts non-linear dimensionality reduction.

Breast tumor MRIs had been recently tested through deep learning algorithms to predict whether the tumors were benign, which would ultimately help reduce unnecessary and painful biopsies in patients to deem the status of tumor, i.e. cancerous or benign ( Deep Learning approaches have also been applied in histopathology domain to train for instance hematoxylin-eosin-stained breast cancer microscopy images to predict whether the respective tissue is benign or cancerous (

The Cancer Imaging Archive (TCIA) portal hosts radiology imaging and omic datasets from multiple studies in cancer conducted either at The Cancer Genome Atlas (TCGA) or by a specific research group such as Non-Small Cell Lung Cancer (NSCLC) Radiogenomics (Figure 2). TCIA also hosts the supporting clinical data for the specimens enabling the AI-based imaging-omic research. TCIA hosts data from several groups such as: a) Cancer Moonshot Biobank (CMB), b) Applied proteogenomics Organizational Learning and Outcomes (APOLLO) and c) Clinical Proteomic Tumor Analysis Consortium (CPTAC). All these groups together provide terabytes of data that could be trained through AI models to predict patient outcomes in cancer near future.

In a nutshell, AI aids clinicians and pathologists with the prediction of the status of tumors in patients using either their radiographic or histopathological images or both together. AI-based models pave the way for researchers to predict omic-based features for tumors from their imaging features. Further, AI-based techniques enable stitching of features from imaging and omic modalities to predict therapy-outcome or disease-recurrence in cancer patients. Publicly available software such as “ImaGene” and data sources such as TCIA and TCGA enable the build and validation of AI models in imaging-omic domain. Cross-validation of AI models on patient-data across various hospitals or research organizations would boost their accuracy in predicting patient outcomes in cancer and contribute to advance the field of cancer-diagnosis and research.

--Issue 01--

Author Bio

Shrey S Sukhadia

Shrey Sukhadia has been leading Bioinformatics efforts since 12 years at clinical laboratories in top tier hospitals in United States, such as Dartmouth-Health, Phoenix Children’s Hospital and Hospital of the University of Pennsylvania. Through his PhD at Queensland University of Technology in Australia, he has developed a robust AI-based software platform “ImaGene” that facilitates the build and validation of several AI models for the prediction of omics data such as genomic, proteomic, patient/therapy outcomes or disease recurrences from the imaging or multi-omics (i.e., imaging features mixed with any omics feature) datasets.

Harvard Medical School - Leadership in Medicine Southeast Asia47th IHF World Hospital CongressHealthcare CNO SummitHealthcare CMO Summit