Machine learning to predict bacteriologic confirmation of Mycobacterium tuberculosis in infants and very young children

Jonathan P. Smith , Kyle Milligan, Kimberly D. McCarthy ,Walter Mchembere, Elisha Okeyo, Susan K. Musau, Albert Okumu, Rinn Song, Eleanor S. Click, Kevin P. Cai


Diagnosis of tuberculosis (TB) among young children (<5 years) is challenging due to the paucibacillary nature of clinical disease and clinical similarities to other childhood diseases. We used machine learning to develop accurate prediction models of microbial confirmation with simply defined and easily obtainable clinical, demographic, and radiologic factors. We evaluated eleven supervised machine learning models (using stepwise regression, regularized regression, decision tree, and support vector machine approaches) to predict microbial confirmation in young children (<5 years) using samples from invasive (reference-standard) or noninvasive procedure. Models were trained and tested using data from a large prospective cohort of young children with symptoms suggestive of TB in Kenya.


Microbial confirmation of TB disease remains among the most pressing challenges facing clinicians and researchers seeking to accurately diagnose TB and initiate treatment in young children. Pediatric TB is paucibacillary by nature and the primary specimen used to confirm TB disease in adults, expectorated sputum, is not feasible to collect from young children. The most accurate reference standards for specimen collection in young children, gastric aspirate and induced sputum, require highly invasive procedures that often cause significant physical and mental discomfort to the child and family [3]. Unfortunately, despite ideal scenarios these invasive procedures remain suboptimal, with a diagnostic yield of only 25–50 percent in high-resource settings [3,4]. Recent work has investigated a collection of alternative specimen collection procedures using minimally- or noninvasive procedures, such as oral swabs, nasopharyngeal aspirate, urine, or stool samples [5–8]. While more comfortable and feasible in limited-resource settings, these combinations typically result in similar or lower bacteriologic yields [5–8].
Materials and methods

The purpose of the M’toto study was to identify combinations of both invasive and minimally invasive bacteriologic specimen collection procedures that produced the highest yield of bacteriologic confirmed TB diagnosis. Clinical study staff collected a panel of up to eight specimen types, including two samples each of the current invasive reference standard procedures (gastric aspirate (GA) and induced sputum (IS)), as well as samples from the minimally invasive procedures of nasopharyngeal aspirate (NPA), stool, string test (ST), and urine. Two samples of cervical lymph node fine-needle aspirate (FNA) were taken if indicated, and a single sample of blood was taken. Samples were collected within three days of study enrollment. The panel was tested for microbial confirmation with both the PCR-based Xpert MTB/RIF (Xpert) and mycobacteria growth indicator tube (MGIT).


However, when considering the comprehensive range of metrics used to examine model performance, there was substantial heterogeneity between models (Table 2), particularly those which prioritize the correct classification of positive samples (AUPRC, sensitivity, PPV, F2). The median AUPRC estimate was 0.46 (range: 0.39–0.52), suggesting a substantial increase in predictive ability over baseline (~0.10 for a random estimator in these data). Among modeling techniques from specimens using invasive procedures, tree-based models demonstrated the highest overall performance by measure of AUROC and AUPRC, however SVM models demonstrated lower overall misclassification error in predicting microbial confirmation (Table 2).


These findings have two key implications: first, clinical teams seeking to determine if an invasive sampling procedure should be carried out for a child with presumptive TB could use such tools at the initial patient encounter for rapid decision-making. Knowledge that a child is very unlikely to produce a positive sample may reinforce a clinical TB diagnosis, thus hasten time to treatment initiation and improve patient outcomes. This is particularly useful in limited-resource, high incidence settings where patient follow up is challenging. Second, future research in pediatric TB, including vaccine trials and novel approaches of microbial confirmation among children, require confirmation using invasive sampling procedures as the reference standard. Researchers seeking to enroll a cohort of children with a high microbial yield can use these tools to guide enro

Citation: Smith JP, Milligan K, McCarthy KD, Mchembere W, Okeyo E, Musau SK, et al. (2023) Machine learning to predict bacteriologic confirmation of Mycobacterium tuberculosis in infants and very young children. PLOS Digit Health 2(5): e0000249.

Editor: Bo Wang, University of Toronto, CANADA
Received: October 27, 2022; Accepted: April 4, 2023; Published: May 17, 2023

This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

Data Availability: All models, full model code, and data to recreate this analysis can be found on the GitHub repository,

Funding: This work was supported by the US Agency for International Development (USAID) and the US Centers for Disease Control and Prevention (CDC). A portion of this work was funded by the President’s Emergency Plan for AIDS Relief (PEPFAR) through the Centers for Disease Control and Prevention, and the Eunice Kennedy Shriver National Institute of Child Health & Human Development [K23HD072802 to RS]. The findings and conclusions in this report are those of the author and do not necessarily represent the official position of the funding agencies. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

The GIANT Health Event 2023National Healthcare CMO Summit USANational Healthcare CNO Summit USA