Statistical Sciences ARES: Ana Kenney
Join us at the Statistical Sciences Applied Research and Education Seminar (ARES) with
Department of Statistics
Free Hybrid (In-person/Online) Event | Registration Required
Leveraging problem structure to improve feature recommendations in biomedical research
Random forests (RF) have been shown to achieve high prediction accuracy and more precisely capture the underlying data-generating mechanisms in many biomedical problems. However, they can lack interpretability, making it challenging to determine which biological features should be prioritized for further study. The cost and resources required for follow-up investigations (e.g., clinical trials) necessitates the development of improved, stable feature recommendations based on these highly predictive models.
In this talk, we first discuss highly collaborative, interdisciplinary
work on identifying and testing epistatic drivers of cardiomyocyte hypertrophy through a machine learning approach based on iterative random forests. Through follow-up gene silencing experiments and cell imaging analysis, we were able to validate our recommendations and show that cardiomyocyte hypertrophy is modifiable by two specific pairwise interactions.
The challenges and insights into RFs encountered in this work incentivized the development of generalized mean decrease in impurity (MDI), or MDI+, a unifying framework for random forest feature importances. We show that MDI, the default importance method for RFs, for a feature in each fitted tree in an RF is the unnormalized r-squared value when fitting a linear regression using only the subset of local decision stumps corresponding to nodes that split on this feature. MDI+ goes beyond this restrictive ordinary least squares setting and allows the use of more appropriate models and metrics that can be tailored towards different problem structures (e.g., using robust regression when there is potential data contamination). We demonstrate that this flexibility improves feature importance rankings, often by 10% or more, in terms of the AUROC for classifying signal vs non-signal features. Moreover, in a case study on drug response prediction, MDI+ extracts well-established predictive genes with greater stability and robustness compared to existing feature importance measures.
Ana Maria Kenney is an Assistant Professor in the Department of Statistics at UC Irvine. She works at the interface of statistics, interpretable machine learning, and large-scale optimization to advance biomedical research. She has been on several interdisciplinary teams across institutions spanning cardiovascular genetics, “Omics” contributions to early infant growth, and early cancer screening. She recently completed a postdoc with Professor Bin Yu in the Department of Statistics at UC Berkeley. Previously, she received a dual title Ph.D. at Penn State in Statistics and Operations Research jointly advised by Francesca Chiaromonte and Matthew Reimherr. There she was a Biomedical Big Data to Knowledge Training Fellow and Alfred P. Sloan Ph.D. Scholar.