University of California, Berkeley
Colloquium Tea held at 4:00 pm in 154 Hurley Hall
Title: Towards reliable hypothesis generation via statistical machine learning and collaboration
Abstract: Given the rapid and continual influx of data in the biomedical field, the question arises: Can we leverage modern statistical machine learning tools to extract interpretable insights from the data and generate reliable scientific hypotheses to guide future experiments? In this talk, we discuss a particular instance regarding cardiac hypertrophy, an important and common heart disease that presents as the enlargement and thickening of the heart wall and carries significant risk for heart failure and sudden cardiac death. We, a highly-interdisciplinary, collaborative team of statisticians, scientists, and clinicians from UC Berkeley, UCSF, and Stanford, developed a statistical machine learning pipeline, the low-signal iterative random forest (lo-siRF), to recommend well-vetted epistatic (i.e., non-additive gene-gene interaction) drivers of cardiac hypertrophy for follow-up knockdown experiments. At a high-level, lo-siRF builds upon the computationally-efficient interaction search engine of iterative random forests but has been specifically tailored for low signal-to-noise data. To extract reliable insights from such low-signal data, lo-siRF leverages biologically-grounded principles and a new local stability-driven importance score for random forests. We ran several gene silencing experiments to assess these lo-siRF-recommended genes and gene-gene interactions. These experiments confirmed the effects of several genes that have been previously well-established drivers of cardiac hypertrophy (i.e., TTN, IGF1R). Moreover, these experiments implicated a new gene (CCDC141) and two gene-gene interactions (CCDC141-TTN, CCDC141-IGF1R), expanding the scope of the genetic regulation of cardiac structure.