BDI Seminar: Joint analysis of gene expression levels and histological images identifies genes associated with cellular morphology

Histopathological images are used to identify and characterize complex phenotypes such as tumor stage. Our goal is to associate histological image phenotypes with high dimensional genomic markers; the limitations to incorporating histology image phenotypes in genomic studies is that the relevant image features are difficult to identify and extract in an automated way, and confounders are difficult to control in this high-dimensional setting. In this work, we use convolutional autoencoders and sparse canonical correlation analysis (CCA) on gene expression levels and pathology images from paired samples to find subsets of genes whose expression values in a tissue sample correlate with subsets of visual features from the stained tissue images. In three data sets, two from TCGA and one from GTEx, we discuss three types of associations with the image phenotypes. In TCGA, we find gene sets associated with the structure of the extracellular matrix and cell wall infrastructure, implicating uncharacterized genes in extracellular processes. Second, we find sets of genes associated with specific cell types, including muscle tissue, neuronal cells, and cell type heterogeneity. Third, in the GTEx data, we find two image features that capture population variation in muscle and neuronal tissues associated with genetic variants, suggesting that genetic variation regulates population variation in cell morphological traits. I will briefly touch on other work, including time series modeling of transcriptional responses to perturbation and electronic medical records data.