Thesis Defense: Xinyi Zhang

Wednesday, April 16
10:00 am - 12:00 pm

45-600B

Thesis title: Representation learning for cell and tissue biology: from multimodality integration to simple biomarkers

Abstract: Biological processes involve complex interactions across different scales, from intracellular changes in gene expression and chromatin organization to intercellular communication and tissue organization. While advances in experimental techniques have enabled measurement of multiple modalities in the same cells, comprehensive understanding requires computational methods that can integrate these diverse data types. However, multimodal data collection remains resource-intensive and technically challenging, limiting its application across large patient cohorts. Therefore, there is a need to not only integrate multimodal data but also to identify simple biomarkers that capture cell state information from a scalable modality.


First, we introduce a graph-based autoencoder framework that integrates gene expression, spatial coordinates, and cellular imaging data into a joint representation. Applied to an Alzheimer’s disease mouse model, this approach reveals spatio-temporal disease progression patterns and associated nuclear morphological and transcriptional changes. We extend this methodology to analyze ovarian cancer, enabling cross-patient comparison despite significant biological heterogeneity and technical batch effects.

Recognizing the complexity and lack of scalability of multimodal measurements and sequencing based assays, we develop unsupervised image autoencoder frameworks that extract comprehensive cellular state information from imaging data. This approach enables meaningful cell state identification in breast cancer tissue microarrays spanning 122 patients and neurodegenerative disease samples across multiple pathologies. Importantly, the proportions and spatial organization of these computationally derived cell states are predictive of pathology diagnosis, clinical phenotype, as well as mutation status, providing scalable biomarkers that also enable insights into the involvement of chromatin organization and mechanical microenvironment in diseases.

Building on our finding that chromatin images contain rich cell state information, we developed PUPS (Predictions of Unseen Proteins’ Subcellular localization) that combines protein language models with image inpainting to predict protein subcellular localization. PUPS addresses the limitations of experimental profiling, which can only measure a small number of proteins in the same experiment and has measured less than 1\% of protein-cell line combinations. PUPS accurately predicts localization patterns across different cell types and states, revealing that proteins with variable nuclear-cytosolic distribution across cell lines associate with transcription regulation, while those with high single-cell variability relate to cell division processes.

Finally, we introduce APOLLO (Autoencoder with a Partially Overlapping Latent space learned through Latent Optimization), which disentangles shared and modality-specific information in multimodal datasets. Applied to paired scRNA-seq and scATAC-seq data, APOLLO identifies gene activity captured by both modalities versus modality-specific activity. When used with multiplexed imaging data, it associates the variability in the localization of a particular protein with cellular compartment morphology.

Together, these computational frameworks advance our ability to integrate diverse biological data modalities, providing deeper insights into cellular states, tissue organization, and disease processes while enabling practical applications such as biomarker identification and protein localization prediction from limited measurements.

Details

  • Date: Wednesday, April 16
  • Time: 10:00 am - 12:00 pm
  • Category:
  • Location: 45-600B
Additional Location Details:

MIT Building 45-600B and also on Zoom: https://mit.zoom.us/j/7232101097.

Host