Skip to main navigation Skip to search Skip to main content

Statistical and Computational Methods for Genomic Data Analysis

  • Zhiwei WANG

Student thesis: Doctoral thesis

Abstract

The rapid advancement of high-throughput genomic technologies has generated a vast amount of data spanning diverse resolutions, species, tissues, and disease conditions. While these re-sources provide unprecedented opportunities to reveal the mechanisms underlying complex biological systems and have great potential to guide biomedical and therapeutic applications, they also present huge challenges for systematic and comprehensive investigation. Therefore, substantial efforts from both modeling and computational perspectives are highly required to extract meaningful insights from the data and translate them into knowledge. First, statistical modeling bridges the observed data with scientific questions, offering a rigorous way to capture the key structure of interest for deeper understanding of the data. Second, as genomic datasets grow in scale and complexity, the development of scalable and robust algorithms with software implementation becomes crucial in practice. In this thesis, we propose two novel statistical and computational methods and demonstrate their applications in genomic data analysis.

Characterizing cell-type-specific spatially variable genes (SVGs) within tissue context is essential for exploring the landscape of complex biological systems in spatial transcriptomic (ST) studies. In the first part of this thesis, we present a unified framework, the Mixture of Mixed Models (MMM), designed to directly model RNA count data and identify cell-type-specific SVGs while accounting for cell type composition and correcting for platform effects. Through a comprehensive simulation study and the analyses of five publicly available ST datasets from various tissues and technologies with different resolutions, we demonstrate the effectiveness and robustness of MMM in identifying cell-type-specific SVGs. Notably, our integrative analysis with genome-wide association studies (GWASs) reveals that the cell-type-specific SVGs identified by MMM in a mouse brain study exhibit significant heritability enrichment in brain-related phenotypes. This finding suggests that cell-type-specific SVGs play a vital role in elucidating the mechanisms underlying complex traits and diseases. When applying MMM to analyze a high-resolution Xenium human breast cancer dataset by accounting for uncertainty in cell segmentation, we find that certain cell-type-specific SVGs may contribute to cell-cell communications, thereby regulating the tissue microenvironment. Furthermore, we show the versatility of MMM by applying it to 3D tissue models constructed from multiple ST slices, highlighting its utility in analyzing 3D ST data.

Matrix factorization methods have been widely used in high-dimensional data analysis, notably in genomic studies for dimension reduction and gene factor identification. However, their effectiveness in practice is frequently compromised by poor data quality, such as high sparsity and low signal-to-noise ratio (SNR). In the second part of this thesis, we consider a matrix factorization problem by utilizing auxiliary information, which is massively available in real-world applications, to overcome the challenges caused by poor data quality. Unlike existing methods that mainly rely on simple linear models to combine auxiliary information with the main data matrix, we propose to integrate gradient boosted trees in the probabilistic Matrix Factorization framework to effectively leverage Auxiliary Information (MFAI). Thus, MFAI naturally inherits several salient features of gradient boosted trees, such as the capability of flexibly modeling nonlinear relationships and robustness to irrelevant features and missing values in auxiliary in-formation. The parameters in MFAI can be automatically determined under the empirical Bayes framework, making it adaptive to the utilization of auxiliary information and immune to overfit-ting. Moreover, MFAI is computationally efficient and scalable to large datasets by exploiting variational inference. We demonstrate the advantages of MFAI through comprehensive numerical results from simulation studies and real data analyses.

Date of Award2025
Original languageEnglish
Awarding Institution
  • The Hong Kong University of Science and Technology
SupervisorCan YANG (Supervisor)

Cite this

'