Skip to main navigation Skip to search Skip to main content

Statistical analysis of patient-derived sequences discovers biologically significant insights in highly mutable viruses

  • Ahmed Abdul QUADEER

Student thesis: Doctoral thesis

Abstract

The advancement in fast DNA sequencing technologies in the last decade has opened up unprecedented opportunities to explore a diverse set of questions in biomedical research. This thesis utilizes statistics and statistical signal processing tools for analyzing sequences of viral proteins to uncover novel insights into biological function and structure. Robust correlation matrix estimation for high-dimensional data plays a major role in addressing such complicated biological problems. The important biological information revealed using the analysis presented in this work can be useful in multiple fields: in structural biology to identify parts of the viral protein important for structural stability; in biochemistry to study the role of particular parts in performing different functions associated with the viral protein; and in immunology to predict potential vulnerable parts of the virus, targeting which can aid in designing potent vaccines. The first part of this work presents a novel vaccine design for an extremely dangerous pathogen, Hepatitis C Virus (HCV). Chronic HCV infection is one of the leading causes of liver cancer, affecting around 3% of the world’s population. Current treatments for HCV are expensive and there is no working vaccine. The vexing problem related to the design of a HCV vaccine is its extreme variability that helps it to evade immune surveillance. A random matrix theory (RMT) based “noise cleaning” correlation matrix estimator is used to reveal a group of “multi-dimensionally conserved sites” in a HCV protein that may be most susceptible to immune pressure, despite the high mutability of the virus. This statistical approach demonstrates for the first time the existence of such vulnerable parts in HCV research, targeting which can lead to the design of efficacious vaccine against this scourge. These results are backed up by linking with clinical evidence available in the literature. Two vaccine designs leveraging such information are also proposed. In addition to identifying immunological significance, the second part of this work shows the remarkable power of this approach in predicting sites with biochemical (structural or functional) significance using only the viral sequence data. This work serves as the first exhibition of a statistical approach capable of addressing this fundamental problem in biology for viruses. Moreover, this analysis reveals the inability of the proposed method to identify distinct groups of biochemically important sites. To tackle this problem, a robust method is proposed which, in addition to using the RMT concepts, exploits the embedded sparsity in the problem using sparse principal component analysis techniques. This sophisticated approach remarkably identifies multiple distinct groups of sites with each of them associated to a specific structural or functional property, thus making it the first statistical procedure to reveal the modular structure of the viral proteins. A simulation model is also presented that provides a cohesive statistical ground-truth understanding of the results obtained using the developed methods.
Date of Award2016
Original languageEnglish
Awarding Institution
  • The Hong Kong University of Science and Technology

Cite this

'