Abstract
Large language models (LLMs) are increasingly deployed in domains that influence information access and decision-making, including healthcare, law, journalism, and public policy. Yet a central question remains: how much can we trust the output from LLMs? Trust in AI encompasses multiple dimensions: (1) robustness to a broad range of user inputs, ensuring reliable performance on diverse, open-ended tasks; (2) alignment with human values, including the avoidance of social bias and the responsible representation of political, cultural, and ethical perspectives; and (3) the avoidance of hallucinations –outputs inconsistent with user input, prior context, or external knowledge. As reliance on LLMs grows, so does the need for benchmarks that assess these trust-related dimensions adequately.This thesis investigates methods for improving the benchmarking of LLMs by addressing three challenges. First, the emergence of powerful LLMs with user interfaces (e.g., ChatGPT) marked a paradigm shift, enabling broad engagement across diverse user intents and levels of complexity. However, existing benchmarks, designed for task-specific models, do not provide a comprehensive evaluation of generalist, interactive LLMs. Second, human value alignment remains critical in high-stakes applications, yet LLMs are often misaligned with such values. Current benchmarking methodologies offer limited insight into how human values are represented or expressed in model outputs. Third, hallucinations undermine user trust but are often conflated with factuality issues, making consistent evaluation difficult. Moreover, many benchmarks are susceptible to data leakage, compromising their long-term reliability.
To address the aforementioned challenges, this thesis introduces three key contributions: a very first interpretable and diagnostic benchmark designed for ChatGPT to assess robustness across open-ended user inputs; benchmarks for evaluating human values in behavioral and representational dimensions; and a robust hallucination benchmark incorporating a refined taxonomy and dynamic test set generation. Across these contributions, the thesis presents empirical findings and offers insights into LLM benchmark design.
| Date of Award | 2025 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Pascale Ngan FUNG (Supervisor) & Daniel PEREZ PALOMAR (Supervisor) |
Cite this
- Standard