Can I trust AI? Benchmarking Large Language Models

  • Ye Jin BANG

Student thesis: Doctoral thesis

Abstract

Large language models (LLMs) are increasingly deployed in domains that influence information access and decision-making, including healthcare, law, journalism, and public policy. Yet a central question remains: how much can we trust the output from LLMs? Trust in AI encompasses multiple dimensions: (1) robustness to a broad range of user inputs, ensuring reliable performance on diverse, open-ended tasks; (2) alignment with human values, including the avoidance of social bias and the responsible representation of political, cultural, and ethical perspectives; and (3) the avoidance of hallucinations –outputs inconsistent with user input, prior context, or external knowledge. As reliance on LLMs grows, so does the need for benchmarks that assess these trust-related dimensions adequately.

This thesis investigates methods for improving the benchmarking of LLMs by addressing three challenges. First, the emergence of powerful LLMs with user interfaces (e.g., ChatGPT) marked a paradigm shift, enabling broad engagement across diverse user intents and levels of complexity. However, existing benchmarks, designed for task-specific models, do not provide a comprehensive evaluation of generalist, interactive LLMs. Second, human value alignment remains critical in high-stakes applications, yet LLMs are often misaligned with such values. Current benchmarking methodologies offer limited insight into how human values are represented or expressed in model outputs. Third, hallucinations undermine user trust but are often conflated with factuality issues, making consistent evaluation difficult. Moreover, many benchmarks are susceptible to data leakage, compromising their long-term reliability.

To address the aforementioned challenges, this thesis introduces three key contributions: a very first interpretable and diagnostic benchmark designed for ChatGPT to assess robustness across open-ended user inputs; benchmarks for evaluating human values in behavioral and representational dimensions; and a robust hallucination benchmark incorporating a refined taxonomy and dynamic test set generation. Across these contributions, the thesis presents empirical findings and offers insights into LLM benchmark design.

Date of Award2025
Original languageEnglish
Awarding Institution
  • The Hong Kong University of Science and Technology
SupervisorPascale Ngan FUNG (Supervisor) & Daniel PEREZ PALOMAR (Supervisor)

Cite this

'