Abstract
The development of large language models (LLMs) increasingly hinges on data quality rather than sheer quantity. Yet key challenges remain across the LLM training pipeline: accurately evaluating model capabilities, selecting efficient training subsets, and generating high-quality data for underrepresented domains. This thesis presents a series of data-centric methods spanning evaluation, selection, and generation, aimed at systematically enhancing LLM performance.In the evaluation setting, we address the limitations of traditional likelihood-based metrics—particularly their susceptibility to exposure bias—by proposing Normalized Discounted Cumulative Gain (NDCG) as a rank-based autoregressive metric. NDCG demonstrates significantly stronger correlation with both human and GPT-4 judgments on fine-tuned models, offering a more semantically grounded alternative for model assessment.
For data selection, we introduce TAGCOS, a gradient-based coreset selection method that identifies highly informative instruction-tuning subsets. TAGCOS reduces data usage by 95% without compromising performance. In the pre-training stage, we develop Fox-1, a small language model trained using a curriculum-based data scheduling strategy. This approach filters and organizes pre-training data by quality and domain, enabling strong performance with efficient resource utilization.
In the domain of data generation, we focus on long-tail and underrepresented settings where high-quality supervision is limited. In code generation, we propose Bridge-Assist Generation, which transfers knowledge from high-resource languages to synthesize low-resource code data, achieving substantial gains on multilingual coding benchmarks. In text-to-SQL tasks, we present ExeSQL, a framework that combines execution-guided filtering with preference learning to support dialect-aware SQL generation. For vision-language alignment, we design an expert-assisted verifier training pipeline that leverages vision experts and structured rationales to mitigate hallucination through iterative data refinement.
Overall, this thesis demonstrates that principled approaches to data evaluation, selection, and generation can significantly improve the efficiency, reliability, and adaptability of large language models across diverse tasks and modalities.
| Date of Award | 2025 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Xiaofang ZHOU (Supervisor) |
Cite this
- Standard