Abstract
While machine learning has achieved remarkable advancements across various domains in recent years, it simultaneously raises significant privacy concerns, particularly due to the potential inclusion of sensitive information within datasets. Differential privacy offers a rigorous approach to quantifying privacy leakage and has emerged as the gold standard for privacy protection. However, safeguarding privacy often comes at the cost of utility degradation. In this thesis, we investigate the cost of privacy in learning and present the following contributions.We examine two primary models of data generation: the Probably Approximately Correct (PAC) model and the online model. In the PAC model, the input data points are independently and identically distributed (i.i.d.) and provided beforehand. The algorithm takes as input the entire dataset and produces a single hypothesis. In contrast, the online model targets the scenario where data points are no longer i.i.d. and arrive sequentially in a stream. The learner must output an updated hypothesis immediately after receiving each data point. Both models can operate under two distinct settings: the realizable setting, which assumes the existence of a hypothesis within the given concept class that perfectly classifies the data, and the agnostic setting, which does not make such assumption.
For agnostic PAC learning, we study the problem under pure differential privacy and propose an algorithm that achieves near-optimal sample complexity. The improvement is made by a novel construction of the score function used in the exponential mechanism. We also extend our algorithm to the user-level privacy setting, where each user may hold multiple data points, and derive improved bounds compared to prior work.
We further introduce a general transformation that converts any private realizable learner into a private agnostic learner with minimal increase in sample complexity, applicable to both pure and approximate differential privacy. This result demonstrates that there is no extra cost of privacy in agnostic PAC learning compared to realizable PAC learning, as the increase is necessary even without privacy constraints. Additionally, we apply our techniques to resolve the sample complexity for the private prediction problem, where the algorithm is required to produce a single prediction rather than a hypothesis.
For the online learning model, we first establish two hardness results for private learning in the realizable setting, suggesting that privacy constraints can render online learning significantly more challenging or even intractable. These findings reveal significant separations between non-private learning and private learning, as well as between pure differential privacy and approximate differential privacy.
Finally, we explore online learning against an adaptive adversary, who generates each data point based on the learner’s previous predictions, under approximate differential privacy. We design a realizable learner with a logarithmic mistake bound and propose the first agnostic learner that achieves sublinear regret. Our results indicate that the cost of privacy in agnostic online learning is relatively small, with only minor degradation compared to the optimal non-private regret.
| Date of Award | 2026 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Bo LI (Supervisor) & Wei WANG (Supervisor) |
Cite this
- Standard