When does learning from data work (math starting from basic probability)
Key takeaways
- This post answers a single question: when does learning from data actually work?
- You train a model on samples, it performs well on those samples, and you hope it performs well on new data.
- Nothing is assumed beyond basic probability (expectations, independence, indicator random variables).
This post answers a single question: when does learning from data actually work?
You train a model on samples, it performs well on those samples, and you hope it performs well on new data. When is that hope justified? The answer turns out to be a clean equivalence: a hypothesis class is learnable if and only if it has finite VC dimension. This is the Fundamental Theorem of Statistical Learning.
We'll build every piece from scratch: Markov's inequality, Hoeffding's lemma, concentration inequalities, the union bound, VC dimension, the Sauer–Shelah lemma, symmetrization, and the generalization bounds that tie it all together. Nothing is assumed beyond basic probability (expectations, independence, indicator random variables). We prove both directions of the equivalence: the upper bound (finite VC dimension implies learnability) and the lower bound (via information theory: total variation, KL divergence, Pinsker's inequality, and Le Cam's method).