STAT 991: Uncertainty Quantification for Machine Learning (UPenn, 2022 Spring)

This class surveys advanced topics in statistical learning based on student presentations, focusing on uncertainty quantification for machine learning.

The core topic of the course is uncertainty quantification for machine learning methods. While modern machine learning methods can have a high prediction accuracy in a variety of problems, it is still challenging to properly quantify their uncertainty. There has been a recent surge of work developing methods for this problem. It is one of the fastest developing areas in contemporary statistics. This course will survey a variety of different problems and approaches, such as calibration, prediction intervals (and sets), conformal inference, OOD detection, etc. We will discuss both empirically successful/popular methods as well as theoretically justified ones. See below for a sample of papers.

In addition to the core topic, there may be a (brief) discussion of a few additional topics:

Influential recent "breakthrough" papers applying machine learning (GPT-3, AlphaFold, etc), to get a sense of the "real" problems people want to solve.
Important recent papers in statistical learning theory; to set the a sense of progress on the theoretical foundations of the area.

Part of the class will be based on student presentations of papers. We imagine a critical discussion of one or two papers per lecture; and several contiguous lectures on the same theme. The goal will be to develop a deep understanding of recent research.

Influential recent ML papers

Why are people excited about ML?

Dermatologist–level classification of skin cancer with deep neural networks
Language Models are Few-Shot Learners
Highly accurate protein structure prediction with AlphaFold
End to End Learning for Self-Driving Cars

Uncertainty quantification

Why do we need to quantify uncertainty? What are the main approaches?

Conformal prediction++

Vovk et al.'s paper series, and books
- A Tutorial on Conformal Prediction
Takeuchi’s prediction regions and theory, and old lecture notes
Inductive Conformal Prediction
(A few) Papers from CMU group
Review emphasizing exchangeability: Exchangeability, Conformal Prediction, and Rank Tests
Predictive inference with the jackknife+. Slides.
Nested conformal prediction and quantile out-of-bag ensemble methods
Conditional Validity
- X-Conditional validity: Already listed above: Mondrian Confidence Machines (also in Vovk'05 book), Lei & Wasserman'14
  - Localized Conformal Prediction
- Y-conditional: Classification with confidence
- others: equalized coverage
Distribution Shift
- (essentially) known covariate shift
  - Conformal Prediction Under Covariate Shift
  - PAC Prediction Sets Under Covariate Shift
- estimated covariate shift, semiparametric efficiency
  - Distribution-free Prediction Sets Adaptive to Unknown Covariate Shift
  - Doubly Robust Calibration of Prediction Sets under Covariate Shift
- testing covariate shift: A Distribution-Free Test of Covariate Shift Using Conformal Prediction
- online gradient descent on the quantile loss: Adaptive Conformal Inference Under Distribution Shift; aggregation
- more general weighted schemes: Conformal prediction beyond exchangeability
Applications to various statistical models
- Causal estimands and Counterfactuals: Chernozhukov et al, An Exact and Robust Conformal Inference Method for Counterfactual and Synthetic Controls, Cattaneo et al, Lei and Candes, Conformal Inference of Counterfactuals and Individual Treatment Effects
- Quantile regression: Romano et al
- Conditional distribution test: Hu & Lei
Dependence
- Conformal prediction for dynamic time-series
- Exact and robust conformal inference methods for predictive machine learning with dependent data
- Model-Free Prediction Principle (Politis and collaborators). book, brief paper
Coverage given distributional properties beyond exchangeability:
- Vovk's work on online compression models and one-off-structures
- Discrete groups and sequential observations: Exact and robust conformal inference methods for predictive machine learning with dependent data
- SymmPI: Predictive Inference for Data with Group Symmetries
Language models:
- Conformal Prediction with Large Language Models for Multi-Choice Question Answering
- PAC Prediction Sets for Large Language Models of Code

Tolerance Regions and Related Notions

Wilks's original paper, 1941
- Wald's multivariate extension, 1943
- Tukey's paper series: 1, 2, 3; Fraser & Wormleighton's extensions 1
Books
- David & Nagaraja: Order statistics, Sec 7.2 (short but good general intro)
- Krishnamoorthy & Mathew: Statistical tolerance regions
Connections between inductive conformal prediction, training set conditional validity, tolerance regions:

Calibration

Lecture notes and summaries:
- Ryan Tibhsirani's lecture at his Statistical Learning Course at Berkeley
- Silva Filho et al: Classifier Calibration: A survey on how to assess and improve predicted class probabilities
- John Duchi's Lecture Notes on Statistics and Information Theory See Sec. 12 for calibration.
Classics
- Sec 5.a of Robert Miller's monograph: Statistical Prediction by Discriminant Analysis (1962). Calibration is called "validity" here.
- Calibration of Probabilities: The State of the Art to 1980
  - Calibration and probability judgements: Conceptual and methodological issues
- A.P. Dawid, The Well-Calibrated Bayesian
  - A Subjectivist View of Calibration
- DeGroot & Fienberg, The Comparison and Evaluation of Forecasters, 1983
- Testing
  - Early works: Cox, 1958, Miller's monograph above
  - Mincer & Zamowitz: The Evaluation of Economic Forecasts (1969) introducing the idea of regressing the outcomes on the predicted scores; sometimes called Mincer-Zamowitz regression
  - On Testing the Validity of Sequential Probability Forecasts
  - Comparing predictive accuracy
  - Vaicenavicius et al: Evaluating model calibration in classification
  - Widmann et al., (2021) Calibration tests beyond classification
  - T-Cal: An optimal test for the calibration of predictive models
- On-line setting (some of it is non-probabilistic):
  - Foster & Vohra (1998) Asymptotic Calibration. Biometrika
  - Vovk, V. and Shafer, G. (2005) Good randomized sequential probability forecasting is always possible. JRSS-B
  - Hart (2021), proving a claim from 1995: Calibrated Forecasts: The Minimax Proof
  - Lower bounds: Qiao and Valiant (2021)Stronger Calibration Lower Bounds via Sidestepping
- Scoring rules, etc
  - Winkler, Scoring rules and the evaluation of probabilities
  - Gneiting et al., Probabilistic forecasts, calibration and sharpness
- Decision-making
  - Calibrating Predictions to Decisions: A Novel Approach to Multi-Class Calibration. Develops a decision theoretic perspective on calibration. Proposes a notion of decision calibration, which requires that one can use the forecast to obtain an unbiased estimate of the loss, for a class of losses and a class of Bayes-optimal decision rules for (a possibly different) class of losses. Shows (quite surprisingly!) that several notions of calibration are equivalent to decision calibration for specific classes of losses: for instance full multiclass calibration is equivalent to decision calibration for the class of all loss functions, while top-one (or confidence) calibration is equivalent to decision calibration for the class of actions that include the classes as well as an extension option, and penalize the abstention less than misclassification error. Also developed methods for achieving decision calibration, based on potential function ideas inspired by work on multi-calibration.
Modern ML
- On Calibration of Modern Neural Networks; suggests using Mincer-Zamowitz regression for re-calibration
- Measuring Calibration in Deep Learning
- Distribution-free binary classification: prediction sets, confidence intervals and calibration
- Beyond Pinball Loss: Quantile Methods for Calibrated Uncertainty Quantification
- Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control
- Calibration Error for Heterogeneous Treatment Effects
- theory in random features models: A study of uncertainty quantification in overparametrized high-dimensional models
- theory on distance to calibration: A Unifying Theory of Distance from Calibration
- connection to conformal prediction: perhaps surprisingly, calibration via temperature scaling can increase the average prediction sets sizes in CP: On Calibration and Conformal Prediction of Deep Classifiers. The paper provides experimental evidence, as well as some theoretical support for this claim.

Language Models

Collection of links about LLM uncertainty and robustness

Calibration
- How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering. Slides. Investigates calibration of LMs for Q&A. Studies techniques such as "fine-tuning" (e.g., with MLE over answer set), post-hoc, and LM-specific (augmentation, paraphrasing). Find that "Post-processing Confidence is Effective Universally."
- Calibrate Before Use: Improving Few-Shot Performance of Language Models; LLMs have biases for generating outcomes that should a priori have pre-determined probabilities, such as 50-50
- Language Models (Mostly) Know What They Know; Perhaps surprisingly, LLMs are sometimes calibrated. On the other hand, there are a few limitations: (1) the results do not really hold for small models. (2) The results are not so robust (for instance they can easily break when they add the answer "none of the above")
- Teaching models to express their uncertainty in words; They finetune an LLM (GPT-3) on math problems, using labels generated as the empirical accuracy over various sub-tasks; further quantized into 5 quantiles (lowest, ...). Observe that this generalizes to same task, and partly to distribution shift. Limitations include: (1) very task-specific. (2) ad hoc grouping of tasks. (3) performance for other tasks (dist. shift) is very poor (not much better than constant baseline)
- A Study on the Calibration of In-context Learning. Find an empirical accuracy-calibration trade-off for in-context learning. In examples (e.g., LLaMA-7B) as the number of ICL samples increases, the prediction accuracy improves; at the same time, the calibration first worsens and then becomes better.
- Uncertainty in Language Models: Assessment through Rank-Calibration. Proposes rank-calibration to assess uncertainty & confidence peak measures for language models. The principle is that higher uncertainty (or lower confidence) should imply lower generation quality, on average. Rank-calibration quantifies deviations from this ideal relationship in a principled manner, without requiring ad hoc binary thresholding of the correctness score (e.g., ROUGE or METEOR).
- Linguistic Calibration of Language Models. Proposes to evaluate the usefulness of longform language model Generations by their ability to increase the quality of answering questions. For a question x, answer y, and a longfrom generation z from a helper LM $\pi$, aims to maximize (over $\pi$) the expectation of log p(y|x,z), where p is a question-answering LM. [Also argues about how this improves calibration and decision-making, but this does not seem to be crucial.]
Conformal Prediction
- Conformal Prediction with Large Language Models for Multi-choice Question Answering; Use CP based on score f.
- PAC Prediction Sets for Large Language Models of Code
- Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners. Features: robot observation, user instruction, few-shot examples of possible plans in other scenarios, LLM-generated list of plans based on prev. three (call the set Y). Outcomes: entries y in Y. Score: f(x,y). Assume have iid sample from distribution over scenarios ξ:=(e,ℓ,g); e: POMDP environment, ℓ: language instruction, and g: goal + contexts + plans + labels (correct plan index). Handle sequences: predict at the sequence level, with fixed confidence.
- Prompt Risk Control: A Rigorous Framework for Responsible Deployment of Large Language Models. Input a set of prompts P. Return a subset that satisfies an upper bound on) some user-chosen notion of risk R. Either Learn-Then-Test or Quantile Risk Control (Snell et al. (2023)).
- Conformal Language Modeling; Apply Learn-Then-Test.

Types of uncertainty

Kiureghian and Ditlevsen: Aleatory or epistemic? does it matter?
What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?

Empirics

Calibrated Chaos: Variance Between Runs of Neural Network Training is Harmless and Inevitable
Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation. Sample, cluster (via mutual entailment using a natural language inference classification system - Deberta-large model), and estimate entropy (summing over meanings). Evaluate via the AUROC of a predictor of "Y = is answer correct" based on "X = uncertainty score", implicit assumption: uncertain generations should be less likely to be correct (?).

Bayesian approaches, ensembles

Baseline methods:

Deep ensembles: Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
MC Dropout: Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning; Uncertainty in Deep Learning, Yarin Gal PhD Thesis
Deep Ensembles Work, But Are They Necessary?

Other approaches:

Bayesian Layers: A Module for Neural Network Uncertainty

Dataset shift

Can You Trust Your Model's Uncertainty. Evaluating Predictive Uncertainty Under Dataset Shift

Lectures

Lecture 1-2: Introduction. By Edgar Dobriban.

Lecture 3-8: Conformal Prediction, Calibration. By Edgar Dobriban. Caveat: handwritten and may be hard to read. To be typed up in the future.

Lecture 9 onwards: student presentations.

Presentation 1: Deep Learning in Medical Imaging by Rongguang Wang.

Presentation 2: Introduction to Fairness in Machine Learning by Harry Wang.

Presentation 3: Conformal Prediction with Dependent Data by Kaifu Wang.

Presentation 4: Bayesian Calibration by Ryan Brill.

Presentation 5: Conditional Randomization Test by Abhinav Chakraborty.

Presentation 6: Distribution Free Prediction Sets and Regression by Anirban Chatterjee.

Presentation 7: Advanced Topics in Fairness by Alexander Tolbert.

Presentation 8: Calibration and Quantile Regression by Ignacio Hounie.

Presentation 9: Conformal Prediction under Distribution Shift by Patrick Chao and Jeffrey Zhang.

Presentation 10: Testing for Outliers with Conformal p-values by Donghwan Lee.

Presentation 11: Out-of-distribution detection and Likelihood Ratio Tests by Alex Nguyen-Le.

Presentation 12: Online Multicalibration and No-Regret Learning by Georgy Noarov.

Presentation 13: Online Asymptotic Calibration by Juan Elenter.

Presentation 14: Calibration in Modern ML by Soham Dan.

Presentation 15: Bayesian Optimization and Some of its Applications by Seong Han.

Presentation 16: Distribution-free Uncertainty Quantification Impossibility and Possibility I by Xinmeng Huang.

Presentation 17: Distribution-free Uncertainty Quantification Impossibility and Possibility II by Shuo Li.

Presentation 18: Top-label calibration and multiclass-to-binary reductions by Shiyun Xu.

Presentation 19: Ensembles for uncertainty quantification by Rahul Ramesh.

Presentation 20: Universal Inference by Behrad Moniri.

Presentation 21: Typicality and OOD detection by Eric Lei.

Presentation 22: Bayesian uncertainty quantification and dropout by Samar Hadou. (See lec 27 for an introduction).

Presentation 23: Distribution-Free Risk-Controlling Predictio Sets by Ramya Ramalingam.

Presentation 24: Task-Driven Detection_of Distribution Shifts by Charis Stamouli.

Presentation 25: Calibration: a transformation-based method and a connection with adversarial robustness by Sooyong Jang.

Presentation 26: A Theory of Universal Learning by Raghu Arghal.

Presentation 27: Deep Ensembles: An introduction by Xiayan Ji.

Presentation 28: Why are Convolutional Nets More Sample-efficient than Fully-Connected Nets? by Evangelos Chatzipantazis.

Presentation 29: E-values by Sam Rosenberg.

Other materials

Related educational materials

Course notes from STAT 300C at Stanford University, by Emmanuel Candes; 2022 edition

Recent workshops and tutorials on related topics

Workshop on Distribution-Free Uncertainty Quantification at ICML 2022
ICML 2021 Workshop on Uncertainty & Robustness in Deep Learning
Workshop on Distribution-Free Uncertainty Quantification at ICML: 2021, 2022
Video tutorial by AN Angelopoulos and S Bates
NeurIPS 2020 Tutorial on Practical Uncertainty Estimation and Out-of-Distribution Robustness in Deep Learning

Seminar series

International Seminar on Distribution-Free Statistics

Software tools

Uncertainty Toolbox, associated papers
Uncertainty Baselines
MAPIE, conformal-type methods
crepes
Fortuna; paper

Probability background

Penn courses STAT 430, STAT 930.
Stat 110: Probability, Harvard. edX course, book
Online probability book

ML background

Penn courses CIS 520, ESE 546, STAT 991, and links therein

Perspectives

Comments on AI, and the role of statistics by Candes, Duchi & Sabatti
Prediction, Estimation, and Attribution, by Bradley Efron

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

STAT 991: Uncertainty Quantification for Machine Learning (UPenn, 2022 Spring)

Influential recent ML papers

Uncertainty quantification

Conformal prediction++

Tolerance Regions and Related Notions

Calibration

Language Models

Types of uncertainty

Empirics

Bayesian approaches, ensembles

Lectures

Other topics

OOD Detection

Classical statistical goals: confidence intervals, (single and multiple) hypothesis testing

Inductive biases

Reviews, applications, etc

Learning theory & training methods

Distributed learning

Other materials

Related educational materials

Recent workshops and tutorials on related topics

Seminar series

Software tools

Probability background

ML background

Perspectives

Files

README.md

Latest commit

History

README.md

File metadata and controls

STAT 991: Uncertainty Quantification for Machine Learning (UPenn, 2022 Spring)

Influential recent ML papers

Uncertainty quantification

Conformal prediction++

Tolerance Regions and Related Notions

Calibration

Language Models

Types of uncertainty

Empirics

Bayesian approaches, ensembles

Lectures

Other topics

OOD Detection

Classical statistical goals: confidence intervals, (single and multiple) hypothesis testing

Inductive biases

Reviews, applications, etc

Learning theory & training methods

Distributed learning

Other materials

Related educational materials

Recent workshops and tutorials on related topics

Seminar series

Software tools

Probability background

ML background

Perspectives