Tutorials on Bayesian Nonparametrics

This page collects references and tutorials on Bayesian nonparametrics:

Lecture Notes

The first few chapters of these class notes provide a basic introduction to the Dirichlet process, Gaussian process, and to latent feature models. The remaining chapters cover more advanced material. The focus is on concepts; it is not a literature survey.
  • Lecture Notes on Bayesian Nonparametrics.
    P Orbanz.
    [PDF (draft)]
This file is a draft and still littered with errors.

Video tutorials

NIPS Tutorial

Machine Learning Summer School 2012

  • Bayesian nonparametrics.
    P Orbanz.
    Machine Learning Summer School, 2012.

    [Videolectures]
    [Slides] Lecture 1: Clustering, Dirichlet processes, IBPs
    [Slides] Lecture 2: Gaussian processes, model construction, exchangeability, asymptotics
  • Gaussian Processes for Machine Learning.
    JP Cunningham. Machine Learning Summer School, 2012.
    [Videolectures]

Machine Learning Summer School 2009

At MLSS 2009, I gave two talks on the basics of measure theory and stochastic process concepts involved in Bayesian nonparametrics. They complemented the talks by Yee Whye Teh at the same Summer School, which I highly recommend.
  • Nonparametric Bayesian Models.
    YW Teh.
    Machine Learning Summer School, 2009.

    [Videolectures]
  • Theoretical Foundations of Nonparametric Bayesian Models.
    P Orbanz.
    Machine Learning Summer School, 2009.

    [Videolectures]

Further Reading

Surveys

Yee Whye Teh and I have written a short introductory article:
  • Bayesian Nonparametric Models.
    P Orbanz and YW Teh.
    In Encyclopedia of Machine Learning (Springer), 2010.

    [PDF]
A machine learning introduction to nonparametric Bayes that does take into account some theory, is well written and beautifully illustrated, is given by Erik Sudderth in his thesis.
  • Graphical Models for Visual Object Recognition and Tracking.
    EB Sudderth.
    PhD thesis, 2006.

    [PDF]
If you are new to Bayesian nonparametrics, chances are that you are looking for a gentle and concise introduction to clustering with Dirichlet processes. If so, look no further:
  • A tutorial on Bayesian nonparametric models.
    SJ Gershman and DM Blei.
    Journal of Mathematical Psychology (56):1-12, 2012.

    [PDF]
A very powerful tool to construct and understand Bayesian nonparametric models are representation theorems (of de Finetti, Kingman, Aldous-Hoover etc). In the following survey, we try to explain what these theorems mean and how they are used in Bayesian nonparametrics; the main focus is on graph-valued and relational data.
  • Bayesian Models of Graphs, Arrays and Other Exchangeable Random Structures.
    P Orbanz and DM Roy.
    IEEE Transactions on Pattern Analysis and Machine Intelligence (in press).

    [arxiv 1312.7857]
An overview article on various models based on random measures (Dirichlet processes, Pólya trees, neutral-to-the-right processes, etc.) is the following:
  • Bayesian nonparametric inference for random distributions and related functions.
    SG Walker, P Damien, PW Laud, and AFM Smith.
    Journal of the Royal Statistical Society B, 61 (3):485-527, 1999.

    [MathSciNet]
A useful reference on parametric Bayesian models (exponential families etc.) is the following book. (Despite the term "theory" in the title, this text does not involve any mathematical sophistication.)
  • Bayesian Theory.
    JM Bernardo and AFM Smith.
    John Wiley & Sons, 1994.

    [MathSciNet]

Random discrete measures

Random discrete measures include models such as the Dirichlet process and the Pitman-Yor process. In applications, these models are typically used as priors on the mixing measure of a mixture model (e.g. Dirichlet process mixtures).

Dirichlet and Pitman-Yor processes

A concise introduction to the Dirichlet process is:
  • Dirichlet processes.
    YW Teh.
    In Encyclopedia of Machine Learning (Springer), 2010.

    [PDF]
Perhaps the best way to get to grips with Dirichlet process mixtures is to understand the inference algorithms. There is one and only one article to read on the basic Gibbs samplers:
  • Markov chain sampling methods for Dirichlet process mixture models.
    RM Neal.
    Journal of Computational and Graphical Statistics, 9:249-265, 2000.

    [MathSciNet]
A more detailed introduction to the Dirichlet process and its technical properties is the following book chapter:
  • Dirichlet process, related priors and posterior asymptotics.
    S Ghosal.
    In N. L. Hjort et al., editors, Bayesian Nonparametrics.
    Cambridge University Press, 2010.

    [MathSciNet]
A key reference on Dirichlet processes and stick-breaking is a classic article by Ishwaran and James, which first made ideas such as stick-breaking constructions and Pitman-Yor processes accessible to machine learning researchers. The name "Pitman-Yor process" also seems to appear here for the first time.
  • Gibbs sampling methods for stick-breaking priors.
    H Ishwaran and LF James.
    Journal of the American Statistical Association, 96: 161-173, 2001.

    [MathSciNet]
The Pitman-Yor process was introduced by Perman, Pitman and Yor. Their article is still the authoritative reference.
  • Size-biased sampling of Poisson point processes and excursions.
    M Perman, J Pitman and M Yor.
    Probability Theory and Related Fields, 25(92): 21-39, 1992.

    [MathSciNet]
For a non-technical introduction to the Pitman-Yor process, have a look at Yee Whye Teh's article on Kneser-Ney smoothing, which applies the Pitman-Yor process to an illustrative problem in language processing.
  • A Hierarchical Bayesian Language Model based on Pitman-Yor Processes.
    YW Teh.
    Coling/ACL 2006.

    [PDF]

Generalizations

Dirichlet processes and Pitman-Yor processes are two examples of random discrete probabilities. Any random discrete probability measure can in principle be used to replace the Dirichlet process in mixture models or one of its other applications (infinite HMMs etc). Over the past few years, it has become much clearer which models exist, how they can be represented, and in which cases we can expect inference to be tractable. If you are interested in understanding how these models work and what the landscape of nonparametric Bayesian clustering models looks like, I recommend the following two articles:
  • Models beyond the Dirichlet process.
    A Lijoi and I Prünster.
    In N. L. Hjort et al., editors, Bayesian Nonparametrics.
    Cambridge University Press, 2010.

    [MathSciNet] [PDF]
  • Conditional formulae for Gibbs-type exchangeable random partitions.
    S Favaro, A Lijoi, and I Prünster.
    Annals of Applied Probability, to appear.

    [PDF]
This talk by Igor Prünster gives a clear and concise overview:
  • Two tales about Bayesian nonparametric modeling.
    I Prünster.
    [Videolectures]
Random discrete measures are usually obtained using stick-breaking constructions and related techniques. The construction of models which do not admit such representations is a bit more demanding. For the construction of general random measures, see
  • Projective limit random probabilities on Polish spaces.
    P Orbanz.
    Electronic Journal of Statistics, 5:1354-1373, 2011.

    [PDF] [MathSciNet]

Point processes

Random discrete measures have natural representations as point processes. Basic knowledge of point process makes it much easier to understand random measure models, and all more advanced work on random discrete measures uses point process techniques. This is one of the topics on which "the" book to read has been written; Kingman's book on the Poisson process is certainly one of the best expository texts in probability.
  • Poisson Processes.
    JFC Kingman.
    Oxford University Press, 1993.

    [MathSciNet]
If you have any serious interest in Dirichlet processes, stick-breaking etc, I would recommend that you read at least Chapters 2, 5.1, 8 and 9. For a wider range of material (Kingman's book has only 104 pages), I have found the two volumes by Daley and Vere-Jones quite useful.
  • An introduction to the theory of point processes.
    D Daley and D Vere-Jones.
    Springer, 2nd edition, 2008.

    Volumes I [MathSciNet] and II [MathSciNet].
The conditional probability of a point process given a sample point has a number of specific properties that general conditional probabilities do not satisfy. These conditionals are called Palm measures in point process theory, and come with their own calculus. If a random discrete measure is represented as a point process, its posterior is represented by a Palm measure. Via the correspondence between random discrete measures and random partitions, the theory of Palm measures can be applied to partitions:
  • Bayesian Poisson process partition calculus with an application to Bayesian Lévy moving averages.
    LF James.
    Annals of Statistics, 33(4):1771-1799, 2005.

    [MathSciNet]
Many of James' results are far ahead of current Bayesian nonparametrics. For applications to existing models, see
  • Posterior analysis for normalized random measures with independent increments.
    LF James, A Lijoi, and I Prünster.
    Scandinavian Journal of Statistics, 36:76-97, 2009.

    [MathSciNet]

Hierarchical and covariate-dependent models

One of the most popular models based on the Dirichlet process is the dependent Dirichlet process. Despite its great popularity, Steven MacEachern's original article on the model remains unpublished and is hard to find on the web. Steven has kindly given me permission to make it available here:
  • Dependent Dirichlet processes.
    SN MacEachern.
    Technical report, Ohio State University, 2000.

    [PDF]
Bayesian models are inherently hierarchical: The prior and the likelihood represent two layers in a hierarchy. The term "hierarchical modeling" often refers to the idea that the prior can itself be split up into further hierarchy layers. This provides an almost generic way to combine existing Bayesian models into new, more complex ones.
  • Hierarchical Bayesian nonparametric models with applications.
    YW Teh and MI Jordan.
    In N. L. Hjort et al., editors, Bayesian Nonparametrics.
    Cambridge University Press, 2010.

    [MathSciNet]
A widely known nonparametric model of this type is the hierarchical Dirichlet process.
  • Hierarchical Dirichlet processes.
    YW Teh, MI Jordan, MJ Beal, and DM Blei.
    Journal of the American Statistical Association, (476):1566-1581, 2006.

    [MathSciNet]

Random functions

Distributions on random functions can be used as prior distributions in regression and related problems. The prototypical prior on smooth random functions is the Gaussian process. An excellent introduction to Gaussian process models and many references can be found in the monograph by Rasmussen and Williams.
  • Gaussian Processes for Machine Learning.
    CE Rasmussen and CKI Williams.
    MIT Press, 2006.

    [PDF]
There are many texts on the mathematical theory of Gaussian processes, for example:
  • Random Fields and Geometry.
    RJ Adler and JE Taylor.
    Springer, 2007.

    [MathSciNet]

Theory

A very good reference on abstract Bayesian methods, exchangeability, sufficiency, and parametric models (including infinite-dimensional Bayesian models) are the first two chapters of Schervish's Theory of Statistics.
  • Theory of Statistics.
    MJ Schervish.
    Springer, 1995.

    [MathSciNet]

Posterior convergence

A clear and readable introduction to the questions studied in this area, and to how they are addressed, is a survey chapter by Ghosal which is referenced above. The following monograph is a good reference that provides many more details. Be aware though that the most interesting work in this area has arguably been done in the past decade, and hence is not covered by the book.
  • Bayesian Nonparametrics.
    JK Ghosh and RV Ramamoorthi.
    Springer, 2002.

    [MathSciNet]
The following sample references are a small subset of the large and growing literature on this subject:
  • Misspecification in infinite-dimensional Bayesian statistics.
    BJK Kleijn and AW van der Vaart.
    Annals of Statistics, 34(2):837-877, 2006.

    [MathSciNet]
  • Posterior convergence rates of Dirichlet mixtures at smooth densities.
    S Ghosal and AW van der Vaart.
    Annals of Statistics, 35(2):697-723, 2007.

    [MathSciNet]
  • Rates of contraction of posterior distributions based on Gaussian process priors.
    AW van der Vaart and JH van Zanten.
    Annals of Statistics, 36(3):1435-1463, 2008.

    [MathSciNet]
  • A semiparametric Bernstein-von Mises theorem for Gaussian process priors.
    I Castillo.
    Probability Theory and Related Fields, 152:53-99, 2012.

    [PDF]

Exchangeability

For a good introduction to exchangeability and its implications for Bayesian models, see Schervish's Theory of Statistics, which is referenced above. If you are interested in the bigger picture, and in how exchangeability generalizes to other random structures than exchangeable sequences, I highly recommend an article based on David Aldous' lecture at the International Congress of Mathematicians:
  • Exchangeability and continuum limits of discrete random structures.
    DJ Aldous.
    In Proceedings of the International Congress of Mathematicians, 2010.

    [PDF]
The most comprehensive and rigorous treatise on exchangeability I am aware of is:
  • Probabilistic Symmetries and Invariance Principles.
    O Kallenberg.
    Springer, 2005.

    [MathSciNet]
I discuss applications to nonparametric Bayesian models of data not representable as exchangeable sequences in this preprint:
  • Nonparametric priors on complete separable metric spaces.
    P Orbanz.
    Preprint.

    [PDF]

Urns and power laws

When the Dirichlet process was first developed, Blackwell and MacQueen realized that a sample from a DP can be generated by a so-called Pólya urn with infinitely many colors. Roughly speaking, an urn model assumes that balls of different colors are contained in an urn, and are drawn uniformly at random; the proportions of balls per color determine the probability of each color to be drawn. A specific urn is defined by a rule for how the number of balls is changed when a color is drawn. In Pólya urns, the number of balls of a color is increased whenever that color is drawn; this process is called reinforcement, and corresponds to the rich-get-richer property of the Dirichlet process. There are many different versions of Pólya urns, defined by different reinforcement rules.

For Bayesian nonparametrics, urns provide a probabilistic tool to study the sizes of clusters in a clustering model, or more generally the weight distributions of random discrete measures. They also provide a link to population genetics, where urns model the distribution of species; you will sometimes encounter references to species sampling models. The relationship between the different terminologie is \[\begin{aligned} \text{colors in urn } = \text{ species } = \text{ clusters } \end{aligned} \] and \[\begin{aligned} \#\text{balls } = \#\text{individuals } = \text{ cluster size. } \end{aligned} \] A key property of Pólya urns is that they can generate power law distributions, which occur in applications such as language models or social networks.

If you are interested in urns and power laws, I recommend that you have a look at the following two survey articles (in this order):
  • Notes on the occupancy problem with infinitely many boxes: General asymptotics and power laws.
    AV Gnedin, B Hansen, and J Pitman.
    Probability Surveys, 4:146-171, 2007.

    [PDF]
  • A survey of random processes with reinforcement.
    R Pemantle.
    Probability Surveys, 4:1-79, 2007.

    [PDF]

Mathematical background

I am often asked for references on the mathematical foundations of Bayesian nonparametrics. There are a few specific reasons why Bayesian nonparametric models require more powerful mathematical tools than parametric ones; this is particularly true for theoretical problems.

One of the reasons is that Bayesian nonparametric models do not usually have density representation, and hence require a certain amount of measure theory. Since the parameter space of a nonparametric model is infinite-dimensional, the prior and posterior distributions are probabilities on infinite-dimensional spaces, and hence stochastic processes. If you are interested in the theory of Bayesian nonparametrics and do not have a background in probability, you may have to familiarize yourself with some topics such as stochastic processes and regular conditional probabilities. These are covered in every textbook on probability theory. Billingsley's book is a popular choice.
  • Probability and Measure.
    P Billingsley.
    J. Wiley & Sons, 1995.

    [MathSciNet]
My favorite probability textbook is Kallenberg's "Foundations". However, this book is probably not a good place to start if you do not already have a reasonable knowledge of the field. If you are interested in this book, make sure you read the second edition.
  • Foundations of Modern Probability.
    O Kallenberg.
    Springer, 2nd edition, 2001.

    [MathSciNet]
The mathematical tools for handling infinite-dimensional spaces are the subject of functional analysis. There is a marvelous textbook by Aliprantis and Border, which I believe every researcher with a serious interest in the theory of Bayesian nonparametric models should keep on their shelf.
  • Infinite Dimensional Analysis.
    CD Aliprantis and KC Border.
    Springer, 3rd edition, 2006.

    [MathSciNet]
Another problem is that Bayes' theorem does not generally hold for Bayesian nonparametric models. Technically speaking, this is due to the fact that infinite-dimensional models can be undominated. For an introduction to undominated models and the precise conditions required by Bayes' theorem, I recommend the first chapter of Schervish's textbook.

This problem has motivated my own work on conjugate models (since conjugacy is the only reasonably general way we know to get from the prior and data to the posterior); see e.g. A more rigorous treatment is given in my paper referenced above.

Historical references

The original DP paper is of course Ferguson's 1973 article. In his acknowledgments, Ferguson attributes the idea to David Blackwell.
  • A Bayesian analysis of some nonparametric problems.
    TS Ferguson.
    Annals of Statistics, 1(2), 1973.

    [MathSciNet]
At about the same time, Ferguson's student Antoniak introduced a model called a Mixture of Dirichlet Processes (MDP), which is sometimes mistaken as a Dirichlet Process mixture. The MDP puts a prior on the parameters of the DP base measure. A draw from a MDP is discrete almost surely, just as for the DP.
  • Mixtures of Dirichlet processes with applications to Bayesian nonparametric estimation.
    CE Antoniak.
    Annals of Statistics, 2(6):1152-1174, 1974.

    [MathSciNet]
Steven MacEachern has pointed out to me that Antoniak's paper also contains a Dirichlet process mixture: Antoniak introduces the idea of using a parametric likelihood with a DP or MDP, which he refers to as "random noise" (cf his Theorem 3) and as a sampling distribution (cf Example 4). If this is used with a DP, the resulting distribution is identical to a Dirichlet process mixture model. However, Albert Lo was the first author to study models of this form from a mixture perspective:
  • On a class of Bayesian nonparametric estimates. I. Density estimates.
    AY Lo.
    Annals of Statistics, 12(1):351-357, 1984.

    [MathSciNet]
Discreteness of the DP was first shown by David Blackwell. For a clear exposition of the discreteness argument used by Blackwell, see Chapter 8.3 of Kingman's book.
  • Discreteness of Ferguson selections.
    D Blackwell.
    Annals of Statistics, 1(2):356-358, 1973.

    [MathSciNet]
The Pólya urn interpretation is due to James MacQueen.
  • Ferguson distributions via Pólya urn schemes.
    D Blackwell and JB MacQueen.
    Annals of Statistics, 1(2):353-355, 1973.

    [MathSciNet]

Consistency and posterior convergence

Until the 1980s, Bayesian statistics used a definition of consistency that is weaker than the modern definition. Roughly speaking, this definition states that the model has to behave well for all values of the parameter except for a set of zero probability under the prior. In parametric models, this set of exceptions does not usually cause problems, but in nonparametric models, it can make this notion of consistency almost meaningless. Work on stronger forms of consistency began after Diaconis and Freedman pointed out the problem by constructing a pathological counter example to consistent behavior of the Dirichlet process.
  • On the consistency of Bayes estimates (with discussion).
    P Diaconis and D Freedman.
    Annals of Statistics, 14(1):1-67, 1986.

    [MathSciNet]
This matter has caused quite a bit of confusion. A result going back to Doob shows that (under very mild identifiability conditions) any Bayesian model is consistent in the weak sense:
  • Application of the theory of martingales.
    JL Doob.
    Coll. Int. du CNRS Paris. 1949.

    [MathSciNet]
The fallout of this result was a folklore belief that consistency is never an issue for Bayesian models (you may still encounter this claim from time to time in the literature). A more accurate statement is perhaps that consistency is usually not an issue in parametric models, but can cause problems in nonparametric ones (regardless of whether these models are Bayesian or non-Bayesian). In Bayesian statistics, such problems went unnoticed until Bayesian nonparametrics became a serious research topic. For modern results on consistency of Bayesian nonparametric models, see the references given above.

Exchangeability

Work on the equivalence of exchangeability and conditional independence dates back to several publications of de Finetti on sequences of binary random variables in the early 1930s, such as:
  • Fuzione caratteristica di un fenomeno aleatorio.
    B de Finetti.
    Atti della R. Academia Nazionale dei Lincei, 4:251-299, 1931.
For de Finetti's perspective on the subject, see his Theory of Probability [MathSciNet]. The generalization to arbitrary random variables, as well as the interpretation of the set of exchangeable measures as a convex polytope, is due to:
  • Symmetric measures on Cartesian products.
    E Hewitt and LJ Savage.
    Transactions of the American Mathematical Society, 80(2):470-501, 1955.

    [MathSciNet]