Gaussian processes were covered in more depth by John Cunningham:
At MLSS 2009, I gave two talks on the basics of measure theory and stochastic process concepts involved in Bayesian nonparametrics.
They complement the talks by Yee Whye Teh at the same Summer School, which I highly recommend:
Yee Whye Teh and I have recently written a short introductory article:
A machine learning introduction to nonparametric Bayes that does take into account some theory, is well written and beautifully illustrated, is given by Erik Sudderth in his thesis.
If you are new to Bayesian nonparametrics, chances are that you are looking for a gentle and concise introduction to clustering with Dirichlet processes. If so, look no further:
A very powerful tool to construct and understand Bayesian nonparametric models are representation theorems (of de Finetti, Kingman, Aldous-Hoover etc). In the following survey, we try to explain what these theorems mean and how they are used in Bayesian nonparametrics; the main focus is on graph-valued and relational data.
An overview article on various models based on random measures (Dirichlet processes, Pólya trees, neutral-to-the-right processes, etc.) is the following:
A useful reference on parametric Bayesian models (exponential families etc.) is the following book. (Despite the term "theory" in the title, this text does not involve any mathematical sophistication.)
Random discrete measures include models such as the Dirichlet process and the Pitman-Yor process. In applications, these models are typically used as priors on the mixing measure of a mixture model (e.g. Dirichlet process mixtures).
A concise introduction to the Dirichlet process is:
Perhaps the best way to get to grips with Dirichlet process mixtures is to understand the inference algorithms. There is one and only one article to read on the basic Gibbs samplers:
A more detailed introduction to the Dirichlet process and its technical properties is the following book chapter:
A key reference on Dirichlet processes and stick-breaking is a classic article by Ishwaran and James, which first made ideas such as stick-breaking constructions and Pitman-Yor processes accessible to machine learning researchers. The name "Pitman-Yor process" also seems to appear here for the first time.
The Pitman-Yor process was introduced by Perman, Pitman and Yor. Their article is still the authoritative reference.
For a non-technical introduction to the Pitman-Yor process, have a look at Yee Whye Teh's article on Kneser-Ney smoothing, which applies the Pitman-Yor process to an illustrative problem in language processing.
Dirichlet processes and Pitman-Yor processes are two examples of random discrete probabilities. Any random discrete probability measure can in principle be used to replace the Dirichlet process in mixture models or one of its other applications (infinite HMMs etc). Over the past few years, it has become much clearer which models exist, how they can be represented, and in which cases we can expect inference to be tractable. If you are interested in understanding how these models work and what the landscape of nonparametric Bayesian clustering models looks like, I recommend the following two articles:
The following talk by Igor Prünster gives a very concise overview:
Random discrete measures are usually obtained using stick-breaking constructions and related techniques. The construction of models which do not admit such representations is a bit more demanding. For the construction of general random measures, see
Random discrete measures have natural representations as point processes. Basic knowledge of point process makes it much easier to understand random measure models, and all more advanced work on random discrete measures uses point process techniques. This is one of the topics on which "the" book to read has been written; Kingman's book on the Poisson process is certainly one of the best expository texts in probability.
If you have any serious interest in Dirichlet processes, stick-breaking etc, I would recommend that you read at least Chapters 2, 5.1, 8 and 9. For a wider range of material (Kingman's book has only 104 pages), I have found the two volumes by Daley and Vere-Jones quite useful.
The conditional probability of a point process given a sample point has a number of specific properties that general conditional probabilities do not satisfy. These conditionals are called Palm measures in point process theory, and come with their own calculus. If a random discrete measure is represented as a point process, its posterior is represented by a Palm measure. Via the correspondence between random discrete measures and random partitions, the theory of Palm measures can be applied to partitions:
Many of James' results are far ahead of current Bayesian nonparametrics. For applications to existing models, see
One of the most popular models based on the Dirichlet process is the dependent Dirichlet process. Despite its great popularity, Steven MacEachern's original article on the model remains unpublished and is hard to find on the web. Steven has kindly given me permission to make his article available:
Bayesian models are inherently hierarchical: The prior and the likelihood represent two layers in a hierarchy. The term "hierarchical modeling" often refers to the idea that the prior can itself be split up into further hierarchy layers. This provides an almost generic way to combine existing Bayesian models into new, more complex ones.
A widely known nonparametric model of this type is the hierarchical Dirichlet process.
Distributions on random functions can be used as prior distributions in regression and related problems. The prototypical prior on smooth random functions is the Gaussian process. An excellent introduction to Gaussian process models and many references can be found in the monograph by Rasmussen and Williams.
There are many texts on the mathematical theory of Gaussian processes, for example:
A very good reference on abstract Bayesian methods, exchangeability, sufficiency, and parametric models (including infinite-dimensional Bayesian models) are the first two chapters of Schervish's Theory of Statistics.
A clear and readable introduction to the questions studied in this area, and to how they are addressed, is a survey chapter by Ghosal which is referenced above.
The following monograph is a good reference that provides many more details. Be aware though that the most interesting work in this area has arguably been done in the past decade, and hence is not covered by the book.
The following sample references are a small subset of the large and growing literature on this subject:
For a good introduction to exchangeability and its implications for Bayesian models, see Schervish's Theory of Statistics, which is referenced above.
If you are interested in the bigger picture, and in how exchangeability generalizes to other random structures than exchangeable sequences, I highly recommend this article based on David Aldous' lecture at the International Congress of Mathematicians:
The most comprehensive and rigorous treatise on exchangeability I am aware of is:
I discuss applications to nonparametric Bayesian models of data not representable as exchangeable sequences in this preprint:
When the Dirichlet process was first developed, Blackwell and MacQueen realized that a sample from a DP can be generated by a so-called Pólya urn with infinitely many colors. Roughly speaking, an urn model assumes that balls of different colors are contained in an urn, and are drawn uniformly at random; the proportions of balls per color determine the probability of each color to be drawn. A specific urn is defined by a rule for how the number of balls is changed when a color is drawn. In Pólya urns, the number of balls of a color is increased whenever that color is drawn; this process is called reinforcement, and corresponds to the rich-get-richer property of the Dirichlet process. There are many different versions of Pólya urns, defined by different reinforcement rules.
For Bayesian nonparametrics, urns provide a probabilistic tool to study the sizes of clusters in a clustering model, or more generally the weight distributions of random discrete measures. They also provide a link to population genetics, where urns model the distribution of species; you will sometimes encounter references to species sampling models. The relationship between the different terminologies is
A key property of Pólya urns is that they can generate power law distributions, which occur in applications such as language models or social networks.
If you are interested in urns and power laws, I recommend that you have a look at the following two survey articles (in this order):
I am often asked for references on the mathematical foundations of Bayesian nonparametrics. There are a few specific reasons why Bayesian nonparametric models require more powerful mathematical tools than parametric ones; this is particularly true for theoretical problems.
One of the reasons is that Bayesian nonparametric models do not usually have density representation, and hence require a certain amount of measure theory. Since the parameter space of a nonparametric model is infinite-dimensional, the prior and posterior distributions are probabilities on infinite-dimensional spaces, and hence stochastic processes. If you are interested in the theory of Bayesian nonparametrics and do not have a background in probability, you may have to familiarize yourself with some topics such as stochastic processes and regular conditional probabilities. These are covered in every textbook on probability theory. Billingsley's book is a popular choice.
My favorite probability textbook is Kallenberg's "Foundations". However, this book is probably not a good place to start if you do not already have a reasonable knowledge of the field. If you are interested in this book, make sure you read the second edition.
The mathematical tools for handling infinite-dimensional spaces are the subject of functional analysis. There is a marvelous textbook by Aliprantis and Border, which I believe every researcher with a serious interest in the theory of Bayesian nonparametric models should keep on their shelf.
Another problem is that Bayes' theorem does not generally hold for Bayesian nonparametric models. Technically speaking, this is due to the fact that infinite-dimensional models can be undominated. For an introduction to undominated models and the precise conditions required by Bayes' theorem, I recommend the first chapter of Schervish's textbook.
This problem has motivated my own work on conjugate models (since conjugacy is the only reasonably general way we know to get from the prior and data to the posterior); see e.g.
A more rigorous treatment is given in my paper referenced above.
The original DP paper is of course Ferguson's 1973 article. In his acknowledgments, Ferguson attributes the idea to David Blackwell.
At about the same time, Ferguson's student Antoniak introduced a model called a Mixture of Dirichlet Processes (MDP), which is sometimes mistaken as a Dirichlet Process mixture. The MDP puts a prior on the parameters of the DP base measure. A draw from a MDP is discrete almost surely, just as for the DP.
Steven MacEachern has pointed out to me that Antoniak's paper also contains a Dirichlet process mixture: Antoniak introduces the idea of using a parametric likelihood with a DP or MDP, which he refers to as "random noise" (cf his Theorem 3) and as a sampling distribution (cf Example 4). If this is used with a DP, the resulting distribution is identical to a Dirichlet process mixture model. However, Albert Lo was the first author to study models of this form from a mixture perspective:
Discreteness of the DP was first shown by David Blackwell. For a clear exposition of the discreteness argument used by Blackwell, see Chapter 8.3 of Kingman's book.
The Pólya urn interpretation is due to James MacQueen.
Until the 1980s, Bayesian statistics used a definition of consistency that is weaker than the modern definition. Roughly speaking, this definition states that the model has to behave well for all values of the parameter except for a set of zero probability under the prior. In parametric models, this set of exceptions does not usually cause problems, but in nonparametric models, it can make this notion of consistency almost meaningless. Work on stronger forms of consistency began after Diaconis and Freedman pointed out the problem by constructing a pathological counter example to consistent behavior of the Dirichlet process.
This matter has caused quite a bit of confusion. A result going back to Doob shows that (under very mild identifiability conditions) any Bayesian model is consistent in the weak sense:
The fallout of this result was a folklore belief that consistency is never an issue for Bayesian models (you may still encounter this claim from time to time in the literature). A more accurate statement is perhaps that consistency is usually not an issue in parametric models, but can cause problems in nonparametric ones (regardless of whether these models are Bayesian or non-Bayesian). In Bayesian statistics, such problems went unnoticed until Bayesian nonparametrics became a serious research topic. For modern results on consistency of Bayesian nonparametric models, see the references given above.
Work on the equivalence of exchangeability and conditional independence dates back to several publications of de Finetti on sequences of binary random variables in the early 1930s, such as:
For de Finetti's perspective on the subject, see his Theory of Probability [MathSciNet]. The generalization to arbitrary random variables, as well as the interpretation of the set of exchangeable measures as a convex polytope, is due to: