Abstract: Testing for uniformity given very sparsely-sampled discrete data

Testing for uniformity given very sparsely-sampled discrete data

Liam Paninski

In press, IEEE Transactions on Information Theory

How many independent samples $N$ do we need from a distribution $p$ to decide that $p$ is $\epsilon$-distant from uniform in an L$_1$ sense, $\sum _{i=1} ^m |p(i) - 1/m|> \epsilon$? (Here $m$ is the number of bins on which the distribution is supported, and is assumed known {\it a priori}.) Somewhat surprisingly, we only need $N \epsilon^2 \gg m^{1/2}$ to make this decision reliably (this condition is both sufficient and necessary). The test for uniformity introduced here is based on the number of observed ``coincidences'' (samples that fall into the same bin), the mean and variance of which may be computed explicitly for the uniform distribution and bounded nonparametrically for any distribution that is known to be $\epsilon$-distant from uniform. Some connections to the classical birthday problem are noted.

Draft (150K, pdf) | Related work on estimation of sparsely-sampled discrete distributions | Liam Paninski's research page