Testing for uniformity given very sparsely-sampled discrete
data
In press, IEEE Transactions on Information Theory
How many independent samples $N$ do we need from a distribution $p$ to
decide that $p$ is $\epsilon$-distant from uniform in an L$_1$ sense,
$\sum _{i=1} ^m |p(i) - 1/m|> \epsilon$? (Here $m$ is the number of
bins on which the distribution is supported, and is assumed known {\it
a priori}.) Somewhat surprisingly, we only need $N \epsilon^2 \gg
m^{1/2}$ to make this decision reliably (this condition is both
sufficient and necessary). The test for uniformity introduced here is
based on the number of observed ``coincidences'' (samples that fall
into the same bin), the mean and variance of which may be computed
explicitly for the uniform distribution and bounded nonparametrically
for any distribution that is known to be $\epsilon$-distant from
uniform. Some connections to the classical birthday problem are
noted.
Draft (150K, pdf) | Related work on estimation of
sparsely-sampled discrete distributions | Liam Paninski's research page