# Recently in Miscellaneous Statistics Category

## According to Benford's law, lots of financial data are (still) fraudulent

This post is by Phil.

I love this post by Jialan Wang. Wang "downloaded quarterly accounting data for all firms in Compustat, the most widely-used dataset in corporate finance that contains data on over 20,000 firms from SEC filings" and looked at the statistical distribution of leading digits in various pieces of financial information. As expected, the distribution is very close to what is predicted by Benford's Law.

Very close, but not identical. But does that mean anything? Benford's "Law" isn't really a law, it's more of a rule or principle: it's certainly possible for the distribution of leading digits in financial data --- even a massive corpus of it --- to deviate from the rule without this indicating massive fraud or error. But, aha, Wang also looks at how the deviation from Benford's Law has changed with time, and looks at it by industry, and this is where things get really interesting and suggestive. I really can't summarize any better than Wang did, so click on the first link in this post and go read it. But come back here to comment!

## The estimated effect size is implausibly large. Under what models is this a piece of evidence that the true effect is small?

| 1 Comment

Paul Pudaite writes in response to my discussion with Bartels regarding effect sizes and measurement error models:

You [Gelman] wrote: "I actually think there will be some (non-Gaussian) models for which, as y gets larger, E(x|y) can actually go back toward zero."

I [Pudaite] encountered this phenomenon some time in the '90s. See this graph which shows the conditional expectation of X given Z, when Z = X + Y and the probability density functions of X and Y are, respectively, exp(-x^2) and 1/(y^2+1) (times appropriate constants). As the magnitude of Z increases, E[X|Z] shrinks to zero.

I wasn't sure it was worth the effort to try to publish a two paragraph paper.

I suspect that this is true whenever the tail of one distribution is 'sufficiently heavy' with respect to the tail of the other. Hmm, I suppose there might be enough substance in a paper that attempted to characterize this outcome for, say, unimodal symmetric distributions.

Maybe someone can do this? I think it's an important problem. Perhaps some relevance to the Jeffreys-Lindley paradox also.

## Hey--here's what you missed in the past 30 days!

| 1 Comment

OK, the 30 days of statistics are over. I'll still be posting regularly on statistical topics, but now it will be mixed in with everything else, as before.

Here's what I put on the sister blogs in the past month:

1. How to write an entire report with fake data.

2. "Life getting shorter for women in hundreds of U.S. counties": I'd like to see a graph of relative change in death rates, with age on the x-axis.

4. Remember when I said I'd never again post on albedo? I was lying.

5. Update on Arrow's theorem. It's a Swiss thing, you wouldn't understand.

6. Dan Ariely can't read, but don't blame Johnson and Goldstein.

7. My co-blogger endorses college scholarships for bowling. Which reminds me that my friends and I did "intramural bowling" in high school to get out of going to gym class. Nobody paid us. We even had to rent the shoes!

8. The quest for http://www.freakonomics.com/2008/10/10/my-colleague-casey-mulligan-in-the-times-there-is-no-reason-to-panic/

9. For some reason, the commenters got all worked up about the dude with the two kids and completely ignored the lady who had to sell her summer home in the Hamptons."

10. The most outrageous parts of a story are the parts that don't even attract attention.

12. The fallacy of composition in brownstone Brooklyn.

13. No, the federal budget is not funded by taking money from poor people.

14. Leading recipient of U.S. foreign aid says that foreign aid is bad.

15. Jim Davis has some pretty controversial opinions.

16. Political scientist links to political scientist linking to political scientist claiming political science is irrelevant.

17. "Approximately one in 11.8 quadrillion." (I love that "approximately." The exact number is 11.8324589480035 quadrillion but they did us the favor of rounding.)

## Static sensitivity analysis

This is one of my favorite ideas. I used it in an application but have never formally studied it or written it up as a general method.

Sensitivity analysis is when you check how inferences change when you vary fit several different models or when you vary inputs within a model. Sensitivity analysis is often recommended but is typically difficult to do, what with the hassle of carrying around all these different estimates. In Bayesian inference, sensitivity analysis is associated with varying the prior distribution, which irritates me: why not consider sensitivity to the likelihood, as that's typically just as arbitrary as the prior while having a much larger effect on the inferences.

So we came up with static sensitivity analysis, which is a way to assess sensitivity to assumptions while fitting only one model. The idea is that Bayesian posterior simulation gives you a range of parameter values, and from these you can learn about sensitivity directly.

The published example comes from my paper with Frederic Bois and Don Maszle on the toxicokinetics of percloroethylene (PERC). One of the products of the analysis was estimation of the percent of PERC metabolized at high and low doses. We fit a multilevel model to data from six experimental subjects, so we obtained inference for the percent metabolized at each dose for each person and the distribution of these percents over the general population.

Here's the static sensitivity analysis:

Each plot shows inferences for two quantities of interest--percent metabolized at each of the two doses--with the dots representing different draws from the fitted posterior distribution. (The percent metabolized is lower at high doses (an effect of saturation of the metabolic process in the liver), so in this case it's possible to "cheat" and display two summaries on each plot.) The four graphs show percent metabolized as a function of four different variables in the model. All these graphs represent inference for subject A, one of the six people in the experiment. (It would be possible to look at the other five subjects, but the set of graphs here gives the general idea.)

To understand the static sensitivity analysis, consider the upper-left graph. The simulations reveal some posterior uncertainty about the percent metabolized (it is estimated to be between about 10-40% at low dose and 0.5-2% at high dose) and also on the fat-blood partition coefficient displayed on the x-axis (it is estimated to be somewhere between 65 and 110). More to the point, the fat-blood partition coefficient influences the inference for metabolism at low dose but not at high dose. This result can be directly interpreted as sensitivity to the prior distribution for this parameter: if you shift the prior to the left or right, you will shift the inferences up or down for percent metabolized at low dose, but not at high dose.

Now look at the lower-left graph. The scaling coefficient strongly influences the percent metabolized at high dose but has essentially no effect on the low-dose rate.

Suppose that as a decision-maker you are primarily interested in the effects of low-dose exposure. Then you'll want to get good information about the fat-blood partition coefficient (if possible) but it's not so important to get more precise on the scaling coefficient. You can go similarly through the other graphs.

I think this has potential as a general method, but I've never studied it or written it up as such. It's a fun problem: it has applied importance but also links to a huge theoretical literature on sensitivity analysis.

## Hypothesis testing with multiple imputations

Vincent Yip writes:

I have read your paper [with Kobi Abayomi and Marc Levy] regarding multiple imputation application.

In order to diagnostic my imputed data, I used Kolmogorov-Smirnov (K-S) tests to compare the distribution differences between the imputed and observed values of a single attribute as mentioned in your paper. My question is:

For example I have this attribute X with the following data: (NA = missing)

Original dataset: 1, NA, 3, 4, 1, 5, NA

Imputed dataset: 1, 2 , 3, 4, 1, 5, 6

a) in order to run the KS test, will I treat the observed data as 1, 3, 4,1, 5?

b) and for the observed data, will I treat 1, 2 , 3, 4, 1, 5, 6 as the imputed dataset for the K-S test? or just 2 ,6?

c) if I used m=5, I will have 5 set of imputed data sets. How would I apply K-S test to 5 of them and compare to the single observed distribution? Do I combine the 5 imputed data set into one by averaging each imputed values so I get one single imputed data and compare with the observed data? OR will I run KS test to all 5 and averaging the KS test result (i.e. averaging the p-values)?

I have to admit I have not thought about this in detail. I suppose it would make sense to compare the observed data (1,3,4,1,5) to the imputed (2,6). I would do the test separately for each imputation. I also haven't thought about what to do with the p-values. My intuition would be to average them but this again is not something I've thought much about. Also if the test does reject, this implies a difference between observed and imputed values. It does not show that the imputations are wrong, merely that under the model the data are not missing completely at random.

I'm sure there's a literature on combining hypothesis tests with multiple imputation. Usually I'm not particularly interested in testing--we just threw that Kolmogorov-Smirnov idea into our paper without thinking too hard about what we would do with it.

## How do we evaluate a new and wacky claim?

Around these parts we see a continuing flow of unusual claims supported by some statistical evidence. The claims are varyingly plausible a priori. Some examples (I won't bother to supply the links; regular readers will remember these examples and newcomers can find them by searching):

- Obesity is contagious
- People's names affect where they live, what jobs they take, etc.
- Beautiful people are more likely to have girl babies
- More attractive instructors have higher teaching evaluations
- In a basketball game, it's better to be behind by a point at halftime than to be ahead by a point
- Praying for someone without their knowledge improves their recovery from heart attacks
- A variety of claims about ESP

How should we think about these claims? The usual approach is to evaluate the statistical evidence--in particular, to look for reasons that the claimed results are not really statistically significant. If nobody can shoot down a claim, it survives.

The other part of the story is the prior. The less plausible the claim, the more carefully I'm inclined to check the analysis.

But what does it mean, exactly, to check an analysis? The key step is to interpret the findings quantitatively: not just as significant/non-significant but as an effect size, and then looking at the implications of the estimated effect.

I'll explore in the context of two examples, one from political science and one from psychology. An easy example is one in which the estimated effect is completely plausible (for example, the incumbency advantage in U.S. elections), or in which it is completely implausible (for example, a new and unreplicated claim of ESP).

Neither of the examples I consider here is easy: both of the claims are odd but plausible, and both are supported by data, theory, and reasonably sophisticated analysis.

The effect of rain on July 4th

My co-blogger John Sides linked to an article by Andreas Madestam and David Yanagizawa-Drott that reports that going to July 4th celebrations in childhood had the effect of making people more Republican. Madestam and Yanagizawa-Drott write:

Using daily precipitation data to proxy for exogenous variation in participation on Fourth of July as a child, we examine the role of the celebrations for people born in 1920-1990. We find that days without rain on Fourth of July in childhood have lifelong effects. In particular, they shift adult views and behavior in favor of the Republicans and increase later-life political participation. Our estimates are significant: one Fourth of July without rain before age 18 raises the likelihood of identifying as a Republican by 2 percent and voting for the Republican candidate by 4 percent. . . .

Here was John's reaction:

In sum, if you were born before 1970, and experienced sunny July 4th days between the ages of 7-14, and lived in a predominantly Republican county, you may be more Republican as a consequence.

When I [John] first read the abstract, I did not believe the findings at all. I doubted whether July 4th celebrations were all that influential. And the effects seem to occur too early in the life cycle: would an 8-year-old would be affected politically? Doesn't the average 8-year-old care more about fireworks than patriotism?

But the paper does a lot of spadework and, ultimately, I was left thinking "Huh, maybe this is true." I'm still not certain, but it was worth a blog post.

My reaction is similar to John's but a bit more on the skeptical side.

Let's start with effect size. One July 4th without rain increases the probability of Republican vote by 4%. From their Figure 3, the number of rain-free July 4ths is between 6 and 12 for most respondents. So if we go from the low to the high end, we get an effect of 6*4%, or 24%.

[Note: See comment below from Winston Lim. If the effect is 24% (not 24 percentage points!) on the Republican vote and 0% on the Democratic vote, then the effect on the vote share D/(D+R) is 1.24/1.24 - 1/2 or approximately 6%. So the estimate is much less extreme than I'd thought. The confusion arose because I am used to seeing results reported in terms of the percent of the two-party vote share, but these researchers used a different form of summary.]

Does a childhood full of sunny July 4ths really make you 24 percentage points more likely to vote Republican? (The authors find no such effect when considering the weather in a few other days in July.) I could imagine an effect--but 24 percent of the vote? The number seems too high--especially considering the expected attenuation (noted in section 3.1 of the paper) because not everyone goes to a July 4th celebration and that they don't actually know the counties where the survey respondents lived as children. It's hard enough to believe an effect size of 24%, but it's really hard to believe of 24% as an underestimate.

So what could've gone wrong? The most convincing part of the analysis was that they found no effect of rain on July 2, 3, 5, or 6. But this made me wonder about the other days of the year. I'd like to see them automate their analysis and loop it thru all 365 days, then make a graph showing how the coefficient for July 4th fits in. (I'm not saying they should include all 365 in a single regression--that would be a mess. Rather, I'm suggesting the simpler option of 365 analyses, each for a single date.)

Otherwise there are various features in the analysis that could cause problems. The authors predict individual survey respondents given the July 4th weather when they were children, in the counties where they currently reside. Right away we can imagine all sorts of biases based on how moves and who stays put.

Setting aside these measurement issues, the big identification issue is that counties with more rain might be systematically different than counties with less rain. To the extent the weather can be considered a random treatment, the randomization is occurring across years within counties. The authors attempt to deal with this by including "county fixed effects"--that is, allowing the intercept to vary by county. That's ok but their data span a 70 year period, and counties have changed a lot politically in 70 years. They also include linear time trends for states, which helps some more, but I'm still a little concerned about systematic differences not captured in these trends.

No study is perfect, and I'm not saying these are devastating criticisms. I'm just trying to work through my thoughts here.

The effects of names on life choices

For another example, consider the study by Brett Pelham, Matthew Mirenberg, and John Jones of the dentists named Dennis (and the related stories of people with names beginning with F getting low grades, baseball players with K names getting more strikeouts, etc.). I found these claims varyingly plausible: the business with the grades and the strikeouts sounded like a joke, but the claims about career choices etc seemed possible.

My first step in trying to understand these claims was to estimate an effect size: my crude estimate was that, if the research findings were correct, that about 1% of people choose their career based on their first names.

This seemed possible to me, but Uri Simonsohn (the author of the recent rebuttal of the name-choice article by Pelham et al.) argued that the implied effects were too large to be believed (just as I was arguing above regarding the July 4th study), which makes more plausible his claims that the results arise from methodological artifacts.

That calculation is straight Bayes: the distribution of systematic errors has much longer tails than the distribution of random errors, so the larger the estimated effect, the more likely it is to be a mistake. This little theoretical result is a bit annoying, because it is the larger effects that are the most interesting!

Simonsohn moved the discussion forward by calibrating the effect-size questions to other measurable quantities:

We need a benchmark to make a more informed judgment if the effect is small or large. For example, the Dennis/dentist effect should be much smaller than parent-dentist/child-dentist. I think this is almost certainly true but it is an easy hurdle. The J marries J effect should not be much larger than the effect of, say, conditioning on going to the same high-school, having sat next to each other in class for a whole semester.

I have no idea if that hurdle is passed. These are arbitrary thresholds for sure, but better I'd argue than both my "100% increase is too big", and your "pr(marry smith) up from 1% to 2% is ok."

Summary

No easy answers. But I think that understanding effect sizes on a real scale is a start.

## Censoring on one end, "outliers" on the other, what can we do with the middle?

This post was written by Phil.

A medical company is testing a cancer drug. They get a 16 genetically identical (or nearly identical) rats that all have the same kind of tumor, give 8 of them the drug and leave 8 untreated...or maybe they give them a placebo, I don't know; is there a placebo effect in rats?. Anyway, after a while the rats are killed and examined. If the tumors in the treated rats are smaller than the tumors in the untreated rats, then all of the rats have their blood tested for dozens of different proteins that are known to be associated with tumor growth or suppression. If there is a "significant" difference in one of the protein levels, then the working assumption is that the drug increases or decreases levels of that protein and that may be the mechanism by which the drug affects cancer. All of the above is done on many different cancer types and possibly several different types of rats. It's just the initial screening: if things look promising, many more tests and different tests are done, potentially culminating (years later) in human tests.

So the initial task is to determine, from 8 control and 8 treated rats, which proteins look different. There are some complications: (1) the data are left-censored, i.e. below some level a protein is simply reported as "low"; (2) even above the censoring threshold the data are very uncertain (50% or 30% uncertainty for concentrations up to maybe double the censoring threshold); (3) some proteins are reported only in discrete levels (e.g. measurements might be 3.7 or 7.4, but never in between); (4) sometimes instrument problems, chemistry problems, or abnormalities in one or more rats lead to very high measurements of one or more proteins.

For instance:
("low" means < 0.10) :
Protein A, cases: 0.31, 0.14, low, 0.24, low, low, 0.14, low
Protein A, controls: low, low, low, low, 0.24, low, low, low

Protein B, cases: 160, 122, 99, 145, 377, 133, 123, 140
Protein B, controls: 94, 107, 139, 135, 152, 120, 111, 118
Note the very high value of Protein B in case rat 5. The drug company would not want to flag Protein B as being affected by their drug just because they got that one big number.

Finally, the question: what's a good algorithm to recognize if the cases tend to have higher levels of a given protein than the controls? A few possibilities that come to mind: (1) generate bootstrap samples from the cases and from the controls, and see how often the medians differ by more than the observed medians do; if it's a small fraction of the time, then the observed difference is "statistically significant." (2) Use the Whitney-Mann "U-test". (3) Discard outliers, then use censored maximum likelihood (or similar) on the rest of the data, thus generating a mean (or geometric mean) and uncertainty for the cases and for the controls.

Which of those is the best approach, and if the answer is "neither" then what do you recommend?

## Weighting and prediction in sample surveys

| 1 Comment

A couple years ago Rod Little was invited to write an article for the diamond jubilee of the Calcutta Statistical Association Bulletin. His article was published with discussions from Danny Pfefferman, J. N. K. Rao, Don Rubin, and myself.
Here it all is.

I'll paste my discussion below, but it's worth reading the others' perspectives too. Especially the part in Rod's rejoinder where he points out a mistake I made.

## Putting together multinomial discrete regressions by combining simple logits

When predicting 0/1 data we can use logit (or probit or robit or some other robust model such as invlogit (0.01 + 0.98*X*beta)). Logit is simple enough and we can use bayesglm to regularize and avoid the problem of separation.

What if there are more than 2 categories? If they're ordered (1, 2, 3, etc), we can do ordered logit (and use bayespolr() to avoid separation). If the categories are unordered (vanilla, chocolate, strawberry), there are unordered multinomial logit and probit models out there.

But it's not so easy to fit these multinomial model in a multilevel setting (with coefficients that vary by group), especially if the computation is embedded in an iterative routine such as mi where you have real time constraints at each step.

So this got me wondering whether we could kluge it with logits. Here's the basic idea (in the ordered and unordered forms):

- If you have a variable that goes 1, 2, 3, etc., set up a series of logits: 1 vs. 2,3,...; 2 vs. 3,...; and so forth. Fit each one with bayesglm (or whatever). The usual ordered logit is a special case of this model in which the coefficients (except for the constant term) are the same for each model. (At least, I'm guessing that's what's happening; if not, there are some equivalently-dimensioned constraints.) The simple alternative has no constraints. intermediate versions could link the models with soft constraints, some prior distribution on the coefficients. (This would need some hyperparameters but these could be estimated too if the whole model is being fit in some iterative context.)

- If you have a vanilla-chocolate-strawberry variable, do the same thing; just order the categories first, either in some reasonable way based on substantive information or else using some automatic rule such as putting the categories in decreasing order of frequency in the data. In any case, you'd first predict the probability of being in category 1, then the probability of being in 2 (given that you're not in 1), then the probability of 3 (given not 1 or 2), and so forth.

Depending on your control over the problem, you could choose how to model the variables. For example, in a political survey with some missing ethnicity responses, you might model that variable as ordered: white/other/hispanic/black. In other contexts you might go unordered.

I recognized that my patchwork of logits is a bit of a hack, but I like its flexibility, as well as the computational simplicity of building it out of simple logits. Maybe there's some literature on this (perhaps explaining why it's a bad idea)?? I'd appreciate any comments or suggestions.

## Fundamental difficulty of inference for a ratio when the denominator could be positive or negative

Ratio estimates are common in statistics. In survey sampling, the ratio estimate is when you use y/x to estimate Y/X (using the notation in which x,y are totals of sample measurements and X,Y are population totals).

In textbook sampling examples, the denominator X will be an all-positive variable, something that is easy to measure and is, ideally, close to proportional to Y. For example, X is last year's sales and Y is this year's sales, or X is the number of people in a cluster and Y is some count.

Ratio estimation doesn't work so well if X can be either positive or negative.

More generally we can consider any estimate of a ratio, with no need for a survey sampling context. The problem with estimating Y/X is that the very interpretation of Y/X can change completely if the sign of X changes.

Everything is ok for a point estimate: you get X.hat and Y.hat, you can take the ratio Y.hat/X.hat, no problem. But the inference falls apart if you have enough uncertainty in X.hat that you can't be sure of its sign.

This problem has been bugging me for a long time, and over the years I've encountered various examples in different fields of statistical theory, methods, and applications. Here I'll mention a few:
- LD50
- Ratio of regression coefficients
- Incremental cost-effectiveness ratio
- Instrumental variables
- Fieller-Creasy problem

LD50

We discuss this in section 3.7 of Bayesian Data Analysis. Consider a logistic regression model, Pr(y=1) = invlogit (a + bx), where x is the dose of a drug given to an animal and y=1 if the animal dies. The LD50 (lethal dose, 50%) is the value x for which Pr(y=1)=0.5. That is, a+bx=0, so x = -a/b. This is the value of x for which the logistic curve goes through 0.5 so there's a 50% chance of the animal dying.

The problem comes when there is enough uncertainty about b that its sign could be either positive or negative. If so, you get an extremely long-tailed distribution for the LD50, -a/b. How does this happen? Roughly speaking, the estimate for a has a normal dist, the estimate for b has a normal dist, so their ratio has a Cauchy-like dist, in which it can appear possible for the LD50 to take on values such as 100,000 or -300,000 or whatever. In a real example (such as in section 3.7 of BDA), these sorts of extreme values don't make sense.

The problem is that the LD50 has a completely different interpretation if b>0 than if b<0. If b>0, then x is the point at which any higher dose has a more than 50% chance of killing. If b<0, then any dose lower than x has a more than 50% chance to kill. The interpretation of the model changes completely. LD50 by itself is pretty pointless, if you don't know whether the curve goes up or down. And values such as LD50=100,000 are pretty meaningless in this case.

Ratio of regression coefficients

Here's an example. Political science Daniel Drezner pointed to a report by James Gwartney and Robert A. Lawson, who wrote:

Economic freedom is almost 50 times more effective than democracy in restraining nations from going to war. In new research published in this year's report [2005], Erik Gartzke, a political scientist from Columbia University, compares the impact of economic freedom on peace to that of democracy. When measures of both economic freedom and democracy are included in a statistical study, economic freedom is about 50 times more effective than democracy in diminishing violent conflict. The impact of economic freedom on whether states fight or have a military dispute is highly significant while democracy is not a statistically significant predictor of conflict.

What Gartzke did was run a regression and take the coefficient for economic freedom and divide it by the coefficient for democracy. Now I'm not knocking Gartzke's work, nor am I trying to make some smug slam on regression. I love regression and have used it for causal inference (or approximate causal inference) in my own work.

My only problem here is that ratio of 50. If beta.hat.1/beta.hat.2=50, you can bet that beta.hat.2 is not statistically significant. And, indeed, if you follow the link to Gartzke's chapter 2 of this report, you find this:

The "almost 50" above is the ratio of the estimates -0.567 and -0.011. (567/11 is actually over 50, but I assume that you get something less than 50 if you keep all the significant figures in the original estimate.) In words, each unit on the economic freedom scale corresponds to a difference of 0.567 on the probability (or, in this case, I assume the logit probability) of a militarized industrial dispute, while a difference of one unit on the democracy score corresponds to a difference of 0.011 on the outcome.

A factor of 50 is a lot, no?

But now look at the standard errors. The coefficient for the democracy score is -0.011 +/- 0.065. So the data are easily consistent with a coefficient of -0.011, or 0.1, or -0.1. All of these are a lot less than 0.567. Even if we put the coef of economic freedom at the low end of its range in absolute value (say, 0.567 - 2*0.179 = 0.2) and put the coef of the democracy score at the high end (say, 0.011 + 2*0.065=0.14)--even then, the ratio is still 1.4, which ain't nothing. (Economic freedom and democracy score both seem to be defined roughly on a 1-10 scale, so it seems plausible to compare their coefficients directly without transformation.) So, in the context of Gartzke's statistical and causal model, his data are saying something about the relative importance of the two factors.

But, no, I don't buy the factor of 50. One way to see the problem is: what if the coef of democracy had been +0.011 instead of -0.011? Given the standard error, this sort of thing could easily have occurred. The implication would be that democracy is associated with more war. Could be possible. Would the statement then be that economic freedom is negative 50 times more effective than democracy in restraining nations from going to war??

Or what if the coef of democracy had been -0.001? Then you could say that economic freedom is 500 times as important as democracy in preventing war.

The problem is purely statistical. The ratio beta.1/beta.2 has a completely different meaning according to the signs of beta.1 and beta.2. Thus, if the sign of the denominator (or, for that matter, the numerator) is uncertain, the ratio is super-noisy and can be close to meaningless.

Incremental cost-effectiveness ratio

Several years ago Dan Heitjan pointed me to some research on the problem of comparing two treatments that can vary on cost and efficacy.

Suppose the old treatment has cost C1 and efficacy E1, and the new treatment has cost C2 and efficacy E2. The incremental cost-effectiveness ratio is (C2-C1)/(E2-E1). In the usual scenario in which cost and efficacy both increase, we want this ratio to be low: the least additional cost per additional unit of efficacy.

Now suppose that C1,E1,C2,E2 are estimated from data, so that your estimated ratio is (C2.hat-C1.hat)/(E2.hat-E1.hat). No problem, right? No problem . . . as long as the signs of C2-C1 and E2-E1 are clear. But suppose the signs are uncertain--that could happen--so that we are not sure whether the new treatment is actually better, or whether it is actually more expensive.

1. C2 .gt. C1 and E2 .gt. E1. The new treatment costs more and works better. The incremental cost-effectiveness ratio is positive, and we want it to be low.
2. C2 .gt. C1 and E2 .lt. E1. The new treatment costs more and works worse. The incremental cost-effectiveness ratio is negative, and the new treatment is worse no matter what.
3. C2 .lt. C1 and E2 .gt. E1. The new treatment costs less and works better! The incremental cost-effectiveness ratio is negative, and the new treatment is better no matter what.
4. C2 .lt. C1 and E2 .lt. E1. The new treatment costs less and works worse. The incremental cost-effectiveness ratio is positive, and we want it to be high (that is, a great gain in cost for only a small drop in efficacy).

Consider especially quadrants 1 and 4. An estimate or a confidence interval in incremental cost-effectiveness ratio is meaningless if you don't know what quadrant you're in.

Here are the references for this one:

Heitjan, Daniel F., Moskowitz, Alan J. and Whang, William (1999). Bayesian estimation of cost-effectiveness ratios from clinical trials. Health Economics 8, 191-201.

Heitjan, Daniel F., Moskowitz, Alan J. and Whang, William (1999). Problems with interval estimation of the incremental cost-effectiveness ratio. Medical Decision Making 19, 9-15.

Instrumental variables

This is another ratio of regression coefficients. For a weak instrument, the denominator can be so uncertain that its sign could go either way. But if you can't get the sign right for the instrument, the ratio estimate doesn't mean anything. So, paradoxically, when you use a more careful procedure to compute uncertainty in an instrumental variables estimate, you can get huge uncertainty estimates that are inappropriate.

Fieller-Creasy problem

This is the name in classical statistics for estimating the ratio of two parameters that are identified with independent normally distributed data. It's sometimes referred to as the problem as the ratio of two normal means, but I think the above examples are more realistic.

Anyway, the Fieller-Creasy problem is notoriously difficult: how can you get an interval estimate with close to 95% coverage? The problem, again, is that there aren't really any examples where the ratio has any meaning if the denominator's sign is uncertain (at least, none that I know of; as always, I'm happy to be educated further by my correspondents). And all the statistical difficulties in inference here come from problems where the denominator's sign is uncertain.

So I think the Fieller-Creasy problem is a non-problem. Or, more to the point, a problem that there is no point in solving. Which is one reason it's so hard to solve (recall the folk theorem of statistical computing).

P.S. This all-statistics binge is pretty exhausting! Maybe this one can count as 2 or 3 entries?

## The pervasive twoishness of statistics; in particular, the "sampling distribution" and the "likelihood" are two different models, and that's a good thing

Lots of good statistical methods make use of two models. For example:

- Classical statistics: estimates and standard errors using the likelihood function; tests and p-values using the sampling distribution. (The sampling distribution is not equivalent to the likelihood, as has been much discussed, for example in sequential stopping problems.)

- Bayesian data analysis: inference using the posterior distribution; model checking using the predictive distribution (which, again, depends on the data-generating process in a way that the likelihood does not).

- Machine learning: estimation using the data; evaluation using cross-validation (which requires some rule for partitioning the data, a rule that stands outside of the data themselves).

- Bootstrap, jackknife, etc: estimation using an "estimator" (which, I would argue, is based in some sense on a model for the data), uncertainties using resampling (which, I would argue, is close to the idea of a "sampling distribution" in that it requires a model for alternative data that could arise).

This commonality across these very different statistical procedures suggests to me that thinking on parallel tracks is an important and fundamental property of statistics. Perhaps, rather than trying to systematize all statistical learning into a single inferential framework (whether it be Neyman-Pearson hypothesis testing, Bayesian inference over graphical models, or some purely predictive behavioristic approach), we would be better off embracing our twoishness.

This relates to my philosophizing with Shalizi on falsification, Popper, Kuhn, and statistics as normal and revolutionary science.

Twoishness also has relevance to statistical practice in focusing one's attention on both parts of the model. To see this, step back for a moment and consider the transition from optimization problems such as "least squares" to model-based inference such as "maximum likelihood under the normal distribution." Moving from the procedure to the model was a step forward in that models can be understood, checked, and generalized, in a way that is more difficult with mere procedures. Or maybe I will take a slightly more cautious and thus defensible position and say that, if the goal is to understand, check, and generalize a learning algorithm (such as least squares), it can help to understand its expression as model-based inference.

Now back to the two levels of models. Once we recognize, for example, that bootstrap inference has two models (the implicit data model underlying the estimator, and the sampling model for the bootstrapping), we can ask questions such as:
- Are the two models coherent? Can we learn anything from the data model that will help with the sampling model, and vice-versa?
- What sampling model should we use? This is often treated as automatic or as somewhat of a technical problem (for example, how do you bootstrap time series data), but ultimately, as with any sampling problem, it should depend on the problem context.
Recognizing the bootstrapping step as a model (rather than simply a computational trick), the user is on the way to choosing the model rather than automatically taking the default.

Where does the twoishness come from? That's something we can discuss. There are aspects of sampling distributions (for example, sequential design) that don't arise in the data at hand, and there are aspects of inference (for example, regularization) that don't come from the sampling distribution. So it makes sense to me that two models are needed.

## Should we always be using the t and robit instead of the normal and logit?

My (coauthored) books on Bayesian data analysis and applied regression are like almost all the other statistics textbooks out there, in that we spend most of our time on the basic distributions such as normal and logistic and then, only as an aside, discuss robust models such as t and robit.

Why aren't the t and robit front and center? Sure, I can see starting with the normal (at least in the Bayesian book, where we actually work out all the algebra), but then why don't we move on immediately to the real stuff?

This isn't just (or mainly) a question of textbooks or teaching; I'm really thinking here about statistical practice. My statistical practice. Should t and robit be the default? If not, why not?

10. Estimating the degrees of freedom in the error distribution isn't so easy, and throwing this extra parameter into the model could make inference unstable.

9. Real data usually don't have outliers. In practice, fitting a robust model costs you more in efficiency than you gain in robustness. It might be useful to fit a contamination model as part of your data cleaning process but it's not necessary once you get serious.

8. If you do have contamination, better to model it directly rather than sweeping it under the rug of a wide-tailed error distribution.

7. Inferential instability: t distributions can yield multimodal likelihoods, which are a computational pain in their own right and also, via the folk theorem, suggest a problem with the model.

6. To make that last argument in reverse: the normal and logistic distributions have various excellent properties which make them work well even if they are not perfect fits to the data.

5. As Jennifer and I discuss in chapters 3 and 4 of our book, the error distribution is not the most important part of a regression model anyway. To the extent there is long-tailed variation, we'd like to explain this through long-tailed predictors or even a latent-variable model if necessary.

4. A related idea is that robust models are not generally worth the effort--it would be better to place our modeling efforts elsewhere.

3. Robust models are, fundamentally, mixture models, and fitting such a model in a serious way requires a level of thought about the error process that is not necessarily worth it. Normal and logistic models have their problems but they have the advantage of being more directly interpretable.

2. The problem is 100% computational. Once stan is up and running, you'll never see me fit a normal model again.

1. Clippy!

I don't know what to think. RIght now I'm leaning toward answer #2 above, but at the same time it's hard for me to imagine such a massive change in statistical practice. It might well be that in most cases the robust model won't make much of a difference, but I'm still bothered that the normal is my default choice. If computation wasn't a constraint, I think I'd want to estimate the t (with some innocuous prior distribution to average over the degrees of freedom and get a reasonable answer in those small-sample problems where the df would not be easy to estimate), or if I had to pick, maybe I'd go with a t with 7 degrees of freedom. Infinity degrees of freedom doesn't seem like a good default choice to me.

## 30 days of statistics

I was talking with a colleague about one of our research projects and said that I would write something up, if blogging didn't get in the way. She suggested that for the next month I just blog about my research ideas.

So I think I'll do that. This means no mocking of plagiarists, no reflections on literature, no answers to miscellaneous questions about how many groups you need in a multilevel model, no rants about economists, no links to pretty graphs, etc., for 30 days.

Meanwhile, I have a roughly 30-day backlog. So after my next 30 days of stat blogging, the backlog will gradually appear. There's some good stuff there, including reflections on Milos, a (sincere) tribute to the haters, an updated Twitteo Killed the Bloggio Star, a question about acupuncture, and some remote statistical modeling advice I gave that actually worked! I'm sure you'll enjoy it.

But you'll have to wait for all that fun stuff. For the next thirty days, it's statistics research every day.

P.S. If anything comes up that's too topical to be held for a month, I'll post it on one of the sister blogs.

P.P.S. As always, my cobloggers can feel free to post here whenever they want on whatever they want.

P.P.P.S. We'll soon be moving the blog to a new site for the blog. I'll make an exception to the all-statistics-research rule to update you on that when it occurs.

P.P.P.P.S. To anybody whose comments don't appear: As noted earlier, we get thousands of spam comments per hour, so (a) some legitimate comments get caught by the spam filter, and (b) it's impossible for us to look through the spam to see if anything real got stuck there. Try registering as a commenter, that might help. Or maybe things will be better in a few days with the new blog software.

## A survey's not a survey if they don't tell you how they did it

Since we're on the topic of nonreplicable research . . . see here (link from here) for a story of a survey that's so bad that the people who did it won't say how they did it.

I know too many cases where people screwed up in a survey when they were actually trying to get the right answer, for me to trust any report of a survey that doesn't say what they did.

I'm reminded of this survey which may well have been based on a sample of size 6 (again, the people who did it refused to release any description of methodology).

## Hey, good news! Your p-value just passed the 0.05 threshold!

E. J. Wagenmakers writes:

Here's a link for you. The first sentences tell it all:
Climate warming since 1995 is now statistically significant, according to Phil Jones, the UK scientist targeted in the "ClimateGate" affair. Last year, he told BBC News that post-1995 warming was not significant--a statement still seen on blogs critical of the idea of man-made climate change. But another year of data has pushed the trend past the threshold usually used to assess whether trends are "real."

Now I [Wagenmakers] don't like p-values one bit, but even people who do like them must cringe when they read this. First, this apparently is a sequential design, so I'm not sure what sampling plan leads to these p-values. Secondly, comparing significance values suggests that the data have suddenly crossed some invisible line that divided nonsignificant from significant effects (as you pointed out in your paper with Hal Stern). Ugh!

I share Wagenmakers's reaction. There seems to be some confusion here between inferential thresholds and decision thresholds. Which reminds me how much I hate the old 1950s literature (both classical and Bayesian) on inference as decision, loss functions for estimators, and all the rest. I think the p-value serves a role in summarizing certain aspects of a model's fit to data but I certainly don't think it makes sense as any kind of decision threshold (despite that it is nearly universally used as such to decide on acceptance of research in scientific journals).

## Traffic Prediction

I always thought predicting traffic for a particular day and time would be something easily predicted from historic data with regression. Google Maps now has this feature:

It would be good to actually include season, holiday and similar information: the predictions would be better. I wonder if one can find this data easily, or if others have done this work before.

## "Sampling: Design and Analysis": a course for political science graduate students

Early this afternoon I made the plan to teach a new course on sampling, maybe next spring, with the primary audience being political science Ph.D. students (although I hope to get students from statistics, sociology, and other departments). Columbia already has a sampling course in the statistics department (which I taught for several years); this new course will be centered around political science questions. Maybe the students can start by downloading data from the National Election Studies and General Social Survey and running some regressions, then we can back up and discuss what is needed to go further.

About an hour after discussing this new course with my colleagues, I (coincidentally) received the following email from Mike Alvarez:

If you were putting together a reading list on sampling for a grad course, what would you say are the essential readings? I thought I'd ask you because I suspect you might have taught something along these lines.

I pointed Mike here and here.

To which Mike replied:

I wasn't too far off your approach to teaching this. I agree with your blog posts that the Groves et al. book is the best basic text to use on survey methodology that is currently out there. On sampling I have in the past relied on some sort of nonlinear combination of Kish and a Wiley text by Levy and Lemeshow, though that was unwieldy for students. I'll have to look more closely at Lohr, my impression of it when I glanced at it was like yours, that it sort of underrepresented some of the newer topics.

I think Lohr's book is great, but it might not be at quite the right level for political science students. I want something that is (a) more practical and (b) more focused on regression modeling rather than following the traditional survey sampling textbook approach of just going after the population mean. I like the Groves et al. book but it's more of a handbook than a textbook. Maybe I'll have to put together a set of articles. Also, I'm planning to do it all in R. Stata might make more sense but I don't know Stata.

Any other thoughts and recommendations would be appreciated.

## Research Directions for Machine Learning and Algorithms

After reading this from John Langford:

The Deep Learning problem remains interesting. How do you effectively learn complex nonlinearities capable of better performance than a basic linear predictor? An effective solution avoids feature engineering. Right now, this is almost entirely dealt with empirically, but theory could easily have a role to play in phrasing appropriate optimization algorithms, for example.

Does this sound related to modeling the deep interactions you often talk about? (I [Jimmy] never understand the stuff on hunch, but thought that might be so?)

My reply: I don't understand that stuff on hunch so well either--he uses slightly different jargon than I do! That said, it looks interesting and important so I'm pointing you all to it.

## Why your Klout score is meaningless

As a Ph D statistician and search quality engineer, I [Braunstein] know a lot about how to properly measure things. In the past few months I've become an active Twitter user and very interested in measuring the influence of individuals. Klout provides a way to measure influence on Twitter using a score also called Klout. The range is 0 to 100. Light users score below 20, regular users around 30, and celebrities start around 75. Naturally, I was intrigued by the Klout measurement, but a careful analysis led to some serious issues with the score. . . .

Braunstein continues with some comparisons of different twitter-users and how their Klout scores don't make much sense. I don't really see the point of the Klout scores in the first place: I guess they're supposed to be a quick measure to use in pricing advertising? Whatever, I don't really care.

What did interest me was a remark on Braunstein's blog:

Everything in life can be measured. Some quantities live on natural measurement scales: height, weight, temperature, etc. Some quantities are derived measurements: happiness, deliciousness, hunger, etc. Though all useful measurements, research has repeatedly shown derived measurements to be inconsistent and not trustworthy individually. Specifically, if two individuals tell you their happiness levels are an 8 and a 9 on a scale of 10, we have no way to know: what this means for each individual without significant amounts of context, and which individual is "happier" even if 8 is less than 9.

I think Braunstein is on to something but I would frame it slightly differently. Happiness is typically defined as responses in a survey interview. So, no, I wouldn't want to call it a derived quantity. I'd rather call it a subjective measurement.

The problem with the Klout score is not that it's subjective but that it's cloudy: we don't know what it is. To understand a cloudy measurement, one has to poke it from the outside. With Google, people have done this using google fights, google trends, etc., and they find interesting things. Braunstein tries this with Klout and finds confusion.

Which makes sense given that Klout itself seems like a tool for . . . selling itself! Sort of other notorious rating schemes such as the Places Rated Almanac and the U.S. News college ratings.

Let's try a baseball analogy. Hits, walks, home runs, stolen bases, even goofy statistics such as "saves"--all of these are direct measurements or nearly so (accepting that we still have to handle decisions about errors, sacrifices, etc.). Batting average, on-base percentage, earned run average, etc.: these are derived quantities. Some more esoteric derived quantities such as Runs Produced will be accepted only if they offer some benefits in understanding.

## An unexpected benefit of Arrow's other theorem

In my remarks on Arrow's theorem (the weak form of Arrow's Theorem is that any result can be published no more than five times. The strong form is that every result will be published five times), I meant no criticism of Bruno Frey, the author of the articles in question: I agree that it can be a contribution to publish in multiple places. Regarding the evaluation of contributions, it should be possible to evaluate research contributions and also evaluate communication. One problem is that communication is both under- and over-counted. It's undercounted in that we mostly get credit for original ideas not for exposition; it's overcounted in that we need communication skills to publish in the top journals. But I don't think these two biases cancel out.

The real reason I'm bringing this up, though, is because Arrow's theorem happened to me recently and in interesting way. Here's the story.

Two years ago I was contacted by Harold Kincaid to write a chapter on Bayesian statistics for the Oxford Handbook of the Philosophy of the Social Sciences. I typically decline such requests because I don't know that people often read handbooks anymore, but in this case I said yes, because for about 15 years I'd been wanting to write something on the philosophy of Bayesian inference but had never gotten around to collecting my thoughts on the topic. While writing the article for Kincaid, I realized I'd like to reach a statistical audience also, so I enlisted the collaboration of Cosma Shalizi. After quite a bit of effort, we wrote an article that was promptly rejected by a statistics journal. We're now revising and I'm sure it will appear somewhere. (I liked the original a lot but the revision will be much better.)

In the meantime, though, we completed the chapter for the handbook. It overlaps with our journal article but we're aiming for different audiences.

Then came opportunity #3: I was asked if I wanted to contribute something to an online symposium on the philosophy of statistics. I took this as an opportunity to express my views as clearly and succinctly as possible. Again, there's overlap with the two previous papers but I felt that for some reason I was able to make my point more directly on this third try.

The symposium article is still under revision and I'll post it when it's done, but here's how the first draft begins:

Abstract

The frequentist approach to statistics is associated with a deductivist philosophy of science that follows Popper's doctrine of falsification. In contrast, Bayesian inference is associated with inductive reasoning and the idea that a model can be dethroned by a competing mode but can never be falsified on its own.

The purpose of this article is to break these associations, which I think are incorrect and have been detrimental to statistical practice, in that they have steered falsificationists away from the very useful tools of Bayesian inference and have discouraged Bayesians from checking the fit of their models. From my experience using and developing Bayesian methods in social and environmental science, I have found model checking and falsification to be central in the modeling process.

1. The standard view of the philosophy of statistics, and its malign influence on statistical practice

Statisticians can be roughly divided into two camps, each with a clear alignment of practice and philosophy. I will divide some of the relevant adjectives into two columns:

Frequentist Bayesian
Objective Subjective
Procedures Models
P-values Bayes factors
Deduction Induction
Falsification Pr (model is true)

I shall call this the standard view of the philosophy of statistics and abbreviate it as S. The point of this article is that S is a bad idea and that one can be a better statistician--and a better philosopher--by picking and choosing among the two columns rather than simply choosing one.

## Statistical methods for healthcare regulation: rating, screening and surveillance

| 1 Comment

Here is my discussion of a recent article by David Spiegelhalter, Christopher
Sherlaw-Johnson, Martin Bardsley, Ian Blunt, Christopher Wood and Olivia Grigg, that is scheduled to appear in the Journal of the Royal Statistical Society:

I applaud the authors' use of a mix of statistical methods to attack an important real-world problem. Policymakers need results right away, and I admire the authors' ability and willingness to combine several different modeling and significance testing ideas for the purposes of rating and surveillance.

That said, I am uncomfortable with the statistical ideas here, for three reasons. First, I feel that the proposed methods, centered as they are around data manipulation and corrections for uncertainty, has serious defects compared to a more model-based approach. My problem with methods based on p-values and z-scores--however they happen to be adjusted--is that they draw discussion toward error rates, sequential analysis, and other technical statistical concepts. In contrast, a model-based approach draws discussion toward the model and, from there, the process being modeled. I understand the appeal of p-value adjustments--lots of quantitatively-trained people know about p-values--but I'd much rather draw the statistics toward the data rather than the other way around. Once you have to bring out the funnel plot, this is to me a sign of (partial) failure, that you're talking about properties of a statistical summary rather than about the underlying process that generates the observed data.

My second difficulty is closely related: to me, the mapping seems tenuous from statistical significance to the ultimate healthcare and financial goals. I'd prefer a more direct decision-theoretic approach that focuses on practical significance.

That said, the authors of the article under discussion are doing the work and I'm not. I'm sure they have good reasons for using what I consider to be inferior methods, and I believe that one of the points of this discussion is to give them a chance to give this explanation.

Finally, I am glad that these methods result in ratings rather than rankings. As has been discussed by Louis (1984), Lockwood et al. (2002), and others, two huge problems arise when constructing ranks from noisy data. First, with unbalanced data (for example, different sample sizes in different hospitals) there is no way to simultaneously get reasonable point estimates of parameters and their rankings. Second, ranks are notoriously noisy. Even with moderately large samples, estimated ranks are unstable and can be misleading, violating well-known principles of quality control by encouraging decision makers to chase noise rather than understanding and reducing variation (Deming, 2000). Thus, although I am unhappy with the components of the methods being used here, I like some aspects of the output.

## Works well versus well understood

John Cook discusses the John Tukey quote, "The test of a good procedure is how well it works, not how well it is understood." Cook writes:

At some level, it's hard to argue against this. Statistical procedures operate on empirical data, so it makes sense that the procedures themselves be evaluated empirically.

But I [Cook] question whether we really know that a statistical procedure works well if it isn't well understood. Specifically, I'm skeptical of complex statistical methods whose only credentials are a handful of simulations. "We don't have any theoretical results, buy hey, it works well in practice. Just look at the simulations."

Every method works well on the scenarios its author publishes, almost by definition. If the method didn't handle a scenario well, the author would publish a different scenario.

I agree with Cook but would give a slightly different emphasis. I'd say that a lot of methods can work when they are done well. See the second meta-principle listed in my discussion of Efron from last year. The short story is: lots of methods can work well if you're Tukey. That doesn't necessarily mean they're good methods. What it means is that you're Tukey. I also think statisticians are overly impressed by the appreciation of their scientific collaborators. Just cos a Nobel-winning biologist or physicist or whatever thinks your method is great, it doesn't mean your method is in itself great. If Brad Efron or Don Rubin had come through the door bringing their methods, Mister Nobel Prize would probably have loved them too.

Second, and back to the original quote above, Tukey was notorious for developing methods that were based on theoretical models and then rubbing out the traces of the theory and presenting the methods alone. For example, the hanging rootogram makes some sense--if you think of counts as following Poisson distributions. This predilection of Tukey's makes a certain philosophical sense (see my argument a few months ago) but I still find it a bit irritating to hide one's traces even for the best of reasons.

## Lottery probability update

It was reported last year that the national lottery of Israel featured the exact same 6 numbers (out of 45) twice in the same month, and statistics professor Isaac Meilijson of Tel Aviv University was quoted as saying that "the incident of six numbers repeating themselves within a month is an event of once in 10,000 years."

I shouldn't mock when it comes to mathematics--after all, I proved a false theorem once! (Or, to be precise, my collaborator and I published a false claim which we thought we'd proved, thus we thought was a theorem.)

So let me retract the mockery and move, first to the mathematics and then to the statistics.

## Rechecking the census

Sam Roberts writes:

The Census Bureau [reported] that though New York City's population reached a record high of 8,175,133 in 2010, the gain of 2 percent, or 166,855 people, since 2000 fell about 200,000 short of what the bureau itself had estimated.

Public officials were incredulous that a city that lures tens of thousands of immigrants each year and where a forest of new buildings has sprouted could really have recorded such a puny increase.

How, they wondered, could Queens have grown by only one-tenth of 1 percent since 2000? How, even with a surge in foreclosures, could the number of vacant apartments have soared by nearly 60 percent in Queens and by 66 percent in Brooklyn?

That does seem a bit suspicious. So the newspaper did its own survey:

## Handling multiple versions of an outcome variable

I have a question for you about what to do in a situation where you have two measures of your dependent variable and no prior reasons to strongly favor one over the other.

Here's what brings this up: I'm working on a project with Michael Ross where we're modeling transitions to and from democracy in countries worldwide since 1960 to estimate the effects of oil income on the likelihood of those events' occurrence. We've got a TSCS data set, and we're using a discrete-time event history design, splitting the sample by regime type at the start of each year and then using multilevel logistic regression models with parametric measures of time at risk and random intercepts at the country and region levels. (We're also checking for the usefulness of random slopes for oil wealth at one or the other level and then including them if they improve a model's goodness of fit.) All of this is being done in Stata with the gllamm module.

Our problem is that we have two plausible measures of those transition events. Unsurprisingly, the results we get from the two DVs differ, sometimes not by much but in a few cases to a non-trivial degree. The conventional solution to this problem seems to be to pick one version as the "preferred" measure and then report results from the other version in footnotes as a sensitivity analysis (invariably confirming the other results, of course; when's the last time you saw a sensitivity analysis in a published paper that didn't back up the "main" findings?). I just don't like that solution, though, because it sweeps under the rug some uncertainty that's arguably as informative as the results from either version alone. At the same time, it seems a little goofy just to toss both sets of results on the table and then shrug in cases where they diverge non-trivially.

Do you know of any elegant solutions to this problem? I recall seeing a paper last year that used Bayesian methods to average across estimates from different versions of a dependent variable, but I don't think that paper used multilevel models and am assuming the math required is much more complicated (i.e., there isn't a package that does this now).

My quick suggestion would be to add the two measures and then use the sum as the outcome. If it's a continuous measure there's no problem (although you'd want to prescale the measures so that they're roughly on a common scale before you add them). If they are binary outcomes you can just fit an ordered logit.

Jay liked my suggestion but added:

One hitch for our particular problem, though: because we're estimating event history models, the alternate versions of the DV (which is binary) also come with alternate versions of a couple of the IVs: time at risk and counts of prior events. I can't see how we could accommodate those differences in the framework you propose. Basically, we've got two alternate universes (or two alternate interpretations of the same universe), and the differences permeate both sides of the equation. Sometimes I really wish I worked in the natural sciences...

My suggestion would be to combine the predictors in some way as well.

## NYT Labs releases Openpaths, a utility for saving your iphone data

Jake Porway writes:

We launched Openpaths the other week. It's a site where people can privately upload and view their iPhone location data (at least until an Apple update wipes it out) and also download their data for their own use. More than just giving people a neat tool to view their data with, however, we're also creating an option for them to donate their data to research projects at varying levels of anonymity. We're still working out the terms for that, but we'd love any input and to get in touch with anyone who might want to use the data.

I don't have any use for this personally but maybe it will interest some of you.

From the webpage:

## Improvement of 5 MPG: how many more auto deaths?

This entry was posted by Phil Price.

A colleague is looking at data on car (and SUV and light truck) collisions and casualties. He's interested in causal relationships. For instance, suppose car manufacturers try to improve gas mileage without decreasing acceleration. The most likely way they will do that is to make cars lighter. But perhaps lighter cars are more dangerous; how many more people will die for each mpg increase in gas mileage?

There are a few different data sources, all of them seriously deficient from the standpoint of answering this question. Deaths are very well reported, so if someone dies in an auto accident you can find out what kind of car they were in, what other kinds of cars (if any) were involved in the accident, whether the person was a driver or passenger, and so on. But it's hard to normalize: OK, I know that N people who were passengers in a particular model of car died in car accidents last year, but I don't know how many passenger-miles that kind of car was driven, so how do I convert this to a risk? I can find out how many cars of that type were sold, and maybe even (through registration records) how many are still on the road, but I don't know the total number of miles. Some types of cars are driven much farther than others, on average.

Most states also have data on all accidents in which someone was injured badly enough to go to the hospital. This lets you look at things like: given that the car is in an accident, how likely is it that someone in the car will die? This sort of analyses makes heavy cars look good (for the passengers in those vehicles; not so good for passengers in other vehicles, which is also a phenomenon of interest!) but perhaps this is misleading: heavy cars are less maneuverable and have longer stopping distance, so perhaps they're more likely to be in an accident in the first place. Conceivably, a heavy car might be a lot more likely to be in an accident, but less likely to kill the driver if it's in one, compared to a lighter car that is better for avoiding accidents but more dangerous if it does get hit.

Confounding every question of interest is that different types of driver prefer different cars. Any car that is driven by a disproportionately large fraction of men in their late teens or early twenties is going to have horrible accident statistics, whereas any car that is selected largely by middle-aged women with young kids is going to look pretty good. If 20-year-old men drove Volvo station wagons, the Volvo station wagon would appear to be one of the most dangerous cars on the road, and if 40-year-old women with 5-year-old kids drove Ferraris, the Ferrari would seem to be one of the safest.

There are lots of other confounders, too. Big engines and heavy frames cost money to make, so inexpensive cars tend to be light and to have small engines, in addition to being physically small. They also tend to have less in the way of safety features (no side-curtain airbags, for example). If an inexpensive car has a poor safety record, is it because it's light, because it's small, or because it's lacking safety features? And yes, size matters, not just weight: a bigger car can have a bigger "crumple zone" and thus lower average acceleration if it hits a solid object, for example. If large, heavy cars really are safer than small, light cars, how much of the difference is due to size and how much is due to weight? Perhaps a large, light car would be the best, but building a large, light car would require special materials, like titanium or aluminum or carbon fiber, which might make it a lot more expensive...what, if anything, do we want to hold constant if we increase the fleet gas mileage? Cost? Size?

And of course the parameters I've listed above --- size, weight, safety features, and driver characteristics --- don't begin to cover all of the relevant factors.

So: is it possible to untangle the causal influence of various factors?

Most people who are involved in this research topic appear to rely on linear or logistic regression, controlling for various explanatory variables, and make various interpretations based on the regression coefficients, r-squared values, etc. Is this the best that can be done? And if so, how does one figure out the right set of explanatory variables?

This is a "causal inference" question, and according to the title of this blog, this blog should be just the place for this sort of thing. So, bring it on: where do I look to find the right way to answer this kind of question?

(And, by the way, what is the answer to the question I posed at the end of this causal inference discussion?)

## The happiness gene: My bottom line (for now)

I had a couple of email exchanges with Jan-Emmanuel De Neve and James Fowler, two of the authors of the article on the gene that is associated with life satisfaction which we blogged the other day. (Bruno Frey, the third author of the article in question, is out of town according to his email.) Fowler also commented directly on the blog.

I won't go through all the details, but now I have a better sense of what's going on. (Thanks, Jan and James!) Here's my current understanding:

1. The original manuscript was divided into two parts: an article by De Neve alone published in the Journal of Human Genetics, and an article by De Neve, Fowler, Frey, and Nicholas Christakis submitted to Econometrica. The latter paper repeats the analysis from the Adolescent Health survey and also replicates with data from the Framingham heart study (hence Christakis's involvement).

The Framingham study measures a slightly different gene and uses a slightly life-satisfaction question compared to the Adolescent Health survey, but De Neve et al. argue that they're close enough for the study to be considered a replication. I haven't tried to evaluate this particular claim but it seems plausible enough. They find an association with p-value of exactly 0.05. That was close! (For some reason they don't control for ethnicity in their Framingham analysis--maybe that would pull the p-value to 0.051 or something like that?)

2. Their gene is correlated with life satisfaction in their data and the correlation is statistically significant. The key to getting statistical significance is to treat life satisfaction as a continuous response rather than to pull out the highest category and call it a binary variable. I have no problem with their choice; in general I prefer to treat ordered survey responses as continuous rather than discarding information by combining categories.

3. But given their choice of a continuous measure, I think it would be better for the researchers to stick with it and present results as points on the 1-5 scale. From their main regression analysis on the Adolescent Health data, they estimate the effect of having two (compared to zero) "good" alleles as 0.12 (+/- 0.05) on a 1-5 scale. That's what I think they should report, rather than trying to use simulation to wrestle this into a claim about the probability of describing oneself as "very satisfied."

They claim that having the two alleles increases the probability of describing oneself as "very satisfied" by 17%. That's not 17 percentage points, it's 17%, thus increasing the probability from 41% to 1.17*41% = 48%. This isn't quite the 46% that's in the data but I suppose the extra 2% comes from the regression adjustment. Still, I don't see this as so helpful. I think they'd be better off simply describing the estimated improvement as 0.1 on a 1-5 scale. If you really really want to describe the result for a particular category, I prefer percentage points rather than percentages.

4. Another advantage as describing the result as 0.1 on a 1-5 scale is that it is more consistent with intuitive notions of 1% of variance explained. It's good they have this 1% in their article--I should present such R-squared summaries in my own work, to give a perspective on the sizes of the effects that I find.

5. I suspect the estimated effect of 0.1 is an overestimate. I say this for the usual reason, discussed often on this blog, that statistically significant findings, by their very nature, tend to be overestimates. I've sometimes called this the statistical significance filter, although "hurdle" might be a more appropriate term.

6. Along with the 17% number comes a claim that having one allele gives an 8% increase. 8% is half of 17% (subject to rounding) and, indeed, their estimate for the one-allele case comes from their fitted linear model. That's fine--but the data aren't really informative about the one-allele case! I mean, sure, the data are perfectly consistent with the linear model, but the nature of leverage is such that you really don't get a good estimate on the curvature of the dose-response function. (See my 2000 Biostatistics paper for a general review of this point.) The one-allele estimate is entirely model-based. It's fine, but I'd much prefer simply giving the two-allele estimate and then saying that the data are consistent with a linear model, rather than presenting the one-allele estimate as a separate number.

7. The news reports were indeed horribly exaggerated. No fault of the authors but still something to worry about. The Independent's article was titled, "Discovered: the genetic secret of a happy life," and the Telegraph's was not much better: "A "happiness gene" which has a strong influence on how satisfied people are with their lives, has been discovered." An effect of 0.1 on a 1-5 scale: an influence, sure, but a "strong" influence?

8. There was some confusion with conditional probabilities that made its way into the reports as well. From the Telegraph:

The results showed that a much higher proportion of those with the efficient (long-long) version of the gene were either very satisfied (35 per cent) or satisfied (34 per cent) with their life - compared to 19 per cent in both categories for those with the less efficient (short-short) form.

After looking at the articles carefully and having an email exchange with De Neve, I can assure you that the above quote is indeed wrong, which is really too bad because it was an attempted correction of an earlier mistake. The correct numbers are not 35, 34, 19, 19. Rather, they are 41, 46, 37, 44. A much less dramatic difference: changes of 4% and 2% rather than 18% and 15%. The Telegraph reporter was giving P(gene|happiness) rather than P(happiness|gene). What seems to have happened is that he misread Figure 2 in the Human Genetics paper. He then may have got stuck on the wrong track by expecting to see a difference of 17%.

9. The abstract for the Human Genetics paper reports a p-value of 0.01. But the baseline model (Model 1 in Table V of the Econometrica paper) reports a p-value of 0.02. The lower p-values are obtained by models that control for a big pile of intermediate outcomes.

10. In section 3 of the Econometrica paper, they compare identical to fraternal twins (from the Adolescent Health survey, it appears) and estimate that 33% of the variation in reported life satisfaction is explained by genes. As they say, this is roughly consistent with estimates of 50% or so from the literature. I bet their 33% has a big standard error, though: one clue is that the difference in correlations between identical and fraternal twins is barely statistically significant (at the 0.03 level, or, as they quaintly put it, 0.032). They also estimate 0% of the variation to be due to common environment, but again that 0% is gonna be a point estimate with a huge standard error.

I'm not saying that their twin analysis is wrong. To me the point of these estimates is to show that the Adolescent Health data are consistent with the literature on genes and happiness, thus supporting the decision to move on with the rest of their study. I don't take their point estimates of 33% and 0% seriously but it's good to know that the twin results go in the expected direction.

11. One thing that puzzles me is why De Neve et al. only studied one gene. I understand that this is the gene that they expected to relate to happiness and life satisfaction, but . . . given that it only explains 1% of the variation, there must be hundreds or thousands of genes involved. Why not look at lots and lots? At the very least, the distribution of estimates over a large sample of genes would give some sense of the variation that might be expected. I can't see the point of looking at just one gene, unless cost is a concern. Are other gene variants already recorded for the Adolescent Health and Framingham participants?

12. My struggles (and the news reporters' larger struggles) with the numbers in these articles makes me feel, even more strongly than before, the need for a suite of statistical methods for building from simple comparisons to more complicated regressions. (In case you're reading this, Bob and Matt3, I'm talking about the network of models.)

As researchers, transparency should be our goal. This is sometimes hindered by scientific journals' policies of brevity. You can end up having to remove lots of the details that make a result understandable.

13. De Neve concludes the Human Genetics article as follows:

There is no single ''happiness gene.' Instead, there is likely to be a set of genes whose expression, in combination with environmental factors, influences subjective well-being.

I would go even further. Accepting their claim that between one-third and one-half of the variation in happiness and life satisfaction is determined by genes, and accepting their estimate that this one gene explains as much as 1% of the variation, and considering that this gene was their #1 candidate (or at least a top contender) for the "happiness gene" . . . my guess is that the set of genes that influence subjective well-being is a very large number indeed! The above disclaimer doesn't seem disclaimery-enough to me, in that it seems to leave open the possibility that this "set of genes" might be just three or four. Hundreds or thousands seems more like it.

I'm reminded of the recent analysis that found that the simple approach of predicting child's height using a regression model given parents' average height performs much better than a method based on combining 54 genes.

14. Again, I'm not trying to present this as any sort of debunking, merely trying to fit these claims in with the rest of my understanding. I think it's great when social scientists and public health researchers can work together on this sort of study. I'm sure that in a couple of decades we'll have a much better understanding of genes and subjective well-being, but you have to start somewhere. This is a clean study that can be the basis for future research.

Hmmm . . . .could I publish this as a letter in the Journal of Human Genetics? Probably not, unfortunately.

P.S. You could do this all yourself! This and my earlier blog on the happiness gene study required no special knowledge of subject matter or statistics. All I did was tenaciously follow the numbers and pull and pull until I could see where all the claims were coming from. A statistics student, or even a journalist with a few spare hours, could do just as well. (Why I had a few spare hours to do this is another question. The higher procrastination, I call it.) I probably could've done better with some prior knowledge--I know next to nothing about genetics and not much about happiness surveys either--but I could get pretty far just tracking down the statistics (and, as noted, without any goal of debunking or any need to make a grand statement).

P.P.S. See comments for further background from De Neve and Fowler!

## Some interesting unpublished ideas on survey weighting

| 1 Comment

A couple years ago we had an amazing all-star session at the Joint Statistical Meetings. The topic was new approaches to survey weighting (which is a mess, as I'm sure you've heard).

Michael Elliott used mixture models for complex survey design.

And here's my introduction to the session.

## "Discovered: the genetic secret of a happy life"

I took the above headline from a news article in the (London) Independent by Jeremy Laurance reporting a study by Jan-Emmanuel De Neve, James Fowler, and Bruno Frey that reportedly just appeared in the Journal of Human Genetics.

One of the pleasures of blogging is that I can go beyond the usual journalistic approaches to such a story: (a) puffing it, (b) debunking it, (c) reporting it completely flatly. Even convex combinations of (a), (b), (c) do not allow what I'd like to do, which is to explore the claims and follow wherever my exploration takes me. (And one of the pleasures of building my own audience is that I don't need to endlessly explain background detail as was needed on a general-public site such as 538.)

OK, back to the genetic secret of a happy life. Or, in the words the authors of the study, a gene that "explains less than one percent of the variation in life satisfaction."

"The genetic secret" or "less than one percent of the variation"?

Perhaps the secret of a happy life is in that one percent??

I can't find a link to the journal article which appears based on the listing on De Neve's webpage to be single-authored, but I did find this Googledocs link to a technical report from January 2010 that seems to have all the content. Regular readers of this blog will be familiar with earlier interesting research of Fowler and Frey working separately; I had no idea that they have been collaborating.

De Neve et al. took responses to a question on life satisfaction from a survey that was linked to genetic samples. They looked at a gene called 5HTT which, according to their literature review, has been believed to be associated with happy feelings.

I haven't taken a biology class since 9th grade, so I'll give a simplified version of the genetics. You can have either 0, 1, or 2 alleles of the gene in question. Of the people in the sample, 20% have 0 alleles, 45% have 1 allele, and 35% have 2. The more alleles you have, the happier you'll be (on average): The percentage of respondents describing themselves as "very satisfied" with their lives is 37% for people with 0 alleles, 38% for those with one allele, and 41% for those with two alleles.

The key comparison here comes from the two extremes: 2 alleles vs. 0. People with 2 alleles are 4 percentage points (more precisely, 3.6 percentage points) more likely to report themselves as very satisfied with their lives. The standard error of this difference in proportions is sqrt(.41*(1-.41)/862+.37*(1-.37)/509) = 0.027, so the difference is not statistically significant at a conventional level.

But in their abstract, De Neve et al. reported the following:

Having one or two allleles . . . raises the average likelihood of being very satisfied with one's life by 8.5% and 17.3%, respectively?

How did they get from a non-significant difference of 4% (I can't bring myself to write "3.6%" given my aversion to fractional percentage points) to a statistically significant 17.3%?

A few numbers that I can't figure out at all!

Here's the summary from Stephen Adams, medical correspondent of the Daily Telegraph:

The researchers found that 69 per cent of people who had two copies of the gene said they were either satisfied (34) or very satisfied (35) with their life as a whole.

But among those who had no copy of the gene, the proportion who gave either of these answers was only 38 per cent (19 per cent 'very satisfied' and 19 per cent 'satisfied').

This leaves me even more confused! According to the table on page 21 of the De Neve et al. article, 46% of people who had two copies of the gene described themselves as satisfied and 41% described themselves as very satisfied. The corresponding percentages for those with no copies were 44% and 37%.

I suppose the most likely explanation is that Stephen Adams just made a mistake, but it's no ordinary confusion because his numbers are so specific. Then again, I could just be missing something big here. I'll email Fowler for clarification but I'll post this for now so you loyal blog readers can see error correction (of one sort or another) in real time.

Where did the 17% come from?

OK, so setting Stephen Adams aside, how can we get from a non-significant 4% to a significant 17%?

- My first try is to use the numerical life-satisfaction measure. Average satisfaction on a 1-5 scale is 4.09 for the 0-allele people in this sample and 4.25 for the 1-allele people, and the difference has a standard error of 0.05. Hey--a difference of 0.16 with a standard error of 0.05--that's statistically significant! So it doesn't seem just like a fluctuation in the data.

- The main analysis of De Neve et al., reported in their Table 1, appears to be a least-squares regression of well-being (on that 1-5) scale, using the number of alleles as a predictor and also throwing in some controls for ethnicity, sex, age, and some other variables. They include error terms for individuals and families but don't seem to report the relative sizes of the errors. In any case, the controls don't seem to do much. Their basic result (Model 1, not controlling for variables such as marital status which might be considered as intermediate outcomes of the gene) yields a coefficient estimate of 0.06.

They then write, "we summarize the results for 5HTT by simulating first differences from the coefficient covariance matrix of Model 1. Holding all else constant and changing the 5HTT gene of all subjects from zero to one long allele would increase the reporting of being very satisfied with one's life in this population by about 8.5%." Huh? I completely don't understand this. It looks to me that the analyses in Table 1 are regressions on the 1-5 scale. So how can they transfer these to claims about "the reporting of being very satisfied"? Also, if it's just least squares, why do they need to work with the covariance matrix? Why can't they just look at the coefficient itself?

- They report (in Table 5) that whites have higher life satisfaction responses than blacks but lower numbers of alleles, on average. So controlling for ethnicity should increase the coefficient. I still can't see it going all the way from 4% to 17%. But maybe this is just a poverty of my intuition.

- OK, I'm still confused and have no idea where the 17% could be coming from. All I can think of is that the difference between 0 alleles and 2 alleles corresponds to an average difference of 0.16 in happiness on that 1-5 scale. And 0.16 is practically 17%, so maybe when you control for things the number jumps around a bit. Perhaps the result of their "first difference" calculations was somehow to carry that 0.16 or 0.17 and attribute it to the "very satisfied" category?

1% of variance explained

One more thing . . . that 1% quote. Remember? "the 5HTT gene explains less than one percent of the variation in life satisfaction." This is from page 14 of the De Neve, Fowler, and Frey article. 1%? How can we understand this?

Let's do a quick variance calculation:

- Mean and sd of life satisfaction responses (on the 1-5 scale) among people with 0 alleles: 4.09 and 0.8
- Mean and sd of life satisfaction responses (on the 1-5 scale) among people with 2 alleles: 4.25 and 0.8
- The difference is 0.16 so the explained variance is (0.16/2)^2 = 0.08^2
- Finally, R-squared is explained variance divided by total variance: (0.08/0.8)^2 = 0.01.

A difference of 0.16 on a 1-5 scale ain't nothing (it's approximately the same as the average difference in life satisfaction, comparing whites and blacks), especially given that most people are in the 4 and 5 categories. But it only represents 1% of the variance in the data. It's hard for me to hold these two facts in my head at the same time. The quick answer is that the denominator of the R-squared--the 0.8--contains lots of individual variation, including variation in the survey response. Still, 1% is such a small number. No surprise it didn't make it into the newspaper headline . . .

Here's another story of R-squared = 1%. Consider a 0/1 outcome with about half the people in each category. For.example, half the people with some disease die in a year and half live. Now suppose there's a treatment that increases survival rate from 50% to 60%. The unexplained sd is 0.5 and the explained sd is 0.05, hence R-squared is, again, 0.01.

Summary (for now):

I don't know where the 17% came from. I'll email James Fowler and see what he says. I'm also wondering about that Daily Telegraph article but it's usually not so easy to reach newspaper journalists so I'll let that one go for now.

P.S. According to his website, Fowler was named the most original thinker of the year by The McLaughlin Group. On the other hand, our sister blog won an award by the same organization that honored Peggy Noonan. So I'd call that a tie!

P.P.S. Their data come from the National Survey of Adolescent Health, which for some reason is officially called "Add Health." Shouldn't that be "Ad Health" or maybe "Ado Health"? I'm confused where the extra "d" is coming from.

P.P.P.S. De Neve et al. note that the survey did not actually ask about happiness, only about life satisfaction. We all know people who appear satisfied with their lives but don't seem so happy, but the presumption is that, in general, things associated with more life satisfaction are also associated with happiness. The authors also remark upon the limitations using a sample of adolescents to study life satisfaction. Not their fault--as is appropriate, they use the data they have and then discuss the limitations of their analysis.

P.P.P.P.S. De Neve and Fowler have a related paper with a nice direct title, "The MAOA Gene Predicts Credit Card Debt." This one, also from Add Health, reports: "Having one or both MAOA alleles of the low efficiency type raises the average likelihood of having credit card debt by 14%." For some reason I was having difficulty downloading the pdf file (sorry, I have a Windows machine!) so I don't know how to interpret the 14%. I don't know if they've looked at credit card debt and life satisfaction together. Being in debt seems unsatisfying; on the other hand you could go in debt to buy things that give you satisfaction, so it's not clear to me what to expect here.

P.P.P.P.P.S. I'm glad Don Rubin didn't read the above-linked article. Footnote 9 would probably make him barf.

P.P.P.P.P.P.S. Just to be clear: The above is not intended to be a "debunking" of the research of De Neve, Fowler, and Frey. It's certainly plausible that this gene could be linked to reported life satisfaction (maybe, for example, it influences the way that people respond to survey questions). I'm just trying to figure out what's going on, and, as a statistician, it's natural for me to start with the numbers.

P.^7S. James Fowler explains some of the confusion in a long comment.

## My talk at Hunter College on Thurs

Thurs 5 May at 11am at Roosevelt House, at 47-49 East 65th Street (north side of East 65th street, between Park and Madison Avenues).

## Peter Huber's reflections on data analysis

Peter Huber's most famous work derives from his paper on robust statistics published nearly fifty years ago in which he introduced the concept of M-estimation (a generalization of maximum likelihood) to unify some ideas of Tukey and others for estimation procedures that were relatively insensitive to small departures from the assumed model.

Huber has in many ways been ahead of his time. While remaining connected to the theoretical ideas from the early part of his career, his interests have shifted to computational and graphical statistics. I never took Huber's class on data analysis--he left Harvard while I was still in graduate school--but fortunately I have an opportunity to learn his lessons now, as he has just released a book, "Data Analysis: What Can Be Learned from the Past 50 Years."

The book puts together a few articles published in the past 15 years, along with some new material. Many of the examples are decades old, which is appropriate given that Huber is reviewing fifty years of the development of his ideas. (I used to be impatient with statistics books that were full of dead examples but then I started to realize this was happening to me! The 8 schools experiments are almost 35 years old. The Electric Company is 40. The chicken brains are over 20 years old. The radon study is 15 years old, the data from the redistricting study are from the 1960s and 1970s, and so on. And of course even my more recent examples are getting older at the rate of one year per year and don't keep so well once they're out of the fridge. So at this point in my career I'd like to make a virtue of necessity and say that it's just fine to work with old examples that we really understand.

OK. As noted, Huber is modern--a follower of Tukey--in his treatment of computing and graphics as central to the statistical enterprise. His ISP software is R-like (as we would say now; of course ISP came first), and the principle of interactivity was important. He also has worked on various graphical methods for data exploration and dimension reduction; although I have not used these programs myself, I view them as close in spirit to the graphical tools that we now use to explore our data in the context of our fitted models.

Right now, data analysis seems dominated by three approaches:
- Machine learning
- Bayes
- Graphical exploratory data analysis
with some overlap, of course.

Many other statistical approaches/methods exist (e.g., time series/spatial, generalized estimating equations, nonparametrics, even some old-fashioned extensions of Fisher, Neyman, and Pearson), but they seem more along the lines of closed approaches to "inference" rather than open-ended tools for "data analysis."

I like Huber's pluralistic perspective, which ranges from contamination models to object-oriented programming, from geophysics to data cleaning. His is not a book to turn to for specific advice; rather, I enjoyed reading his thoughts on a variety of statistical issues and reflecting upon the connections between Huber's strategies for data analysis and his better-known theoretical work.

Huber writes:

Too much emphasis is put on futile attempts to automate non-routine tasks, and not enough effort is spent on facilitating routine work.

I really like this quote and would take it a step further: If a statistical method can be routinized it can be used much more often and its limitations better understood.

Huber also writes:

The interpretation of the results of goodness-of-fit tests must rely on judgment of content rather than on P-values.

This perspective is commonplace today but, as Huber writes, "for a traditional mathematical statistician, the implied primacy of judgment over mathematical proof and over statistical significance clearly goes against the grain." The next question is where the judgment comes from. One answer is that an experienced statistician might work on a few hundred applied problems during his or her career, and that will impart some judgment. But what advice can we give to people without such a personal history? My approach has been to impart as much of the lessons I have learned into methods in my books, but Huber is surely right that any collection of specific instructions will miss something.

It is an occupational hazard of all scholars to have an incomplete perspective on work outside their own subfield. For example, section 5.2 of the book in question contains the following disturbing (to me) claim: "Bayesian statistics lacks a mechanism for assessing goodness-of-fit in absolute terms. . . . Within orthodox Bayesian statistics, we cannot even address the question whether a model Mi, under consideration at stage i of the investigation, is consonant with the data y."

Huh? Huh? Also please see chapter 6 of Bayesian Data Analysis and my article, "A Bayesian formulation of exploratory data analysis and goodness-of-fit testing," which appeared in the International Statistical Review in 2003. (Huber's chapter 5 was written in 2000 so too soon for my 2003 paper, but the first edition of our book and our paper on posterior predictive checks had already appeared several years before.)

Just to be clear: I'm not faulting Huber for not citing my work. The statistics literature is huge and ever-expanding. It's just unfortunate that such a basic misunderstanding--the idea that Bayesians can't check their models--persists.

I like what Huber writes about approximately specified models, and I think he'd be very comfortable with our formulation of Bayesian data analysis, from the very first page of our book, as comprising three steps: (1) Model building, (2) Inference, (3) Model checking. Step 3 is crucial to making steps 1 and 2 work. Statisticians have written a lot about the problems with inference in a world in which models are tested--and that's fine, such biases are a worthy topic of study--but consider the alternative, in which models were fit without ever being checked. This would be horrible indeed.

Here's a quote that is all too true (from section 5.7, following a long and interesting discussion of a decomposition of a time series in physics):

For some parts of the model (usually the less interesting ones) we may have an abundance of degrees of freedom, and a scarcity for the interesting parts.

This reminds me of a conversation I've had with Don Rubin in the context of several different examples. Like many (most?) statisticians, my tendency is to try to model the data. Don, in contrast, prefers to set up a model that matches what the scientists in the particular field of application are studying. He doesn't worry so much about fit to the data and doesn't do much graphing. For example, the schizophrenics' reaction time example (featured in the mixture-modeling chapter of Bayesian Data Analysis), we used the model Don recommended of a mixture of normal distributions with a fixed lag between them. Looking at the data and thinking about the phenomenon, a fixed lag didn't make sense to me, but Don emphasized that the psychology researchers were interested in an average difference and so it didn't make sense in his perspective to try to do any further modeling on these data. He said that if we wanted to model the variation of the lag, that would be fine but it would make sense to gather more data rather than knocking ourselves out on this particular small data set. In a field such as international relations, this get-more-data approach might not work, but in experimental psychology it seems like a good idea. (And I have to admit that I have not at all kept up with whatever research has been done in eye-tracking and schizophrenia in the past twenty years.)

This all reminds me of another story from when I was teaching at Berkeley. Phil Price and I had two students working with us on hierarchical modeling to estimate the distributions of home radon in U.S. counties. One day, one of the students simply quit. Why? He said he just wasn't comfortable with the Bayesian approach. I was used to the old-style Berkeley environment so just accepted it: the kid had been indoctrinated and I didn't have the energy to try to unbrainwash him. But Phil was curious, having just completed a cross-validation demonstrating how well Bayes was working in this example, Phil asked the student what he would do in our problem instead of a hierarchical model: if the student had a better idea, Phil said, we'd be happy to test it out. The student thought for a moment and said, well, I suppose Professor X (one of my colleagues from down the hall at the time) would say that the solution is to gather more data. At this point Phil blew up. Gather more data! We already have measurements from 80,000 houses! Could you tell us how many more measurements you think you'd need? The student had no answer to this but remained steadfast in his discomfort with the idea of performing statistical inference using conditional probability.

I think that student, and others like him, would benefit from reading Huber's book and realizing that even a deep theoretician saw the need for using a diversity of statistical methods.

Also relevant to those who worship the supposed purity of likelihoods, or permutation tests, or whatever, is this line from Huber's book:

A statistician rarely sees the raw data themselves--most large data collections in the sciences are being heavily preprocessed already in the collection stage, and the scientists not only tend to forget to mention it, but sometimes they also forget exactly what they had done.

We're often torn between modeling the raw raw data or modeling the processed data. The latter choice can throw away important information but has the advantage, not only of computational convenience but also, sometimes, conceptual simplicity: processed data are typically closer to the form of the scientific concepts being modeled. For example, an economist might prefer to analyze some sort of preprocessed price data rather than data on individual transactions. Sure, there's information in the transactions but, depending on the context of the analysis, this behavioral story might distract from the more immediate goals of the economist. Other times, though, the only way to solve a problem is to go back to the raw data, and Huber provides several such examples in his book.

I will conclude with a discussion of a couple of Huber's examples that overlap with my own applied research.

Radon. In section 3.8, Huber writes:

We found (through exploratory data analysis of a large environmental data set) that very high radon levels were tightly localized and occurred in houses sitting on the locations of old mine shafts. . . . The issue here is one of "data mining" in the sense of looking for a rare nugget, not one of looking, like a traditional statistician, "for a central tendency, a measure of variability, measures of pairwise association between a number of variables." Random samples would have been useless, too: either one would have missed the exceptional values altogether, or one would have thrown them out as outliers.

I'm not so sure. Our radon research was based on two random samples, one of which, as noted above, included 80,000 houses. I agree that if you have a nonrandom sample of a million houses, it's a good idea to use it for some exploratory analysis, so I'm not at all knocking what Huber has done, but I think he's a bit too quick to dismiss random samples as "useless." Also, I don't buy his claim that extreme values, if found, would've been discarded as outliers. The point about outliers is that you look at them, you don't just throw them out!

Aggregation. In chapter 6, Huber deplores that not enough attention is devoted to Simpson's paradox. But then he demonstrates the idea with two fake-data examples. If a problem is important, I think it should be important enough to appear in real data. I recommend our Red State Blue State article for starters.

Survey data. In section 7.2, Huber analyzes data from a small survey of the opinions of jurors. When I looked at the list of survey items, I immediately thought of how I would reverse some of the scales to put everything in the same direction (this is basic textbook advice). Huber ends up doing this too, but only after performing a singular value decomposition. That's fine but in general I'd recommend doing all the easy scalings first so the statistical method has a chance to discover something new. More generally, methods such as singular value decomposition and principal components analyses have their limitations--they can work fine for balanced data such as in this example but in more complicated problems I'd go with item-response or ideal-point models. In general I prefer approaches based on models rather than algorithms: when a model goes wrong I can look for the assumption that was violated, whereas when an algorithm spits out a result that doesn't make sense, I'm not always sure how to proceed. This may be a matter of taste or emphasis more than anything else; see my discussion on Tukey's philosophy.

The next example in Huber's book is the problem of reconstructing maps. I think he'd be interested in the work of Josh Tenenbaum and his collaborators on learning structured models such as maps and trees. Multidimensional scaling is fine--Huber gives a couple of references from 1970--but we can do a lot more now!

In conclusion, I found Huber's book to be enjoyable and thought provoking. It's good to have a sense of what a prominent theoretical statistician thinks about applied statistics.

## My NOAA story

I recently learned we have some readers at the National Oceanic and Atmospheric Administration so I thought I'd share an old story.

About 35 years ago my brother worked briefly as a clerk at NOAA in their D.C. (or maybe it was D.C.-area) office. His job was to enter the weather numbers that came in. He had a boss who was very orderly. At one point there was a hurricane that wiped out some weather station in the Caribbean, and his boss told him to put in the numbers anyway. My brother protested that they didn't have the data, to which his boss replied: "I know what the numbers are."

Nowadays we call this sort of thing "imputation" and we like it. But not in the raw data! I bet nowadays they have an NA code.

## Job opening at NIH for an experienced statistician

This announcement might be of interest to some of you. The application deadline is in just a few days:

The National Center for Complementary and Alternative Medicine at the National Institutes of Health is seeking an additional experienced statistician to join our Office of Clinical and Regulatory Affairs team. www.usajobs.gov is accepting applications through April 22, 2011 for the general announcement and April 21 for status (typically current federal employee) candidates. To apply to this announcement or for more information, click on the links provided below or the USAJobs link provided above and search for NIH-NCCAM-DE-11-448747 (external) or NIH-NCCAM-MP-11-448766 (internal).

You have to be a U.S. citizen for this one.

## More on the correlation between statistical and political ideology

This is a chance for me to combine two of my interests--politics and statistics--and probably to irritate both halves of the readership of this blog. Anyway...

I recently wrote about the apparent correlation between Bayes/non-Bayes statistical ideology and liberal/conservative political ideology:

The Bayes/non-Bayes fissure had a bit of a political dimension--with anti-Bayesians being the old-line conservatives (for example, Ronald Fisher) and Bayesians having a more of a left-wing flavor (for example, Dennis Lindley). Lots of counterexamples at an individual level, but my impression is that on average the old curmudgeonly, get-off-my-lawn types were (with some notable exceptions) more likely to be anti-Bayesian.

This was somewhat based on my experiences at Berkeley. Actually, some of the cranky anti-Bayesians were probably Democrats as well, but when they were being anti-Bayesian they seemed pretty conservative.

Recently I received an interesting item from Gerald Cliff, a professor of mathematics at the University of Alberta. Cliff wrote:

I took two graduate courses in Statistics at the University of Illinois, Urbana-Champaign in the early 1970s, taught by Jacob Wolfowitz. He was very conservative, and anti-Bayesian. I admit that my attitudes towards Bayesian statistics come from him. He said that if one has a population with a normal distribution and unknown mean which one is trying to estimate, it is foolish to assume that the mean is random; it is fixed, and currently unknown to the statistician, but one should not assume that it is a random variable.

Wolfowitz was in favor of the Vietnam War, which was still on at the time. He is the father of Paul Wolfowitz, active in the Bush administration.

To which I replied:

Very interesting. I never met Neyman while I was at Berkeley (he had passed away before I got there) but I've heard that he was very liberal politically (as was David Blackwell). Regarding the normal distribution comment below, I would say:

1. Bayesians consider parameters to be fixed but unknown. The prior distribution is a regularization tool that allows more stable estimates.

2. The biggest assumptions in probability models are typically not the prior distribution but in the data model. In this case, Wolfowitz was willing to assume a normal distribution with no question but then balked at using any knowledge about its mean. It seems odd to me, as a Bayesian, for one's knowledge to be divided so sharply: zero knowledge about the parameter, perfect certainty about the distributional family.

To return to the political dimension: From basic principles, I don't see any strong logical connection between Bayesianism and left-wing politics. In statistics, non-Bayesian ("classical") methods such as maximum likelihood are often taken to be conservative, as compared to the more assumption-laden Bayesian approach, but, as Aleks Jakulin and I have argued, the labeling of a political method as liberal or conservative depends crucially on what is considered your default.

As statisticians, we are generally trained to respect conservatism, which can sometimes be defined mathematically (for example, nominal 95% intervals that contain the true value more than 95% of the time) and sometimes with reference to tradition (for example, deferring to least-squares or maximum-likelihood estimates). Statisticians are typically worried about messing with data, which perhaps is one reason that the Current Index to Statistics lists 131 articles with "conservative" in the title or keywords and only 46 with the words "liberal" or "radical."

In that sense, given that, until recently, non-Bayesian approaches were the norm in statistics, it was the more radical group of statisticians (on average) who wanted to try something different. And I could see how a real hardline conservative such as Wolfowitz could see a continuity between anti-Bayesian skepticism and political conservatism, just how, on the other side of the political spectrum, a leftist such as Lindley could ally Bayesian thinking with support of socialism, a planned economy, and the like.

As noted above, I don't think these connections make much logical sense but I can see where they were coming from (with exceptions, of course, as noted regarding Neyman above).

## Is it plausible that 1% of people pick a career based on their first name?

In my discussion of dentists-named-Dennis study, I referred to my back-of-the-envelope calculation that the effect (if it indeed exists) corresponds to an approximate 1% aggregate chance that you'll pick a profession based on your first name. Even if there are nearly twice as many dentist Dennises as would be expected from chance alone, the base rate is so low that a shift of 1% of all Dennises would be enough to do this. My point was that (a) even a small effect could show up when looking at low-frequency events such as the choice to pick a particular career or live in a particular city, and (b) any small effects will inherently be difficult to detect in any direct way.

Uri Simonsohn (the author of the recent rebuttal of the original name-choice article by Brett Pelham et al.) wrote:

## 100-year floods

According to the National Weather Service:

What is a 100 year flood? A 100 year flood is an event that statistically has a 1% chance of occurring in any given year. A 500 year flood has a .2% chance of occurring and a 1000 year flood has a .1% chance of occurring.

The accompanying map shows a part of Tennessee that in May 2010 had 1000-year levels of flooding.

At first, it seems hard to believe that a 1000-year flood would have just happened to occur last year. But then, this is just a 1000-year flood for that particular place. I don't really have a sense of the statistics of these events. How many 100-year, 500-year, and 1000-year flood events have been recorded by the Weather Service, and when have they occurred?

## Maybe a great idea in theory, didn't work so well in practice

| 1 Comment

I followed the link of commenter "Epanechnikov" to his blog, where I found, among other things, an uncritical discussion of Richard von Mises's book, "Probability, Statistics and Truth."

The bad news is that, based on the evidence of his book, Mises didn't seem to understand basic ideas of statistical significance. See here, Or at the very least, he was grossly overconfident (which can perhaps be seen from the brash title of his book). This is not the fault of "Epanechnikov," but I just thought that people should be careful about taking too seriously the statistical philosophy of someone who didn't think to do a chi-squared test when it was called for. (This is not a Bayesian/non-Bayesian thing; it's just basic statistics.)

## Induction within a model, deductive inference for model evaluation

| 1 Comment

Jonathan Livengood writes:

I have a couple of questions on your paper with Cosma Shalizi on "Philosophy and the practice of Bayesian statistics."

First, you distinguish between inductive approaches and hypothetico-deductive approaches to inference and locate statistical practice (at least, the practice of model building and checking) on the hypothetico-deductive side. Do you think that there are any interesting elements of statistical practice that are properly inductive? For example, suppose someone is playing around with a system that more or less resembles a toy model, like drawing balls from an urn or some such, and where the person has some well-defined priors. The person makes a number of draws from the urn and applies Bayes theorem to get a posterior. On your view, is that person making an induction? If so, how much space is there in statistical practice for genuine inductions like this?

Second, I agree with you that one ought to distinguish induction from other kinds of risky inference, but I'm not sure that I see a clear payoff from making the distinction. I'm worried because a lot of smart philosophers just don't distinguish "inductive" inferences from "risky" inferences. One reason (I think) is that they have in mind Hume's problem of induction. (Set aside whether Hume ever actually raised such a problem.) Famously, Popper claimed that falsificationism solves Hume's problem. In a compelling (I think) rejoinder, Wes Salmon points out that if you want to do anything with a scientific theory (or a statistical model), then you need to believe that it is going to make good predictions. But if that is right, then a model that survives attempts at falsification and then gets used to make predictions is still going to be open to a Humean attack. In that respect, then, hypothetico-deductivism isn't anti-inductivist after all. Rather, it's a variety of induction and suffers all the same difficulties as simple enumerative induction. So, I guess what I'd like to know is in what ways you think the philosophers are misled here. What is the value / motivation for distinguishing induction from hypothetico-deductive inference? Do you think there is any value to the distinction vis-a-vis Hume's problem? And what is your take on the dispute between Popper and Salmon?

I replied:

My short answer is that inductive inference of the balls-in-urns variety takes place within a model, and the deductive Popperian reasoning takes place when evaluating a model. Beyond this, I'm not so familiar with the philosophy literature. I think of "Popper" more as a totem than as an actual person or body of work. Finally, I recognize that my philosophy, like Popper's, does not say much about where models come from. Crudely speaking, I think of models as a language, with models created in the same way that we create sentences, by working with recursive structures. But I don't really have anything formal to say on the topic.

Livengood then wrote:

The part of Salmon's writing that I had in mind is his Foundations of Scientific Inference. See especially Section 3 on deductivism, starting on page 21.

Let me just press a little bit so that I am sure I'm understanding the proposal. When you say that inductive inference takes place within a model, are you claiming that an inductive inference is justified just to the extent that the model within which the induction takes place is justified (or approximately correct or some such -- I know you won't say "true" here ...)? If so, then under what conditions do you think a model is justified? That is, under what conditions do you think one is justified in making *predictions* on the basis of a model?

No model will perform well for every kind of prediction. For any particular kind of prediction, we can use posterior predictive checks and related ideas such as cross-validation to see if the model performs well on these dimensions of interest. There will (almost) always be some assumptions required, some sense in which any prediction is conditional on something. Stepping back a bit, I'd say that scientists get experience with certain models, they work well for prediction until they don't. For an example from my own research, consider opinion polling. Those survey estimates you see in the newspapers are conditional on all sorts of assumptions. Different assumptions get checked at different times, often after some embarrassing failure.

## Single or multiple imputation?

Vishnu Ganglani writes:

It appears that multiple imputation appears to be the best way to impute missing data because of the more accurate quantification of variance. However, when imputing missing data for income values in national household surveys, would you recommend it would be practical to maintain the multiple datasets associated with multiple imputations, or a single imputation method would suffice. I have worked on household survey projects (in Scotland) and in the past gone with suggesting single methods for ease of implementation, but with the availability of open source R software I am think of performing multiple imputation methodologies, but a bit apprehensive because of the complexity and also the need to maintain multiple datasets (ease of implementation).

My reply: In many applications I've just used a single random imputation to avoid the awkwardness of working with multiple datasets. But if there's any concern, I'd recommend doing parallel analyses on multiple imputed datasets and then combining inferences at the end.

## Assumptions vs. conditions, part 2

In response to the discussion of his remarks on assumptions vs. conditions, Jeff Witmer writes:

If [certain conditions hold] , then the t-test p-value gives a remarkably good approximation to "the real thing" -- namely the randomization reference p-value. . . .

I [Witmer] make assumptions about conditions that I cannot check, e.g., that the data arose from a random sample. Of course, just as there is no such thing as a normal population, there is no such thing as a random sample.

I disagree strongly with both the above paragraphs! I say this not to pick a fight with Jeff Witmer but to illustrate how, in statistics, even the most basic points that people take for granted, can't be.

Let's take the claims in order:

1. The purpose of a t test is to approximate the randomization p-value. Not to me. In my world, the purpose of t tests and intervals is to summarize uncertainty in estimates and comparisons. I don't care about a p-value and almost certainly don't care about a randomization distribution. I'm not saying this isn't important, I just don't think it's particularly fundamental. One might as well say that the randomization p-value is a way of approximating the ultimate goal which is the confidence interval.

2. There is no such thing as a random sample. Hey--I just drew a random sample the other day! Well, actually it was a few months ago, but still. It was a sample of records to examine for a court case. I drew random numbers in R and everything.

## Assumptions vs. conditions

Jeff Witmer writes:

I noticed that you continue the standard practice in statistics of referring to assumptions; e.g. a blog entry on 2/4/11 at 10:54: "Our method, just like any model, relies on assumptions which we have the duty to state and to check."

I'm in the 6th year of a three-year campaign to get statisticians to drop the word "assumptions" and replace it with "conditions." The problem, as I see it, is that people tend to think that an assumption is something that one assumes, as in "assuming that we have a right triangle..." or "assuming that k is even..." when constructing a mathematical proof.

But in statistics we don't assume things -- unless we have to. Instead, we know that, for example, the validity of a t-test depends on normality, which is a condition that can and should be checked. Let's not call normality an assumption, lest we imply that it is something that can be assumed. Let's call it a condition.

What do you all think?

## On summarizing a noisy scatterplot with a single comparison of two points

John Sides discusses how his scatterplot of unionization rates and budget deficits made it onto cable TV news:

It's also interesting to see how he [journalist Chris Hayes] chooses to explain a scatterplot -- especially given the evidence that people don't always understand scatterplots. He compares pairs of cases that don't illustrate the basic hypothesis of Brooks, Scott Walker, et al. Obviously, such comparisons could be misleading, but given that there was no systematic relationship depicted that graph, these particular comparisons are not.

This idea--summarizing a bivariate pattern by comparing pairs of points--reminds me of a well-known statistical identities which I refer to in a paper with David Park:

John Sides is certainly correct that if you can pick your pair of points, you can make extremely misleading comparisons. But if you pick every pair of points, and average over them appropriately, you end up with the least-squares regression slope.

Pretty cool, and it helps develop our intuition about the big-picture relevance of special-case comparisons.

## Statisticians vs. everybody else

Statisticians are literalists.

When someone says that the U.K. boundary commission's delay in redistricting gave the Tories an advantage equivalent to 10 percent of the vote, we're the kind of person who looks it up and claims that the effect is less than 0.7 percent.

When someone says, "Since 1968, with the single exception of the election of George W. Bush in 2000, Americans have chosen Republican presidents in times of perceived danger and Democrats in times of relative calm," we're like, Hey, really? And we go look that one up too.

And when someone says that engineers have more sons and nurses have more daughters . . . well, let's not go there.

So, when I was pointed to this blog by Michael O'Hare making the following claim, in the context of K-12 education in the United States:

My [O'Hare's] favorite examples of this junk [educational content with no workplace value] are spelling and pencil-and-paper algorithm arithmetic. These are absolutely critical for a clerk in an office of fifty years ago, but being good at them is unrelated to any real mental ability (what, for example, would a spelling bee in Chinese be?) and worthless in the world we live in now. I say this, by the way, aware that I am the best speller that I ever met (and a pretty good typist). But these are idiot-savant abilities, genetic oddities like being able to roll your tongue. Let's just lose them.

My first reaction was: Are you sure? I also have no systematic data on this, but I strongly doubt that being able to spell and add are "unrelated to any real world abilities" and are "genetic oddities like being able to roll your tongue." For one thing, people can learn to spell and add but I think it's pretty rare for anyone to learn how to roll their tongue! Beyond this, I expect that one way to learn spelling is to do a lot of reading and writing, and one way to learn how to add is to do a lot of adding (by playing Monopoly or whatever). I'd guess that these are indeed related to "real mental ability," however that is defined.

My guess is that, to O'Hare, my reactions would miss the point. He's arguing that schools should spend less time teaching kids spelling and arithmetic, and his statements about genetics, rolling your tongue, and the rest are just rhetorical claims. I'm guessing that O'Hare's view on the relation between skills and mental ability, say, is similar to Tukey's attitude about statistical models: they're fine as an inspiration for statistical methods (for Tukey) or as an inspiration for policy proposals (for O'Hare), but should not be taken literally. That things I write are full of qualifications, which might be a real hindrance if you're trying to propose policy changes.

## With a bit of precognition, you'd have known I was going to post again on this topic, and with a lot of precognition, you'd have known I was going to post today

Chris Masse points me to this response by Daryl Bem and two statisticians (Jessica Utts and Wesley Johnson) to criticisms by Wagenmakers et.al. of Bem's recent ESP study. I have nothing to add but would like to repeat a couple bits of my discussions of last month, of here:

Classical statistical methods that work reasonably well when studying moderate or large effects (see the work of Fisher, Snedecor, Cochran, etc.) fall apart in the presence of small effects.

I think it's naive when people implicitly assume that the study's claims are correct, or the study's statistical methods are weak. Generally, the smaller the effects you're studying, the better the statistics you need. ESP is a field of small effects and so ESP researchers use high-quality statistics.

To put it another way: whatever methodological errors happen to be in the paper in question, probably occur in lots of researcher papers in "legitimate" psychology research. The difference is that when you're studying a large, robust phenomenon, little statistical errors won't be so damaging as in a study of a fragile, possibly zero effect.

In some ways, there's an analogy to the difficulties of using surveys to estimate small proportions, in which case misclassification errors can loom large.

And here:

[One thing that Bem et al. and Wagenmakers et al. both miss] is that Bayes is not just about estimating the weight of evidence in favor of a hypothesis. The other key part of Bayesian inference--the more important part, I'd argue--is "shrinkage" or "partial pooling," in which estimates get pooled toward zero (or, more generally, toward their estimates based on external information).

Shrinkage is key, because if all you use is a statistical significance filter--or even a Bayes factor filter--when all is said and done, you'll still be left with overestimates. Whatever filter you use--whatever rule you use to decide whether something is worth publishing--I still want to see some modeling and shrinkage (or, at least, some retrospective power analysis) to handle the overestimation problem. This is something Martin and I discussed in our discussion of the "voodoo correlations" paper of Vul et al.

Finally, my argument for why a top psychology journal should never have published Bem's article:

I mean, how hard would it be for the experimenters to gather more data, do some sifting, find out which subjects are good at ESP, etc. There's no rush, right? No need to publish preliminary, barely-statistically-significant findings. I don't see what's wrong with the journal asking for better evidence. It's not like a study of the democratic or capitalistic peace, where you have a fixed amount of data and you have to learn what you can. In experimental psychology, once you have the experiment set up, it's practically free to gather more data.

I made this argument in response to a generally very sensible paper by Tal Yarkoni on this topic.

P.S. Wagenmakers et al. respond (to Bem et al., that is, not to me). As Tal Yarkoni would say, I agree with Wagenmakers et al. on the substantive stuff. But I still think that both they and Bem et al. err in setting up their models so starkly: either there's ESP or there's not. Given the long history of ESP experiments (as noted by some of the commenters below), it seems more reasonable to me to suppose that these studies have some level of measurement error of magnitude larger than that of any ESP effects themselves.

As I've already discussed, I'm not thrilled with the discrete models used in these discussions and I am for some reason particularly annoyed by the labels "Strong," "Substantial," "Anecdotal" in figure 4 of Wagenmakers et al. Whether or not a study can be labeled "anecdotal" seems to me to be on an entirely different dimension than what they're calculating here. Just for example, suppose you conduct a perfect randomized experiment on a large random sample of people. There's nothing anecdotal at all about this (hypothetical) study. As I've described it, it's the opposite of anecdotal. Nonetheless, it might very well be that the effect under study is tiny, in which case a statistical analysis (Bayesian or otherwise) is likely to report no effect. It could fall into the "anecdotal" category used by Wagenmakers et al. But that would be an inappropriate and misleading label.

That said, I think people have to use what statistical methods they're comfortable with, so it's sort of silly for me to fault Wagenmakers et al. for not using the sorts of analysis I would prefer. The key point that they and other critics have made is that the Bem et al. analyses aren't quite as clean as a casual observer might think, and it's possible to make that point coming from various statistical directions. As I note above, my take on this is that if you study very small effects, then no amount of statistical sophistication will save you. If it's really true, as commenter Dean Radin writes below, that these studies "took something like 6 or 7 years to complete," then I suppose it's no surprise that something turned up.

## A departmental wiki page?

I was recently struggling with the Columbia University philophy department's webpage (to see who might be interested in this stuff). The faculty webpage was horrible: it's just a list of names and links with no information on research interests. So I did some searching on the web and found a wonderful wikipedia page which had exactly what I wanted.

Then I checked my own department's page, and it's even worse than what they have in philosophy! (We also have this page, which is even worse in that it omits many of our faculty and has a bunch of ridiculously technical links for some of the faculty who are included.)

I don't know about the philosophy department, but the statistics department's webpage is an overengineered mess, designed from the outset to look pretty rather than to be easily updated. Maybe we could replace it entirely with a wiki?

In the meantime, if anybody feels like setting up a wikipedia entry for the research of Columbia's statistics faculty, that would be great. As it is, I think it would be difficult for outsiders who don't know us to have any idea of what we do here!

P.S. The political science department's faculty listing is useless as well. We need a wiki for that one too!

P.P.S. The physics department's wikipage is pretty useless for a potential student's purposes, though--lots on history but nothing much on what the faculty are doing now.

## Get the Data

At GetTheData, you can ask and answer data related questions. Here's a preview:

I'm not sure a Q&A site is the best way to do this.

My pipe dream is to create a taxonomy of variables and instances, and collect spreadsheets annotated this way. Imagine doing a search of type: "give me datasets, where an instance is a person, the variables are age, gender and weight" - and out would come datasets, each one tagged with the descriptions of the variables that were held constant for the whole dataset (person_type=student, location=Columbia, time_of_study=1/1/2009, study_type=longitudinal). It would even be possible to automatically convert one variable into another, if it was necessary (like age = time_of_measurement-time_of_birth). Maybe the dream of Semantic Web will actually be implemented for relatively structured statistical data rather than much fuzzier "knowledge", just consider the difficulties of developing a universal Freebase. Wolfram|Alpha is perhaps currently closest effort to this idea (consider comparing banana consumption between different countries), but I'm not sure how I can upload my own data or do more complicated data queries - also, for some simple variables (like weight), the results are not very useful.

I've talked about data tools before, as well as about Q&A sites.

## Statistician cracks Toronto lottery

Christian points me to this amusing story by Jonah Lehrer about Mohan Srivastava, (perhaps the same person as R. Mohan Srivastava, coauthor of a book called Applied Geostatistics) who discovered a flaw in a scratch-off game in which he could figure out which tickets were likely to win based on partial information visible on the ticket. It appears that scratch-off lotteries elsewhere have similar flaws in their design.

The obvious question is, why doesn't the lottery create the patterns on the tickets (including which "teaser" numbers to reveal) completely at random? It shouldn't be hard to design this so that zero information is supplied from the outside. in which case Srivastava's trick would be impossible.

So why not put down the numbers randomly? Lehrer quotes Srivastava as saying:

The tickets are clearly mass-produced, which means there must be some computer program that lays down the numbers. Of course, it would be really nice if the computer could just spit out random digits. But that's not possible, since the lottery corporation needs to control the number of winning tickets. The game can't be truly random. Instead, it has to generate the illusion of randomness while actually being carefully determined.

I'd phrase this slightly differently. We're talking about \$3 payoffs here, so, no, the corporation does not need to control the number of winning tickets. What they do need to control is the probability of a win, but that can be done using a completely random algorithm.

From reading the article, I think the real reason the winning tickets could be predicted is that the lottery tickets were designed to be misleadingly appealing. Lehrer writes:

Instead of just scratching off the latex and immediately discovering a loser, players have to spend time matching up the revealed numbers with the boards. Ticket designers fill the cards with near-misses (two-in-a-row matchups instead of the necessary three) and players spend tantalizing seconds looking for their win. No wonder players get hooked.

"Ticket designers fill the cards with near-misses . . .": This doesn't sound like they're just slapping down random numbers. Instead, the system seems to be rigged in the fashion of old-time carnival games in order to manipulate one's intuition that the probability of near-misses should be informative about the underlying probability of hits. (See here for some general discussion of the use of precursors to estimate the probability of extremely rare events.)

In this sense, the story is slightly more interesting than "Lottery designers made a mistake." The mistake they made is directly connected to the manipulations they make in order to sucker people into spend more money.

P.S. Lehrer writes that Srivastava does consulting. This news story should get him all the business he needs for awhile!

## An addition to the model-makers' oath

Yesterday Aleks posted a proposal for a model makers' Hippocratic Oath. I'd like to add two more items:

1. From Mark Palko: "Our model only describes the data we used to build it; if you go outside of that range, you do so at your own risk."

2. In case you like to think of your methods as nonparametric or non-model-based: "Our method, just like any model, relies on assumptions which we have the duty to state and to check."

(Observant readers will see that I use "we" rather than "I" in these two items. Modeling is an inherently collaborative endeavor.

## Model Makers' Hippocratic Oath

Emanuel Derman and Paul Wilmott wonder how to get their fellow modelers to give up their fantasy of perfection. In a Business Week article they proposed, not entirely in jest, a model makers' Hippocratic Oath:

• I will remember that I didn't make the world and that it doesn't satisfy my equations.
• Though I will use models boldly to estimate value, I will not be overly impressed by mathematics.
• I will never sacrifice reality for elegance without explaining why I have done so. Nor will I give the people who use my model false comfort about its accuracy. Instead, I will make explicit its assumptions and oversights.
• I understand that my work may have enormous effects on society and the economy, many of them beyond my comprehension.

Found via Abductive Intelligence.

## is it possible to "overstratify" when assigning a treatment in a randomized control trial?

| 1 Comment

Peter Bergman writes:

is it possible to "overstratify" when assigning a treatment in a randomized control trial? I [Bergman] have a sample size of roughly 400 people, and several binary variables correlate strongly with the outcome of interest and would also define interesting subgroups for analysis. The problem is, stratifying over all of these (five or six) variables leaves me with strata that have only 1 person in them. I have done some background reading on whether there is a rule of thumb for the maximum number of variables to stratify. There does not seem to be much agreement (some say there should be between N/50-N/100 strata, others say as few as possible). In economics, the paper I looked to is here, which seems to summarize literature related to clinical trials. In short, my question is: is it bad to have several strata with 1 person in them? Should I group these people in with another stratum?

P.S. In the paper I mention above, they also say it is important to include stratum indicators in the regression analysis to ensure the appropriate sized type-I error in the final analysis (i.e. regress outcome on treatment & strata indicators). They demonstrate this through simulation, but is there a reference (or intuition) that shows why these indicators are important theoretically?

My reply: I doubt it matters so much exactly how you do this. If you want, there are techniques to ensure balance over many predictors. In balanced setups, you have ideas such as latin squares, and similar methods can be developed in unbalanced scenarios. It's ok to have strata with one person in them, but if you think people won't like it, then you should feel free to use larger strata.

P.S. Bergman actually wrote "dummies," but I couldn't bear to see that term so I changed it to "Ã¯ndicators."

## "Roughly 90% of the increase in . . ." Hey, wait a minute!

Matthew Yglesias links approvingly to the following statement by Michael Mandel:

Homeland Security accounts for roughly 90% of the increase in federal regulatory employment over the past ten years.

Roughly 90%, huh? That sounds pretty impressive. But wait a minute . . . what if total federal regulatory employment had increased a bit less. Then Homeland Security could've accounted for 105% of the increase, or 500% of the increase, or whatever. The point is the change in total employment is the sum of a bunch of pluses and minuses. It happens that, if you don't count Homeland Security, the total hasn't changed much--I'm assuming Mandel's numbers are correct here--and that could be interesting.

The "roughly 90%" figure is misleading because, when written as a percent of the total increase, it's natural to quickly envision it as a percentage that is bounded by 100%. There is a total increase in regulatory employment that the individual agencies sum to, but some margins are positive and some are negative. If the total happens to be near zero, then the individual pieces can appear to be large fractions of the total, even possibly over 100%.

I'm not saying that Mandel made any mistakes, just that, in general, ratios can be tricky when the denominator is the sum of positive and negative parts. In this particular case, the margins were large but not quite over 100%, which somehow gives the comparison more punch than it deserves, I think.

We discussed a mathematically identical case a few years ago involving the 2008 Democratic primary election campaign.

What should we call this?

There should be a name for this sort of statistical slip-up. The Fallacy of the Misplaced Denominator, perhaps? The funny thing is that the denominator has to be small (so that the numerator seems like a lot, "90%" or whatever) but not too small (because if the ratio is over 100%, the jig is up).

P.S. Mandel replies that, yes, he agrees with me in general about the problems of ratios where the denominator is a sum of positive and negative components, but that in this particular case, "all the major components of regulatory employment change are either positive or a very tiny negative." So it sounds like I was choosing a bad example to make my point!

## Splitting the data

Antonio Rangel writes:

I'm a neuroscientist at Caltech . . . I'm using the debate on the ESP paper, as I'm sure other labs around the world are, as an opportunity to discuss some basic statistical issues/ideas w/ my lab.

Request: Is there any chance you would be willing to share your thoughts about the difference between exploratory "data mining" studies and confirmatory studies? What I have in mind is that one could use a dataset to explore/discover novel hypotheses and then conduct another experiment to test those hypotheses rigorously. It seems that a good combination of both approaches could be the best of both worlds, since the first would lead to novel hypothesis discovery, and the later to careful testing. . . it is a fundamental issue for neuroscience and psychology.

I know that people talk about this sort of thing . . . but in any real setting, I think I'd want all my data right now to answer any questions I have. I like cross-validation and have used it with success, but I don't think I could bring myself to keep the split so rigorous as you describe. Once I have the second dataset, I'd form new hypotheses, etc.

Every once in awhile, the opportunity presents itself, though. We analyzed the 2000 and 2004 elections using the Annenberg polls. But when we were revising Red State Blue State to cover the 2008 election, the Annenberg data weren't available, so we went with Pew Research polls instead. (The Pew people are great--they post raw data on their website.) In the meantime, the 2008 Annenberg data have been released, so now we can check our results, once we get mrp all set up to do this.

## Small world: MIT, asymptotic behavior of differential-difference equations, Susan Assmann, subgroup analysis, multilevel modeling

A colleague recently sent me a copy of some articles on the estimation of treatment interactions (a topic that's interested me for awhile). One of the articles, which appeared in the Lancet in 2000, was called "Subgroup analysis and other (mis)uses of baseline data in clinical trials," by Susan F. Assmann, Stuart J. Pocock, Laura E. Enos, and Linda E. Kasten. . . .

Hey, wait a minute--I know Susan Assmann! Well, I sort of know her. When I was a freshman in college, I asked my adviser, who was an applied math prof, if I could do some research. He connected me to Susan, who was one of his Ph.D. students, and she gave me a tiny part of her thesis to work on.

The problem went as follows. You have a function f(x), for x going from 0 to infinity, that is defined as follows. Between 0 and 1, f(x)=x. Then, for x higher than 1, f'(x) = f(x) - f(x-1). The goal is to figure out what f(x) does. I think I'm getting this right here, but I might be getting confused on some of the details. The original form of the problem had some sort of probability interpretation, I think--something to do with a one-dimensional packing problem, maybe f(x) was the expected number of objects that would fit in an interval of size x, if the objects were drawn from a uniform distribution. Probably not that, but maybe something of that sort.

One of the fun things about attacking this sort of problem as a freshman is that I knew nothing about the literature on this sort of problem or even what it was called (a differential-difference equation, or it can also be formulated using as an integral). Nor was I set up to do any simulations on the computer. I just solved the problem from scratch. First I figured out the function in the range [1,2], [2,3], and so forth, then I made a graph (pencil on graph paper) and conjectured the asymptotic behavior of f. The next step was to prove my conjecture. It ate at me. I worked on the problem on and off for about eleven months, then one day I finally did it: I had carefully proved the behavior of my function! This accomplishment gave me a warm feeling for years after.

I never actually told Susan Assmann about this--I think that by then she had graduated, and I never found out whether she figured out the problem herself as part of her Ph.D. thesis or whether it was never really needed in the first place. And I can't remember if I told my adviser. (He was a funny guy: extremely friendly to everyone, including his freshman advisees, but one time we were in his office when he took a phone call. He was super-friendly during the call, then after the call was over he said, "What an asshole." After this I never knew whether to trust the guy. If he was that nice to some asshole on the phone, what did it mean that he was nice to us?) I switched advisers. the new adviser was much nicer--I knew him because I'd taken a class with him--but it didn't really matter since he was just another mathematician. I was lucky enough to stumble into statistics, but that's another story.

Anyway, it was funny to see that name--Susan Assmann! I did a quick web search and I'm pretty sure it is the same person. And her paper was cited 430 times--that's pretty impressive!

P.S. The actual paper by Assmann et al. is reasonable. It's a review of some statistical practice in medical research. They discuss the futility of subgroup analysis given that, compared to main effects, interactions are typically (a) smaller in magnitude and (b) estimated with larger standard errors. That's pretty much a recipe for disaster! (I made a similar argument in a 2001 article in Biostatistics, except that my article went in depth for one particular model and Assmann et al. were offering more general advice. And, unlike me, they had some data.) Ultimately I do think treatment interactions and subgroup analysis are important, but they should be estimated using multilevel models. If you try to estimate complex interactions using significance tests or classical interval estimation, you'll probably just be wasting your time, for reasons explained by Assmann et al.

## For those of you in the U.K., also an amusing paradox involving the infamous hookah story

I'll be on Radio 4 at 8.40am, on the BBC show "Today," talking about The Honest Rainmaker. I have no idea how the interview went (it was about 5 minutes), but I'm kicking myself because I was planning to tell the hookah story, but I forgot. Here it is:

I was at a panel for the National Institutes of Health evaluating grants. One of the proposals had to do with the study of the effect of water-pipe smoking, the hookah. There was a discussion around the table. The NIH is a United States government organisation; not many people in the US really smoke hookahs; so should we fund it? Someone said, 'Well actually it's becoming more popular among the young.' And if younger people smoke it, they have a longer lifetime exposure, and apparently there is some evidence that the dose you get of carcinogens from hookah smoking might be 20 times the dose of smoking a cigarette. I don't know the details of the math, but it was a lot. So even if not many people do it, if you multiply the risk, you get a lot of lung cancer.

Then someone at the table - and I couldn't believe this - said, 'My uncle smoked a hookah pipe all his life, and he lived until he was 90 years old.' And I had a sudden flash of insight, which was this. Suppose you have something that actually kills half the people. Even if you're a heavy smoker, your chance of dying of lung cancer is not 50 per cent, so therefore, even with something as extreme as smoking and lung cancer, you still have lots of cases where people don't die of the disease. The evidence is certainly all around you pointing in the wrong direction - if you're willing to accept anecdotal evidence - there's always going to be an unlimited amount of evidence which won't tell you anything. That's why the psychology is so fascinating, because even well-trained people make mistakes. It makes you realise that we need institutions that protect us from ourselves.

I think that last bit--"if you're willing to accept anecdotal evidence, there's always going to be an unlimited amount of evidence which won't tell you anything." Of course, what makes this story work so well is that it's backed up by a personal anecdote!

Damn. I was planning to tell his story but I forgot. Next time I do radio, I'm gonna bring an index card with my key point. Not my 5 key points, not my 3 key points, but my 1 key point. Actually, I'm gonna be on the radio (in Seattle) next Monday afternoon, so I'll have a chance to try this plan then.

## Theoretical vs applied statistics

Anish Thomas writes:

## Tukey's philosophy

The great statistician John Tukey, in his writings from the 1970s onward (and maybe earlier) was time and again making the implicit argument that you should evaluate a statistical method based on what it does; you should {\em not} be staring at the model that purportedly underlies the method, trying to determine if the model is "true" (or "true enough"). Tukey's point was that models can be great to inspire methods, but the model is the scaffolding; it is the method that is the building you have to live in.

I don't fully agree with this philosophy--I think models are a good way to understand data and also often connect usefully to scientific models (although not as cleanly as is thought by our friends who work in economics or statistical hypothesis testing).

To put it another way: What makes a building good? A building is good if it is useful. If a building is useful, people will use it. Eventually improvements will be needed, partly because the building will get worn down, partly because the interactions between the many users will inspire new, unforeseen uses, partly for the simple reason that if a building is popular, more space will be desired. At that point, work needs to be done. And, at that point, wouldn't it be great if some scaffolding were already around?

That scaffolding that we'd like to have . . . if we now switch the analogy back from buildings to statistical methods, that scaffolding is the model that was used in constructing the method in the first place.

No statistical method is perfect. In fact, it is the most useful, wonderful statistical methods that get the most use and need improvements most frequently. So I like the model and I don't see the virtue in hiding it and letting the method stand alone. The model is the basis for future improvements in many directions. And this is one reason why I think that one of the most exciting areas in statistical research is the systematization of model building. The network of models and all that.

But, even though I don't agree with the implicit philosophy of late Tukey (I don't agree with the philosophy of early Tukey either, with all that multiple comparisons stuff), I think (of course) that he made hugely important contributions. So I'd like to have this philosophy out there for statisticians and users to evaluate on their own.

I have not ever seen Tukey's ideas expressed in this way before (and they're just my own imputation; I only met Tukey once, many years ago, and we spoke for about 30 seconds), so I'm posting them here, on the first day of this new decade.

## Age and happiness: The pattern isn't as clear as you might think

A couple people pointed me to this recent news article which discusses "why, beyond middle age, people get happier as they get older." Here's the story:

When people start out on adult life, they are, on average, pretty cheerful. Things go downhill from youth to middle age until they reach a nadir commonly known as the mid-life crisis. So far, so familiar. The surprising part happens after that. Although as people move towards old age they lose things they treasure--vitality, mental sharpness and looks--they also gain what people spend their lives pursuing: happiness.

This curious finding has emerged from a new branch of economics that seeks a more satisfactory measure than money of human well-being. Conventional economics uses money as a proxy for utility--the dismal way in which the discipline talks about happiness. But some economists, unconvinced that there is a direct relationship between money and well-being, have decided to go to the nub of the matter and measure happiness itself. . . There are already a lot of data on the subject collected by, for instance, America's General Social Survey, Eurobarometer and Gallup. . . .

And here's the killer graph:

All I can say is . . . it ain't so simple. I learned this the hard way. After reading a bunch of articles on the U-shaped relation between age and happiness--including some research that used the General Social Survey--I downloaded the GSS data (you can do it yourself!) and prepared some data for my introductory statistics class. I made a little dataset with happiness, age, sex, marital status, income, and a couple other variables and ran some regressions and made some simple graphs. The idea was to start with the fascinating U-shaped pattern and then discuss what could be learned further using some basic statistical techniques of subsetting and regression.

But I got stuck--really stuck. Here was my first graph, a quick summary of average happiness level (on a 0, 1, 2 scale; in total, 12% of respondents rated their happiness at 0 (the lowest level), 56% gave themselves a 1, and 32% described themselves as having the highest level on this three-point scale). And below are the raw averages of happiness vs. age. (Note: the graph has changed. In my original posted graph, I plotted the percentage of respondents of each age who had happiness levels of 1 or 2; this corrected graph plots average happiness levels.)

Uh-oh. I did this by single years of age so it's noisy--even when using decades of GSS, the sample's not infinite--but there's nothing like the famous U-shaped pattern! Sure, if you stare hard enough, you can see a U between ages 35 and 70, but the behavior from 20-35 and from 70-90 looks all wrong. There's a big difference between the publishedl graph, which has maxima at 20 and 85, and the my graph from GSS, which has minima at 20 and 85.

There are a lot of ways these graphs could be reconciled. There could be cohort or period effects, perhaps I should be controlling for other variables, maybe I'm using a bad question, or maybe I simply miscoded the data. All of these are possibilities. I spent several hours staring at the GSS codebook and playing with the data in different ways and couldn't recover the U. Sometimes I could get happiness to go up with age, but then it was just a gradual rise from age 18, without the dip around age 45 or 50. There's a lot going on here and I very well may still be missing something important. [Note: I imagine that sort of cagey disclaimer is typical of statisticians: by our training we are so aware of uncertainty. Researchers in other fields don't seem to feel the same need to do this.]

Anyway, at some point in this analysis I was getting frustrated at my inability to find the U (I felt like the characters in that old movie they used to show on TV on New Year's Eve, all looking for "the big W") and beginning to panic that this beautiful example was too fragile to survive in the classroom.

So I called Grazia Pittau, an economist (!) with whom I'd collaborated on some earlier happiness research (in which I contributed multilevel modeling and some ideas about graphs but not much of substance regarding psychology or economics). Grazia confirmed to me that the U-shaped pattern is indeed fragile, that you have to work hard to find it, and often it shows up when people fit linear and quadratic terms, in which case everything looks like a parabola. (I'd tried regressions with age & age-squared, but it took a lot of finagling to get the coefficient for age-squared to have the "correct" sign.)

And then I encountered a paper by Paul Frijters and Tony Beatton which directly addressed my confusion. Frijters and Beatton write:

Whilst the majority of psychologists have concluded there is not much of a relationship at all, the economic literature has unearthed a possible U-shape relationship. In this paper we [Frijters and Beatton] replicate the U-shape for the German SocioEconomic Panel (GSOEP), and we investigate several possible explanations for it.

They conclude that the U is fragile and that it arises from a sample-selection bias. I refer you to the above link for further discussion.

In summary: I agree that happiness and life satisfaction are worth studying--of course they're worth studying--but, in the midst of looking for explanations for that U-shaped pattern, it might be worth looking more carefully to see what exactly is happening. At the very least, the pattern does not seem to be as clear as implied from some media reports. (Even a glance at the paper by Stone, Schwartz, Broderick, and Deaton, which is the source of the top graph above, reveals a bunch of graphs, only some of which are U-shaped.) All those explanations have to be contingent on the pattern actually existing in the population.

My goal is not to debunk but to push toward some broader thinking. People are always trying to explain what's behind a stylized fact, which is fine, but sometimes they're explaining things that aren't really happening, just like those theoretical physicists who, shortly after the Fleischmann-Pons experiment, came up with ingenious models of cold fusion. These theorists were brilliant but they were doomed because they were modeling a phenomenon which (most likely) doesn't exist.

A comment from a few days ago by Eric Rasmusen seems relevant, connecting this to general issues of confirmation bias. If you make enough graphs and you're looking for a U, you'll find it. I'm not denying the U is there, I'm just questioning the centrality of the U to the larger story of age, happiness, and life satisfaction. There appear to be many different age patterns and it's not clear to me that the U should be considered the paradigm.

P.S. I think this research (even if occasionally done by economists) is psychology, not economics. No big deal--it's just a matter of terminology--but I think journalists and other outsiders can be misread if they hear about this sort of thing and start searching in the economics literature rather than in the psychology literature. In general, I think economists will have more to say than psychologists about prices, and psychologists will have more insights about emotions and happiness. I'm sure that economists can make important contributions to the study of happiness, just as psychologists can make important contributions to the study of prices, but even a magazine called "The Economist" should know the difference.

## Instead of "confidence interval," let's say "uncertainty interval"

I've become increasingly uncomfortable with the term "confidence interval," for several reasons:

- The well-known difficulties in interpretation (officially the confidence statement can be interpreted only on average, but people typically implicitly give the Bayesian interpretation to each case),

- The ambiguity between confidence intervals and predictive intervals. (See the footnote in BDA where we discuss the difference between "inference" and "prediction" in the classical framework.)

- The awkwardness of explaining that confidence intervals are big in noisy situations where you have less confidence, and confidence intervals are small when you have more confidence.

So here's my proposal. Let's use the term "uncertainty interval" instead. The uncertainty interval tells you how much uncertainty you have. That works pretty well, I think.

P.S. As of this writing, "confidence interval" outGoogles "uncertainty interval" by the huge margin of 9.5 million to 54000. So we have a ways to go.

## WWJD? U can find out!

| 1 Comment

Two positions open in the statistics group at the NYU education school. If you get the job, you get to work with Jennifer HIll!

One position is a postdoctoral fellowship, and the other is a visiting professorship. The latter position requires "the demonstrated ability to develop a nationally recognized research program," which seems like a lot to ask for a visiting professor. Do they expect the visiting prof to develop a nationally recognized research program and then leave it there at NYU after the visit is over?

In any case, Jennifer and her colleagues are doing excellent work, both applied and methodological, and this seems like a great opportunity.

## "The truth wears off: Is there something wrong with the scientific method?"

My reply is that it reminds me a bit of what I wrote here. Or see here for the quick powerpoint version: The short story is that if you screen for statistical significance when estimating small effects, you will necessarily overestimate the magnitudes of effects, sometimes by a huge amount. I know that Dave Krantz has thought about this issue for awhile; it came up when Francis Tuerlinckx and I wrote our paper on Type S errors, ten years ago.

My current thinking is that most (almost all?) research studies of the sort described by Lehrer should be accompanied by retrospective power analyses, or informative Bayesian inferences. Either of these approaches--whether classical or Bayesian, the key is that they incorporate real prior information, just as is done in a classical prospective power analysis--would, I think, moderate the tendency to overestimate the magnitude of effects.

In answer to the question posed by the title of Lehrer's article, my answer is Yes, there is something wrong with the scientific method, if this method is defined as running experiments and doing data analysis in a patternless way and then reporting, as true, results that pass a statistical significance threshold.

And corrections for multiple comparisons will not solve the problem: such adjustments merely shift the threshold without resolving the problem of overestimation of small effects.

## Compare p-values from privately funded medical trials to those in publicly funded research?

Sander Wagner writes:

I just read the post on ethical concerns in medical trials. As there seems to be a lot more pressure on private researchers i thought it might be a nice little exercise to compare p-values from privately funded medical trials with those reported from publicly funded research, to see if confirmation pressure is higher in private research (i.e. p-values are closer to the cutoff levels for significance for the privately funded research). Do you think this is a decent idea or are you sceptical? Also are you aware of any sources listing a large number of representative medical studies and their type of funding?

This sounds like something worth studying. I don't know where to get data about this sort of thing, but now that it's been blogged, maybe someone will follow up.

The American Statistical Association has an annual recommended gift list. (I think they had Red State, Blue State on the list a couple years ago.) They need some more suggestions in the next couple of days. Does anybody have any ideas?

## What do practitioners need to know about regression?

Fabio Rojas writes:

## The Joy of Stats

| 1 Comment

Hal Varian sends in this link to a series of educational videos described to be "a journey into the heart of statistics." It seems to be focused on exploratory data analysis, which it describes as "an extraordinary new method of understanding ourselves and our Universe."

## This is a footnote in one of my papers

In the annals of hack literature, it is sometimes said that if you aim to write best-selling crap, all you'll end up with is crap. To truly produce best-selling crap, you have to have a conviction, perhaps misplaced, that your writing has integrity. Whether or not this is a good generalization about writing, I have seen an analogous phenomenon in statistics: If you try to do nothing but model the data, you can be in for a wild and unpleasant ride: real data always seem to have one more twist beyond our ability to model (von Neumann's elephant's trunk notwithstanding). But if you model the underlying process, sometimes your model can fit surprisingly well as well as inviting openings for future research progress.

## Quality control problems at the New York Times

I guess there's a reason they put this stuff in the Opinion section and not in the Science section, huh?

P.S. More here.

## When Small Numbers Lead to Big Errors

My column in Scientific American.

Check out the comments. I have to remember never ever to write about guns.

## One way that psychology research is different than medical research

Medical researchers care about main effects, psychologists care about interactions. In psychology, the main effects are typically obvious, and it's only the interactions that are worth studying.

## Postdoc opportunity here at Columbia -- deadline soon!

The deadline for this year's Earth Institute postdocs is 1 Dec, so it's time to apply right away! It's a highly competitive interdisciplinary program, and we've had some statisticians in the past.

We're particularly interested in statisticians who have research interests in development and public health. It's fine--not just fine, but ideal--if you are interested in statistical methods also.

## ff

Can somebody please fix the pdf reader so that it can correctly render "ff" when I cut and paste? This comes up when I'm copying sections of articles on to the blog.

Thank you.

P.S. I googled "ff pdf" but no help there.

P.P.S. It's a problem with "fi" also.

P.P.P.S. Yes, I know about ligatures. But, if you already knew about ligatures, and I already know about ligatures, then presumably the pdf people already know about ligatures too. So why can't their clever program, which can already find individual f's, also find the ff's and separate them? I assume it's not so simple but I don't quite understand why not.

## Ethical concerns in medical trials

I just read this article on the treatment of medical volunteers, written by doctor and bioethicist Carl Ellliott.

As a statistician who has done a small amount of consulting for pharmaceutical companies, I have a slightly different perspective. As a doctor, Elliott focuses on individual patients, whereas, as a statistician, I've been trained to focus on the goal of accurately estimate treatment effects.

I'll go through Elliott's article and give my reactions.

## The Wald method has been the subject of extensive criticism by statisticians for exaggerating results"

Paul Nee sends in this amusing item:

## Estimation from an out-of-date census

Suguru Mizunoya writes:

When we estimate the number of people from a national sampling survey (such as labor force survey) using sampling weights, don't we obtain underestimated number of people, if the country's population is growing and the sampling frame is based on an old census data? In countries with increasing populations, the probability of inclusion changes over time, but the weights can't be adjusted frequently because census takes place only once every five or ten years.

I am currently working for UNICEF for a project on estimating number of out-of-school children in developing countries. The project leader is comfortable to use estimates of number of people from DHS and other surveys. But, I am concerned that we may need to adjust the estimated number of people by the population projection, otherwise the estimates will be underestimated.

I googled around on this issue, but I could not find a right article or paper on this.

My reply: I don't know if there's a paper on this particular topic, but, yes, I think it would be standard to do some demographic analysis and extrapolate the population characteristics using some model, then poststratify on the estimated current population.

P.S. Speaking of out-of-date censuses, I just hope you're not working with data from Lebanon!

## Kaggle: forecasting competitions in the classroom

Anthony Goldbloom writes:

For those who haven't come across Kaggle, we are a new platform for data prediction competitions. Companies and researchers put up a dataset and a problem and data scientists compete to produce the best solutions.

We've just launched a new initiative called Kaggle in Class, allowing instructors to host competitions for their students. Competitions are a neat way to engage students, giving them the opportunity to put into practice what they learn. The platform offers live leaderboards, so students get instant feedback on the accuracy of their work. And since competitions are judged on objective criteria (predictions are compared with outcomes), the platform offers unique assessment
opportunities.

The first Kaggle in Class competition is being hosted by Stanford University's Stats 202 class and requires students to predict the price of different wines based on vintage, country, ratings and other information.

Those interested in hosting a competition for their students should visit the Kaggle in Class page or contact daniel.gold@kaggle.com

Looks cool to me. More on Kaggle here.

### Research Supported By

• C Ryan King: I'd say that the previous discussion had a feature which read more
• K? O'Rourke: On the surface, it seems like my plots, but read more
• Vic: I agree with the intervention-based approach -- spending and growth read more
• Phil: David: Ideally I think one would model the process that read more
• Bill Jefferys: Amplifying on Derek's comment: http://en.wikipedia.org/wiki/Buridan%27s_ass read more
• Nameless: It is not uncommon in macro to have relationships that read more
• derek: taking in each others' laundry It's more like the farmer read more
• DK: #17. All these quadrillions and other super low p-values assume read more
• Andrew Gelman: Anon: No such assumption is required. If you multiply the read more
• anon: Doesn't this rely on some form of assumed orthogonality in read more
• Andrew Gelman: David: Yup. What makes these graphs special is: (a) Interpretation. read more
• David Shor: This seems pretty similar to the "Correlations" feature in the read more
• David W. Hogg: If you want probabilistic results (probabilities over outcomes, with and read more
• Cheryl Carpenter: Bob is my brother and he mentioned this blog entry read more
• Bob Carpenter: That's awesome. Thanks. Exactly the graphs I was talking about. read more