Ed Leamer’s brilliant paper, Let’s Take the Con out of Econometrics makes the point that classical statistical tests don’t hold when you go on a fishing expedition. This equally brilliant comic strip from xkcd makes the point with elegance. The tip is from John Allen Paulos who reminds the reader to count the number of tests.
The ‘con’ in econometrics made visible
Previous post: How Much Is a Dollar’s Worth of Exports Worth?
Next post: What a Joke











{ 33 comments }
love!
Anyone involved in any such study already knows everything in that paper. Nothing new here that any basic stat book doesn’t teach.
Thats not to say of course, that its not abused, either through ignorance or purpose.
I liked this one too… but I would think this is a perfect illustration of what ignorance of econometrics gives you.
The con isn’t in econometrics here – the con is in journalism. Econometrics tells you exactly what to make of that green jelly bean result: not much.
The next time you see a published paper that lists how many regressions were run and which specifications were used, let me know. It may have happened in the history of economics but it isn’t common. The source of the problem is economists who oversell their results in search of notoriety and influence. The journalists are just the middlemen.
Well that’s practically tautological… if you don’t do econometrics well the econometrics you do will not be good. What is the critique, that econometrics is misleading or that econometrics isn’t practiced?
Anyway, this is just in the latest RESTAT: http://www.mitpressjournals.org/doi/pdf/10.1162/REST_a_00067
http://www.mitpressjournals.org/doi/pdf/10.1162/REST_a_00064
http://www.mitpressjournals.org/doi/pdf/10.1162/REST_a_00058
They all discuss additional specifications in their data appendix or in footnotes. Not all of them say how many are run in total, but they describe the results (whether the results were robust to the additional specifications). I’ve always known it to be common practice to report additional runs. Perhaps three articles in one journal are fluke, but it’s been my experience at the Urban Institute too (since we do more reports than articles, we usually stuff all additional specifications in an appendix… not as pretty, but same idea). We aren’t exactly doing cutting-edge work, and this is pretty standard operating proceedure. RESTAT sets the standard for the field – do you really think this is all that unusual?
I am less pessimistic than you are, but again – If it’s true and RESTAT and my experience are just flukes of good behavior, the problem isn’t econometrics. The problem is the failure to do econometrics.
The problem is the people who don’t look at those details. How do you publish a paper without providing the specifics of the analysis, and the reasons behind them? No paper should be allowed to be published otherwise.
What they don’t tell you in the manuscript is all the data mining that when on until they found an interesting post hoc result. Very large epidemiological studies tend to have the methodology completely laid out ahead of time, as it should be, but even then, few can resist going off the reservation, especially if the results turn out not to be interesting. These post hoc results often also get mentioned in the discussion sections, or in separate smaller publications as the authors try to milk their data set for all it is worth (or not worth).
Yeah but the problem again rests with the reviewers and readers. They don’t have to tell me how much data mining was done; it is easily understood with minimum detail.
Its not all that difficult to go off the reservation, as you say, but it is also not that difficult for reviewers or readers to reject such papers. The problem is…they don’t…cause a lot of the time the reviewers are just as illiterate as the authors. And the reader doesn’t have to buy my explanations for the observed significances, they just have to believe the results of the statistics (which require some basic info, otherwise it shouldn’t be published).
Nothing inherently wrong in the statistics or their use; its the intellectual laziness of the interpreter (ie the reader, not the author), thats a problem.
VV,
What’s wrong with computing more statistics on a data set after the fact? Do you really think that just because a data set was created to study affect A, it can’t also study affect B?
Regards,
Ken
Strongly agree EG.
I think Russ and vikingvista are being awfully pessimistic here too – there’s a professional ethic that monitors these things too. The datamining tabboo was drilled into me in academia AND in my professional career, and I don’t think I’m atypical. People don’t make a point of drilling that into their students and junior colleagues unless it’s something they take seriously. Most economic studies that I’m familiar with also resemble what vikingvista outlined – you lay out a methodology before you start working on the data. In a lot of modern work, so much hinges on very careful identification assumptions that you rarely have all that much room to play around with specification.
Ken,
There are often issues with study design, but the main problem is what is illustrated in the cartoon. An equivalent certainty for post hoc analysis requires smaller P values. The more post hoc tests you do, the smaller the P values must become. You can test blue jelly beans at level .05, but if you later decide to use the same data to test green jelly beans, the same level .05 test requires a P-value for the green beans around .025. Maybe smaller. As you choose to do more post hoc tests, the low level of the P gets very discouraging. And it should. If you were to conduct a brand new study for what you think is a significant post hoc result using the wrong methodology, you would find time and time again that the results are not significant.
And even though this is textbook statistics, low profile small studies are full of bad methodologies, with this error one of the more common.
EG,
First, reviewers only know what the authors give them.
Second, reviewers often lack competence in statistics. Bad methodologies get by them all the time. Have a good statistician attend your journal clubs, and you will soon develop a healthy pessimism.
“First, reviewers only know what the authors give them.
Second, reviewers often lack competence in statistics. Bad methodologies get by them all the time”
Thats my point.
I just came across a fisheries article that referred to 158 different p-values. The issues aren’t only in economics, it’s pretty much in any field where people think they know statistical inference well enough to avoid consultation with a statistician.
Thats awesome! Now how does such stuff get through reviewers? Actually I made the same mistake my first paper when I was an RA…my professors said “don’t worry no one is going to notice”.
Well, no one is going to notice. So much for peer review.
Isn’t it reasonable to expect that at some p level greater than .05, the null hypothesis of no relationship between jelly bean consumption and the occurrence of acne would be rejected? So, why did the cartoonist use “p greater than .05″ to qualify the conclusions? Why not “p less than .05″?
oooops — NEVER MIND (in my best Emily Latella voice)
Sorry, I replied before I saw your ooops.
A difference is considered significant if its probability due to random chance is low. P0.05 not significant. 0.05 = 1/20 means if you do the same experiment 20 times, you would expect random chance to produce 1 result that appears significant. The individual P’s for multiple simultaneous tests therefore needs to be much lower. At level 0.0025, the green jelly bean result probably would not have been significant.
In practice, the researchers do all the tests for all the jelly beans internally, and just publish the results for the green ones. Often they are honest enough to mention all the tests, but in small studies it is incredibly common that they do not statistically adjust for it.
Ideally, they should also determine a meaningful biological effect size, determine a sample size that could find that difference. If they are just trying to find an effect simply greater than zero, the p-value is partially a function of the sample size. Given large enough of a sample, all of those p-values would have been significant.
Good point.
To appreciate xkcd, you have to read the alt-text as well. On the site, it reads “So, uh, we did the green study again and got no link. It was probably a–’ ‘RESEARCH CONFLICTED ON GREEN JELLY BEAN/ACNE LINK; MORE STUDY RECOMMENDED!’”
xkcd has the best alt-text on the web.
I’d like to hear the Austrian school’s thoughts on Minecraft.
There are even better texts from medical science, e.g. here: http://psg-mac43.ucsf.edu/ticr/syllabus/courses/4/2003/11/13/Lecture/readings/Steven%20Goodman.pdf
Now there is problem even with the comics strip. A great analogy I found was this:
- when there’s somebody at home, there’s 95% probability that the light will be on; when there is nobody at home, there is 95% probability, that the light will be off.
You see that the light is on; what is the probability that somebody is at home?
The problem is that most people think that p-value 0.05 means “there is %5 probability that I am wrong”…. which it doesn’t…
If I follow Reverend Bayes’ formula, it depends on how likely somebody is to be home…
If somebody is home 50% of the time, the lights being on can let you predict someone being home with 90% certainty.
If someone is only home 10% of the time, then seeing the lights on will lead you to make a ‘someone’s home’ prediction that’s only 65.5% certain.
I always tell folks that if I need a statistician to tell me if something worked, I assume it didn’t work.
For “statistical significance”, you need your statistician’s considered opinion.
For “significance”, you need your considered opinion.
It’s a very true cartoon. I read too many epidemiologic studies that retrospectively looked at scores of factors and then drew similar idiotic correlations. What’s really bad is that these papers were in peer-reviewed journals and got past at least two peer reviewers, at least one editor, and sometimes a biostatistician.
I don’t read many economics journal articles, but some of the behavioral economics papers I’ve read contained similar flaws.
Perhaps all journal articles that include statistics should be reviewed by an independent panel of statisticians. A few medical journals now do this (after being badly embarrassed a few times).
This seems possibly related.
http://www.cnn.com/2011/US/04/08/particle.physics.tevatron/index.html
I forgot to quote the relevant part of the article which is: “At Tevatron, …they may have found evidence of a particle never observed before. … But the keyword is “may” — there’s about a 1 in 1,000 chance that it’s just a fluke of statistics.”
It’s not just data mining in the data set, it’s also a good point about the publication bias. How many papers testing a hypothesis didn’t get written because there weren’t significant results in the data set? Shouldn’t those be counted as evidence against the hypothesis as much as the one paper which did have a data set with significant results be counted as evidence in favor of it? Or at least, comparing the non-results with successful results would seem to potentially give researchers some insight on where a theory is applicable and where it isn’t, or investigate why it applies in one situation but not others.
The first two readings in my 1984 honours year methodology of econometrics course were the Leamer’s Con out of econometrics from 1983 and Hendry’s equally recent published Econometrics: alchemy or science? These now classics blew me away and stayed at the front of my mind to this day.
My lecturer suggested that one guard against data mining was to appendix all regressions that were run. Few do that.
Hendry’s developed a massive bank of tests in PCgive, but econometric diagnostics reported in many papers the 2000s are little more informative than those published in the 1970s.
I saw a paper the other day arguing that data mining together with publication bias increases false positives in published articles to about 40% or more.