The ‘con’ in econometrics made visible

by Russ Roberts on April 8, 2011

in Data

Ed Leamer’s brilliant paper, Let’s Take the Con out of Econometrics makes the point that classical statistical tests don’t hold when you go on a fishing expedition. This equally brilliant comic strip from xkcd makes the point with elegance. The tip is from John Allen Paulos who reminds the reader to count the number of tests.

Be Sociable, Share!

Comments

comments

33 comments    Share Share    Print    Email

{ 33 comments }

Methinks1776 April 8, 2011 at 12:29 pm

love!

E.G. April 8, 2011 at 12:31 pm

Anyone involved in any such study already knows everything in that paper. Nothing new here that any basic stat book doesn’t teach.

Thats not to say of course, that its not abused, either through ignorance or purpose.

Daniel Kuehn April 8, 2011 at 12:34 pm

I liked this one too… but I would think this is a perfect illustration of what ignorance of econometrics gives you.

The con isn’t in econometrics here – the con is in journalism. Econometrics tells you exactly what to make of that green jelly bean result: not much.

Russ Roberts April 8, 2011 at 1:51 pm

The next time you see a published paper that lists how many regressions were run and which specifications were used, let me know. It may have happened in the history of economics but it isn’t common. The source of the problem is economists who oversell their results in search of notoriety and influence. The journalists are just the middlemen.

Daniel Kuehn April 8, 2011 at 2:22 pm

Well that’s practically tautological… if you don’t do econometrics well the econometrics you do will not be good. What is the critique, that econometrics is misleading or that econometrics isn’t practiced?

Anyway, this is just in the latest RESTAT: http://www.mitpressjournals.org/doi/pdf/10.1162/REST_a_00067
http://www.mitpressjournals.org/doi/pdf/10.1162/REST_a_00064
http://www.mitpressjournals.org/doi/pdf/10.1162/REST_a_00058

They all discuss additional specifications in their data appendix or in footnotes. Not all of them say how many are run in total, but they describe the results (whether the results were robust to the additional specifications). I’ve always known it to be common practice to report additional runs. Perhaps three articles in one journal are fluke, but it’s been my experience at the Urban Institute too (since we do more reports than articles, we usually stuff all additional specifications in an appendix… not as pretty, but same idea). We aren’t exactly doing cutting-edge work, and this is pretty standard operating proceedure. RESTAT sets the standard for the field – do you really think this is all that unusual?

I am less pessimistic than you are, but again – If it’s true and RESTAT and my experience are just flukes of good behavior, the problem isn’t econometrics. The problem is the failure to do econometrics.

E.G. April 8, 2011 at 2:41 pm

The problem is the people who don’t look at those details. How do you publish a paper without providing the specifics of the analysis, and the reasons behind them? No paper should be allowed to be published otherwise.

vikingvista April 8, 2011 at 4:12 pm

What they don’t tell you in the manuscript is all the data mining that when on until they found an interesting post hoc result. Very large epidemiological studies tend to have the methodology completely laid out ahead of time, as it should be, but even then, few can resist going off the reservation, especially if the results turn out not to be interesting. These post hoc results often also get mentioned in the discussion sections, or in separate smaller publications as the authors try to milk their data set for all it is worth (or not worth).

E.G. April 8, 2011 at 5:30 pm

Yeah but the problem again rests with the reviewers and readers. They don’t have to tell me how much data mining was done; it is easily understood with minimum detail.

Its not all that difficult to go off the reservation, as you say, but it is also not that difficult for reviewers or readers to reject such papers. The problem is…they don’t…cause a lot of the time the reviewers are just as illiterate as the authors. And the reader doesn’t have to buy my explanations for the observed significances, they just have to believe the results of the statistics (which require some basic info, otherwise it shouldn’t be published).

Nothing inherently wrong in the statistics or their use; its the intellectual laziness of the interpreter (ie the reader, not the author), thats a problem.

Ken April 8, 2011 at 6:42 pm

VV,

What’s wrong with computing more statistics on a data set after the fact? Do you really think that just because a data set was created to study affect A, it can’t also study affect B?

Regards,
Ken

Daniel Kuehn April 8, 2011 at 6:43 pm

Strongly agree EG.

I think Russ and vikingvista are being awfully pessimistic here too – there’s a professional ethic that monitors these things too. The datamining tabboo was drilled into me in academia AND in my professional career, and I don’t think I’m atypical. People don’t make a point of drilling that into their students and junior colleagues unless it’s something they take seriously. Most economic studies that I’m familiar with also resemble what vikingvista outlined – you lay out a methodology before you start working on the data. In a lot of modern work, so much hinges on very careful identification assumptions that you rarely have all that much room to play around with specification.

vikingvista April 8, 2011 at 9:15 pm

Ken,

There are often issues with study design, but the main problem is what is illustrated in the cartoon. An equivalent certainty for post hoc analysis requires smaller P values. The more post hoc tests you do, the smaller the P values must become. You can test blue jelly beans at level .05, but if you later decide to use the same data to test green jelly beans, the same level .05 test requires a P-value for the green beans around .025. Maybe smaller. As you choose to do more post hoc tests, the low level of the P gets very discouraging. And it should. If you were to conduct a brand new study for what you think is a significant post hoc result using the wrong methodology, you would find time and time again that the results are not significant.

And even though this is textbook statistics, low profile small studies are full of bad methodologies, with this error one of the more common.

vikingvista April 8, 2011 at 9:19 pm

EG,

First, reviewers only know what the authors give them.

Second, reviewers often lack competence in statistics. Bad methodologies get by them all the time. Have a good statistician attend your journal clubs, and you will soon develop a healthy pessimism.

E.G. April 8, 2011 at 9:26 pm

“First, reviewers only know what the authors give them.
Second, reviewers often lack competence in statistics. Bad methodologies get by them all the time”

Thats my point.

Matthew April 9, 2011 at 11:40 am

I just came across a fisheries article that referred to 158 different p-values. The issues aren’t only in economics, it’s pretty much in any field where people think they know statistical inference well enough to avoid consultation with a statistician.

E.G. April 9, 2011 at 3:32 pm

Thats awesome! Now how does such stuff get through reviewers? Actually I made the same mistake my first paper when I was an RA…my professors said “don’t worry no one is going to notice”.

Well, no one is going to notice. So much for peer review.

Bill April 8, 2011 at 12:37 pm

Isn’t it reasonable to expect that at some p level greater than .05, the null hypothesis of no relationship between jelly bean consumption and the occurrence of acne would be rejected? So, why did the cartoonist use “p greater than .05″ to qualify the conclusions? Why not “p less than .05″?

Bill April 8, 2011 at 4:16 pm

oooops — NEVER MIND (in my best Emily Latella voice)

vikingvista April 8, 2011 at 5:03 pm

Sorry, I replied before I saw your ooops.

vikingvista April 8, 2011 at 4:26 pm

A difference is considered significant if its probability due to random chance is low. P0.05 not significant. 0.05 = 1/20 means if you do the same experiment 20 times, you would expect random chance to produce 1 result that appears significant. The individual P’s for multiple simultaneous tests therefore needs to be much lower. At level 0.0025, the green jelly bean result probably would not have been significant.

In practice, the researchers do all the tests for all the jelly beans internally, and just publish the results for the green ones. Often they are honest enough to mention all the tests, but in small studies it is incredibly common that they do not statistically adjust for it.

Matthew April 9, 2011 at 11:46 am

Ideally, they should also determine a meaningful biological effect size, determine a sample size that could find that difference. If they are just trying to find an effect simply greater than zero, the p-value is partially a function of the sample size. Given large enough of a sample, all of those p-values would have been significant.

vikingvista April 9, 2011 at 3:32 pm

Good point.

J Mann April 8, 2011 at 1:05 pm

To appreciate xkcd, you have to read the alt-text as well. On the site, it reads “So, uh, we did the green study again and got no link. It was probably a–’ ‘RESEARCH CONFLICTED ON GREEN JELLY BEAN/ACNE LINK; MORE STUDY RECOMMENDED!’”

Michael April 8, 2011 at 1:08 pm

xkcd has the best alt-text on the web.

Chris Bauer April 8, 2011 at 1:24 pm

I’d like to hear the Austrian school’s thoughts on Minecraft.

andy April 8, 2011 at 4:23 pm

There are even better texts from medical science, e.g. here: http://psg-mac43.ucsf.edu/ticr/syllabus/courses/4/2003/11/13/Lecture/readings/Steven%20Goodman.pdf

Now there is problem even with the comics strip. A great analogy I found was this:
- when there’s somebody at home, there’s 95% probability that the light will be on; when there is nobody at home, there is 95% probability, that the light will be off.
You see that the light is on; what is the probability that somebody is at home?

The problem is that most people think that p-value 0.05 means “there is %5 probability that I am wrong”…. which it doesn’t…

jcpederson April 8, 2011 at 4:48 pm

If I follow Reverend Bayes’ formula, it depends on how likely somebody is to be home…

If somebody is home 50% of the time, the lights being on can let you predict someone being home with 90% certainty.

If someone is only home 10% of the time, then seeing the lights on will lead you to make a ‘someone’s home’ prediction that’s only 65.5% certain.

Seth April 8, 2011 at 4:28 pm

I always tell folks that if I need a statistician to tell me if something worked, I assume it didn’t work.

vikingvista April 8, 2011 at 5:08 pm

For “statistical significance”, you need your statistician’s considered opinion.
For “significance”, you need your considered opinion.

Dr. T April 8, 2011 at 6:26 pm

It’s a very true cartoon. I read too many epidemiologic studies that retrospectively looked at scores of factors and then drew similar idiotic correlations. What’s really bad is that these papers were in peer-reviewed journals and got past at least two peer reviewers, at least one editor, and sometimes a biostatistician.

I don’t read many economics journal articles, but some of the behavioral economics papers I’ve read contained similar flaws.

Perhaps all journal articles that include statistics should be reviewed by an independent panel of statisticians. A few medical journals now do this (after being badly embarrassed a few times).

Marcus April 8, 2011 at 7:38 pm
Marcus April 8, 2011 at 7:47 pm

I forgot to quote the relevant part of the article which is: “At Tevatron, …they may have found evidence of a particle never observed before. … But the keyword is “may” — there’s about a 1 in 1,000 chance that it’s just a fluke of statistics.”

Hasdrubal April 9, 2011 at 8:40 am

It’s not just data mining in the data set, it’s also a good point about the publication bias. How many papers testing a hypothesis didn’t get written because there weren’t significant results in the data set? Shouldn’t those be counted as evidence against the hypothesis as much as the one paper which did have a data set with significant results be counted as evidence in favor of it? Or at least, comparing the non-results with successful results would seem to potentially give researchers some insight on where a theory is applicable and where it isn’t, or investigate why it applies in one situation but not others.

Jim Rose April 10, 2011 at 3:56 am

The first two readings in my 1984 honours year methodology of econometrics course were the Leamer’s Con out of econometrics from 1983 and Hendry’s equally recent published Econometrics: alchemy or science? These now classics blew me away and stayed at the front of my mind to this day.

My lecturer suggested that one guard against data mining was to appendix all regressions that were run. Few do that.

Hendry’s developed a massive bank of tests in PCgive, but econometric diagnostics reported in many papers the 2000s are little more informative than those published in the 1970s.

I saw a paper the other day arguing that data mining together with publication bias increases false positives in published articles to about 40% or more.

Previous post:

Next post: