Tuesday, 24 January 2023

Does economics have a bigger publication bias problem than other fields?

Publication bias is the tendency for studies that show statistically significant effects, often in a particular direction predicted by theory, to be much more likely to be published than studies that show statistically insignificant effects (or statistically significant effects in the direction opposite to that predicted by theory). Meta-analyses, which collate the results of many published studies, often show evidence for publication bias.

Publication bias could arise because of the 'file drawer' problem: since studies with statistically significant effects are more likely to be published, researchers put studies that find statistically insignificant effects into a file drawer, and never try to get them published. Or, publication bias could result from p-hacking: researchers make a number of choices about how the analysis is conducted in order to increase the chances of finding a statistically significant effect, which can then be published.

Is publication bias worse in some fields than others? That is the question addressed by this new working paper by František Bartoš (University of Amsterdam) and co-authors. Specifically, they compare publication bias in the fields of medicine, economics, and psychology by undertaking a 'meta-analysis of meta-analyses', combining about 800,000 effect sizes from 26,000 meta-analyses. However, the sample is not balanced across the three fields, with 25,447 meta-analyses in medicine, 327 in economics, and 605 in psychology. Using this sample though, Bartoš et al. find that:

...meta-analyses in economics and psychology predominantly show evidence for an effect before adjusting for PSB [publication selection bias] (unadjusted); whereas meta-analyses in medicine often display evidence against an effect. This disparity between the fields remains even when comparing meta-analyses with equal numbers of effect size estimates. When correcting for PSB, the posterior probability of an effect drops much more in economics and psychology (medians drop from 99.9% to 29.7% and from 98.9% to 55.7%, respectively) compared to medicine (38.0% to 27.5%).

In other words, we should be much more cautious about the claims of statistically significant effects arising from meta-analyses in economics than equivalent claims from meta-analyses in psychology or medicine (over and above any general caution we should have about meta-analysis - see here). In other words, these results suggest that publication bias is a much greater problem in economics than in psychology or medicine.

However, there are a few points to note here. The number of meta-analyses included in this study is much lower for economics than for psychology or medicine. Although Bartoš et al. appear to account for this, I think it suggest the potential for another issue.

Perhaps there is publication bias in meta-analyses (a meta-publication bias?), which Bartoš et al. don't test for? If it were the case that meta-analyses that show statistically significant effects were more likely to be published in economics than meta-analyses that show statistically insignificant effects, and this meta-publication bias was larger in economics than in psychology or medicine, then that would explain the results of Bartoš et al. However, it would not necessarily demonstrate that there was underlying publication bias in the underlying economic studies. Bartoš et al. need to test for publication bias in the meta-analysis sample.

That is a reasonably technical point. However, it does seem likely that there is publication bias, and it would not surprise me if it is larger in economics than in medicine, but I wouldn't necessarily expect it to be any worse than psychology. As noted in the book The Cult of Statistical Significance by Stephen Ziliak and Dierdre McCloskey (which I reviewed here), there remains a dedication among social scientists in general, and economists in particular, to finding statistically significant results, and that is a key driver of publication bias (see here).

Maybe economists are dodgy researchers. Or maybe, we just need to be better at reporting and publishing statistically insignificant results, and adjusting meta-analyses to account for this bias.

[HT: Marginal Revolution]

Sunday, 22 January 2023

Home crowds and home advantage

It is well known that, in most if not all sports, there is a sizeable advantage to playing at home. However, it isn't clear exactly what mechanism causes this home advantage to arise. Is it because when a team (or individual sportsperson) is playing at home, they don't have to travel as far, and are refreshed and comfortable at game time? Or, is it because the team (or individual sportsperson) is more familiar with the home venue than their competitors are? Or, is it because of home fan support?

Previous research has found it very difficult to disentangle these different mechanisms as being the underlying cause of home advantage. Cue the coronavirus pandemic, which created an excellent natural experiment that allows us to test a range of hypotheses, including about home advantage in sports. Since fans were excluded from stadiums in many sports, if home advantage was no longer apparent, we can at least rule out home fan support as being a contributor to home advantage.

And that is essentially what the research reported in this new article by Jeffrey Cross (Hamilton College) and Richard Uhrig (University of California, Santa Barbara), published in the Journal of Sports Economics (open access), tries to do. Specifically, they look at four of the top five European football leagues (Bundesliga, La Liga, Premier League, and Serie A), all of which faced a disrupted 2019-20 season, and after the disruption resumed play with restrictions that prevented fan attendance at games. Essentially, they compare home team performance before and after the introduction of the no-fans policy. Their preferred outcome variable is 'expected goals' rather than actual goals scored. Cross and Uhrig justify the choice as:

Due to randomness, human error, and occasional moments of athletic brilliance, the realized score of a match is a noisy signal for which team actually played better over the course of 90 minutes. In order to mitigate this noise, we focus on expected goals, or xG, which measure the quantity and quality of each team’s chances to score; they have been shown to better predict future performance and more closely track team actual performance than realized goals... Expected goals are calculated by summing the ex ante probabilities that each shot, based on its specific characteristics and historical data, is converted into a goal... For example, if a team has four shots in a game, each with a scoring probability of 0.25, then their expected goals for the match would sum to 1. However, their realized goals could take any integer value from 0 to 4...

Their data goes back to the 2009-10 season, and includes some 15,906 games in total. However, they only have data on expected goals from the 2017-18 season onwards, which includes 4,336 games. Because the games with no fans were played later than usual, the temperature was higher (as the season was extending closer to summer), so they make sure to control for weather, as well as for the number of coronavirus cases.

Looking at realised goals, Cross and Uhrig find that:

...raw home field advantage decreased by 0.213 goals per game from a baseline of a 0.387 goals per game advantage for the home team... This represents a decrease of 55%.

But, as they argue, this is quite a noisy measure of home advantage. So, they turn to their measure of expected goals, and find that:

...raw home field advantage, as measured by expected goals instead of realized goals, decreased by 64% from a 0.307 expected goal advantage for the home team to just 0.110 expected goals. Although the magnitude of the decrease is smaller than realized goals in absolute terms (0.197 xG as opposed to 0.213 G), it represents a larger fraction of the initial home field advantage (64% as opposed to 55%) because the initial home field advantage is smaller as measured by expected goals than realized goals.

Finally, looking at game outcomes, they find that:

...the lack of fans led to fewer home wins and more home losses, but the probability of a draw is unaffected, suggesting that fans are symmetrically pivotal: fans are approximately as likely to shift a result from a draw to a home win as they are from a home loss to a draw... Approximately 5.4 percentage points are shifted from the probability of winning to the probability of losing.

So, coming back to the question we started with, at least some of the home advantage that football teams experience is due to home crowd support. Given that home advantage decreased by somewhere between 55 percent and 64 percent, the share of home advantage that home crowd support is responsible for is sizeable. Of course, this doesn't necessarily extend to all sports. But it does show that home crowd support is important.

Friday, 20 January 2023

Grading bias, classroom behaviour, and assessing student knowledge

There is a large literature that documents teachers' biases in the grading of student assessments. For example, studies have used comparisons of subjectively graded (by teachers) assessments and objectively graded (or blind graded) assessments, to demonstrate gender bias and racial bias. However, grading bias may not just arise from demographics. Teachers may also show favouritism towards well-behaved students (relative to badly-behaved students). The challenge with demonstrating that bias is that researchers often lack detailed measures of student behaviour.

That is not the case for the research reported in this recent article by Bruno Ferman (Sao Paulo School of Economics) and Luiz Felipe Fontes (Insper Institute of Education and Research), published in the Journal of Public Economics (sorry, I don't see an ungated version online). They used data from a Brazilian private education company that manages schools across the country, and covered:

...about 23,000 students from grades 6-11 in 738 classrooms and 80 schools.

Importantly, the data includes student assessment results that were graded by their teacher, standardised test results that were machine-graded, and measures of student behaviour, which the company collected in order to "better predict their dropout and retention rates". Ferman and Fontes collate the behavioural data, and:

...classify a student being assessed in subject s and cycle t as well-behaved (GBits = 1) if she is in the top quartile within class in terms of good behavior notifications received until t by all teachers except the subject one. We classify bad-behaved students (BBits = 1) analogously.

They then compare maths test scores between well-behaved and badly-behaved students, and show that:

...the math test scores of ill-behaved students (BB = 1) are on average 0.31 SD below those such that BB = 0. The unconditional grade gap between students with GB = 1 and GB = 0 is even greater: 0.54 SD in favor of the better-behaved pupils.

So far, so unsurprising. Perhaps better-behaved students also study harder. However, when Ferman and Fontes control for blindly graded math scores, they find that:

...the behavior effects are significantly reduced, indicating that a share of the competence differences seen by teachers is captured by performance in the blindly-scored tests... Nevertheless, the behavior effects remain significant and are high in magnitude, indicating that teachers confound scholastic and behavioral skills when grading proficiency exams. Our results suggest that the better(worse)-behaved students have their scores inflated (deducted) by 0.14 SD...

This is quite a sizeable effect, amounting to "approximately 60% of the black-white achievement gap". And that is simply arising from teacher grading bias. Ferman and Fontes then go on to show that their results are robust to some alternative specifications, and that there is also apparent teacher bias in decisions of which students are allowed to move up to the next grade.

However, should we care about grading bias? Ferman and Fontes point out that their results:

...characterize an evaluation scheme that is condemned by educators and classroom assessment specialists, which explicitly warn against the adjustment of test scores to reflect students’ behavior... and consider this practice as unethical... Their argument is that achievement grades are the main source of feedback teachers send about the students’ proficiency levels. Therefore, test scores help pupils form perceptions about their own aptitudes and assist them in the process of self-regulation of learning; additionally, they help parents to understand how to allocate effort to improve their children’s academic achievement...

Still, one could argue that biasing test scores may be socially desirable if it induces a student to behave better, generating private benefits to the pupil and positive externalities to peers...

Let me suggest another counterpoint. If grades are a signal to universities or to employers about the relative ranking of students in terms of performance, then maybe you want those grades to reflect students' behaviour as well as students' attainment of learning outcomes. You might disagree, but I'd argue that there are already elements of this in the way that we grade students (in high schools and universities) already. If teachers (and educational institutions) were purists about grades reflecting student learning alone, then we would never estimate student grades for students who miss a piece of assessment, we would never scale grades (up or down). The fact that we do those things (and did so especially during the pandemic) suggests that student grades already can't be interpreted solely as reflecting students' attainment of learning outcomes.

Employers (and universities) want grades that will be predictive of how a student will perform in the future. However, academic achievement is an imperfect measure of future performance of students. This is demonstrated clearly in this recent article by Georg Graetz and Arizo Karimi (both Uppsala University), published in the journal Economics of Education Review (open access). They used administrative data from Sweden, focusing mainly on the cohort of students born in 1992. Graetz and Karimi are most interested in explaining a gender gap that exists between high school grades (where female students do better) and the standardised Swedish SAT tests (where male students do better). Specifically:

...female students, on average, outperform male students on both compulsory school and high school GPAs by about a third of a standard deviation. At the same time, the reverse is true for the Swedish SAT, where female test takers underperform relative to male test takers by a third of a standard deviation...

Graetz and Karimi find that differences in cognitive skills, motivation, and effort explain more than half of the difference in GPAs between female and male students, and that female students have higher motivation and exert greater effort. In contrast, there is selection bias in the SAT scores. This arises in part because Swedish students can qualify for university based on grades, or based on SAT scores. So, students that already have high grades are less likely to sit the SATs. Since more of those students are females with high cognitive skills, the remaining students who sit the SAT test disproportionately include high-cognitive-skill males, which is why males on average perform better in the Swedish SATs.

However, aside from being kind of interesting, that is not the important aspect of the Graetz and Karimi paper that I want to highlight. They then go on to look at the post-high-school outcomes for students born in 1982, and look at how those outcomes relate to grades and SAT scores. In this analysis, they find that:

Grades and SAT scores are strong predictors of college graduation, but grades appear about twice as important as SAT scores, with standardized coefficients around 0.25 compared to just over 0.1...

A one-standard-deviation increase in CSGPA and HSGPA is associated with an increase in annual earnings of SEK15,500 and 25,200, respectively (SEK1,000 is equal to about USD100). But for the SAT score, the increase is only SEK8,000.

In other words, high school grades are a better predictor of both university outcomes (graduation) and employment outcomes (earnings) than standardised tests. This should not be surprising, given that, when compared with standardised tests, grades may better capture student effort and motivation, which will be predictive of student success in university and in employment. And to the extent that good student behaviour is also associated with higher motivation and greater effort, perhaps we want grades to reflect that too. [*]

None of this is to say that we shouldn't be assessing student knowledge. It's more that grades that represent a more holistic measure of student success, will be more useful in predicting future student performance. That is more helpful for employers, and as a result it may be more helpful for encouraging students to study harder as well.


[*] Of course, selection bias matters here too. In the case of the Swedish SATs, the most motivated and hardest working students may have opted out of the SAT test entirely. However, the analysis that Graetz and Karimi undertook is (I think) limited to students who had both grades and SAT scores recorded.

Thursday, 19 January 2023

Tea drinking vs. beer drinking, and mortality in pre-industrial England

When I introduce the difference between causation and correlation in my ECONS101 class, I talk about how, even when there is a good story to tell about why a change in one variable causes a change in the other, that doesn't necessarily mean that an observed relationship is causal. It appears that I am just a susceptible to a good story as anyone else. When a research paper has a good story, and the data and methods seem credible, I'm willing to update my priors by a lot (unless the results also contradict a lot of the prior research). I guess that's a form of confirmation bias.

So, I was willing to accept at face value the results of the article on tea drinking and mortality in England that I blogged about earlier this week. To recap, that research found that the increase in tea drinking in 18th Century England, by promoting the boiling of water, reduced mortality. However, now I'm not so sure. What has caused me to re-evaluate my position is this other paper by Francisca Antman and James Flynn (both University of Colorado, Boulder), on the effect of beer drinking on mortality in pre-industrial England.

Antman is the author of the tea-drinking article, so it should be no surprise to expect that the methods and data sources are similar, given the similarity of the two papers in terms of research question and setting. However, there are some key differences between the two papers (which I will come to in a minute). First, why study beer? Antman and Flynn explain that:

Although beer in the present day is regarded primarily as a beverage that would be worse for health than water, several features of both beer and water available during this historical period suggest the opposite was likely to be true. First, brewing beer would have required boiling the water, which would kill many of the dangerous pathogens that could be found in contaminated drinking water. As Bamforth (2004) puts it, ‘the boiling and the hopping were inadvertently water purification techniques’ which made beer safer than water in 17th century Great Britain. Second, the fermentation process which resulted in alcohol may have added antiseptic qualities to the beverage as well...

Notice that the first mechanism here is basically the same as for tea. Boiling water makes water safer to drink, even when it is being used in brewing. Also:

...beer in this period, which sometimes referred to as ”small beer,” was generally much weaker than it is today, and thus would have been closer to purified water. Accum (1820) found that small beer in late 18th and early 19th century England averaged just 0.75% alcohol by volume, a tiny fraction of the content of even the ‘light’ beers of today.

The data sources are very similar to those used for the tea drinking paper, and the methods are substantially similar as well. Antman and Flynn compare parish-level summer deaths (which are more likely to be associated with water-borne disease than summer deaths) between areas with high water quality and low water quality, before and after a substantial increase in the malt tax in 1780. Using this difference-in-differences approach, they find that:

...the summer death rate in low water quality parishes increases by 22.2% relative to high water quality parishes, with a p-value on the equality of the two coefficients of .001.

Antman and Flynn then use a second identification strategy, which is to compare summer deaths between parishes that have gley soil (suitable for growing barley, which is then malted and used to make beer) and parishes without gley soil, before and after the change in the malt tax. In this analysis, they find that:

...parishes with gley soil had summer death rates which increased by approximately 18% after the malt tax was implemented relative to parishes without gley soil.

Not satisfied with only two identification strategies, Antman and Flynn then use a third, which is rainfall. Their data is limited to the counties around London (because that is where they have the rainfall data from). In this analysis, they find that:

...the effect of rainier barley growing seasons on parishes with few nearby water sources is positive and significant, indicating that summer deaths rise following particularly rainy barley growing seasons... [and] ...rainy barley-growing seasons lead to more summer deaths in areas where beer is most abundant, even controlling for the number of deaths occurring in the winter months.

So, the evidence seems consistent with beer drinking being associated with lower mortality, because in areas where beer drinking decreased (because of the increase in the malt tax) by a greater amount, mortality increased by more.

But not so fast. There are two problems here, when you compare across the tea drinking and beer drinking research. First, the data that they use is not consistent. The tea drinking paper uses all deaths in each parish. The beer drinking paper uses only summer deaths, arguing that summer deaths are more likely to be from water-borne causes. If that is the case, why use all deaths in the tea drinking paper? What happens to the results from each paper when you use the same mortality data specification?

Second, the increase in the Malt Tax was in 1780. The decrease in the tea tax (which the tea drinking paper relies on) was in 1784. The two tax changes are awfully close together timewise, and disentangling their effects would be difficult. However, neither paper seems to account for the other properly. The beer drinking paper includes tea imports as a control variable, but in the tea drinking paper it wasn't tea imports, but tea imports interacted with water quality that was the key explanatory variable (and the timing of the tea tax change interacted with the water quality variable). The tea drinking paper doesn't really control for changes in beer drinking at all.

That second problem is the bigger issue, and creates a potentially problematic omitted variable problem in both papers. If you don't include changes in tea drinking in the beer drinking paper, and the two tax changes happened around the same time, how can you be sure that the change in mortality was due to tea drinking, and not beer drinking? And vice versa for failing to include changes in beer drinking in the tea drinking paper.

However, maybe things are not all bad here. Remember that the two effects are going in opposite directions. It is possible that the decrease in the tea tax increased tea drinking, and mortality reduced, while the increase in the malt tax decreased beer drinking, and mortality increased. However, then we come back to the first problem. Why use a measure of overall mortality in the tea drinking paper, and a measure of only summer mortality in the beer drinking paper, when both papers are supposed to be looking at changes in mortality stemming from water-borne diseases?

Hopefully now you can see why I have my doubts about the tea drinking paper, as well as the beer drinking paper. Both are telling an interesting story, but the inconsistencies in data and approach across the two papers should make use extra cautious about the results, and leave us pondering the question of whether the results are causal or simply correlation.

[HT for the beer paper: The Dangerous Economist]

Read more: