Tuesday, 25 April 2023

Reason to be cautious with the inverse hyperbolic sine transformation

Trigger warning: This post is more technical than my usual posts.

Economists often transform data (on incomes, for example) by taking logarithms. This has statistical advantages, in terms of making the distribution of otherwise skewed variables behave better in the analysis. It also has a neat property in terms of the interpretation of regression coefficients, because in a log-linear model (where the dependent variable is measured in logs and the explanatory variable is not) the coefficient can be interpreted as a percentage change, and in a log-log model (where both the dependent and explanatory variables are measured in logs) the coefficient is an elasticity.

However, there is a problem with the log transformation. Any value of zero (or a negative number) is undefined, and this makes some analyses challenging. For example, in gravity models of trade or migration, small areas that are far apart may have zero flows between them. Since the gravity model relies on logs of trade or migration flows, the zero values cause a problem. Or, if you want to estimate the effect of some programme for underemployed youth on employment income, you would often use the log of income as the dependent variable. However, unemployed people may have zero reported income, and those zero values cause a problem.

There are few good ways of dealing with the problem of zeroes or negative values in a variable that you want to log-transform. You could drop all negative or zero values, but that decreases the sample size and likely biases your results (because observations that have zeroes or negative values are usually different in meaningful ways from those that have positive non-zero values). Another option is to compute ln(X+1) rather than ln(X) when log-transforming the variable X. That deals with zeroes, but not large negative numbers, and it also biases the results (but probably not as much as simply dropping data would).

An alternative transformation that has gained some traction in recent years is the inverse hyperbolic sine (asinh) transformation. That transformation involves computing the equation asinh(X) = ln(X+(X^2+1)^(1/2)), which is actually not quite as complicated as it seems. It deals with variables with zero values (but not large negative values). Moreover, it has been argued that coefficients on variables transformed in this way have the same interpretations as variables that have been log-transformed.

However, all may not be as rosy as it seems. This blog post by David McKenzie at the Development Impact blog suggests that we should be much more cautious with the asinh transformation. The post draws on a variety of recent articles and working papers that have investigated the asinh transformation and its properties. The first problem is that it seems that it is really sensitive to the units of measurement, such that measuring in dollars can result in different coefficient estimates than measuring in thousands or millions of dollars. That should not be the case when the coefficient is supposed to be interpreted as a percentage or an elasticity!

The kicker may be this bit:

Chen and Roth re-estimate 10 papers published in the AER that used the i.h.s transformation for at least one outcome, and illustrate how re-scaling the outcome units by 100 can lead to a change of more than 100% in the estimated treatment effect – with the largest changes coming for programs that had impacts on the extensive margin. E.g. In Rogall (2021)’s work on the Rwandan genocide, he looks at how the presence of armed groups fosters civilian participation in the violence. The extensive margin effect is 0.195, so a big extensive margin change. The estimated treatment effect then changes from 1.248 to 2.15 depending on whether y or 100*y is used as the outcome – which implies a massive change in the implied percentage change effect if interpreting these as either log points or like a log variable.

The Chen and Roth working paper that McKenzie refers to is available here. Given how often this transformation has been used in recent times, I had recently added it to my personal econometrics cheat sheet. However, I haven't felt the need to use it in my own work as yet (because, in gravity models for example, we tend to use Poisson pseudo-maximum likelihood (PPML), which deals with zero values better than the alternatives to log-transformation). I've now had to go back and footnote my cheat sheet with a cautionary note.

And that is probably the takeaway from McKenzie's post (and the papers he cites there), although he does provide some suggested ways of proceeding (adapted from the Chen and Roth working paper). I prefer to just suggest that when we use the asinh transformation, we need to be cautious.

No comments:

Post a Comment