Saturday, 20 June 2020

Book review: Everybody Lies

'Big Data' has become one of the most commonly used buzzwords in both the corporate world and in academia, and 'data science' is not far behind. If you want to understand the increasing relevance of these terms to social science, then Seth Stephens-Davidowitz's book Everybody Lies is a good place to start. Stephens-Davidowitz is a former Google data scientist and has a PhD in economics from Harvard (where Alberto Alesina, who sadly passed away last month, was his PhD advisor).

The subtitle of the book is "Big data, new data, and what the internet can tell us about who we really are", and the content follows on from Stephens-Davidowitz's thesis work using Google Trends search data, but goes much broader - the book also makes use of Pornhub search data, and data scraped from Wikipedia, among other sources. The underlying idea is that:
...people's search for information is, in itself, information. When and where they search for facts, quotes, jokes, places, persons, things, or help, it turns out, can tell us a lot more about what they really think, really desire, really fear, and really do than anyone might have guessed.
Stephens-Davidowitz identifies four 'unique powers' of Big Data: (1) offering up new types of data; (2) providing honest data; (3) allowing us to zoom in on small subsets of people; and (4) allowing us to do many causal experiments. On the first power, among other examples he talks about text-as-data and sentiment analysis (which is a rapidly emerging field, and I have blogged previously about its use - see here), and pictures-as-data and the use of night lights as a proxy for economic activity (I have blogged about several papers that do this - see here and here and here for examples).

On the second power, Stephens-Davidowitz rightly notes that there are a lot of topics where survey data are likely to be seriously flawed due to social desirability bias - people answering survey questions in a way that makes them look better than they actually are. However, it turns out that people are much less worried about social desirability when it comes to what they type into a search engine. Stephens-Davidowitz is able to use these search data to explore questions such as what proportion of men are homosexual (more than most surveys suggest),  how racist Americans really are (much more than surveys reveal, as recent events have demonstrated), and whether Freudian slips are real (probably not).

On the third power, Stephens-Davidowitz provides several examples of where large datasets allow analyses for small areas or small groups, that even large surveys would not be able to tell us much about. Finally, Stephens-Davidowitz devotes a section to experiments and A/B testing to illustrate the fourth power of big data.

Finally, he devotes a chapter to some of the limitations of big data, and in particular, he highlights the ethical issues that may arise. That sets this book apart from, say, Reinventing Capitalism in the Age of Big Data (which I reviewed here), which is much less reflexive in discussing the potential current and future uses of big data.

Overall, I really enjoyed this book, and I highly recommend it to the general reader.


1 comment:

  1. Have now read it - all the way to the end!

    Good recommendation. Thanks.

    ReplyDelete