Economists and other social-science researchers are using large bodies of text—from Twitter messages, Google searches, Yelp reviews, and other sources—to predict asset price movements, estimate racial prejudice, and study what’s driving consumer decisions.
As text analysis becomes more popular, Stanford’s Matthew Gentzkow and Chicago Booth’s Bryan T. Kelly and Matt Taddy survey the research, and identify and explain some of the most common ways to transform strings of words into usable data.
In its simplest form, text analysis involves counting the frequency with which words and phrases appear in documents, internet search queries, or online databases, and using these counts to answer research questions. (For more, see “Why words are the new numbers,” Spring 2015.) As the field develops, researchers are using patterns uncovered by earlier studies to test for potential biases in their own findings.
Researchers are also making a host of observations—from finding political slant in news coverage to measuring how much political uncertainty affects economic growth. MIT’s Albert Saiz and Wharton’s Uri Simonsohn use web search results to estimate the extent of corruption in US cities. Seth Stephens-Davidowitz, a former Google data scientist, uses the frequency of racially charged terms in searches to measure prejudice in areas of the United States. And University of Bonn’s Benjamin Born and the European Central Bank’s Michael Ehrmann and Marcel Fratzscher analyze the effects of financial stability reports issued by central banks. Their research suggests optimistic reports move stock markets at least 1 percent in the month following their release, while pessimistic reports have less influence.
Such analyses involve large volumes of data. “A sample of 30-word Twitter messages that use only the 1,000 most common words in the English language has roughly as many dimensions as there are atoms in the universe,” write Gentzkow, Kelly, and Taddy.
A process of deconstruction
While researchers have been processing text into numeric data for decades, new technologies are enabling them to analyze these data in inventive ways. Click “Scan” below to see what researchers are doing on a large scale, preparing text for different kinds of analysis.