What Tweets and Yelp reviews are revealing

Amy Merrick | Nov 06, 2017

Sections Economics

Economists and other social-science researchers are using large bodies of text—from Twitter messages, Google searches, Yelp reviews, and other sources—to predict asset price movements, estimate racial prejudice, and study what’s driving consumer decisions.

As text analysis becomes more popular, Stanford’s Matthew Gentzkow and Chicago Booth’s Bryan T. Kelly and Matt Taddy survey the research, and identify and explain some of the most common ways to transform strings of words into usable data.

In its simplest form, text analysis involves counting the frequency with which words and phrases appear in documents, internet search queries, or online databases, and using these counts to answer research questions. (For more, see “Why words are the new numbers,” Spring 2015.) As the field develops, researchers are using patterns uncovered by earlier studies to test for potential biases in their own findings. 

Researchers are also making a host of observations—from finding political slant in news coverage to measuring how much political uncertainty affects economic growth. MIT’s Albert Saiz and Wharton’s Uri Simonsohn use web search results to estimate the extent of corruption in US cities. Seth Stephens-Davidowitz, a former Google data scientist, uses the frequency of racially charged terms in searches to measure prejudice in areas of the United States. And University of Bonn’s Benjamin Born and the European Central Bank’s Michael Ehrmann and Marcel Fratzscher analyze the effects of financial stability reports issued by central banks. Their research suggests optimistic reports move stock markets at least 1 percent in the month following their release, while pessimistic reports have less influence. 

Such analyses involve large volumes of data. “A sample of 30-word Twitter messages that use only the 1,000 most common words in the English language has roughly as many dimensions as there are atoms in the universe,” write Gentzkow, Kelly, and Taddy.

A process of deconstruction
While researchers have been processing text into numeric data for decades, new technologies are enabling them to analyze these data in inventive ways. Click “Scan” below to see what researchers are doing on a large scale, preparing text for different kinds of analysis.

goodMost whole words are added to the “bag of words” below night, good night!Remove punctuation
partingOnly use roots is such sweet sorrow,
thatRemove common words I shall say good night
till it be morrow.

Word

good

I

morrow

night

part

say

shall

sorrow

sweet

till

Count

3

1

1

3

1

1

1

1

1

1

Word Count

good

3

I

1

morrow

1

night

3

part

1

say

1

shall

1

sorrow

1

sweet

1

till

1

 

Methods researchers use to analyze text data

Dictionary-based methods:
After making a bag of words, researchers classify them according to a predefined dictionary such as Harvard's General Inquirer program, which groups them by sentiments such as “positive” or “optimistic.” “This is by far the most common method in the social science literature using text to date,” the researchers note.

Generative model:
Here researchers explicitly simulate the process by which language is produced. For example, when more people lose their jobs, unemployment-related terms will rise in web searches. One common generative model is a topic model, which discovers the themes in a collection of documents and then clusters the documents accordingly.

Text regression model:
This type of model uses statistical methods to predict some attribute, such as real-estate prices, from counts of words. The most common text regression model, known as a penalized linear model, imposes constraints that minimize irrelevant estimates. This approach does not try to model the structure of the language.

Deep-learning techniques:
Artificial neural networks simulate interconnected brain cells inside a computer. They recognize nonlinear relationships in complex data sets extremely well. New “deep” versions of neural networks work faster with less tuning required by users. Already powering Google Translate, these methods are catching on quickly in research.