What Tweets and Yelp Reviews Are Revealing

By Amy Merrick
November 06, 2017
CBR - Economics

Economists and other social-science researchers are using large bodies of text—from Twitter messages, Google searches, Yelp reviews, and other sources—to predict asset price movements, estimate racial prejudice, and study what’s driving consumer decisions.

As text analysis becomes more popular, Stanford’s Matthew Gentzkow and Chicago Booth’s Bryan T. Kelly and Matt Taddy survey the research, and identify and explain some of the most common ways to transform strings of words into usable data.

In its simplest form, text analysis involves counting the frequency with which words and phrases appear in documents, internet search queries, or online databases, and using these counts to answer research questions. (For more, see “Why words are the new numbers,” Spring 2015.) As the field develops, researchers are using patterns uncovered by earlier studies to test for potential biases in their own findings.

Researchers are also making a host of observations—from finding political slant in news coverage to measuring how much political uncertainty affects economic growth. MIT’s Albert Saiz and Wharton’s Uri Simonsohn use web search results to estimate the extent of corruption in US cities. Seth Stephens-Davidowitz, a former Google data scientist, uses the frequency of racially charged terms in searches to measure prejudice in areas of the United States. And University of Bonn’s Benjamin Born and the European Central Bank’s Michael Ehrmann and Marcel Fratzscher analyze the effects of financial stability reports issued by central banks. Their research suggests optimistic reports move stock markets at least 1 percent in the month following their release, while pessimistic reports have less influence.

Such analyses involve large volumes of data. “A sample of 30-word Twitter messages that use only the 1,000 most common words in the English language has roughly as many dimensions as there are atoms in the universe,” write Gentzkow, Kelly, and Taddy.

A process of deconstruction
While researchers have been processing text into numeric data for decades, new technologies are enabling them to analyze these data in inventive ways. Click “Scan” below to see what researchers are doing on a large scale, preparing text for different kinds of analysis.

goodMost whole words are added to the “bag of words” below night, good night!Remove punctuation
partingOnly use roots is such sweet sorrow,
thatRemove common words I shall say good night
till it be morrow.

Scan

Word	Count
good	3
I	1
morrow	1
night	3
part	1
say	1
shall	1
sorrow	1
sweet	1
till	1

Dictionary-based methods:

After making a bag of words, researchers classify them according to a predefined dictionary such as Harvard's General Inquirer program, which groups them by sentiments such as “positive” or “optimistic.” “This is by far the most common method in the social science literature using text to date,” the researchers note.

Generative model:

Here researchers explicitly simulate the process by which language is produced. For example, when more people lose their jobs, unemployment-related terms will rise in web searches. One common generative model is a topic model, which discovers the themes in a collection of documents and then clusters the documents accordingly.

Text regression model:

This type of model uses statistical methods to predict some attribute, such as real-estate prices, from counts of words. The most common text regression model, known as a penalized linear model, imposes constraints that minimize irrelevant estimates. This approach does not try to model the structure of the language

Deep-learning techniques:

Artificial neural networks simulate interconnected brain cells inside a computer. They recognize nonlinear relationships in complex data sets extremely well. New “deep” versions of neural networks work faster with less tuning required by users. Already powering Google Translate, these methods are catching on quickly in research.

Works Cited

Benjamin Born, Michael Ehrmann, and Marcel Fratzscher, “Central Bank Communication on Financial Stability,” Economic Journal, June 2014.
Matthew Gentzkow, Bryan T. Kelly, and Matt Taddy, “Text As Data,” NBER working paper, March 2017.
Albert Saiz and Uri Simonsohn, “Proxying for Unobserved Variables with Internet Document-Frequency,” Journal of the European Economic Association, February 2013.
Seth Stephens-Davidowitz, “The Cost of Racial Animus on a Black Candidate: Evidence Using Google Search Data,” Journal of Public Economics, June 2014.

More from Chicago Booth Review

A Good Boss Can Boost Team Productivity

A study of two multibillion-dollar retail chains homes in on managers.

CBR - Management

How Much Will Consumers Sacrifice for a Moral Stand?

Research examines people’s willingness to cut ties with companies whose actions they deem immoral or unethical.

CBR - Economics

Did the Fed Contribute to SVB’s Collapse?

Quantitative easing may have played a part in the US financial sector’s current instability.

CBR - Finance

NECESSARY COOKIES These cookies are essential to enable the services to provide the requested feature, such as remembering you have logged in.	ALWAYS ACTIVE
	Accept \| Reject
PERFORMANCE AND ANALYTIC COOKIES These cookies are used to collect information on how users interact with Chicago Booth websites allowing us to improve the user experience and optimize our site where needed based on these interactions. All information these cookies collect is aggregated and therefore anonymous.
FUNCTIONAL COOKIES These cookies enable the website to provide enhanced functionality and personalization. They may be set by third-party providers whose services we have added to our pages or by us.
TARGETING OR ADVERTISING COOKIES These cookies collect information about your browsing habits to make advertising relevant to you and your interests. The cookies will remember the website you have visited, and this information is shared with other parties such as advertising technology service providers and advertisers.
SOCIAL MEDIA COOKIES These cookies are used when you share information using a social media sharing button or “like” button on our websites, or you link your account or engage with our content on or through a social media site. The social network will record that you have done this. This information may be linked to targeting/advertising activities.

What Tweets and Yelp Reviews Are Revealing

Dictionary-based methods:

Generative model:

Text regression model:

Deep-learning techniques:

More from Chicago Booth Review

A Good Boss Can Boost Team Productivity

How Much Will Consumers Sacrifice for a Moral Stand?

Did the Fed Contribute to SVB’s Collapse?

Related Topics

More from Chicago Booth

Related Topics

Manage Cookie Preferences

What Tweets and Yelp Reviews Are Revealing

Dictionary-based methods:

Generative model:

Text regression model:

Deep-learning techniques:

More from Chicago Booth Review

A Good Boss Can Boost Team Productivity

How Much Will Consumers Sacrifice for a Moral Stand?

Did the Fed Contribute to SVB’s Collapse?

Related Topics

More from Chicago Booth

Related Topics