You should run more experiments

Testing assumptions in the field can avoid costly errors in decision-making

Credit: Sebastien Thibault

Oleg Urminsky

Oleg Urminsky | Aug 03, 2020

Sections Marketing Strategy

When we develop business strategies, we often rely on what we believe has worked in the past. Unfortunately, we’re often wrong.

A better way to make a decision with financial consequences is to conduct an experiment. Obviously, the idea of doing experiments is really old. Think of Galileo in the late 1500s at the top of the Leaning Tower of Pisa with two spheres of different sizes, dropping them off the side to test theories of gravity.

Experiments in the business world have been around for a long time, too, but they weren’t that prevalent in the past. In the 1960s, for example, there was a debate about whether arranging a particular grocery-store item to have a lot of shelf facings—identical products with the label turned out toward the consumer—would cause customers to be more likely to buy the product, or to buy more of it.

That’s hard to tease apart, because things that sell in higher volume are going to get more shelf facings. So Kent State University’s Keith Cox partnered with a grocery chain over several weeks, and at different locations, he randomly varied the number of shelf facings for four products.

He found the evidence was mixed. Increasing shelf facings seemed to cause an uptick in sales of one product, but not of the other three. The results were not necessarily groundbreaking, but the idea of systematically varying things to test these causal effects is the core idea of experimentation.

You can see similar tests happening these days with the aid of technology, which has allowed experiments to explode online, where they’re often referred to as A/B testing. One of the early adopters was Amazon. The company had an internal debate about whether to give you a recommendation for something else to buy when you’re about to check out. At the time, this was controversial. The concern was that if Amazon gave customers a recommendation for another product, they might read about it and then think, “Oh, maybe I should get that too, but I’ll decide later,” and never complete the purchase.

Amazon decided to test this debate experimentally. When new customers came to the site, they were randomly assigned either to get an extra product recommendation or not. What Amazon found was a big increase in sales when it made the extra recommendation.

Amazon learned from this experiment not just the answer to one question about recommendations. The company took away something much more profound: it should use experimentation whenever feasible to make these decisions, rather than relying on a manager’s judgment or on a roomful of people debating. 

Companies develop certain ways of making decisions that they trust. If you didn’t come up in an experimental culture, it’s hard to change.

Today, lots of other companies experiment all the time. Capital One supposedly does 80,000 small experiments in a typical year. Eight to 10 years ago, Google was doing 40,000–70,000 experiments a year. Now, we couldn’t even quantify how many experiments it completes, because it has moved to a continuous-experimentation mode. If you’ve been online in the past 24 hours, you probably have been a participant in one of these experiments.

That said, the use of experimentation is unevenly distributed. Companies develop certain ways of making decisions that they trust. If you didn’t come up in an experimental culture, it’s hard to change.

In businesses that don’t treat experimentation as a core value, how are people making decisions, and how do those methods stack up to experimentation?

Testing decision-making processes

In a case study, I worked with Indranil Goswami, a graduate of Chicago Booth’s PhD Program, who’s now at the University at Buffalo, to compare sources of information for decision-making in the context of fundraising. Specifically, we looked at matching offers. You’ve probably received an appeal in the mail that says, “This is a great time to donate, because for every dollar you give, we have a sponsor who’s also going to give a dollar.” 

For fundraisers who plan to use this kind of matching appeal, there are a lot of decisions to make in the wording. What is the basis for these decisions? We can think about precedent: we did this last year and didn’t get a lot of angry letters, so let’s just do it again. For a more sophisticated approach, we can look to expert intuition. From their experience trying different things over time, professional fundraisers may be learning a lot about donor behavior and feedback. 

We also can think about this as a marketing research problem. We could show our proposed fundraising appeal to people in a focus group or do an online survey. Finally, there may be economic models that we can use to predict what would happen. 

There have been a number of studies testing matching offers. Some of them—such as a 2007 study by Northwestern’s Dean Karlan and University of Chicago’s John A. List—suggested that it helps and more funds are raised. But other studies in other settings found no difference, and a few studies found a slightly lower amount of funds raised when matching was used. 

Researchers have proposed that other factors may influence whether the match is successful. The match might be seen as a quality signal. If donors are willing to support this organization and match donations, they must take the organization seriously. 

There’s also speculation, though not a lot of evidence, that matching is a social cue. We do some things because we like doing them, but we may also do other things because we like being part of a group of people doing the same thing. 

There also could be negative responses among potential donors to a matching appeal. Donors might think that if an organization can get a matching sponsor, it must have lots of ways of getting money, and they should instead give to an organization that needs the money more. 

I’m not saying we should make all decisions on the basis of some field experiment. In fact, sometimes that may not be enough.

It’s awkward to ask people for money, so fundraisers also worry about whether an appeal is going to sound coercive or offend the recipients in some way.

There could be two additional factors at work. One is a substitution effect, particularly for repeat donors. A donor might think: last time, I gave $40, and you got $40. This time, I could give $20, and you’d still get $40. It’s like a half-off sale: the donor could spend half the money, make the same impact, and pocket the difference.

We also could think about it as a quality or a norm signal, which is that if an organization has to resort to matching, maybe its regular fundraising wasn’t going that well, and others weren’t giving.

This is a complex situation in which to make a decision. All of these interpretations and motives seem plausible, and that makes it hard for the person designing charity appeals to make a confident prediction. 

Changing the frame

In our research, we partnered with the Hyde Park Art Center in Chicago for their 75th-anniversary fundraising drive to come up with different ways to implement a matching campaign. The first change we designed was a framing manipulation: if we’re worried that a matching appeal might be coercive, we could reframe it in nicer terms. There’s something a little strange about a matching manipulation, that could feel like a wealthy sponsor saying, “I’m only going to donate if you take money out of your own wallet.” 

We reframed this as “Let me help you donate more.” Instead of saying that for every dollar you donate, the sponsor will donate a dollar as well, the alternative “giving credit” appeal said that for every dollar you contribute, the sponsor will add a dollar to your donation, helping you give more. 

The second proposed idea was a threshold match: if we’re worried about the substitution effect, why not have the match kick in only above a certain point? In this version, the appeal communicated to repeat donors that anything you give above what you donated last time will be matched.

As much as possible, we kept the rest of the wording the same, because we were trying to isolate the effects of these different strategies.

Hydroxychloroquine: A case study in the value of good experiments

It’s a common enough type of scene: the CEO of a major company announces the decision to leverage a new promising technology, backed by scientific evidence. The tech visionary who heads up the company’s strategic partner explains how big data methods will provide new insights as the technology is rolled out. The company’s chief engineer uncomfortably tries to lower expectations, pointing out that the technology is not fully proven yet, while also attempting to not contradict the CEO. But the press, zeroing in on the fact that the public is more interested in solutions than in caveats and caution, breathlessly promotes the new advance. Providers around the world react to the demand, offering this unproven new product or service to their customers, causing shortages and another round of press coverage. A highly successful rollout—but will customers benefit?

We have seen exactly this scenario take place in the past few months, but this time on a national stage and in a matter of life and death: treatment of the coronavirus-induced COVID-19 illness. A team of researchers in France, led by Philippe Gautret of the Mediterranean University Hospital Institute for Infectious Diseases, published a hastily assembled research paper on March 17, in which they suggest that the antimalarial drug hydroxychloroquine is effective in treating coronavirus, particularly when used in conjunction with another drug, azithromycin. US president Donald Trump, in press conferences and on Twitter, touted this breakthrough as a “game changer.” Oracle chairman Larry Ellison announced an initiative to collect data on the drugs’ efficacy faster than a US Food and Drug Administration clinical trial would.

Read more >>

To test the sources of information that decision makers might typically rely on, we looked at published academic research, we surveyed professional fundraisers, and we did a simple, cheap market-research study. We compared these sources with the ground truth of our field experiment with the Hyde Park Art Center, testing these different appeals.

In terms of economic models, we can think about the utility you get from donating money to a charity as having three parts. The first part is the utility you have from everything else in your life that costs money. The more you donate, the lower that gets.

The second part is what’s called pure altruism. I get utility from the fact that a good organization has the funds to run its programs, and every additional dollar it receives, whether from me or from someone else, gives me utility.

The third part is what’s called “warm glow.” I feel good about myself for being a donor, and if I give more, I get to feel even better about myself. And if I don’t give anything, even if the organization has all the money it needs, I’m missing out on this warm glow. 

Analyzing these three sources of utility in the standard models of altruism tells us that you’re going to donate enough that the incremental dollar you give, in terms of the combination of pure altruism and warm glow, is equal to the value you would have gotten from the incremental dollar you would have instead spent on something else. If this model of utility is accurate, we can predict that appeals with a match will be more effective than not having a match. 

What does the model predict about the effectiveness of our alternative appeals? We hope the giving-credit framing will make it so that the donor is giving $20 but gets to feel good about $40. If people actually interpret it this way, the model makes an unambiguous prediction: the giving-credit match will raise more funds than the regular match. 

The threshold match is more complicated. The model doesn’t give us a clear prediction. If you would give at least your prior donation regardless, the threshold match will motivate you to give more. But if you’re not going to give as much as last time, the threshold match provides no matching funds, compared with a full match in the standard appeal, and so it will do worse.

Expert guidance is uncertain

We surveyed professional fundraisers with an average of 10 years of experience. We showed them five versions of the appeal: no match at all, the regular match, reframing the match as giving credit, the threshold match with regular wording, and the threshold match reframed as giving credit.

Out of these five, we asked the fundraisers to consider two versions of the appeal in terms of which would be more effective for participation and average contribution. Overwhelmingly, the professional fundraisers said including a match would be more effective than not having one. 

The fundraisers didn’t have much direct experience with the giving-credit framing. But overwhelmingly, the majority said it probably would be better both for participation rates and for average contributions among those who sent in a donation. 

The fundraisers were split on whether the threshold match would work well. It could help to deal with the substitution effect, or maybe it would instead demotivate or confuse people.

The last question we posed: If we do the threshold match, should we do the giving-credit framing or the standard framing? The professional fundraisers thought that whether or not we were doing the threshold match, the giving-credit framing was a good idea.

In a second survey with fundraisers, we focused only on comparing the giving-credit framing to the standard appeal. We first showed them one appeal and asked them to evaluate that one version, and then we showed them the other appeal and asked them to evaluate that one on its own. 

What we find is that when we first showed fundraisers the standard appeal and then the giving-credit framing, they said giving credit was going to be more effective. But if we started off describing the giving-credit appeal and then described the standard appeal, they rated them pretty similarly.

This is strange if we think that professional fundraisers are drawing on their wealth of experience and their mental model of the donor. Their responses shouldn’t be sensitive to uninformative factors, such as the order of the comparison. Throughout behavioral science, we see that people’s judgments change with these kinds of factors when they are making up their minds on the spot, rather than relying on preexisting knowledge. This might shake our confidence in the fundraisers.

The last potential source of information is a market-research study. We designed an online survey that a charity could implement cheaply and quickly online. First, we had respondents choose their favorite charity from a list of 20. We told participants that five people were going to be chosen to win $20, but they had to decide in advance, if they won, how much they wanted to give to their selected charity. Then we randomly assigned them to one of the five appeals. 

In the results, there was no strong evidence of any significant differences. The full match plus giving credit did a little better, but it wasn’t a statistically significant difference compared with the other appeals.

So what actually happened when we ran the field study to test the effect of the appeals? We sent out 1,500 mailers. The donation rate was about 5 percent, and the median donation was $100. In the market research study, the donation rate was instead 75 percent! That’s a clear warning sign that the market-research survey might not have accurately captured the thinking of potential donors.

As it turns out, the giving-credit framing with the threshold match—both of our brilliant ideas combined—actually reduced participation. The threshold match didn’t seem to make a big difference, but the giving-credit framing significantly decreased participation. Overall, there was a negative net effect on how much money was raised, on average, per appeal sent out. I felt very bad that we cost the Hyde Park Art Center money, but our intuitions were no worse than the other sources of information they could have consulted.

We ran the experiment again, in a second fundraising campaign, this time testing only the standard match versus the giving-credit framing. We sent out 3,000 mailers, and 3 percent donated.

The first experiment wasn’t a fluke. The second time, we didn’t see a difference in average contribution, but we saw a huge effect on participation. The giving-credit framing basically cut participation in half, and as a result, net donations were half of what was raised by the standard-match appeal. In fundraising, that’s a massive effect.

So, the result of all this research is not only confirmation that the giving-credit framing is a bad idea, but that it’s a bad idea that we couldn’t really have predicted with the kinds of information available to fundraisers.

No substitute for experiments

You might argue that there’s something unique about fundraising when a field experiment shows terrible results for an idea that experts predicted would be successful. But there are examples in other contexts that are quite consistent. 

In education, there’s a study looking at an intervention to have parents receive alerts on their phones when their kids miss school or don’t hand in homework. Researchers polled experts for their predictions about which of four strategies would be most effective in getting parents to sign up for the alerts, but few of the experts actually identified the right one. 

In health care, there’s been almost a consensus that “hot spotting” is a good idea. The philosophy is that identifying the subset of patients who are responsible for a huge proportion of medical costs and treating them more intensively the first time they show up to the hospital will reduce costs overall. 

The basis of this idea came primarily from observational data. But an experiment just published in the New England Journal of Medicine randomly assigning high-risk patients either to a hot-spotting intervention or to regular treatment found absolutely no difference in hospital readmission rates.

I’m not saying we should make all decisions on the basis of some field experiment. In fact, sometimes that may not be enough. There’s a lot of evidence showing that field experiments done in one context at one time with one population vary in how well they generalize to other settings. What I’m arguing for is, when possible, to conduct in-context field experiments.

This leaves us with the probably disappointing advice that I give to people in industry. When they ask me about the big new ideas from academics that they can implement tomorrow, I typically have to say that I don’t know their business well enough to make a confident recommendation about what would work in their context. What I can give them are some good ideas for experiments and advice on how to conduct them.

Oleg Urminsky is professor of marketing at Chicago Booth, and teaches the Experimental Marketing course on how to use experimental methods to make business decisions. This essay is adapted from a lecture given at Booth’s Kilts Center for Marketing in February 2020.