Hospital ratings are deeply flawed. Can they be fixed?

The most influential rating system rests on some faulty calculations, affecting millions of people and billions of dollars.

Credit: Federico Gastaldi

Brian Wallheimer | Aug 26, 2020

The US Centers for Medicare and Medicaid Services (CMS) regularly releases star-system ratings for US hospitals, and did so in January 2020, on the cusp of the COVID-19 outbreak. The news about ratings did not make a lot of headlines, and was quickly buried by stories about the public-health crisis. 

But the release of the ratings was an important moment for patients looking up which facilities might provide the best care for nonemergency procedures—and for the health-care industry, which has billions of dollars at stake. The pandemic has slashed hospital revenues, and hospitals with high ratings may advertise those heavily to attract non-COVID-19 patients and the dollars they bring with them. Ratings can guide patients’ decisions, shape negotiations between hospitals and insurance companies, and determine how much health-care providers get reimbursed by both the federal government and insurers.

Of the four major hospital rating systems available to patients in the US, the CMS offers perhaps the most influential one. As the health-care industry fights COVID-19 on multiple fronts, it’s worth taking a deep dive into the CMS’s ratings to understand how they affect the success or failure of a hospital. Academic researchers have scrutinized the methodology to identify flaws in the CMS system, and to come up with ideas for modifying the ratings. Their research holds the promise of producing more-accurate measurements that could better inform the decisions of patients, hospitals, insurance companies, and the federal government—and of ultimately improving the hospital system, and public health more generally.

The star system

Whether they’re looking to get a meal at a local restaurant or buy a blender on Amazon, modern shoppers are often guided by the advice of professionals or other customers. When it comes to seeking medical care, the impulse is no different. According to a 2018 Deloitte survey, 39 percent of respondents considered reputation when choosing a doctor, and nearly one-quarter said they had used a quality rating when choosing a doctor or hospital, up 5 percentage points from a 2013 survey. More than half of respondents said they planned to use such ratings in the future. 

The CMS rates a hospital’s quality of care using 51 measures centered around common, serious issues—such as heart failure and pneumonia—among Medicare patients who require hospitalization. It launched a star rating system in 2016 to provide more information about a hospital, including patient experience, complications and deaths, and value of care. Its online patient-focused Hospital Compare tool allows people to compare hospitals on the basis of their overall star ratings. 

US News & World Report rates hospitals but also publishes a few annual rankings, including the best hospitals in the country, the best hospitals in each region, and the best for a few different specialties such as cancer care. “Find the Best Hospital for You” trumpets its website, advertising its guides for patients and caregivers. The Leapfrog Group, an independent health-care watchdog, biannually scores general acute-care hospitals with its Hospital Safety Grade, ranking states on the basis of the percentage of A-rated hospitals they have. Healthgrades, a company that provides information on doctors and other health-care providers, issues an annual list called “America’s Best Hospitals,” highlighting those that it deems to be in the top 5 percent for overall clinical excellence, on the basis of clinical quality outcomes for 32 conditions and procedures. 

In all, the model confuses hospitals and has the potential to mislead patients.

Each of these reviewers uses mostly the same data but distinct methodologies, and the results they come up with can be very different. The Johns Hopkins Hospital, for example, was, in 2019, ranked as the No. 3 hospital in the country by US News & World Report and received a top-5-percent designation from Healthgrades. But it received a B in the fall from Leapfrog and only three stars from the CMS.

The rating systems themselves get rated. Researchers from Northwestern Medicine, Sound Physicians, the Council of Medical Specialty Societies, the University of Michigan, Washington University, and University Hospitals did their own analysis and gave their highest grade to US News & World Report, which got a B. The CMS got a C.

But the ratings have a lot of influence, starting with patient decisions. Looking at a decade’s worth of nonemergency Medicare-patient hospital visits, Chicago Booth’s Devin G. Pope finds that the average hospital experiences a 5 percent change in patient volume due to fluctuations in its US News & World Report hospital rankings. He estimates that from 1993 to 2004, this accounted for 15,000 Medicare patients switching from lower-ranked to higher-ranked hospitals for nonemergency care, resulting in more than $750 million dollars changing hands. An improvement in rank for a hospital’s specialty also corresponded with a rise in both the number of nonemergency patients treated and the revenue generated in that specialty. With the explosion in online ratings that occurred since 2004, would the impact be even bigger now? “It’s possible,” says Pope. “Certainly many people are hoping to make an informed, data-driven decision these days when choosing a hospital for elective care.” 

The CMS ratings have a particularly strong influence in the industry, in part because they affect a hospital’s contract negotiations with insurance companies. A better rating can give a hospital more leverage, while a drop can hurt it. A contract includes agreements on the amount insurance companies will pay for certain tests and procedures, which is enormously important to a hospital. Contracts also determine whether a hospital is considered “in network,” which in turn drives patient decisions.

The CMS ratings can also affect Medicare and Medicaid reimbursements, which, according to a 2019 analysis from data provider Definitive Healthcare, can comprise a whopping 30 percent of hospitals’ revenues. The 2010 Patient Protection and Affordable Care Act includes provisions to test value-based care, which determines payments to hospitals on the basis of health-care outcomes. This ties a significant portion of a hospital’s income to CMS formulas that determine how well a hospital is doing.

On top of all that, private insurance companies look to the ratings to determine if they’ll consider similar reimbursement models. If data and algorithms are producing inaccurate ratings, it could jeopardize reimbursements
for hospitals. 

In the end, a rating can affect a hospital’s success or failure, and this in turn has ramifications for the hospital’s immediate area and extended community. And ratings also have implications for public health. For one example, a ratings system that minimizes the impact of certain hospital-acquired infection rates when compiling scores may end up directing patients to less-safe hospitals. Those patients could then become infected and suffer medical consequences.

Problem No. 1: Instability 

Yet, the CMS ratings system has flaws, research finds—and to understand these flaws, you need to unpack the underlying algorithm. The CMS considers mortality, safety, readmission rates, and patient experience, which each account for 22 percent of a hospital’s overall star rating. Three other categories—timeliness, effectiveness of care, and the use of medical imaging—count for 4 percent each. For all of these categories, the CMS takes several other measures into account. The safety-of-care category, for example, includes data on surgical-site infections from colon surgery, and the rate of complications for hip and knee replacements. 

The CMS then takes overall hospital scores and clusters them into five groups corresponding to the assigned number of stars. Of the more than 4,500 hospitals in the January 2020 ratings, about 9 percent have five stars, 25 percent have four stars, 24 percent have three stars, 15 percent have two stars, and 5 percent have one star. The remainder don’t have a star score because they failed to report enough information to be rated, in some cases because they simply didn’t have enough data to report. 

All the data are dumped into what statisticians call a latent variable model, which determines a score for each category and gives weight to metrics that are statistically correlated but not necessarily indicative of a hospital’s performance. The CMS has spent tens of millions of dollars developing its latent variable model. It has contracted with the Center for Outcomes Research and Evaluation at Yale, paying $73 million over five years, and with Lantana Consulting Group, for $13.5 million over the same time frame. 

The model’s instability means that a hospital could be rated differently each time the ratings are calculated, even if it has seen few changes or improvements.

The model assumes that there is a single, important underlying quality trait that cannot be directly measured, but that if a few metrics in a category are correlated, they must be aligned with the variable, and are therefore given more weight. The latent variable essentially allows the algorithm, rather than people, to determine which measures are most important.

There is nothing inherently wrong with this approach. The IQ test is a classic example of a latent variable model, measuring and weighting test takers’ scores. However, according to Chicago Booth’s Dan Adelman, the key issue with the latent variable model in the CMS context is that every time the ratings are recalculated, the model may shift weight from one measure in a category to another. This “knife’s-edge instability,” Adelman says, means that a hospital could be rated differently each time the ratings are calculated, even if it has seen few changes or improvements.

This is what happened to Rush University Medical Center in Chicago, which had the maximum five stars until July 2018, when the CMS, in a preview of its new ratings calculation, dropped it to a three-star hospital. A change in the weights under the latent variable model’s safety-of-care metric hurt Rush. Before, the model had put most of the weight on the Patient Safety and Adverse Events Composite, known as PSI 90, which includes a number of factors, such as hospital mistakes, patient falls, and infection rates. But then the weight shifted to complications from knee and hip surgeries, which Adelman argues is less indicative of a hospital’s overall safety record than the wide-ranging PSI-90 score.

This shifting means hospitals are left to the mercy of a test in which the answer key can change without warning. Hospital administrators trying to improve their scores, which presumably reflect the safety of the hospital, are left with no clear guidance on where to focus their improvement efforts. Plus, as Adelman writes, “the fundamental problem is this: even if a hospital improves along every measure relative to all other hospitals, it is still possible for a hospital’s score to decrease.”

Another issue with the model is distortion involving hospital size. Small hospitals have fewer things to report, and if a hospital doesn’t have many data related to a measure the model weighs heavily, the algorithm gives the hospital an average score. This is good for small, poor-quality hospitals, as their ratings will be pulled toward the mean.

In all, the model confuses hospitals and has the potential to mislead patients. “You move the weights around a little bit, and all of a sudden all the top hospitals change. It’s kind of a funky business,” Adelman says. “If it’s wrong, you have bad hospitals that look good, and good hospitals that look bad. And you’re sending people to the wrong hospitals.”

Adelman proposes a replacement to the latent variable model—one that factors in patient volumes and measures hospitals against best performers. In his efficient frontier model, every hospital would have its own unique set of weights. The more people affected by a particular measure, the more weight is given to that measure.

Learn more about the Efficient Frontier Hospital Rating System

The Efficient Frontier Hospital Rating System scores more than 3,700 individual hospitals using a method that addresses some of the limitations underlying the U.S. Centers for Medicare and Medicaid Services’ Hospital Compare ratings. Click here for more information and a detailed example showing how the system works, or go straight to the searchable EFHRS database available via Chicago Booth Review.

Imagine Hospital A. In Adelman’s model, its weights are determined by comparing it with other hospitals that are more efficient and better performing in key dimensions. These hospitals are combined to create a virtual hospital that sits between Hospital A and an ideal hospital that achieves the maximum performance along every measure. The model constructs this virtual hospital by combining hospitals that perform most efficiently on the basis of factors such as mortality and readmissions. It then finds measure weights that score the hospital as close as possible to these efficient hospitals measured under the same weights, while also ensuring that measures impacting more people are weighted more.

In this approach, which essentially provides a stable answer key, hospitals that are improving could still see their ratings drop if national volumes of patients impacted by measures shift dramatically relative to one another. “However, infinitesimal shifts would result in only infinitesimal shifts in hospital scores (a result of math programming sensitivity analysis), not dramatic shifts in the scores and measure weights as we see in the LVM [latent variable model] approach with respect to correlations,” Adelman writes. “Thus, our approach enjoys substantially greater stability properties.”

Problem No. 2: Small-data issues

Adelman’s efficient frontier model only addresses the shortcomings of the latent variable model, while other research suggests that issues with hospital data, risk adjustment, and methodology also affect the accuracy of the CMS rating system. Here, too, the small-data problem at small hospitals creates challenges. 

The CMS website offers information about hospital heart-attack mortality rates, but some hospitals deal with fewer heart-attack patients than others, and therefore a small hospital’s rating could be affected by one or two heart-attack deaths. To address this, the CMS adjusts the data, with the goal of making a fairer comparison; the outcome, however, is that small hospitals look much safer than they actually are. 

The problem—say Edward I. George, Paul R. Rosenbaum, and Jeffrey H. Silber of the University of Pennsylvania; Chicago Booth’s Veronika Ročková; and INSEAD’s Ville A. Satopää—is that the model doesn’t take into account hospital characteristics such as volume or the procedures the professionals there can do. In cases in which a hospital has few heart-attack mortality data, the CMS simply estimates it to have a rate that is closer to the national average. In 2007, of all the hospitals rated, large and small, almost 100 percent were classified as “no different than the national rate.” The next year, none was worse than average, and nine were better than average. 

“For any one small hospital, there is not much data to contradict that prediction,” the researchers write. But, they ask, when the CMS model claims that its mortality rate is close to the national average, “is this a discovery or an assumption?”

To find out, the researchers analyzed data from Medicare billing records for 377,615 patients treated for heart attacks at 4,289 hospitals between July 2009 and the end of 2011. This analysis suggests the actual heart-attack-mortality rate is 12 percent at large hospitals and 28 percent at small hospitals. The CMS model adjusts the rate to 13 percent at large hospitals and 23 percent at small ones. It tries to compensate for the lack of data from small hospitals by borrowing information from large ones, Ročková says. “This would only work if the small and large hospitals were comparable in terms of their performance. The data, however, speaks to the contrary.”

It would be more reasonable, the researchers argue, to borrow information from hospitals of similar size. They do this, plus take into account hospital volume (number of patients), nurse-to-bed ratio, and the hospital’s technological capability—particularly its ability to perform percutaneous coronary intervention (PCI), better known as angioplasty, to improve blood flow to the heart.

In the researchers’ proposed expanded model, “hospital characteristics that generally indicate better mortality (say PCI or increased volume) can be utilized to direct patients away from specific hospitals that do not perform PCI and have small volume,” they write. “If patients instead utilized the HC [Hospital Compare tool in the CMS] model, which does not include hospital characteristics, they would not be directed away from these hospitals. While there may be some small hospitals with excellent outcomes despite not performing PCI, the vast majority of such hospitals perform worse than those larger hospitals that do perform PCI.”

Problem No. 3: The underlying data

Thus, research suggests at least two problems with how the CMS ratings are compiled, and another research project indicates there are some issues with the data that are fed into the ratings. Analysis Group’s Christopher Ody, Chicago Booth PhD candidate Lucy Msall, and Harvard’s Leemore S. Dafny, David C. Grabowski, and David M. Cutler highlight an issue with readmissions data, one of the seven measures used to inform the algorithm behind the CMS star system.

The researchers’ study isn’t about hospital ratings. Rather, it looks at another program administered by the CMS, the Affordable Care Act’s Hospital Readmissions Reduction Program (HRRP). Using a value-based care approach, the ACA contains rules that penalize hospitals with a higher-than-expected 30-day readmission rate, premised on the idea that hospitals could do a better job of avoiding readmissions. 

Prior to the HRRP’s implementation, in October 2012, the government reimbursed hospitals for Medicare-covered patients on the basis of the kind of care provided. But once the HRRP went into effect, hospitals with high readmissions for heart attacks, heart failure, and pneumonia were docked 1 percent of reimbursements. This increased annually, until it reached 3 percent in 2015. Findings from several studies suggest that the plan has worked, noting that readmission rates declined not only for the targeted conditions, but for others as well.

However, Ody, Msall, Dafny, Grabowski, and Cutler probe this conclusion by looking at what goes into the readmission rates, which are risk adjusted to account for the incoming health of a patient. The sicker a patient is upon her first hospital admission, the greater the likelihood she will be readmitted. In an attempt to be fair, and not have sick patients hurt hospitals’ ratings or readmission statistics, the CMS considers patient data on age, sex, and comorbidities (the simultaneous presence of two or more chronic problems) from diagnoses in the year before hospitalization.

A patient arriving at a hospital may have a severe cough, high blood pressure, diabetes, and other medical issues. Hospital staff can note these health issues, and others, on the patient’s chart. When it comes time to send the information to the CMS, as part of submitting Medicare claims, staff electronically submit codes that indicate symptoms or illnesses. The CMS uses these codes to make its risk adjustments. 

About the same time that the ACA’s program went into effect, the CMS made a change to these electronic-transaction standards that hospitals use to submit Medicare claims, the researchers point out. Prior to the readmissions penalty program, hospitals could include a maximum of 10 patient-diagnosis codes in their submissions. Even if the patient had dozens of other symptoms or illnesses, the hospital staff could electronically add no more than 10 codes.

But coincidentally, just as the HRRP began, the CMS changed the rules and allowed for up to 25 diagnosis codes, which helped doctors paint a more accurate picture of a patient’s health. “We document that around January 2011, the share of inpatient claims with nine or 10 diagnoses plummeted and the share with 11 or more rose sharply,” the researchers write. Prior to the rule change, more than 80 percent of submissions had nine or 10 diagnosis codes. After the change, 15 percent had nine or 10 codes, while 70 percent of submissions had 11 or more. There was little change in the number of submissions with eight or fewer codes. Rather, doctors included more codes and better indicated all the health issues patients presented.

The CMS didn’t take this into account when evaluating the effect of the HRRP—and the diagnosis-code change may account for about half of the supposed progress made by hospitals in reducing readmissions, the researchers write. The additional codes helped show that many patients were sicker than they would have looked previously. And while about half of hospitals’ overall decline in readmissions may have been due to hospitals doing a better job, the other half resulted from both recording more accurate data and recognizing the health of incoming patients, the researchers conclude. 

They note that the program may have unfairly penalized certain hospitals, including ones that treat poorer and less-healthy patients, who are readmitted more frequently. Say two people, one affluent and one poor, both had heart attacks and went to two different hospitals. Doctors at each hospital would have entered no more than 10 data points indicating what was wrong with their patients, so that in the system, the patients looked similar. In fact, though, there was more wrong with the poor patient, who was more likely to be readmitted. 

“Pay-for-performance schemes expose participants to the risk of unstable funding, in ways that may seem unfair or contrary to other social goals,” the researchers write. “In the case of the HRRP, the program was found to have initially penalized hospitals that cared predominantly for patients of low socioeconomic status—hospitals that are more likely to be safety-net providers already operating on tight budgets.”

The system change addressed this problem, in part—but it remains an issue, as hospital staff are still limited as to how many codes they can input, even if the limit is higher than it was before. Even as hospitals serving a poorer, sicker population submit more data on the health of their patients, they are still more likely to suffer from high readmission rates and be penalized, the researchers say. And the incoming health of patients doesn’t necessarily reflect the quality of hospital care. 

Readmission rates may similarly impact the number of stars a hospital receives from the CMS—but also illustrate how incomplete data and analysis can skew ratings. Patients comparing hospitals on the CMS website see “unplanned hospital visits” as one of the seven categories they can use for evaluation. Under that, hospitals are scored for readmissions for heart issues, pneumonia, hip and knee replacements, colonoscopies, and more. These data affect hospital reimbursement but also how patients and insurance companies view a hospital. 

Changing the system?

The CMS generally releases hospital ratings twice a year. When it issued ratings in February 2019, however, 15 months after the previous ones, it announced that it would be taking public stakeholder comments on potential changes to the rating system, an indication that there could be a chance to correct some of the problems in the methodology. 

The CMS’s announcement suggested the latent variable model could be on the chopping block, potentially to be replaced with “an explicit approach (such as an average of measure scores) to group score calculation.” Other potential changes included assigning hospitals to peer groups, modifying the frequency of ratings releases, and developing a tool that would allow users to modify ratings according to their preferred measures. 

But there’s no guarantee that the latent variable model will be scrapped or significantly changed. As for the other problems researchers have identified, Ročková for one said that although representatives of the CMS have shown positive interest in their proposed model, it has not yet been incorporated into their current recommendation system. 

Adelman argues that there should be a moratorium on all hospital ratings during the pandemic. Even poorly rated hospitals are full of medical staff—many demoralized by equipment shortages, furloughs, and pay cuts—working tirelessly and risking their own lives to save the lives of others. Because of this, and because there are no measures related to COVID-19 responsiveness or preparedness, publicly rating hospitals at this time is not appropriate, Adelman says.

The CMS, through a spokesperson, says that it will go through “appropriate rulemaking” for any changes to the star-rating methodology.

“The agency, with its vast network of partners in health-care delivery and on behalf of people with Medicare benefits, patients, and their families, most certainly celebrates and appreciates the amazing work that medical staff (and many others) have been doing,” reads a statement from the CMS, adding that it “has responded by offering unprecedented waivers and flexibilities to remove barriers, expand telehealth, and allow all providers, and especially hospitals, to focus on patient care.”

The CMS is assessing how COVID-19 has impacted data reporting, according to the agency. But for now, the current ratings stand. In spite of its flaws, for the foreseeable future, the CMS rating system will continue to drive patient decisions, shape hospital budgets, and influence public policy.