A problem with hospital ratings—and how to fix it

Credit: John Devolle

Brian Wallheimer | Dec 03, 2019

Until recently, Chicago’s Rush University Medical Center boasted the maximum of 5 stars in the Centers for Medicare and Medicaid Services (CMS) hospital rating system. Data used to compute the July 2018 rating indicated that the hospital had improved in many areas, so it came as a shock when hospital administrators got a preview of the new ratings and Rush had dropped to 3 stars, according to Chicago Booth’s Dan Adelman.

Even a hospital that improves in every single metric can experience a rating drop, says Adelman. This indicates a problem with the current CMS system—and he suggests a way to address it.

The CMS rating system organizes hundreds of hospital metrics into seven categories: mortality, safety of care, readmission, patient experience, effectiveness of care, timeliness of care, and efficient use of medical imaging. It then uses what statisticians call a latent variable model, which gives weight to metrics that are statistically correlated but not necessarily indicative of a hospital’s performance.

The latent variable model assumes that in each category, there is a single, unknowable factor driving performance measures. If a few metrics in a category are correlated, the model assumes that they are driven by the latent variable and thereby gives them more weight when computing the hospital’s score.

For example, a hospital’s Patient Safety and Adverse Events Composite, known as PSI-90, takes into account a number of factors including hospital mistakes, patient falls, and infection rates. Until recently, the PSI-90 had been given the most weight in performance measures, but thanks to stronger correlations in the data used to calculate the July 2018 ratings, a new factor was given more weight: complications from knee and hip surgeries. 

The problem is that these surgeries affect far fewer patients and might not be applicable to all hospitals, yet the knee and hip surgeries became a big factor by which all hospital systems were rated.

Hospitals view their individual rating before the CMS releases the information to the public, but hospital uproar over the new results caused the CMS to delay their publication until February. It also modified the ratings, so that PSI-90 now dominates again. Rush was bumped from 3 stars to 4.

Adelman argues that ratings shifts from small changes in correlations result in “knife-edge” instability that renders the evaluation system meaningless for patients who might rely on it when choosing a facility for their care. Hospitals, which use the ratings to negotiate with insurance companies for payments, cannot determine where to focus efforts toward improving. The ratings also affect a hospital’s reputation, which in turn affects patient volume and payor mix (an industry term that refers to the distribution of more-profitable patients, who use private insurance, and less-profitable ones, on public insurance). And when patients are attracted to hospitals that rate higher but have worse outcomes, that hurts the overall health of people in an area.

The idea is reminiscent of portfolio optimization in finance, in which investors seek to maximize a portfolio’s return by taking combinations of investments on the efficient frontier.

“It’s like developing a grading scheme for school,” Adelman says. “The teacher gives the grading scale out at the beginning of the semester and tells everyone the weights for attendance, quizzes, papers, and tests. But this is like going through the semester and then telling everyone where the weight is at the end based on how the students perform. And every semester that might change.”

A benefit of the CMS rating model is that it doesn’t require anyone at the CMS or its affiliates to manually determine the weight of each metric, says Adelman, which could introduce bias and opinions. Rather, the model chooses how to weigh each metric, and additional metrics are easily integrated.

Adelman argues that the same could be accomplished with a model he has created that relies not on correlation but on patient representation and the measurement of hospitals against best performers. In his model, each hospital gets its own unique weights. A measure that affects more people is given more weight.

Efficient Frontier Hospital Ratings

Click here to view ratings for 3,720 US hospitals scored according to the Efficient Frontier system. Compare hospitals based on size, teaching status, and socioeconomic rating; submit questions or comments about the system; and view additional resources used in developing the ratings.

To weigh particular measures for Hospital A, Adelman’s model compares it to other hospitals that are more efficient and better performing along key dimensions. These hospitals are combined to create a “virtual hospital” that sits between Hospital A and an ideal hospital that achieves the maximum performance along every measure. The virtual hospital thus dominates Hospital A. The idea is reminiscent of portfolio optimization in finance, in which investors seek to maximize a portfolio’s return by taking combinations of investments on the efficient frontier, the point at which investments achieve the best risk-adjusted return. Rather than combine investments that are measured by risk and return, Adelman’s model combines hospitals that perform most efficiently on the basis of factors such as mortality and readmissions. The model then finds measure weights that score Hospital A as close as possible to the virtual hospital as measured under the same weights, and maximizes Hospital A’s score. 

This model eliminates the possibility that a hospital would receive a lower rating even if it improves in all metrics, according to Adelman. “Measure weights obey desirable structural properties under reasonable conditions, including that scores improve when hospitals improve, and that better-performing hospitals score higher,” he says.

Adelman warns that his model is designed to combat the problems with the latent variable model but does not address other concerns in hospital rankings—including those related to underlying measures, steps in the methodology, or the ratings system itself.