The American Enterprise Institute’s statistical fumbles: Misusing linear regression for short-term COVID-19 trends

Disclaimer: I am not a medical professional, an expert in statistics or epidemiology, or a former FDA commissioner.

Since the widespread outbreak of COVID-19 in the United States, certain conservative-leaning organizations and institutions – including federal and state governments – seem to have gotten into a bit of a tussle with the fields of statistics and data analytics generally. These entities will typically express agreement that the spread of the pandemic must be brought under control as a precondition to restoring former levels of economic activity, employment, school attendance, and every other aspect of our lives that has been upended. But whether out of a differing arrangement of priorities and values in encouraging economic recovery versus protecting human lives, an especially strong desire to return to pre-pandemic normalcy, or simply the need to appear to be doing something about all this, certain groups and individuals are indeed content to settle for the mere appearance of improvement.

Several states as well as the CDC have been improperly combining statistics from coronavirus viral tests, which detect active infection, and antibody tests, which detect past exposure to coronavirus and can still give a negative result even in those with an active infection. Because viral tests are typically given only to those with symptoms that may be caused by COVID-19 (a group more likely to receive a positive result), while antibody tests are given to the general population (a group less likely to receive a positive result), combining these statistics creates the false impression of both an expanding capacity to test for new and active cases, as well as a lower rate of positive test results. Those are desirable goals, but the choice to combine these two very different measures wrongly portrays greater progress toward those goals than has actually been attained.

Meanwhile, the state of Georgia presented its COVID-19 cases as steadily declining prior to initiating its reopening on April 29. However, there are substantial lag times between infection, the onset of symptoms, the confirmation of a case via viral testing, the reporting of a case, and the inclusion of that case in reported data. By mid-May, updates to the data on confirmed COVID-19 cases in Georgia made that pre-April 29 “decline” disappear – instead, the number of new infections had actually remained steady in the prior 14 days. Georgia’s Department of Health later published a graph, since retracted, which gave the impression of cases continuing to decline over time – because the data had been ordered not by date, but by number of cases in descending order. Commenters at statistician Andrew Gelman’s blog have raised the possibility that this could have been an innocent mistake caused by inept use of SAS dashboards; even if that is the case, the fact that no one in the loop noticed such an egregious error in that asset prior to delivery is inexcusable.

In a particularly appalling display of statistical illiteracy, Kevin Hassett of the White House’s Council of Economic Advisers presented a “cubic fit” to forecast COVID-19 deaths, a model which projected that there would be no deaths in the United States by May 15 (there turned out to be 1,286 confirmed deaths in the United States on that date). While this was claimed to “summarize COVID-19’s observed trajectory”, Hassett later revealed that he had simply used a feature of Microsoft Excel to draw a cubic polynomial fitted to the data – there was no reason to regard this as a useful forecast based on any actual modeling at all. Others have analyzed the behavior of a cubic polynomial as applied to that data, revealing that the graph may have been truncated after the month of August in order to avoid showing that the cubic “model”, as an odd-degree polynomial, would thereafter forecast the number of deaths growing to infinity.

Which brings us to the present issue: yet another example of the tendency to assume that just because you can tell Excel to draw a line, this somehow constitutes a meaningful representation of trends in COVID-19 data.

To be clear on what this post is and is not, this is not a positive case being made for an assessment of the pandemic’s spread and its impacts up to this point, or predictions of the path that this will take. This is not in the vein of widely-shared, poorly-sourced Facebook posts from a friend of a friend who’s a doctor offering the inside scoop on what’s “really” going on. This is a negative case against a specific interpretation of the data and its associated claims, intended to offer a narrow criticism explaining why this analysis is inappropriate and misleading.

Scott Gottlieb, commissioner of the FDA from 2017 to 2019, now serves as a fellow of the American Enterprise Institute right-wing think tank. On the subjects usually covered here, the AEI offers uncritical reposts of the standard array of anti-trans talking points, ranging from claims of a “transgender war on women” to stoking fears about our use of public restrooms to accusations of “denying biological science”, along with the occasional podcast featuring arch-TERF Meghan Murphy and Christina Hoff “a lovely, girly trans woman” Sommers. Will their analysis of COVID-19 data prove to be any more rigorous?

Gottlieb’s AEI-branded tweets over the past month suggest not. On May 14, Gottlieb posted the following graph while asserting that there is “a sustained decline in #Covid19 deaths nationally, another indication that the U.S. epidemic is slowing”:

@ScottGottliebMD, 14 May 2020

There are several problems with how this chart was created and how certain statistics were chosen to represent this data. First: why choose those particular seven-day blocks of daily data? The x-axis is labeled as “days since peak in deaths”, apparently referring to April 29 on which 2,700 confirmed deaths were reported by the cited COVID Tracking Project. Yet that same source reported 2,746 deaths on May 7 – and lag in reporting is not a factor here, as the archived page from May 13 displays the same figure. Why is April 29 described as the “peak” when May 7 had a greater number of deaths, and that information would have been available to the chart’s creators at the time it was posted? And how is it useful or accurate to speak of a “peak” at all while we’re still very much in the middle of an ongoing pandemic that hasn’t been eradicated or even effectively contained? Describing this as a peak implies claiming that future daily death figures will never exceed that. There does not appear to be a sufficient basis as of yet for making such a claim.

Second: Why include those lines? And why include those equations? The lines displayed on each seven-day block are linear regressions of each respective set of seven points – essentially, this is the one solution for a line that optimally minimizes the combined distances between each point and the line. Gottlieb’s chart also includes the equations that describe these lines: an input of x days is evaluated, producing a result of y cases if the trajectory of the line is followed perfectly. The coefficient of x (known as the slope), 9.64286 in the first seven-day set, is multiplied by the number of days since Gottlieb’s Peak, and the second number – referred to as the intercept – is added to that. Such an equation states that for each additional day that passes, the number of deaths follows a pattern of an additional 9.64286 daily from the starting point of 1,686.29 daily deaths at day 0. (The phrase “half your age plus seven” expresses this same general concept.)

What’s important to understand about this line is that you can tell Excel to draw such a line for any scatterplot of data, even random data. A linear regression will not refuse to compute except in the presence of a process generating data that shows a genuine trend in reality – it has no way of knowing that. You could throw seven darts at Gottlieb’s seven-day chart and you could still calculate an optimal line and its equation. So the fact that such a line can be drawn and its equation can be calculated does not itself indicate anything.

What would be a useful indicator? The one thing the AEI happened not to select in Excel: another important value generated by the calculation of a linear regression, r-squared. The option can be found in the lower right corner here after adding a trendline chart element:

This chart uses the same data as Gottlieb’s first seven-day set. What happens when you select the r-squared value?

This value, called the coefficient of determination, essentially expresses how much of the variation in y can be determined from x. So while it’s possible to generate that line and equation from that set of data, the passage of time only explains about 0.1% of the changes seen in the daily deaths over that time. Presenting this as being a time trend massively overstates the role of time in this pattern, as well as the presence of a meaningful pattern correlating time and daily deaths at all – the variation in that seven-day set could be 99.9% due to something else (or many something elses) other than time.

Can we quantitatively evaluate just how meaningful this hypothesized trend is? Yes, using the widely accepted threshold of statistical significance of p < .05.This value’s square root, known as the correlation coefficient r (in this case, r = 0.037), is a value between -1 and 1 indicating how closely the data points group around the calculated line – the closer to -1 or 1, the closer the grouping; the closer to 0, the more widely scattered. In combination with the number of data points in the set (7), this can be used to calculate the linear regression’s p-value – the likelihood of obtaining such a correlation coefficient if there were actually no correlation between x and y at all. For Gottlieb’s first seven-day set, p = 0.9372 – very much not below .05. For his second seven-day set, the correlation appears slightly stronger, with r = -0.531 (negative sign indicating a decreasing relationship) and r-squared = 0.282 (28.2% of variation determined by the calculated linear trend). But even this set has p = 0.2196, not at all significant by the generally accepted definition. It is misleading to present those lines and those equations as if they depict a genuine trend that exists – an actual pattern of relationships between the passage of time and daily deaths over those periods.

Third: What happens when you choose even slightly different seven-day blocks of daily death statistics? The “trend” lines, their equations, and their correlation coefficients, vary wildly on a day-to-day basis. A lack of significant correlation clearly isn’t something that’s stopping him from presenting these lines and equations as if they tell us anything useful. While the data from May 7 to May 13 present a trend in which deaths are decreasing by 153 daily, shifting that window to cover May 8 to May 14 instead produces an equation indicating an increase of 48 deaths daily. In fact, across the overall span of April 30 to May 13, when we look at the slope of the equation produced for the time period defined by each day and its preceding week, 6 of those 14 possible slopes are positive, showing an increase in deaths over time rather than a decrease. Looking at the slope of each weekly window starting with March 1 – 7, it’s clear that you could easily pick whichever seven-day block you wanted, depending on whether you wanted to claim that cases were growing or declining.

Source: COVID Tracking Project

This would obviously be ridiculous – and it would be perfectly acceptable within the standards Gottlieb has established. To claim there’s a “sustained decline” based on such a choice of figures is unjustified.

What’s especially baffling is that it seems Gottlieb might be aware of the problem here. The slopes of those seven-day sliding windows show a weekly pattern because daily deaths from COVID-19 in the U.S. show a weekly pattern, with some days of the week tending to produce higher counts than others. This means that if you’re limiting yourself to looking at only a week of data, you can show an “increase” or “decrease” in daily deaths depending on which day of the week you start with:

Source: COVID Tracking Project

One way to “smooth” periodic trends such as these is to use, in place of daily counts, a seven-day average – a mean of either the preceding week’s values, or the values of a day and three days before and after it. And a seven-day average is exactly what Gottlieb offered in response to those who criticized the other charts for the aforementioned issues:

This chart does show a decrease in daily deaths from April 30 to May 6. So why present another chart with a “trend” line indicating an increase over that same period? In fairness to Gottlieb, rolling weekly averages from April 30 to May 13 do show a statistically significant correlation – a “sustained decline”. Why not just use that, instead of presenting an analysis of small samples with substantial daily fluctuations and no significant trends at all?

Again: I’m not an expert in these fields. I’m someone who took a couple of college stats courses before transferring to a university, where I’m now pursuing a degree in statistics. But these aren’t expert problems, they’re basic mistakes – there isn’t a level of expertise one can attain where everything wraps back around and these approaches suddenly become okay. What this does show is that the practice of statistics isn’t just the emergence of objective and unimpeachable results from the dispassionate application of equations to data. Statistical analysis intimately involves human choices at every step from start to finish. Some choices may be made well, some may be made poorly, and some may be deliberately deceptive. (Some may even be unintentionally deceptive – for instance, among 8,192 equally “valid” choices of which 13 variables to control or not control for, vitamin E intake is associated with a decreased risk of death in 59% of these models and an increased risk of death in 41%.)

And I know that concerns other than accuracy can creep into those choices – I’ve been there. Performing research and analysis at an online content marketing agency often meant working with an eye toward what the data might permit us to say rather than what it strictly does say, what was the most current, and what could facilitate the most newsworthy storylines. Presenting data from May 13 on May 14, when covering a phenomenon that requires waiting weeks in order to obtain accurate statistics on today, is a decision borne of such influences. But in this case, a client’s campaign placements aren’t what’s at stake – human lives are. A pandemic is not something that responds to wishes, only actions. We have the power to act to change its course, but making those decisions requires a full understanding of the reality we’re acting within, where we’ve been, and where we’re heading. No, reporting on the effects of actions taken a month and a half ago may not be breaking news. And waiting weeks for reliable data may not be exciting.

But it can be lifesaving. ■

Four recent studies confirm benefits of medical transition for trans adolescents »

« Why do trans women and transfeminine people choose orchiectomy as a standalone surgery?

Categories: COVID-19 Ethics Health care Replies Statistics and demographics

Tags: sciencestatistics

Zinnia Jones: My work focuses on insights to be found across transgender sociology, public health, psychiatry, history of medicine, cognitive science, the social processes of science, transgender feminism, and human rights, taking an analytic approach that intersects these many perspectives and is guided by the lived experiences of transgender people. I live in Orlando with my family, and work mainly in technical writing.