Visualizing the Covid-19 Pandemic

I’ve held off on creating a visualization of the pandemic so far because I think things have been covered well by others. But there are still a few visualizations that I’d like to see, so I decided to give it a try. When I look at a Covid-19 visualization, I generally want to answer a few questions:

  • How bad is it? Total numbers are needed here to show the human toll.
  • Is it getting worse? Time series data is helpful here to put current data in context. Preferably show new cases rather than total cases on the y-axis so we can quickly see increases.
  • How effective is our response? Per capita numbers are useful here because they allow us to see how well governments are responding relative to each other.
  • How risky are things for me? Per capita numbers are also helpful here because they’re proportional to individual risk. But individual risk can also be calculated using a model, which I do in my next post. Either way, the numbers have to be at a high enough resolution (e.g. state or county) to matter for individuals.

So I created a few visualizations to try to answer these questions. And just to get this out of the way at the start, I want to stress that I am not an epidemiologist. But for the most part I’m communicating existing information rather than creating it, so this shouldn’t be too much of a problem.

Data Sources

For the international timeseries, I’m using the data compiled by Our World In Data. The state level results in the US are from The Covid Tracking Project. The county level results are courtesy of the New York Times county dataset. These visualizations wouldn’t be possible without the tireless work of these groups, so I appreciate their effort.

Of course, these sources aren’t perfect and some leaders have decided to bury their heads in the sand rather than mount an effective response. There’s not a lot that can be done about the suppression of data, so I’ll just have to live with the results I have for now. All of my code for this post is available here.

Population Adjusted Timeseries

The first two visualizations are just timeseries plots showing new cases per million on the y-axis. The bubble size and color represent the total deaths for each place. I think this does a good job of communicating both the current state of things and the cumulative toll.

First, here are the global results. Note that each plot is interactive with tooltips and scroll to zoom enabled:

And here are the results by state in the US:

And here are the results for the top 100 counties in the US:

I think this county visualization shows why it’s so important to have data at a sub-state resolution. At the time of writing, my home state of Wisconsin has 150 new cases per million per day, but my county of Milwaukee County has 300 new cases per million per day. So things are twice as risky here and on par with the state of Texas, but nobody is communicating this risk! The media certainly aren’t reporting on Milwaukee with the same level of alarm as Texas, but they should be.

If you want to look up your own county, here’s a searchable/sortable table with all of the country, state, and top 250 county results. Note that there’s an additional column in this table called simple_probability, which is the probability that a person is infected for the region. This column is created by summing up the number of new cases over the past ten days and multiplying by ten. This is a huge simplification for a number of reasons I get into in my next post. But in the absence of any other source for this risk estimate, I’ll keep providing it as a back of the envelope estimate. Just know that this probability will be an underestimate during the steepest growth of new cases, and an overestimate when new cases are flat or declining.

Update: Newer estimates put the correct multiplier in the 4-8x range, but I’m going to keep using a 10x multiplier. This is because 10x will probably still be right in certain contexts like the exponential growth phase, and there’s probably more downside to understating things than overstating them currently.

So overall I think these plots do a pretty good job of meeting the criteria I set out initially. But they could still use a more robust estimate for the probability of infection, which I try to calculate in my next post.

References

[1] Our World in Data, Coronavirus coverage. https://ourworldindata.org/coronavirus

[2] The Covid Tracking Project. https://covidtracking.com/

[3] NYTimes, Coronavirus (Covid-19) Data in the United States. https://github.com/nytimes/covid-19-data

[4] Trevor Bedford. https://twitter.com/trvrb/status/1249414308355649536

[5] Coronavirus Infections Much Higher Than Reported Cases in Parts of U.S., Study Shows. NYTimes. https://www.nytimes.com/2020/07/21/health/coronavirus-infections-us.html

[6] Commercial Laboratory Seroprevalence Survey Data, CDC. https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/commercial-lab-surveys.html