Visualizing the Covid-19 Pandemic

I’ve held off on creating a visualization of the pandemic so far because I think things have been covered well by others. But there are still a few visualizations that I’d like to see, so I decided to give it a try. When I look at a Covid-19 visualization, I generally want to answer a few questions:

  • How bad is it? Total numbers are needed here to show the human toll.
  • Is it getting worse? Time series data is helpful here to put current data in context. Preferably show new cases rather than total cases on the y-axis so we can quickly see increases.
  • How effective is our response? Per capita numbers are useful here because they allow us to see how well governments are responding relative to each other.
  • How risky are things for me? Per capita numbers are also helpful here because they’re proportionate to individual risk. But individual risk can also be calculated using a model, which I do below. Either way, the numbers have to be at a high enough resolution (e.g. state or county) to matter for individuals.

So I created a few visualizations to try to answer these questions. And just to get this out of the way at the start, I want to stress that I am not an epidemiologist. But for the most part I’m communicating existing information rather than creating it, so this shouldn’t be too much of a problem.

Data Sources

For the international timeseries, I’m using the data compiled by Our World In Data. The state level results in the US are from The Covid Tracking Project. For the infection probabilities, I’m using the predicted infections timeseries from Youyang Gu’s model along with population data from the Census Bureau. These visualizations wouldn’t be possible without the tireless work of these groups, so I appreciate their effort.

Of course, these sources aren’t perfect and some leaders have decided to bury their heads in the sand rather than mount an effective response. There’s not a lot that can be done about the suppression of data, so I’ll just have to live with the results I have for now. All of my code for this post is available here.

Population Adjusted Timeseries

The first two visualizations are just timeseries plots showing new cases per million on the y-axis. The bubble size and color represent the total deaths for each place. I think this does a good job of communicating both the current state of things and the cumulative toll.

First, here are the global results. Note that each plot is interactive with tooltips and scroll to zoom enabled:

And here are the results by state in the US:

So overall I think these two plots do a pretty good job of meeting the criteria I set out initially. But one thing they could do better is communicate individual risk by calculating the probability of infection. This is what I try to do next.

What’s the probability a person has Covid-19?

At first I thought this would be fairly simple question to answer. Just sum up the new cases over the past 14 days and divide by the total population of each region, right? This probability ended up being really difficult to estimate for a few reasons outlined in this paper.

  1. There’s a 10 day lag between an infection and a reported case and a 20 day lag between an infection and death on average. This means the counts we see today reflect the past. The effect of this lag on the numbers depends on the growth rate of the pandemic at the time.
  2. Roughly 35-40% of cases are asymptomatic. These cases will never show up in the numbers unless we do random testing.
  3. Even among symptomatic people, a large fraction (right now estimated at 65% for the US) will never have a positive test. Perhaps they don’t seek one out, one isn’t available, or they have a false negative result.
  4. The availability of testing influences many of the numbers above.

The end result is that it’s probably more accurate to estimate the true number of infections using a SEIR model that matches the reported deaths, rather than back-calculating infections from reported cases. This is what the model I ended up choosing does; see the appendix below for more information.

So after that lengthy introduction, here is the probability that a person is infected by state in the US:

This model also has results for some counties that contain major urban areas, so here’s the probability of infection by county:

Note that although the peak for New York state above is a little over 5%, the peak for the five boroughs of New York City is over 10%. This means that if you attended a meeting with 10 random people at the peak of the outbreak, there was a 65% chance someone attending was infected (1-(1-p)^n= 1-(1-0.1)^10 = 0.65). One thing to add is that the probability someone is infected isn’t neccessarily equal to the probability they’re infectious – there may be a smaller window of time that someone can actually spread the infection but that’s still uncertain for now.

Sometimes a table is the best way to visualize data, so here’s the state and county data combined and sorted by probability:

The idea here is that people can use these probabilies to estimate the risk of their lifestyle given their location. So if there’s a 1% chance that someone is infected in your region, attending a meeting with 10 individuals means there’s a 9.5% chance of getting exposed to a person with Covid-19 (1-(1-0.01)^10 = 0.095). Of course, this approach could backfire. Here are some potential problems:

  • This approach requires some math, but creating a risk calculator similar to this one could help.
  • People could just be really bad at estimating how many people they interact with.
  • The spread via aerosols or surfaces could make estimating the number of interactions impossible.
  • County level data isn’t available for every urban area, so people may underestimate their risk by using statewide risk estimates.

But I think this visualization provides people with more actionable information than others I’ve seen, so I decided to put it out there. I’ll try to update it as often as possible when new data is available. If you want to embed these visualizations elsewhere, please let me know because I could host them on Amazon S3 or something.

Appendix: Model Selection

There a a number of models that try to predict the course of the pandemic, most of which are compiled on the Reich Lab forecasting hub and FiveThirtyEight. But the only ones that include an estimate of the true number of infections over time are the models created by IHME, Columbia, Imperial College, and Youyang Gu.

First, I looked at IHME’s model, but immediately something seemed off. Here’s the predicted cases for Wisconsin during a time when cases were increasing:

Predicted infections are actually lower than measured infections, something that would only happen due to testing lag when cases were declining significantly. This could be because their initial model fit a Gaussian curve to the data, which forced a symmetric increase and decline around a peak. While I don’t think this is their approach anymore, everything still seemed to be asymptoting towards zero when I reviewed it, so that doesn’t inspire much confidence.

The Columbia and Imperial models do a better job, but they’re not updated as frequently as I’d like. Youyang Gu’s model is updated daily, performs really well in predictions, and has good reviews from subject matter experts, so I decided to use it. But I still wanted to do a few checks to validate it. First, I compared it to Imperial College’s model:

If they agreed perfectly, their estimates would sit on a 45 degree line. So there’s some deviation, especially in the high case counts. One way to quantify this deviation is using a concordance correlation coefficient, which ends up being 0.917. Perfect concordance would give a value of 1.0, so this is pretty good.

Next, I compared it against recent serology tests in Spain, which suggests 5% of the population has been infected so far. If we just sum up the reported case counts as of 5/13/2020 and divide by the Spanish population, it gives an estimated percent infected of 0.5%, which is a 10x undercount. But if we sum up the predicted infections from Youyang’s model and divide by the population, we get an estimated 6.7% of the population infected. So this is at least at the right order of magnitude, and could be correct depending on when the serology study actually ended.

New York state also completed a serology study on April 23rd that estimated a New York City infection rate of 21.2% and a statewide rate of 13.9%. The model predictions of 20.9% in the city and 12.5% statewide as of 4/23/2020 are quite close. So overall this seems like a quality model and I’ll probably continue using it.


[1] Our World in Data, Coronavirus coverage.

[2] The Covid Tracking Project.

[3] COVID-19 Projections Using Machine Learning.

[4] Communicating the Risk of Death from Novel Coronavirus Disease (COVID-19).

[5] COVID-19 Pandemic Planning Scenarios.

[6] Variation in False-Negative Rate of Reverse Transcriptase Polymerase Chain Reaction–Based SARS-CoV-2 Tests by Time Since Exposure.

[7] Using a delay-adjusted case fatality ratio to estimate under-reporting.

[8] Inferring cases from recent deaths.

[9] Coronavirus Case Counts Are Meaningless.

[10] Reich Lab COVID-19 Forecast Hub.

[11] Where The Latest COVID-19 Models Think We’re Headed — And Why They Disagree.

[12] The results of a Spanish study on Covid-19 immunity have a scary takeaway.

[13] Online COVID-19 Dashboard Calculates How Risky Reopenings and Gatherings Can Be.

[14] COVID-19 Event Risk Assessment Planning tool.