Hypothesis testing on Instagram: COVID-19 prevalence in my network

On an otherwise boring Friday night, I decided to launch a poll on my Instagram to ask my friends if they had tested positive for COVID-19. This post recounts the story.

As the days of 2020 went by, I started counting the number of people that I knew who had tested positive for COVID-19. By mid-October, that number reached 16 people. These are either friends, acquaintances, or family members who had told me that they had tested positive for COVID-19. My immediate reaction was the following: if the official statistics are to be trusted, the world incidence rate of COVID-19 is 0.5%. If I knew 16 people who had been infected, does that mean that I personally know 3,200 people? A quick glance at my facebook friends list confirmed that no, I do not know that many people. But this was just a mere observation; what if I could conduct a small poll from a (random) sample and ask how many people had been infected with COVID-19?

That’s how the idea of using the Instagram poll feature came up. The methodology, results and commentary are below.

Describing the Sample that responded to my Instagram Poll

My Instagram network is by no means representative of the world. It mostly reflects the places where I’ve lived, and the people I’ve met along the way. I have been fortunate to live in a number of countries, and so I’ve met people from all over the world. Below is a geographic distribution of the people who have replied to my poll:

As the figure above shows, the 82 respondents of my Insta Poll reside in 24 countries. In the “Other” category, there are 12 respondents from exactly 12 countries (Brazil, Panama, Mauritius, Japan, Rwanda, Belgium, Hong Kong, Tunisia, UK, Nigeria, Spain, Korea). Out of the 24 countries represented in this poll, 11 are currently in the top 30 countries that are most severely affected by COVID-19, as measured by total cases reported to-date (Source: Worldometers Coronavirus Tracker).

Based on this description of the sample, three things are clear: (1) this is a very small sample, (2) it is not skewed towards one specific geography and (3) people from the most severely hit country (USA) are over-represented.

Another important thing to keep in mind about the demographics of this sample is that all respondents are 20-something years old, live in urban areas, have college degrees, and are more likely to travel than other people.

Null Hypothesis: The COVID-19 incidence rate

Then, assuming this null hypothesis to be true, we test whether the result that we got from the sample differs from the null hypothesis purely by chance. If the difference is not due to chance, we say that the difference is statistically significant. When that’s the case, we reject the null hypothesis.

Now, we could choose different null hypotheses to compare the results of my poll with. For example:

  • Choose the world incidence rate of COVID-19: 0.5%. We get to this number by dividing the total number of confirmed cases worldwide (~41 millions) by the world population (~7,4 billions).
  • Choose the incidence rate of the top 10 worst affected countries: 1.18%.
  • Choose the incidence rate of the top 30 worst affected countries: 0.93%
  • Construct a hypothetical incidence rate based on the incidence rates in the countries represented in my sample, and their respective weights within the sample: 1.9%

The reason why I am choosing different null hypotheses is because my sample is definitely not representative of the world population, so I am trying to choose the closest world population that my sample might reastically be mirroring. This is an imperfect exercise, simply because I do not have enough data. If we had more data on all sorts of demographic splits, I could try to find incidence rates among urban / rural areas, male / female, young / old, college grads / non-college grads, etc

Alas, this is the closest I could get with the data that I have on my hand.

The COVID-19 incidence rate in my Instagram sample: 6%

Now, let’s test whether this difference between the sample and the true population is due to randomness, or it is actually statistically significant (yes, there will be p-values).

We reject every null hypothesis about COVID-19 incidence rate at the 5% significance level

As you can see, we reject the null hypothesis at the 5% significance level in 3 out of the 4 scenarios:

  • In the first scenario, the likelihood that we draw a sample like the one we did given the global COVID-19 incidence rate of 0.5% is less than 0.000000000000665 (that’s 13 zeros). So we reject the null hypothesis at the 5% and 1% significance levels. Again, I would like to reiterate that my sample probably does not represent the world population! We reach a similar conclusion for scenario 2 and 3.
  • In the fourth scenario, things get a little bit more interesting. We still reject the null hypothesis at the 5% significance level, but the p-value is much larger than in t he previous three scenarios. Moreover, we fail to reject the null hypothesis at the 1% significance level. So.. could it be that my sample is representative of the 24 countries where my respondents come from? Possibly…

Why did I do all of this? (Well, besides 2020 boredom)

More broadly, I think this polling feature on Instagram could be exploited for similar research questions, if and only if it can be anonymized. My friends were kind enough to share private information, and I promised not to share individual results, just the aggregate. But imagine if we can launch anonymized polls on some celebrity’s Instagram (Cristiano Ronaldo has 240 million followers, for example). We could get a lot of good insights, very quickly, and very cheaply.

Finally, I would like to close by stressing that I am not an expert in statistics, so if I abused the subject, please write me immediately (kenza.bouhaj@gmail.com)! I am thankful for my statistics class at the Harvard Kennedy School for re-teaching me all these concepts, and to my Insta friends for indulging me on that otherwise boring Friday night.

Curious. Passionate about storytelling through data. Interested in Work, Skills and EdTech. Twitter: @KenzaBouhaj

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store