Hypothesis testing on Instagram: COVID-19 prevalence in my network

Kenza Bouhaj
7 min readOct 22, 2020

On an otherwise boring Friday night, I decided to launch a poll on my Instagram to ask my friends if they had tested positive for COVID-19. This post recounts the story.

As the days of 2020 went by, I started counting the number of people that I knew who had tested positive for COVID-19. By mid-October, that number reached 16 people. These are either friends, acquaintances, or family members who had told me that they had tested positive for COVID-19. My immediate reaction was the following: if the official statistics are to be trusted, the world incidence rate of COVID-19 is 0.5%. If I knew 16 people who had been infected, does that mean that I personally know 3,200 people? A quick glance at my facebook friends list confirmed that no, I do not know that many people. But this was just a mere observation; what if I could conduct a small poll from a (random) sample and ask how many people had been infected with COVID-19?

That’s how the idea of using the Instagram poll feature came up. The methodology, results and commentary are below.

Describing the Sample that responded to my Instagram Poll

As I mentioned before, I launched a poll on a Friday evening on my Instagram “story”, asking my friends to answer the question: “Have you had COVID-19?”. They could either click “Yes” or “No” and I could track the responses of the poll. Upon reflection, the more scientific and accurate question that I should have asked is: “Have you tested positive for COVID-19?” but I realized that a bit too late, so for the purposes of this post, we will stick to the original question that I asked: “Have you had COVID-19?”. On that day, 154 people saw my “story,” meaning that they had the opportunity to engage with the poll. Out of those 154 people, 82 responded to my poll, a 53% response rate. Now, it could be that those who did not respond did not get tested for COVID-19, did not want to reveal health information, or simply did not want to engage with the poll.

My Instagram network is by no means representative of the world. It mostly reflects the places where I’ve lived, and the people I’ve met along the way. I have been fortunate to live in a number of countries, and so I’ve met people from all over the world. Below is a geographic distribution of the people who have replied to my poll:

As the figure above shows, the 82 respondents of my Insta Poll reside in 24 countries. In the “Other” category, there are 12 respondents from exactly 12 countries (Brazil, Panama, Mauritius, Japan, Rwanda, Belgium, Hong Kong, Tunisia, UK, Nigeria, Spain, Korea). Out of the 24 countries represented in this poll, 11 are currently in the top 30 countries that are most severely affected by COVID-19, as measured by total cases reported to-date (Source: Worldometers Coronavirus Tracker).

Based on this description of the sample, three things are clear: (1) this is a very small sample, (2) it is not skewed towards one specific geography and (3) people from the most severely hit country (USA) are over-represented.

Another important thing to keep in mind about the demographics of this sample is that all respondents are 20-something years old, live in urban areas, have college degrees, and are more likely to travel than other people.

Null Hypothesis: The COVID-19 incidence rate

The first step of statistical inference is to choose the null hypothesis. In this case, my null hypothesis is the proportion of people who have tested positive for COVID-19, out of the world population.

Then, assuming this null hypothesis to be true, we test whether the result that we got from the sample differs from the null hypothesis purely by chance. If the difference is not due to chance, we say that the difference is statistically significant. When that’s the case, we reject the null hypothesis.

Now, we could choose different null hypotheses to compare the results of my poll with. For example:

  • Choose the world incidence rate of COVID-19: 0.5%. We get to this number by dividing the total number of confirmed cases worldwide (~41 millions) by the world population (~7,4 billions).
  • Choose the incidence rate of the top 10 worst affected countries: 1.18%.
  • Choose the incidence rate of the top 30 worst affected countries: 0.93%
  • Construct a hypothetical incidence rate based on the incidence rates in the countries represented in my sample, and their respective weights within the sample: 1.9%

The reason why I am choosing different null hypotheses is because my sample is definitely not representative of the world population, so I am trying to choose the closest world population that my sample might reastically be mirroring. This is an imperfect exercise, simply because I do not have enough data. If we had more data on all sorts of demographic splits, I could try to find incidence rates among urban / rural areas, male / female, young / old, college grads / non-college grads, etc

Alas, this is the closest I could get with the data that I have on my hand.

The COVID-19 incidence rate in my Instagram sample: 6%

As the figure below indicates, 5 out of the 82 respondents reported having tested positive for COVID-19, which makes the incidence rate in this sample 6.1%. To put things into perspective, what this is telling us is that my sample has an incidence rate of COVID-19 that is 12 times of what is officially reported worldwide (0.5% x 12 = 6%).

Now, let’s test whether this difference between the sample and the true population is due to randomness, or it is actually statistically significant (yes, there will be p-values).

We reject every null hypothesis about COVID-19 incidence rate at the 5% significance level

Now that we have all the data we need, we can proceed with the testing. The table below summarizes the findings:

As you can see, we reject the null hypothesis at the 5% significance level in 3 out of the 4 scenarios:

  • In the first scenario, the likelihood that we draw a sample like the one we did given the global COVID-19 incidence rate of 0.5% is less than 0.000000000000665 (that’s 13 zeros). So we reject the null hypothesis at the 5% and 1% significance levels. Again, I would like to reiterate that my sample probably does not represent the world population! We reach a similar conclusion for scenario 2 and 3.
  • In the fourth scenario, things get a little bit more interesting. We still reject the null hypothesis at the 5% significance level, but the p-value is much larger than in t he previous three scenarios. Moreover, we fail to reject the null hypothesis at the 1% significance level. So.. could it be that my sample is representative of the 24 countries where my respondents come from? Possibly…

Why did I do all of this? (Well, besides 2020 boredom)

The short answer to this question is the following: I am skeptical of the official reporting about the global COVID-19 incidence rate. Even before I conducted this small and fun experiment, I was suspicious of the numbers, and thought I knew too many people who had been infected with COVID-19, and that the numbers did not match what the official statistics were reporting. It is troubling to me, because all the people whom I knew to have had the disease tested positive for it (it wasn’t speculation on their part), so they should have been, in theory, accounted for in the official statistics. The polling on Instagram, although far from perfect, did confirm my suspicion.

More broadly, I think this polling feature on Instagram could be exploited for similar research questions, if and only if it can be anonymized. My friends were kind enough to share private information, and I promised not to share individual results, just the aggregate. But imagine if we can launch anonymized polls on some celebrity’s Instagram (Cristiano Ronaldo has 240 million followers, for example). We could get a lot of good insights, very quickly, and very cheaply.

Finally, I would like to close by stressing that I am not an expert in statistics, so if I abused the subject, please write me immediately (kenza.bouhaj@gmail.com)! I am thankful for my statistics class at the Harvard Kennedy School for re-teaching me all these concepts, and to my Insta friends for indulging me on that otherwise boring Friday night.

--

--

Kenza Bouhaj

Curious. Passionate about storytelling through data. Interested in Work, Skills and EdTech. Twitter: @KenzaBouhaj