Who will win the Champions League this year? My models agree on one winner: Bayern Munich

Kenza Bouhaj
7 min readJul 23, 2020

In this blog post, I use the Caret Package in R to build my models. The corresponding course on DataCamp is linked here.

Ah, Football! The Beautiful Game!

I knew 2020 was a truly strange year when European football was cancelled back in March. This massive industry, which generated 28.9 Billion Euros in 2019 (up 85% from 2009), has fans beaming all over the world. I have been a football fan for as long as I can remember; my earliest memories include words such as “Zidane” and “Arsenal”. In my household, an el classico between FC Barcelona and Real Madrid is anticipated for weeks, with the same enthusiasm reserved to a national celebration.

Now that the competition is set to resume in about two weeks, I thought it would be a great idea to use my newly-acquired data skills to predict who might be the winner this year.

It was an absolute treat to do this exercise. Besides being upset that my team, FC Barcelona, does not fare greatly in my results (ok, only mildly, because they don’t deserve to win anything this year), I am happy to be able to do this kind of analysis.

Assembling the Data

Since the Champions League was paused in March 2020 in the middle of the Round 16, twelve teams remain in the compeition. Because we do not have enough information about what happens in the subsequent rounds, I construct a dataset that is based on (a) historcial group stage results and (b) the performance of competing teams in their home leagues.

My final dataset thus consists of a union of two sets of data for 16 years (from the 2003/2004 until the current 2019/2020 season):

  • Historical Champions League data, including group stage results, country coefficients, and whether a team was a finalist, or a winner
  • Historical Home League data, mainly final standings.

My final dataset contains ~500 observations and 12 variables.

Fetching data for all competing teams, their Champions League performance and their corresponding home league results for the past 16 years was a long process. I was unable to easily scrape tables from websites such as transfermarkt and bdfutbol, so I spent two days manually pulling the data into a format that I could easily manipulate. Was it fun? Absolutely not. Was it worth it? Only if I get this right (check back with me on August 23!).

Now, there are other predictors that I could have included in this dataset, such as: strength of the squad (e.g., how many players were finalists of the Ballon d’Or the year before?), player transfer data (e.g., How many players were bought that year?), the team’s financials (e.g., what is the team’s budget? revenues? transfer deals?), the team’s coach (e.g., Are we dealing with a Zineddine Zidane, a Jurgen Klopp or a Pep Guardiola?), luck factor (e.g., Is the team Real Madrid?), etc.

I am sure that all the betting companies, fancy magazines and these clubs’ teams have very detailed and sophisticated data (probably down to what Messi has for breakfast), but in my case we’re dealing with a headcount of one playing with data as a hobby :)

Understanding the Predictors

I aim to predict this year’s winner and runner-up based on what happened in the past 15 seasons of the Champions League. My predictors are as follows:

A. Historical CL data

  • Country coefficients: an index given by the UEFA to participating countries that indicate the strength of that country’s competing clubs. It is calculated by looking at the performance of the teams within a country in both the Champions League and the Europa League for the previous five season. Therefore, it changes every year. This coefficient is used to determine how many clubs will participate in any given CL season and is also used for seeding the clubs.
  • Group Stage final ranking
  • Group Stage accumulated points
  • Group Stage goals scored vs. received

B. Home League data

  • Final Rank
  • Total league points accumulated
  • Goals scored vs. received

Model 1: Logistic Regression

My first model is a logistic regression model that predicts whether a team will be a winner (Yes / No) or will reach a final (Yes / No).

A. Probability of Winning the Champions League

B. Probability of Reaching the Final of the Champions League

Model 2: Random Forest

My second model is a random forest model that predicts whether a team will be a winner (Yes / No) or will reach a final (Yes / No).

A. Probability of Winning the Champions League

B. Probability of Reaching the Final of the Champions League

Which predictors are most important?

A. Country Coefficient

As is probably obvious, the greater the country coefficient of a given team, the more likely that the team will be a winner of the Champions League. On average, winners belong to countries whose coefficients are 5 points greater than non-winners. To give you an idea of this scale, England had a coefficient of 22.6 in 2018/2019 and Spain a coefficient of 19.6 in the same season. Generally, the five big leagues (Spain, England, Italy, Germany and France) have a coefficient greater than 10.

B. Champions League Group Stage Points

Another important predictor is the Group Stage Points. To illustrate its importance, I plot it against the points accumulated by teams in their home leagues, colored by whether the team ends up winning. The figure below represents the plot, and shows that for the same home league final points, a team is more likely to win if it got higher points in the Champions League group stage. Interestingly, beyond a certain threshold (above 95 points), there have been cases where winners did not do as well in the group stage as non-winners but still ended up winning the CL. This might lead us to believe that teams who have exceptional home seasons might be more likely to win the Champions League.

The assessment from the figure below makes sense, especially if we account for the noise associated with low country coefficient teams. For example, a team from Azerbaijan could be the best in its home league (i.e., have the largest number of points) but will not do well in the group stage. To remove this noise, I exclude teams whose countries have coefficients below 10. That is, I only keep the five big leagues, in the figure below.

The figure below shows that the trend still holds: one’s performance in the Champions League group stage is more of an indication of whether they will win than their performance in their home leagues.

C. Group Stage Goal Difference

Another important predictor is the goal difference (goals scored minus goals received) during the home league vs. the Champions League Group Stage. As the figure below shows, most winners have had at least a net positive five goal difference in the group stages. Moreover, for the same home league goal difference, winners are more likely to have higher group stage goal differences.

Again, this follows our logic from above. In order to remove the noise, I once again, exclude countries whose coefficients are below 10. The resulting graph shows two much closer trend lines, but the conclusion remains the same: champions league group stage goal difference is a better indicator than home league goal difference.

Conclusion

It looks like Bayern Munich distinguished itself in my models due to its excellent performance in this season’s group stage. With a perfect 18 points (out of a possible 18), Bayern Munich won its group with a 19 goal differential (ahead of Tottenham, Olympiakos and Crvena zvezda). By comparison, FC Barcelona won its group with 14 points and a goal differential of 5, while Real Madrid came out second in its group.

Bayern Munich is also favored because it has won the Bundesliga with 82 points and a mind-boggling 68 goal differential.

Now, it could be that this model turns out completely wrong, because it over-emphasizes these few metrics and could be missing a lot of crucial information. For example, since 2020 is a strange year, it could be misleading to try to predict results based on what happened in previous years. For example, most teams have had a cramped 3–4 weeks where they had to finish the rest of their home leagues’ fixtures. This might be highly motivating for some or highly exhausting for others.

Anyhow, this was all a lot to digest. I leave you with my all-time favorite Champions League image, and I am eager to find out about all of this in about a month!

Update: Bayern Munich did win.

--

--

Kenza Bouhaj

Curious. Passionate about storytelling through data. Interested in Work, Skills and EdTech. Twitter: @KenzaBouhaj