(Part Two) Who are the Unbanked in Morocco? Using unsupervised learning to further understand the banking sector’s unserved customers
In my previous blog post, I used descriptive statistics and supervised learning techniques to understand who is likely to have a bank account in Morocco, drawing from the World Bank’s 2017 Global Financial Inclusion Survey. I concluded that the most important indicators of having a bank product in Morocco are employment status (being in the workforce), income level (belonging to the top income quintile) and education level (specifically, having a tertiary degree).
In this blog post, I take a deeper look to understand if I can find distinct groups within the “unbanked” population. This is a useful exercise for banks to understand what the needs of their potential customers are. Once the distinct groups are identified, it is possible to create or tailor banking products that fit the needs of this population.
To accomplish this, I use an unsupervised learning technique called K-means clustering. I then describe the resulting groups using some characteristics provided in the survey in order to label these distinct clusters.
K-means clustering is an unsupervised learning technique that works to partition all observations within a dataset into a pre-determined number of clusters, k. This technique works by identifying k centroids (centers of data) within the dataset, and iteratively assigns each data point to these centroids. The goal is to come up with non-overlapping clusters of data where members of each cluster are most similar amongst themselves.
For example, let’s say I have one simple dataset of 1,000 observations with only one variable, Gender. Let’s assume the Gender variable takes 3 values: Female, Male and Other. If I run a k-mean clustering algorithm on this dataset with k =3, I will likely have the Females grouped together as one cluster, the Males grouped together as the next cluster, and the Others grouped together as the last cluster. Of course, real life data is more complicated and messier than just one variable, which is why the k-means technique is useful when we have a large set of variables and we don’t know much about different sub-groups within the data. More on k-means technique here.
As I briefly mentioned, k-means only works by pre-determining the number of clusters, denoted k. A necessary first step is to decide what that number is. Thankfully, there is a technique, called the elbow curve, which we can use to come up with the right number of clusters. The elbow curve plots an evaluation metric (called within group sum of squares, or wss) against the possible number of clusters (in this example, from 1 to 15). The evaluation metric in the y-axis (let’s call it wss) is a measure of how spread out the points around a centroid are, and so, the smaller that number, the better the clustering is. In the graph below, at Cluster k = 1, the wss is at around 2600. As we divide the data into k = 3, the wss drops to 2000. In choosing the optimal number of clusters, we look at the value in the x-axis that minimizes the reduction in the wss as we move from one k to the next. Practically speaking, we choose the k where the curve begins to flatten out and to form an elbow. This ultimately becomes a judgement call. In the graph below, I choose k = 4 as my optimal number of clusters, because the curve is “going down” more rapidly at k < 4 than it does at k > 4.
When I run the k-means clustering algorithm with a k = 4 on my survey where respondents indicate not having a banking product, my population is divided into 4 clusters. As you can see in the graph below, the overall unbanked population is not evenly distributed across these 4 clusters: There’s a lot more in Cluster 3 than in the other clusters.
Now that the computer has helped us group our population, the onus is on us to make sense of these 4 clusters: Who might belong to one cluster and not the next?
This exercise is more art than science, and we need to be creative in order to derive insights. To do so, I pulled a number of characteristics from the survey and computed averages per cluster. Below is a graph indicating how each cluster compares along these characteristics. The y-axis shows the set of characteristics available in the survey, while the x-axis represents the % of population within each cluster that fulfill a given characteristic.
For example, let’s take a look at the first characteristic: “Female”. What the graph below shows is that within Cluster 2 (denoted by the green dot), around 15% of the population is female. In other words, Cluster 2 has a lot more males than females, and so we might be able to say that Cluster 2 contains mostly males. Of course, that is not the only characteristic of Cluster 2. So let’s analyze each cluster on its own to be able to come up with a suitable description for each.
Interpreting Cluster 1
Members of Cluster 1 are more likely to be female than male, because almost 90% of the group’s population is female. With respect to having a secondary degree, members of Cluster 1 are not likely to have a secondary degree (less than 40% have one), but a larger poportion of Cluster 1 has a secondary degree, compared to Clusters 2, 3 and 4. We can apply a similar logic for being in the top 40% of income: while they are not likely to be at the top 40% of income, they are more likely than members of other clusters to belong to that income quintile. Members of Cluster 1 are not likely to be in the workforce, as only ~10% are part of the workforce, and they are not likely to receive wage payments. However, they are very likely to come up with emergency funds, generally do not borrow (either from family, or for medical reasons), and are unlikely to receive domestic remittances. Two additional data points not shown in the graph above: The average age within Cluster 1 is 34 years, and the main source of their emergency funds is their family.
We can thus describe members of Cluster 1 as young females, who are unemployed, likely to be middle-income, and rely on family for emergency funds.
Interpreting Cluster 2
Members of Clutser 2 are likely to be male, unlikely to have a secondary degree, likely to be employed in the workforce and likely to receive wage payments. Just above the majority in Cluster 2 can come up with emergency funding, but they are unlikely to borrow from family or receive domestic remittances. The average age within Cluster 2 is 40 years old.
We can thus describe members of Cluster 2 as middle-aged males, who are employed (likely in a low-pay, formal job), can come up with emergency funding, but are generally on their own (cannot rely on family).
Interpreting Cluster 3
We can describe members of Cluster 3 as middle-aged females, who are unemployed, have the lowest income levels amongst the 4 clusters, and cannot come up with emergency funds nor can they rely on family.
Interpreting Cluster 4
We can describe members of Cluster 4 as middle-aged females, who are unemployed, are generally low-income, but have relied on family and are more likely to borrow for medical needs. Their reliance on family is also evident in their being more likely to receive domestic remittances.
This is a useful exercise for the banking industry to understand who the unserved customers are out there, and how they can create products that fulfill the needs of these customers.
In the example above, a segment that can immediately be served by the banks in Morocco belongs to Cluster 2, who are employed, receive wage payments, and do not rely on family for emergency funds. Banks can offer this “market segment” a low-fee checking account or incentivize them to open a savings account with attractive rates. Similarly, members of cluster 4 can benefit from a medical insurance product or a low-fee checking account to receive domestic remittances.
This clustering exercise is, of course, imperfect, given that we are missing some crucial data points on customer demographics and behaviors. Such data includes number of dependents, income levels, type of work (formal vs informal), spend on other products (e.g., airtime and internet). A next step for the banking industry in Morocco is to conduct a similar, more thorough market research survey to understand the needs of the unbanked.