Deciphering Beyonce’s Lemonade: Insights using text mining, sentiment analysis and NLP
I learnt the techniques mentioned in this post largely from the Text Mining with R Skill Track on DataCamp.
I remember the day Beyonce dropped her “Lemonade” album, without prior notice. I was in university in the US, and like all my friends, I was very busy preparing for my final exams. That’s usually a time where we collectively try to stay away from social media and focus on our exam prep. Alas, Queen B decided otherwise. As soon as the visual album dropped, everybody lost it on social media (ok, I’m exagerating, but you get the point). Lemonade was a phenomenal album, not only because it took us all by surprise and was musically solid, but also because it told a vulnerable story. My good friend DJ Hammer (the music expert in my friend group!) once told me that the best music albums are those that tell a cohesive story: each song can function on its own but is best enjoyed as part of a whole.
Because I knew that about Lemonade, I decided to use it as my first experiment in learning text mining, sentiment analysis and natural language processing. I did it partly to see if the computer can guess what a song is all about, and mostly because I’m a pop music junkie and this would be a fun thing to do :)
Of course, applications of these text mining techniques go beyond just analyzing lyrics. We can scrape twitter to understand the wave of sentiments related to a period of time (e.g., Coronavirus, the BLM protests, the Arab spring) or we can scrape product reviews to more richly understand what consumers think about a particular product.
Throughout the analyses below, I use the geniusR package, which helps you fetch information about artists, albums, and songs from the popular lyrics website: genius.com. To use it, you only need to sign up for a free account on genius.com and receive a set of log-in and password information. Think about this procedure as a form of web scraping where all the underlying work is already done for you by the geniusR package, so you just need to call the functions to get the information to your workspace.
Let’s try different text mining techniques to see if we can extract meaning from the Lemonade album.
After I’ve loaded data from the geniusR package into my local workspace, I proceeded to prepare my dataset for analysis:
(a) First, I download all the songs within the Lemonade album and stictch them together in one dataset. I make sure to index each song with its title and its order within the album.
(b) Second, I use the tidytext package to arrange the dataset such that each word in the lyrics belongs to one row in the dataset, which will come handy later on. Of course, each row will retain its characteristics (e.g., to which song it belongs).
(c ) Third, I remove what is known as stop words, which are the pronouns, connectors and other grammatical words that are frequently present in any language, but do not tell us anything about meaning. Stop words include words such as “the”, “he”, “and”, “or”. These words have been compiled in files that are widely available.
A. Most Frequently Used Words
Now that my data has been cleaned and prepared for analysis, the first and most obvious step would be to look at the most frequent words in the album to see if there’s any recurring theme that we can extract.
Okay… once we plot a graph with the most frequently used words (words that are repeated more than 6 times across the songs in the album), we can try to discern what the major themes are in this album: We can see that the word “love” is the most frequently used, followed by “slay” and then “daddy”. While these words all have positive connotations, we can also see some words that might suggest some negative thoughts, such as “hurt”, “shoot” or “hard”.
Hmmm, so what can we conclude for now? This is an album with positive emotions (e.g., “love”, “sweet”), some negative emotions (e.g., “hurt”, “catch”) and some words of empowerment (e.g., “slay”, “forward”).
But as it is, we can’t really conclude much beyond some basic things one expects every pop music album to have: themes of love, hurt and some empowerment slogan. So we’ll need to dig deeper to understand what this album is all about.
B. Using Sentiment Analysis to classify words into themes
Now that we’ve explored the basics, the next step is to use sentiment dictionaries and join them with our dataset. Sentiment dictionaries are off-the-shelf data files that classify frequently used words into what feelings / sentiments they might infer. For example, the word “scared” infers the sentiment “fear”, and the word “happy” infers the sentiment “joy”.
There are plenty of dictionaries that different people have created and made available for anyone to use for free. One such dictionary that is widely used is the NRC dictionary, created by Saif Mohamed and Peter Turney.
When I join the dictionary to my data, the words of my dataset are classified into the sentiments represented in the NRC dictionary. Of course, if a word in my dataset does not happen to appear in the NRC dictionary, the word is dropped from the dataset altogether (using an inner join). That is indeed a major limiation of using these dictionaries, because they cannot cover all the possible words in the world (the whole point of using computers is to reduce such an effort!).
So what does this look like for the Lemonade album? (To remove the disproportionate effect of the word “love” as seen above, I exclude it from this analysis)
Woah! It looks like once we exclude the word “love”, this album actually inspires more negative feelings than positive ones! It looks like Beyonce expresses negative thoughts such as anger, fear and sadness more frequently than she does positive feelings such as joy and trust!
Okay… so if the feelings are mixed in this album, let’s try to see if there’s any differences in sentiments from one song to the next. To do that, let’s go back to analyzing the most frequent words in each song, as shown in the graph below.
It’s difficult to guess what these songs are all about just by looking at the most frequent words in them, but let’s give it a try:
- Song 1: Could be about incertitude — Beyonce is conveying hesitation and may even be wishing to “run” where someone cannot “catch” her.
- Song 2: Here we can see themes of “craziness”, “wickedness” and “jealousy”.
- Song 3: Clearly the overarching theme here is “hurt”
- Song 4: Ouch, this could be a break-up song! We can see “bye”, “middle fingers” and “nah” (spoiler, it is!)
- Song 5: This song is probably about her husband and baby daddy Jay-Z. She might not be exactly happy with him given the word “shoot” and “trouble”
- Song 6: This song still does not inspire positivity, with words such as “drought” and “stop” being prominent.
- Song 7, 8: Okay… positive feelings start to appear here, with words such as “forward” and “promise”
- Song 9: This song, with words like “run”, “freedom” and “winner” inspires escape and well.. freedom.
- Song 10: In this song, just the frequency of the word “slay” inspire empowerment and positivity.
Okay, so from what I could guess, it looks like some songs in this album are “sad” and “negative”, while others inspire “positivity” and “freedom”. It’s also probable that the first songs (1 through 6) are more negative than positive, and the last songs (7 through 10) are more positive than negative.
To prove whether this is true, and to see if the compuers can guess it on their own, let’s go back to our dictionaries and use a different one this time: the bing dictionary, developed by Bing Liu. Unlike the NRC dictionary, the bing dictionary classifies words into “positive” and “negative”. A useful application in our example is to look at the porportion of positive and negative words within each song, and verify that indeed the first few songs have more negative than positive words, and so on.
Wow! The computers more or less confirmed what we guessed. The graph shows that the first 6 songs are more likely to contain negative words than positive ones, while the last 4 are more likely to be positive than negative.
So could it be that this album narrates a story that starts off sad and gets increasingly more joyful?
Let’s group the first 6 songs as “Group 1”, presumably the “sad songs”, and the last 4 songs as “Group 2”, presumably the “happier songs”, and use the NRC dictionary again to spell out themes within the song. This time, we’ll use a chartJSRadar chart to illustrate the point!
This graph, while scary to look at at first, is actually a great visual to compare songs in Group 1 and those in Group 2. As you can see, Group 1 songs have more words that inspire “fear”, “sadness”, and even “disgust” and “anger”. On the other hand, Group 2 songs have words that represent “trust”, “joy” and “surprise”!
Like the bing dictionary, the NRC dictionary also confirms what we thought was happening in this album: Beyonce starts off with negative feelings and finishes with positive ones.
C. That was too much work… Can the computer do this on its own?
Is there a technique that can magically group these songs into some common themes without us having to look for dictionaries, append them to our datasets, and count words? The Latent Dirichlet Allocation (LDA) method attemps to do just that, but is no replacement to the human mind!
LDA belongs to the family of unsupervised learning, a collection of machine learning techniques that aim to search for patterns in existing data. Think about it as clustering, except for words (which are not continuous variables, but rather discrete).
So in this case, how would LDA help?
All I am going to do is prepare the data in a way that LDA understands it (grouping words by song and re-arranging the data frame as a matrix whose columns are the words) and call the LDA function from the LDA package. The LDA method then proceeds to group the words into a pre-determined number of clusters (in this case, I choose k = 2), with the assumption that each cluster will be part of an overarching theme. In this case, let’s see if the LDA will group our words into one that expresses “sadness / negativity” and one that expresses “joy / positivity”.
In the graph below, I plot the top most prominent words in each cluster that the LDA has identified.
Woah! Not bad!
While not perfect, this classification did group together a lot of “negative” words in group 1 (e.g., “hurt”, “catch”, “hell”, “bye), and did the same for “positive” words in groupe 2 (e.g., “love”, “slay”, “kiss”, “sweet”).
Obviously, the LDA method itself did not give us labels for these two groups. It merely displayed a partition of the most frequent words into two groups that are most similar within themselves. The onus is always on us to explain the two clusters and derive meaning from them.
Conclusion (and the gist on Lemonade — contains spoilers!)
These text mining, sentiment analysis and natural language proessing techniques were all useful in this exercise, but they are far from perfect. I could have spent one hour listening to every song in the Lemonade album and I would have quickly concluded what this is all about: Beyonce blaming Jay-Z for cheating on her in the first few songs, and then forgiving him and choosing to stay in the marriage in the last few songs. Of course, even after an entire afternoon spent on this analysis, the fancy data science techniques did not get us to that obvious conclusion. They merely got us to the point that this album tells a story: one that starts off with pain and finishes with joy. Even with their limitations, I think it’s quite impressive that these techniques got us where we got to!
As I said in the intro, applications of this kind of work are plenty and can be useful. From twitter data, product reviews, reddit scraping, qualitative insights, you can apply text mining to all kinds of industries and research questions. However, the implications of these tools on our future are scary: they can easily be utilized for mass surveillance.
And so we don’t leave each other on a gloomy note, I would like to direct you to watch the Lemonade album.
Again, thank you so much if you’ve read all the way to here!