Lyrical Progression of Led Zeppelin
Dec. 26, 2024
Classification Data Analysis Machine-Learning Progression RegressionThe goal of this analysis was to explore the progression of Led Zeppelin's lyrics over time using machine learning techniques. Specifically, I aimed to assess whether there was a noticeable change in the lyrical characteristics of their songs from album to album using features such as word diversity, happiness, sentiment, sentence length, readability, word homogeneity, and many others. I hypothesized that Led Zeppelin's song lyrics did undergo progression from 1969 to 1982. This analysis was performed using three techniques: classification, regression, and unsupervised learning, and was programmed in Python through notebooks such as Anaconda.
The dataset initially contained a range of features, including word diversity, sentiment analysis, sentence length, and frequency of positive, negative, happiness words, etc. During data preprocessing, I encountered a few issues. Rows 2 and 28 had offset values that were shifted to the right, and I corrected this by shifting the values left. Row 57 contained the class Other songs - and did not specify an album or year. While I could have imputed the average year for this entry, I decided to remove it, as it didn't provide any additional information for analyzing the specific albums of Led Zeppelin.
Additionally, some columns had ill formatted names such as Emoticon :) / :-), Emoticon :( / :-(, and PRP$ frequency, and were subsequently removed. Some columns had spelling errors in their names, such as hapiness words frequency, lanuages freqency, familiy frequency, and Soundex homoganity hist bin 0, so I made spelling corrections for these. The features where all values were set to 0 were removed to prevent introducing confusion during classification and to improve the quality of the analysis. Finally, I trimmed the path from the values from the Path column, leaving only the file name.
For feature selection, I chose to include features of lyrics that may reveal lyrical variation. The complete list of selected features include:
- Quotation length mean
- Quotation length stddev
- Coleman–Liau index
- Emotion circumplex
- Total number of words
- Word diversity
- Automated readability index
- positive words frequency
- negative words frequency
- happiness words frequency
- happy words frequency
- Soundex diversity
- Soundex homogeneity mean
- Soundex homogeneity StdDev
- Sentence length mean
- Sentiment mean
- Sentiment stddev
- Word homogeneity mean
- Word homogeneity StdDev
- Sentiment skewness
- Sentiment maximum difference
- Sentiment mean difference
- Lemma diversity
Most of the features are self-explanatory, such as sentiment, positive and negative word frequency, and quotation length. Other features require more explanation. Both Coleman-Liau and Automated Readability indexes assess the readability of a given text and rates it by grade level. The Coleman-Liau index differs from ARI in that it uses computerized grading to determine the readability rather than counting syllables and sentence length [1][2]. Emotion circumplex is a emotional classification model that splits emotions into a two-dimensional circular space, with positive and negative sides of arousal and valence countering each other. This model is often used to assess the stimulation of emotional words [4]. Soundex is a phonetic algorithm that encodes names based on their pronunciation in English. [4]. Finally, lemma diversity pertains to the lemmatization process in Natural Language Processing, which reduces words to their root or base form, called lemmas. Punctuation, specific noun frequencies, and other broad features were excluded as to improve the quality of analysis.
Classification:
The classification step aimed to predict the album from which a song originated based on its lyrical features. To achieve this, I utilized a Random Forest Classifier, which is an ensemble learning method that builds multiple decision trees and aggregates their predictions [6]. This model was chosen for its reduced risk of overfitting, and ability to handle high-dimensional datasets [6], such as the one in this case, where the feature space was relatively complex with various sentiment-related and linguistic features.
Procedure
In the classification step, I began with preprocessing by using LabelEncoder to convert the Class values into a fully numerical format. After completing the classification, I applied inverse_transform to return the encoded class labels to their original album labels for easier interpretation in the confusion matrix chart. The Path and Class columns were dropped before the dataframe was passed into the train_test_split function. Following the cleaning process, I split the dataset into a training set and a test set using an 75-25 split, which was chosen to provide enough test data to the classifier to make accurate predictions. Unlike previous projects where I manually shuffled and split the data, I opted to use the train_test_split function from sklearn to ensure replicability of the results. After trial-and-error, a seed value of 42 was selected for the random_state parameter as it was found to yield the best classification results.
For model training, I used a RandomForestClassifier, training it with the lyrical features from the training set. Once the model was trained, I proceeded to prediction, where the model assigned a predicted album class label for each song in the test set. Each prediction corresponded to one of Led Zeppelin's albums. To assess the model's evaluation, I used metrics such as accuracy, precision, recall, and f1-score (see table 1).
Classification Report:
precision | precision | recall | f1-score | support |
0 | 1.00 | 1.00 | 1.00 | 1.00 |
1 | 1.00 | 1.00 | 1.00 | 1.00 |
2 | 1.00 | 1.00 | 1.00 | 1.00 |
3 | 1.00 | 1.00 | 1.00 | 1.00 |
4 | 1.00 | 1.00 | 0.50 | 0.67 |
5 | 1.00 | 1.00 | 0.50 | 0.00 |
6 | 0.00 | 0.00 | 1.00 | 1.00 |
7 | 1.00 | 1.00 | 1.00 | 1.00 |
accuracy | 0.95 | 19 | ||
macro avg | 0.89 | 0.83 | 0.85 | 19 |
weighted avg | 1.00 | 0.95 | 0.96 | 19 |
Table 1
Results
Despite achieving a classification accuracy of 95%, which suggested that the model was making accurate album classifications, the confusion matrix highlighted Led Zeppelin IV-1971 as being misclassified (see figure 1). This could be due to the high similarity in lyrical characteristics between Led Zeppelin IV-1971 and Led Zeppelin III-1970, or the model needed more data to distinguish the two albums. With the exception of Led Zeppelin IV-1971, the features used in the analysis provided enough distinction between most albums, leading to the model's performance in accurately predicting the correct album.
Figure 1
Regression
The regression step was focused on predicting the year of the song based on its lyrical characteristics. Instead of categorizing songs into specific albums as in classification, the goal here was to estimate the continuous year of release for each song, treating the year as a target variable. The approach used for this task was Linear Regression, a fundamental statistical method to model the relationship between a set of independent variables as predictors (features) and a continuous dependent variable as the target (year) [7].
Procedure
In the regression analysis, I utilized all the available features in the dataset to predict the year of each song. These features included linguistic metrics such as word diversity, sentiment, sentence length, Coleman-Liau and ARI indexes, word homogeneity, Lemma diversity, and others. Following the same approach as in the classification step, I used the selected features and split the dataset into a training set and a test set using an 75-25 split. For model training, I used LinearRegression, where the features served as predictors, and the target variable was the year of the song. Once trained, the regression model was used to predict the year for each song in the test set. The model’s performance was evaluated using the mean absolute error (MAE) and correlation coefficient.
Results
The regression model results showed a mean absolute error of 2.78 years and a correlation coefficient of 0.73. These results indicate a strong relationship between the features and the song's year, with the mean absolute error (MAE) showing minimal error between the predicted and actual years of the dataset, suggesting that the model was able to make accurate predictions of the year of the song using each of the selected features (see figure 2).
Figure 2
Unsupervised Learning:
For the unsupervised learning portion, I used K-Means clustering to group the songs based on their lyrical features without any prior knowledge of their labels (such as album or year). The idea was to observe if the songs naturally clustered into meaningful groups based on their characteristics and to determine if these clusters corresponded to albums. K-Means is a popular clustering algorithm that partitions the data into a predefined number of clusters (in this case, 4) [8]. It iterates to minimize the variance within each cluster, grouping similar items together [8]. I also used Principal Component Analysis (PCA) for dimensionality reduction, allowing for better visualization of the clusters in two dimensions [8].
Procedure
In the unsupervised learning analysis, I began by scaling the data using StandardScaler to account for the varying scales of the features, such as Coleman-Liau and ARI indexes, sentiment, word homogeneity, word and Lemma diversity, etc. This step ensured that all features contributed equally to the clustering process. Next, I applied PCA to reduce the data’s dimensionality to two components, allowing for a 2D visualization of the clustering results (PCA1 & PCA2). This step was critical for interpreting the data visually and assessing the effectiveness of the clustering.
Following dimensionality reduction, I used kmeans to cluster the songs into four distinct groups. The choice of four clusters was based the idea that Led Zeppelin’s discography might reflect four major phases or eras. After clustering, I evaluated the results using a scatter plot of the two PCA components, with each cluster represented by a distinct color. After applying PCA for dimensionality reduction, I visualized the clusters and found that the four clusters showed some distinction between each other.
Results
This unsupervised approach highlighted mostly distinct groupings in the dataset, suggesting that there were differences among some songs between clusters. Though there is some presence of overlapping cluster nodes between clusters 1 and 2 and clusters 1 and 3, the distance between centroids indicated that the lyrical features of Led Zeppelin’s songs did undergo lyrical evolution over time, further corroborating the results from the classification and regression analyses.
Figure 3
Conclusion
The analysis of the progression of Led Zeppelin's lyrics over the years revealed relatively significant changes over time. Classification of Led Zeppelin's music showed that the Random Forest model had significant success in accurately classifying each song. The confusion matrix confirmed that most songs were correctly classified except for Led Zeppelin IV-1971, suggesting that the model could be improved by being trained with more features. The clustering results suggested that the songs did exhibit differences across albums, as many clustered together with only a few nodes sharing space with other clusters. The regression model performed well and showed a positive relationship between the song features and the predicted year of the song album.
The final piece of this analysis is interpreting a select set of feature means. The following means were selected to determine the level of difference between album years: word diversity, sentiment, sentence length, quotation length, word homogeneity, and soundex homogeneity. The significance between the sentence and quotation length means and the other means is due to how each were measured. Soundex homogeneity and word diversity means showed constant levels, while sentence length and quotation length means showing some variation over the years. The sentiment mean show some level of change, but remained relatively stable throughout. The sentence length mean shows an increase during Led Zeppelin's early years to 1971, then decreased until stabilizing in 1976. The quotation length mean actually decreased in the first year, then increased dramatically up until 1973 and later decreases significantly, suggesting that Led Zeppelin used quotations less over time (see figure 4).
Figure 4
After reviewing all of the model results and the mean analysis, my conclusion is that Led Zeppelin's lyrics did undergo lyrical progression over the years, as evidenced by the high classification accuracy, the significant regression results, and the clustering of the song cluster nodes. While there are some overlap in specific song features, the overall patterns indicate that Led Zeppelin conducted some experimentation in their lyrical approach throughout their discography.
Bibliography
[1] B. Scott, “The Coleman-Liau Readability Formula,” ReadabilityFormulas.com, Sep. 27, 2023. https://readabilityformulas.com/the-coleman-liau-readability-formula/
[2] B. Scott, “How to Use the Automated Readability Index (ARI) for Clearer Communication,” ReadabilityFormulas.com, Jul. 19, 2023. https://readabilityformulas.com/the-automated-readability-index-ari/
[3] Hidevs Community, “Understanding Lemmatization in NLP - Hidevs Community - Medium,” Medium, May 04, 2024. https://hidevscommunity.medium.com/understanding-lemmatization-in-nlp-cc8091647aae (accessed Dec. 08, 2024).
[4] Wikipedia Contributors, “Emotion classification,” Wikipedia, Dec. 04, 2024. https://en.wikipedia.org/wiki/Emotion_classification
[5] Wikipedia Contributors, “Soundex,” Wikipedia, Sep. 19, 2019. https://en.wikipedia.org/wiki/Soundex
[6] IBM, “What Is Random Forest? - IBM,” www.ibm.com, 2023. https://www.ibm.com/topics/random-forest (accessed Dec. 03, 2024).
[7] IBM, “Linear regression,” Ibm.com, Feb. 15, 2024. https://www.ibm.com/docs/en/db2/11.5?topic=building-linear-regression (accessed Dec. 03, 2024).
[8] IBM, “K-means clustering,” Ibm.com, Nov. 14, 2024. https://www.ibm.com/docs/en/db2/12.1?topic=building-k-means-clustering (accessed Dec. 03, 2024).
[9] IBM, “What is principal component analysis? - IBM,” www.ibm.com, Dec. 08, 2023. https://www.ibm.com/topics/principal-component-analysis (accessed Dec. 03, 2024).