Insights from NLP with Music Lyrics

8 min readNov 23, 2022

About a year ago I built a python app that created a corpus of music lyrics and performed several NLP analyses on it. Lyrics in the corpus are categorized by music genre (such as rock, hip-hop, firstwave, country) and identified with an artist. Since then I revisited a social media analytics app that I built and refactored it to contain data within objects instantiated with custom class definitions, similar to my lyrics app. All of this gave me some ideas about managing content for NLP apps and managing the often tedious process of formatting and scrubbing data.

Surprising Result: Match Test Lyrics to Trained Vectors- Higher with More Artists in Train Set

I had several higher-level objectives when I set out to design and build the app. I thought that the nature of this corpus: an aggregate of thousands of relatively small units (each song, or ‘track’), the need to group and analyze by categories like genre and artist, and a set of common tasks to perform on the units or groups, begged for the creation of object-oriented containers using custom classes in python. I could even make use of class inheritance to help with some polymorphic behavior that I wanted from various types of corpus containers.

Next, I wanted to train document vectors from containers of lyrics at both a genre and artist level. Python’s class definition features for language objects were a great fit to support this. Through the definition of a generator-iterator to the builtin __init__ method that is common to all python objects, I could easily feed any of my lyrics container instances to the training algorithms of a vectorizing package like gensim (I can’t say enough about gensim- very powerful, yet even a ‘part-time’ developer like myself can use it effectively). Instead of scripting the selection, formatting and streaming of particular lyrics, once I instantiated a group of lyrics I simply passed the object to gensim’s Doc2Vec method.

Firstwave: Matching Test Set to Vector Model Trained on Artist Corpus

Hip-Hop: Matching Test Set to Vector Model Trained on Artist Corpus

I’ve worked with Python for four years and I’ve been around enterprise Java for about 20 years, but I’d never seen a problem that lent itself so perfectly to the use of object-oriented class definitions including inheritance, polymorphism, and data encapsulation. I’ve now done NLP work on several distinct types of written content- white papers and press releases, tweets and short social media posts, and music lyrics. Each has differences from the use of slang, symbolism or colloquialisms, to grammar and structure. There are researchers who have spent years on these issues and solved many of them- I don’t aim to reinvent the wheel, I simply seek to find the most adaptable approach so that I can apply best practices with a particular type of content.

Best Case Matching of unseen lyrics to trained Doc2Vec model:
    19 out of 23 unseen Madness lyrics had top match on Madness,
    5 out of 23 had 100% of their top ten matches with Madness tracks
    Each test evaluated 7343 tracks from 27 different artists

Data formatting and scrubbing takes up most of the total effort on these projects, but has to be applied intelligently based on downstream needs. For example, the Vader package is outstanding for calculating sentiment with Tweets, but it works best and assures consistency across results if it is fed raw content. Twitter has its own grammar, set of idioms, and usage of punctuation and symbols, and these can vary across localities, generational cohorts, or other social groupings. Vader was built to handle these, and also allows customization of lexicons, idioms, and rules (I suggest resisting the temptation to mess with it, unless some omission in the package has been identified that is needed for a special case). It can be tempting to run basic data scrubbing as part of upstream processing, but this will mess with sentiment scoring. A python class that serves as a ‘container’ and also applies and keeps track of the formatting and scrubbing applied to its data protects one’s sanity! Before I adopted this method for NLP work, I sometimes missed or duplicated steps trying to determine data state and ‘the’ optimal sequence of tasks.

Container classes have instance methods for many NLP tasks, can be applied to track, artist, genre, or corpus. Each of 135 Artists showed unique profile for word choice, variety, and frequency

There are some intermediate tasks that must be run to support certain NLP and vector analyses, such as tf*idf calculation, word-frequency tables, or BOW (bag-of-words) analysis. I added these as instance methods for my container classes, and added logic to track if they’ve been called or if they need to be run again prior to providing their results. I created dictionaries of artists by genre which ‘controlled’ the corpus creation process. Once I was satisfied with the resulting corpus for each genre, I ran the container instantiation. In this case, with my underlying data staying constant, I could run my intermediate tasks, like tf*idf by artist, once for each container instance. I didn’t want to design based on that assumption, as I might add additional artists to a genre at any time, and I could see it more often being the case that containers would need to respond to changes in the underlying data and update their derived calculations.

Why Lyrics?

I’ve been an amateur musician for many years. Amateur may overstate it! I mostly play guitar, ukulele and mandolin and sing at home, and I record a few tracks for SoundCloud or to send to friends. My kids opened me up to more styles of music, so while ‘rock’ or ‘firstwave’ were my top genres, I have also become interested in ‘hip-hop’. I wanted to combine my musical interests with my python-analytics interests and look at similarities and differences in lyrics at the level of both musical genre and individual artist.

Why Container Classes?

In this article I’ve focused on my takeaways regarding application design. I defined my data model and objectives, and wrote scripts and defined functions to build my corpus and begin analyzing it. It was at that point, when I saw the complexity that was emerging that I had the epiphany regarding container classes. I had identified 8 musical genres and about 135 artists that I wanted for representative corpora. The classes made it simpler to execute multi-step NLP analyses on various groupings of genres and/or artists. Inheritance allowed me to override my first attempts. In some cases my methods were too inefficient or couldn’t handle additional hierarchical segmentation of the corpus, so I enhanced my model by inheriting a new class and adding more robust methods. It ended up not being wasteful as it pushed me towards polymorphism by applying more generic methods with ancestor classes and more detailed, fine-grained methods with descendant classes. It’s just a matter of instantiating the correct class based on the data and the granularity of results that I need.

With each NLP application I’ve built, I formed a conceptual model that would allow me to iterate to improve efficiency as well as simplicity of operation. I started out interactively scripting extraction, formatting and cleansing and calculating derived or intermediate data. Then I refactored with function definitions and passed parameters. Finally, I zeroed in on custom ‘container’ classes and primarily instance methods for them. Now I can pass the containers around and stream particular data from them by using python’s generator-iterator features. Everything in python is an object, and every object has a class definition. The definition controls how a corresponding object is instantiated, and how it will behave during its existence. That is my layman’s explanation- python has an extraordinarily clean, common-sense design- so much so that I feel that I can describe it.

# get core attributes from aggregator- helper to instantiate MusicalMeg objects
# aggregator collects artists, tracks, and folders by genre-
rap_core = Genre_Aggregator(gen='rap')
firstw_core = Genre_Aggregator(gen='firstwave')

# instantiate genre containers from core
mm_rap = MusicalMeg(ga_obj=rap_core)
mm_firstw = MusicalMeg(ga_obj=firstw_core)

# show tf*idf histograms for all rap artists
for x in range(len(mm_rap.artists)):
    plot_artist_tfidf(lobj=mm_rap, artst=mm_rap.artists[x])


# MusicalMeg - class definition:
class MusicalMeg():
    def __init__(self, ga_obj: Genre_Aggregator, lydir: str=LYRICDIR):
        self.filter_tfidf: bool = True
        self.stream_td: bool = True
        self.folder = lydir
        if isinstance(ga_obj, Genre_Aggregator):
            self.genre = ga_obj.genre
            self.trax = ga_obj.trax
            self.artists = ga_obj.artist_list
        self.words_raw_count: int = 0
        self.word_freq: dict = {}
        self.corpbow = []
        self.corpdic: Dictionary = {}
        self.word_artists: dict = {}
        self.artist_words: dict = {}
        self.words_trak: list = []
        self.idf_words: dict = {}
        self.tf_by_artist: dict = {}
        self.tf_by_trak: dict = {}
        self.tfidf_artist: dict = {}
        self.tfidf_trak: dict = {}
        self.tfidf_cutoff_pct: int = 30
        self.calc_word_freq()
        self.calc_tf_by_trak()
        self.calc_idf_for_words()

    # iter method- uses python generator via ‘yield’ to stream lyrics
    def __iter__(self):
        if self.artists:
            for artist in self.artists:
                trak_ct: int = 0
                artst_words: int = 0
                tmp: str = str(artist).replace(" ", "").replace(".", "")
                art_lyrfile: str = str(self.genre) + "_" + tmp + ".lyr"
                if self.trax.get(artist):
                    traxdict = self.trax.get(artist)
                    art_taglst = list(traxdict.keys())
                else:
                    art_taglst = []
                # this function below streams each track for a given artist
                for lyrs, artst in feed_specific_artists(artlst=art_lyrfile, srcdir=self.folder):
                    if art_taglst:
                        try:
                            doctag = art_taglst[trak_ct]
                        except IndexError:
                            doctag = artist + str(trak_ct).rjust(3, "0")
                            print("Index out-of-range for registry: %s" % doctag)
                    else:
                        print("custom tag needed %s trak %d" % (artist, trak_ct))
                        doctag = artist + str(trak_ct).rjust(3, "0")
                    lyrics_tok: list = []
                    for wrd in lyrs:
                        lyrics_tok.append(wrd)

                    if lyrics_tok:
                        # self.stream_td is boolean that can feed doc or word vector training
                        if self.stream_td:
                            yield TaggedDocument(words=lyrics_tok, tags=[doctag])
                        else:
                            # if not tagged doc, just stream word tokenized lyrics
                            yield lyrics_tok

                    artst_words += len(lyrics_tok)
                    trak_ct += 1
                self.artist_words[artst] = artst_words

As I mentioned in comparison of white papers, tweets, and music lyrics, each type of textual content has characteristic traits from ‘document’ and file size and quantity to word and symbol use. Despite the differences, the types and sometimes sequence of tasks overlap considerably. No matter the type of textual content, there is also a need to simplify and report on processing through some type of metadata approach. All of this is facilitated by using OO containers. In the MusicalMeg definition shown above, in addition to the variables shown, I can add variables to track formatting and scrubbing operations on the content. This could be a nominal value that identifies state- then any downstream tasks could check that container content is in an acceptable state, and if not could run the required formatting or cleansing.

I have an older version of gs_lyrics uploaded to a repository on my github account at https://github.com/briangalindoherbert. This week I will upload the updated version with improved container classes and some additional functions with Doc2Vec and LDA analysis.

Insights from NLP with Music Lyrics

Why Lyrics?

Why Container Classes?

Written by Brian G Herbert

No responses yet