An Object-Oriented Approach to Managing Data for NLP or ML

8 min readDec 29, 2021

Can plotting choice and frequency of word use tell us about an artist’s creativity? Can machine learning identify a popular artist from just a couple of previously unseen lines of their lyrics? What country artist’s lyrics fit best with the Hip-Hop genre? What similarities and differences can I find across the genres of music to which I listen?

It is not just the shape but x-axis position that distinguishes word use between artists

I set out to create a lyrics corpus made up of a list of artists across several genres, such as ‘rap’, ‘alternative’, ‘rock’, ‘punk’, ‘country’, ‘firstwave’ and ‘pop. I am a fan of a diverse range of music, partly from being a guitarist and singer most of my life. I wanted to build a platform that I could use to explore lyrics for personal interest, but to gain professional experience with word embedding, phrase detection, LDA topic modeling, and ‘paragraph vector models’- particularly gensim’s Doc2Vec.

Wordcloud from lyrics of legendary Atlanta rap duo OutKast

With a slight adjustment to the usual entities of the nlp schema, Doc2Vec could be a powerful way to look at lyrics styles and word choices between artists and across word choices. What most nlp articles refer to as a ‘document’, in this app I refer to as a ‘music track’, aka a single song. Similarly, a ‘sentence’ becomes a line in a verse or chorus of a song. This article is how I built a set of objects to simplify applying nlp to music lyrics, but here is a sneek peak at some early results from building a Doc2Vec model from my rap genre object:

Tests lyrics source: 1. Yelawolf, 2. Kendrick Lamar, 3. TechN9ne, 4. Lil Jon, 5. OutKast
— — test 1 most similar docs — —
(‘Yelawolf144’, 0.7901421189308167)
(‘Yelawolf134’, 0.7844868302345276)
(‘TechN9ne542’, 0.643967866897583)
(‘Jeezy338’, 0.6077883243560791)
— — test 2 most similar docs — —
(‘KendrickLamar002’, 0.7180711030960083)
(‘KendrickLamar029’, 0.7117006778717041)
(‘OutKast183’, 0.5632824301719666)
(‘OutKast182’, 0.5611984133720398)
— — test 3 most similar docs — —
(‘TechN9ne275’, 0.7073992490768433)
(‘TechN9ne216’, 0.6986358761787415)
(‘Drake013’, 0.6495011448860168)
(‘KanyeWest302’, 0.6161069273948669)
— — test 4 most similar docs — —
(‘LilJon017’, 0.7342942953109741)
(‘WizKhalifa188’, 0.6358765959739685)
(‘KanyeWest137’, 0.6336359977722168)
(‘KendrickLamar116’, 0.607702374458313)
— — test 5 most similar docs — —
(‘OutKast184’, 0.8478816747665405)
(‘OutKast130’, 0.8465065956115723)
(‘OutKast051’, 0.8424118757247925)
(‘OutKast111’, 0.8394500613212585)

There is a lot to unpack and explain with these results, detail I defer to my next article, but these results are intriguing as to consistency in word choice and/or lyrical style. That was just a teaser as this article is about the OO model I built to simplify leveraging my lyrics corpus with nlp.

First I’ll explain my approach to building the lyrics corpus. There are multiple sources for lyrics, some sites have API’s and others require html scraping with a tool like Beautiful Soup. I opted for Genius.com, as after signing up for a Genius account I received a token to access the API. With Genius I also had the option of using the really good python package lyricsgenius, which actually wraps the API calls and allows me to simply call python methods to get artist, album, track, and lyrics data.

I instantiate the core lyricsgenius object, Genius, passing it my API token (xxx’ed out above). I wrapper each of the lyricsgenius calls with my own methods, so I can filter the results to my preferred format and I can add retry logic or try-except clauses for exception handling. At first get_lyrics was getting HTTPError, JSONDecodeError and Timeout errors, but it runs reliably now. Part of that was also adjusting the parameters I use to instantiate Genius, as seen in the above code.

The result of this block of code is a set of “.lyr” lyrics files in LYRICDIR, one for each artist and named with “<genre name>_<artist name>.lyr”. A single ‘registry’ file is created for each genre. The file has a python dict for each artist, key= song ‘tag’, value= artist name and name of track, as in:

{

‘Wu-TangClan131’ :’Wu-tang-clan-cream’,

‘Wu-TangClan132’ : ‘Wu-tang-clan-method-man’,

‘Wu-TangClan133’ : ‘Wu-tang-clan-protect-ya-neck’

}

Generating this registry at the same time as the lyrics file keeps the tags and actual lyrics in synch. As I began using Doc2Vec and understood it needed to be passed a stream of lyrics in TaggedDocument format, I decided to produce this track registry and use it in the object’s generator-iterator so that in testing and using models I would have traceability back to the name of a song, artist, and the source lyrics.

There are always data scrubbing hassles with any nlp project, the biggest one on this project was artist name mismatches with the name on Genius. Sometimes multi-byte Unicode or extended ascii characters appeared in the name- there are about 5 different dash (“-“) characters in Unicode. There is also the ‘zero-length space’ (code \u200b), a real bugger because you can only detect it with code, its invisible! It was embedded in one artist name, and it took me awhile to add it to my detection and cleaning code! There is an option of passing a numeric genius artist ID in most calls, but of course you first have to have that artist ID, which is an extra, manual step! I opted to search by name and simply add to my text cleaning code as errors came up.

Creating Objects to Manage the Corpus

I started by defined a fairly simple class, LyricalLou, which would ‘own’ the lyrics within a genre- it could provide information on the vocabulary and word frequencies, overall as well as by artist. One of the most important features was its __iter__ method streamed the slice of the lyrics corpus that the object ‘contained’. As I added functionality in the app I used inheritance and defined child classes, rather than digress into those details I’ll skip to the current evolution of my OO approach to the corpus:

Excerpts from code for Genre_Aggregator and MusicalMeg class definitions

The evolution of my object approach made everything much more simple. An instance of genre_aggregator is responsible for finding all the lyrics for that genre, identifying the artists, and for pulling the song registries for all those artists. By taking on this responsibility, it is much more simple and reliable to instantiate one of the core, MusicalMeg objects.

Creation of a support class greatly simplified instantiating the core lyrics object

Genre_aggregator allowed me to do away with lots of conditional code in the __init__ method to construct a lyrics object, and it reduced the chance of omissions or conflicts in this construction. Here is how the code now works to create a lyrics object that will ‘own’ and perform various nlp tasks on all the alternative rock artists and lyrics in the corpus:

alt_core = Genre_Aggregator(gen=’alternative’)
mm_alt = MusicalMeg(ga_obj=alt_core)

that’s it, and then this object can perform nlp tasks on what it ‘owns’ as well as use its generator-iterator construct to train nlp models:

mm_alt.calc_tfidf_by_trak()
mm_alt.calc_tfidf_by_artist()
arts, artvals = check_outliers(mm_alt)
doc_model = document_skipgram(mm_alt, passes=12,
grpsize=10, dim=100, thrds=4, dm=0)
…

By default, the iteration of the object generates word-tokenized lists by track in gensim TaggedDocument format, but this can be modified by changed the state of instance variables. The boolean instance variable controlling output format is:

self.stream_td: bool = True

in the following code, I generate train and test corpus slices as simple word-tokenized lists, to feed Phrases identification:

Using the alternative rock Object to build and test a Phrases model

I’ll explain a few pieces in the above code. First, I set the instance variable for mm_alt to False, so that it will stream word-tokenized lists rather than TaggedDocuments. Next, Refer to the first for loop: I use the objects knowledge of the number of tracks (self.words_trak) to set the number of iterations, and I pass the object, whose __iter__ method will stream the lyrics.

tfidf plot of TMBG differs significantly from others- right-shift of x-axis values

In the second for loop, I take advantage of the fact that the iterator will still stream tags along with the word-tokenized lyrics, so I pick those up (test_tagsource.append(td[1])). This gives me traceability — I can match a particular test back to the name of the artist and the song, and even pull the lyrics for that song.

I want to give proper credit- my idea to build objects to manage slices of my lyrics corpus came from a post by guru Radim Rahurek at Rare Technologies. He explained how to leverage Python generator and iterator capabilities to stream a corpus and reduce the memory footprint. I took his advice, and then expanded my concept of how I could use custom classes to manage a large corpus of lyrics.

There is a lot more for me to tell about this music project in future posts. I wanted to keep this article focused on just the OO framework, as I think it could be applied on projects with other specialized corpora, particularly when dealing with distinct slices of the corpus, as I have here with music genres. I’ve posted my code to the repo gs_lyrics on my github account https://github.com/briangalindoherbert.

Update January 10, 2022

I have continued to build flexibility into my framework for the lyrics app, particularly in how I split out training and testing data for the models and how I capture results by musical genre.

Recently I have been running tests on the doc2vec models by genre, using previously unseen lyrics to capture the top 10 most similar tracks. I measure how often the artist that was the source of the unseen lyrics is in the top ten most similar tracks returned, and plot the aggregate results by artist. I posted an article on this on https://linkedin.com/in/briangalindoherbert.

There are many different approaches to identifying topics as well as mapping authors to a corpus of documents. I took the approach of using the document-level vectors to identify similarities across tracks by a given artist. It made me realize I have just tapped the surface on both the implementation and interpretation of vector embeddings with lyrics.

An Object-Oriented Approach to Managing Data for NLP or ML

Creating Objects to Manage the Corpus

Update January 10, 2022

Written by Brian G Herbert

No responses yet