The curious case of applying nlp to Twitter
How to adapt tf*idf to a corpus of Tweets
Term Frequency/ Inverse Document Frequency is now seen as ‘old-school’ for identifying important words or phrases in a corpus. For one thing, the best justification for it in mathematics comes from Zipf’s law, a heuristic that basically says the importance of an element is often the inverse of its frequency distribution.
We need to tune out the ordinary without “throwing the baby out with the bathwater”. Terms must be frequent enough to matter, but not so frequent as to be ordinary, such as prepositions
tf*idf tempers the raw frequency of occurrence of a word or n-gram (a sequence of words- for ex. bi-gram is a two-word phrase) by multiplying it by the log of the ratio of total documents in a corpus to the documents that contain the word.
You call that a %#?! Document?!
Most nlp functions use the concept of a ‘document’ in their algorithms. The document is the building block element of a ‘corpus’, and more than one corpus aggregated is known as a ‘corpora’. The issue is that when dealing with articles, press releases, web page content…just about any application of nlp other than tweets…documents are at least several sentences, often much longer.
Tweets are restricted to 140 characters. That means the values we will find for term frequency with Tweets will be very different from those found when analyzing corpora from a different medium.
The top 10 terms by frequency in the corpus for my Superleague dataset are shown below from output of a method I wrote.
show_topten_dict(word_cnt)
0. league
1. super
2. football
3. clubs
4. fans
5. european
6. uefa
7. out
8. club
9. chelsea
This doesn’t give us information that we can use thought to determine what is important- such as what terms form the building blocks of topics- or what is unimportant- what terms we can put in a STOP list to remove from the corpus? For that we need to either turn to older methods like tf*idf, or newer machine learning methods to understand the Importance of words.
There are multiple methods for calculating term frequency:
- the ratio of the count of occurrences of the target word to the count of total words within the document (COUNT method).
- the ratio of the count of the target word to the count of the most frequent word, or ‘top’ word in the document (TOP method).
- the ratio of the count of the target word to the count of unique words in the document (UNIQ method).
There are other calculation methods, but these are three I implemented as parameter options for my calc_tf method. Below is the code:
The formal definition of tf*idf also calculates its value for a word within a document within a larger corpus. I have not seen a formal definition of the tf*idf value to apply for a word for the overall corpus: should it be the average value for the document-level tf*idf values for that word, or perhaps the sum of those values, or something else?
I have experimented with the three term frequency calc types and the two aggregate tfidf types, and found that using UNIQ for tf, and using AVG for aggregating resulted in the best separation of important or distinct language from unimportant language. How did I determine optimal results? I generated wordclouds and compared them to my own understanding of the content from having read many of the threads of tweets.
I used tf*idf values to generate an automatic STOP list of unimportant words, and then generated wordclouds from the resulting filtered set of word-tokenized Tweets:
Wordcloud from tf*idf filtered Superleague tweets
Sorry about the uncensored language, but this is actually without any manual selection or machine learning- just traditional tf*idf being used to filter out STOP words then generate a wordcloud. It captures a lot of the important themes that I detected from reading through the threads on my own.
To run this through, I created a number of functions which generate intermediate results and eventually result in the STOP LISTS and filtered datasets. Here is how I set up the scripts to generate multiple tf calcs and multiple aggregate tfidf calcs, I am showing just tf calc type TOP and aggregate calc type SUM for this example:
# first do some cleaning and prepping prior to initiating the tf*idf calcs:
limrt_cln = gsutil.remove_parens_stops(words_t4d, stop2=STOP_ADD)
limrt_cln = gsutil.do_stops(limrt_cln, stop1=STOP_ESL1, stop2=STOP_TWEET)
limrt_cln = gsutil.do_stops(limrt_cln, stop1=GS_STOP, stop2=ad_hoc)
inp_tfidf = gsutil.do_start_stops(limrt_cln)
words_rtclean = gsutil.do_wrd_tok(inp_tfidf)
tw_wrd_scrub: list = []
for sent in words_rtclean:
# want to only keep words with word characters (I already stripped punctuation earlier)
tmplst: list = []
for wrd in sent:
if wrd.isalpha():
tmplst.append(wrd)
tw_wrd_scrub.append(tmplst)
# use each of my three term freq calc types:
wf_TOP = t2i.calc_tf(tw_wrd_scrub, word_tokens=True, calctyp=”TOP”)
tws_pertop = t2i.count_tweets_for_word(wf_TOP)
# idf does not vary by term frequency type, so single calc will do
idf_by_wrd = t2i.calc_idf(wf_TOP, tws_pertop)
tfidftop = t2i.calc_tfidf_new(wf_TOP, idf_by_wrd)
tfit_fin = t2i.corpus_tfidf(tfidftop, calctyp=”SUM”)
tfit_finav = t2i.corpus_tfidf(tfidftop, calctyp=”AVG”)
# stoplists for sum aggreg types
TFIT_SUM = t2i.do_tfidf_stops(tfit_fin)
# or stoplist for the avg aggregate types:
TFIT_AVG = t2i.do_tfidf_stops(tfit_finav)
# can build stop list of ‘junk’ words with following:
tw_tfit_fin = gsutil.do_stops(tw_wrd_scrub, stop1=TFIT_SUM, stop2=GS_STOP)
tw_tfic_fin = gsutil.do_stops(tw_wrd_scrub, stop1=TFIT_AVG, stop2=GS_STOP)
With the various filtered datasets I then generated a worcloud for each and compared. Here is the call for one of those, you can extrapolate to figure out the additional calls :-) :
gsPT.do_cloud(tw_tfit_fin, opt_stops=None, maxwrd=125)
For comparison, below is a word cloud using term frequency type of TOP (ratio of target to count of most frequent word in doc) and a tf*idf aggregating type of SUM (sum all found tf*idf values from the individual documents to get total for target).
Next time, I’m going to drill down with what I’ve done with Gensim topic modeling with this same codebase and Twitter corpora.
The lesson learned for me has been to think through the differences between social media posts and other type of ‘documents’, and where those differences impact the assumptions inherent to certain nlp algorithms. It wasn’t until I got ‘deep’ working with the Twitter data model that I felt comfortable actually tweaking some of the variables or constants in models to improve their performance.
All the Best,
Brian