In prior posts, I've used used storm to filter the twitter sample stream for religious tweets and then use elasticsearch to perform simple analytics. Since then, I've accumulated about 1 million religious tweets. Now the challenge is: how to gain insights into this mass of 140-character messages. After trying to use mahout to analyze the tweets, I decided to try scikit-learn. Luckily, there is an excellent example using Non Negative Matrix Factorization to generate "topics" for a text corpus.
In the example, each topic is an array of terms extracted by the TfidfTransformer that are ordered by the term weights calculated by NMF. Here's an example "topic" extracted from tweets
[u'impossible possible', u'cast anxiety', u'anxiety cares', u'said mortals', u'mortals impossible', u'cares peter', u'cares 1peter', u'possible said', u'men impossible', u'1peter cast', u'possible mat', u'worries cast', u'rid worries', u'cares cares', u'worries cares', u'peter rid', u'said unto', u'unto men', u'beheld said', u'possible men']
5 Changes to the NMF Example
As with any code snippet one finds on the web, there are certain changes required to make the example fit a specific application. Here are 5 changes I made to the NMF topic extraction example to work with religious tweets.
1. Concatenate similar tweets into single documents within the corpus
When preparing the tweets for analysis, I concatenate similar tweets into 1 giant text blob. Then I pass several of these concatenated documents to the Vectorizer and NMF. This scales a little better than tokenizing and analyzing thousands of tweets as separate documents
2. Ensure the number of requested topics does not exceed the number of 'documents' in the corpus.
This is really simple but it prevented the majority of the failures I encountered when running NMF. A simple check as below fixed my issues.
n_topics = min(n_samples, n_topics)
3. Use min-gram=2 and max-gram=2
I found that single word tokens were noisy but setting mingram=2 and maxgram=2 revealed useful bigrams that reflected natural language patterns.
vectorizer = TfidfVectorizer(max_features=n_features, ngram_range=(2,2))
4. Use a stop word list
I created a stop-word list to filter out tokens that are not interesting, such as the bibleverse citations (e.g. John 3:16), translation acronyms (e.g. NIV, KJV) or common twitter strings (e.g. RT, retweet). Then passed the stop word list to the TfidfVectorizer.
stoplist = ['retweet', 'rt', 'http', 'nlt', 'kjv']
vectorizer.set_params(stop_words=set(list(ENGLISH_STOP_WORDS)+stoplist+bv_tokens))
counts = vectorizer.fit_transform(corpus)
5. Capture the weights and terms returned by NMF
I used the weights returned by NMF to rank results returned by the analysis . Capture the weights like this:
for topic_idx, topic in enumerate(nmf.components_):
sorted_topics = topic.argsort()[:-n_top_words - 1:-1]
print [{'text': feature_names[i], 'weight':topic[i]} for i in sorted_topics]
The output is something like this:
[{'text': u'impossible possible', 'weight': 0.45413113168606384}, {'text': u'cast anxiety', 'weight': 0.40382848799298487}, {'text': u'anxiety cares', 'weight': 0.40382848799298487}, {'text': u'said mortals', 'weight': 0.2794653118068085}, {'text': u'mortals impossible', 'weight': 0.2794653118068085}]
Topic Analysis in Action
People often tweet bible verses for religious holidays. What surprised me were the religious tweets sent for secular holidays. I used the NMF analysis to show the most interesting "topics" for religious tweets on Memorial Day 2014. I use the terms returned within a topic to seed several elasticsearch queries that return the phrases displayed by bakkify. The ordering uses the weights returned by NMF.
The analysis correlated 'Psalm 33:12' with "Happy Memorial Day" when I analyzed over 10k tweets sent the week of 2014-05-26.
Click here to see the Memorial Day topic analysis on bakkify.com. I also ran the analysis for other holidays such as: New Years, Valentines Day and Thanksgiving.