{"id":666,"date":"2014-12-17T03:48:42","date_gmt":"2014-12-17T03:48:42","guid":{"rendered":"http:\/\/technicalelvis.com\/blog\/?p=666"},"modified":"2014-12-17T10:40:00","modified_gmt":"2014-12-17T10:40:00","slug":"topic-analysis-of-religious-tweets","status":"publish","type":"post","link":"https:\/\/technicalelvis.com\/blog\/2014\/12\/17\/topic-analysis-of-religious-tweets\/","title":{"rendered":"Topic Analysis of Religious Tweets Using Scikit-Learn"},"content":{"rendered":"<p>In prior posts, I've used used storm to <a title=\"habakkuk starter\" href=\"http:\/\/technicalelvis.com\/blog\/2012\/06\/21\/habakkuk-starter\/\">filter the twitter sample stream for religious tweets<\/a> and then use\u00a0<a title=\"valentine's day with elasticsearch\" href=\"http:\/\/technicalelvis.com\/blog\/2013\/02\/20\/valentines-day-scripture-usage-on-twitter\/\">elasticsearch to perform simple analytics<\/a>. Since then, I've accumulated about 1 million religious tweets. Now the challenge is: how to gain insights into this mass of 140-character messages. After trying to use <a title=\"mahout analysis\" href=\"http:\/\/technicalelvis.com\/blog\/2013\/03\/28\/mahout-twitter-1\/\">mahout to analyze the tweets<\/a>, I decided to try <a href=\"http:\/\/scikit-learn.org\/\">scikit-learn<\/a>. Luckily, <a href=\"http:\/\/scikit-learn.org\/stable\/auto_examples\/applications\/topics_extraction_with_nmf.html\">there is an excellent example using\u00a0Non Negative Matrix Factorization to generate \"topics\" for a text corpus<\/a>.<\/p>\n<p>In the example, each topic is an array of terms extracted by the\u00a0TfidfTransformer that are ordered by the term weights calculated\u00a0by NMF. Here's an example \"topic\" extracted from tweets<\/p>\n<pre style=\"color: #000000;\">[u'impossible possible', u'cast anxiety', u'anxiety cares', u'said mortals', u'mortals impossible', u'cares peter', u'cares 1peter', u'possible said', u'men impossible', u'1peter cast', u'possible mat', u'worries cast', u'rid worries', u'cares cares', u'worries cares', u'peter rid', u'said unto', u'unto men', u'beheld said', u'possible men']<\/pre>\n<h1>5 Changes to the NMF Example<\/h1>\n<p>As with any code snippet\u00a0one finds on the web, there are certain changes required to make the example fit a specific application. Here are 5 changes I made to\u00a0the NMF topic extraction example to work with religious tweets.<\/p>\n<h4>1. Concatenate similar tweets into single documents within the corpus<\/h4>\n<p>When preparing the tweets\u00a0for analysis, I\u00a0concatenate\u00a0similar tweets into 1 giant text blob. Then I pass several of these concatenated documents to\u00a0the Vectorizer and NMF. This scales a little better than tokenizing and\u00a0analyzing\u00a0thousands of tweets as separate documents<\/p>\n<h4>2. Ensure the number of requested topics does not exceed the number of 'documents' in the corpus.<\/h4>\n<p>This is really simple but it prevented the majority of the failures I encountered when running NMF. A simple check as below fixed my\u00a0issues.<\/p>\n<pre> n_topics = min(n_samples, n_topics)<\/pre>\n<h4>3.\u00a0Use min-gram=2 and max-gram=2<\/h4>\n<p>I found that single word tokens were noisy but setting mingram=2 and maxgram=2 revealed useful bigrams that reflected natural language patterns.<\/p>\n<pre>vectorizer = TfidfVectorizer(max_features=n_features, ngram_range=(2,2))<\/pre>\n<h4>4. Use a stop word list<\/h4>\n<p>I created a\u00a0stop-word list to filter out tokens that are not interesting, such as the bibleverse citations (e.g. John 3:16),\u00a0translation acronyms (e.g. NIV, KJV) or common twitter strings (e.g. RT, retweet). Then passed the stop word list to the\u00a0TfidfVectorizer.<\/p>\n<pre>stoplist = ['retweet', 'rt', 'http', 'nlt', 'kjv']\r\nvectorizer.set_params(stop_words=set(list(ENGLISH_STOP_WORDS)+stoplist+bv_tokens))\r\n counts = vectorizer.fit_transform(corpus)<\/pre>\n<h4>5. Capture the weights and terms returned by NMF<\/h4>\n<p><span style=\"line-height: 1.714285714; font-size: 1rem;\">I used the weights returned by NMF to rank results returned by the analysis\u00a0. Capture the weights like this:<\/span><\/p>\n<pre>for topic_idx, topic in enumerate(nmf.components_):\r\n  sorted_topics = topic.argsort()[:-n_top_words - 1:-1]\r\n  print [{'text': feature_names[i], 'weight':topic[i]} for i in sorted_topics]<\/pre>\n<p>The output is something like this:<\/p>\n<pre>[{'text': u'impossible possible', 'weight': 0.45413113168606384}, {'text': u'cast anxiety', 'weight': 0.40382848799298487}, {'text': u'anxiety cares', 'weight': 0.40382848799298487}, {'text': u'said mortals', 'weight': 0.2794653118068085}, {'text': u'mortals impossible', 'weight': 0.2794653118068085}]<\/pre>\n<h1>Topic Analysis in Action<\/h1>\n<p>People often tweet bible verses for religious holidays. What surprised me were the religious tweets sent for secular holidays. I used the NMF\u00a0analysis to show the most interesting \"topics\" for religious tweets on Memorial Day 2014.\u00a0I use the terms returned within a topic to seed several elasticsearch queries that return the phrases displayed by bakkify. The ordering uses the weights returned by NMF.<\/p>\n<p>The analysis correlated <a href=\"https:\/\/www.bible.com\/search\/bible?q=psalm+33%3A12\">'Psalm 33:12'<\/a>\u00a0with \"Happy Memorial Day\" when I analyzed over 10k tweets sent the week of\u00a02014-05-26.<\/p>\n<p><a href=\"http:\/\/www.bakkify.com\/topics\/memorialday\/\">Click here to see\u00a0the Memorial Day topic analysis\u00a0on bakkify.com.<\/a>\u00a0I also ran the analysis for other holidays such as: <a href=\"http:\/\/www.bakkify.com\/topics\/newyears\/\">New Years<\/a>, <a href=\"http:\/\/www.bakkify.com\/topics\/valentines\/\">Valentines Day<\/a> and <a href=\"http:\/\/www.bakkify.com\/topics\/thanksgiving\/\">Thanksgiving<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In prior posts, I&#8217;ve used used storm to filter the twitter sample stream for religious tweets and then use\u00a0elasticsearch to perform simple analytics. Since then, I&#8217;ve accumulated about 1 million religious tweets. Now the challenge is: how to gain insights into this mass of 140-character messages. After trying to use mahout to analyze the tweets, &hellip; <a href=\"https:\/\/technicalelvis.com\/blog\/2014\/12\/17\/topic-analysis-of-religious-tweets\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Topic Analysis of Religious Tweets Using Scikit-Learn<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[15,12],"tags":[],"class_list":["post-666","post","type-post","status-publish","format-standard","hentry","category-habakkuk-mining","category-twitter_mining"],"_links":{"self":[{"href":"https:\/\/technicalelvis.com\/blog\/wp-json\/wp\/v2\/posts\/666","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/technicalelvis.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/technicalelvis.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/technicalelvis.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/technicalelvis.com\/blog\/wp-json\/wp\/v2\/comments?post=666"}],"version-history":[{"count":20,"href":"https:\/\/technicalelvis.com\/blog\/wp-json\/wp\/v2\/posts\/666\/revisions"}],"predecessor-version":[{"id":686,"href":"https:\/\/technicalelvis.com\/blog\/wp-json\/wp\/v2\/posts\/666\/revisions\/686"}],"wp:attachment":[{"href":"https:\/\/technicalelvis.com\/blog\/wp-json\/wp\/v2\/media?parent=666"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/technicalelvis.com\/blog\/wp-json\/wp\/v2\/categories?post=666"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/technicalelvis.com\/blog\/wp-json\/wp\/v2\/tags?post=666"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}