Review of Reproducible Research by Johns Hopkins University on Coursera

This week I completed the course: Reproducible Research by Johns Hopkins University on Coursera The course introduces tools to publish research documents containing data processing code, raw data and results. Research is "reproducible" if an independent researcher can fetch the code, fetch the data, execute the scripts and verify the results.

IMO, this is akin to the software engineering practices of Software Quality Assurance, Code Reviews and Continuous Integration. These practices are meant to solve the problem where the code "works-on-my-machine" but not anywhere else. This is extremely important in bioinformatics because erroneous research can lead to erroneous clinical trials - as described in the lecture: The Importance of Reproducible Research in High-Throughput Biology.

Key Lectures
My favorite lecture of the course was the The Importance of Reproducible Research in High-Throughput Biology lecture given by Keith A. Baggerly, Ph.D. of the MD Anderson Cancer Center, Houston, TX. The lecture discusses Dr. Baggerly's attempt to reverse engineer the results of a study that had numerous errors. See this NYT article for more details.

Projects
For the first project, we analyzed activity monitoring data created by a fitness tracker. First, I calculate the mean number of steps for each 5-minute interval grouped by weekends and weekdays (i.e. 1 group for Monday-Friday intervals, 1 group for Saturday-Sunday intervals). I conclude that the user is most active on weekdays because the maximum 5-minute interval occurs in the weekday group.

For the second project, we analyze the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. First, I show the data processing steps performed prior to the analysis. Next, I calculate the sum for number of fatalities, number of injuries and economic cost per weather event type. Finally, I rank the weather event types based on (1) public health impact and (2) economic impact. The results show tornados pose a significant public health risk in terms of injuries, fatalities and economic cost. Additionally, excessive heat poses a public health risk based on fatalities. Floods pose the greatest risk in terms of economic cost. I also published the report to Rpubs.com

I earned a certificate for completing the course. The next course in the series is Statistical Inference.

Topic Analysis of Religious Tweets Using Scikit-Learn

In prior posts, I've used used storm to filter the twitter sample stream for religious tweets and then use elasticsearch to perform simple analytics. Since then, I've accumulated about 1 million religious tweets. Now the challenge is: how to gain insights into this mass of 140-character messages. After trying to use mahout to analyze the tweets, I decided to try scikit-learn. Luckily, there is an excellent example using Non Negative Matrix Factorization to generate "topics" for a text corpus.

In the example, each topic is an array of terms extracted by the TfidfTransformer that are ordered by the term weights calculated by NMF. Here's an example "topic" extracted from tweets

[u'impossible possible', u'cast anxiety', u'anxiety cares', u'said mortals', u'mortals impossible', u'cares peter', u'cares 1peter', u'possible said', u'men impossible', u'1peter cast', u'possible mat', u'worries cast', u'rid worries', u'cares cares', u'worries cares', u'peter rid', u'said unto', u'unto men', u'beheld said', u'possible men']

5 Changes to the NMF Example

As with any code snippet one finds on the web, there are certain changes required to make the example fit a specific application. Here are 5 changes I made to the NMF topic extraction example to work with religious tweets.

1. Concatenate similar tweets into single documents within the corpus

When preparing the tweets for analysis, I concatenate similar tweets into 1 giant text blob. Then I pass several of these concatenated documents to the Vectorizer and NMF. This scales a little better than tokenizing and analyzing thousands of tweets as separate documents

2. Ensure the number of requested topics does not exceed the number of 'documents' in the corpus.

This is really simple but it prevented the majority of the failures I encountered when running NMF. A simple check as below fixed my issues.

 n_topics = min(n_samples, n_topics)

3. Use min-gram=2 and max-gram=2

I found that single word tokens were noisy but setting mingram=2 and maxgram=2 revealed useful bigrams that reflected natural language patterns.

vectorizer = TfidfVectorizer(max_features=n_features, ngram_range=(2,2))

4. Use a stop word list

I created a stop-word list to filter out tokens that are not interesting, such as the bibleverse citations (e.g. John 3:16), translation acronyms (e.g. NIV, KJV) or common twitter strings (e.g. RT, retweet). Then passed the stop word list to the TfidfVectorizer.

stoplist = ['retweet', 'rt', 'http', 'nlt', 'kjv']
vectorizer.set_params(stop_words=set(list(ENGLISH_STOP_WORDS)+stoplist+bv_tokens))
 counts = vectorizer.fit_transform(corpus)

5. Capture the weights and terms returned by NMF

I used the weights returned by NMF to rank results returned by the analysis . Capture the weights like this:

for topic_idx, topic in enumerate(nmf.components_):
  sorted_topics = topic.argsort()[:-n_top_words - 1:-1]
  print [{'text': feature_names[i], 'weight':topic[i]} for i in sorted_topics]

The output is something like this:

[{'text': u'impossible possible', 'weight': 0.45413113168606384}, {'text': u'cast anxiety', 'weight': 0.40382848799298487}, {'text': u'anxiety cares', 'weight': 0.40382848799298487}, {'text': u'said mortals', 'weight': 0.2794653118068085}, {'text': u'mortals impossible', 'weight': 0.2794653118068085}]

Topic Analysis in Action

People often tweet bible verses for religious holidays. What surprised me were the religious tweets sent for secular holidays. I used the NMF analysis to show the most interesting "topics" for religious tweets on Memorial Day 2014. I use the terms returned within a topic to seed several elasticsearch queries that return the phrases displayed by bakkify. The ordering uses the weights returned by NMF.

The analysis correlated 'Psalm 33:12' with "Happy Memorial Day" when I analyzed over 10k tweets sent the week of 2014-05-26.

Click here to see the Memorial Day topic analysis on bakkify.com. I also ran the analysis for other holidays such as: New Years, Valentines Day and Thanksgiving.