All posts by Telvis

Prince Song Recommender using R and Shiny

May 6, 2016 Telvis

Overview

This project contains a Prince song recommender developed using R and Shiny. This app was developed for Developing Data Products course by Johns Hopkins University

View the Shiny App

Click this link to view the R/Shiny application

Slides

Click this link to view the presentation slides.

Data

This project uses data from the Million Song Dataset.

data-science

Review of Statistical Inference by Johns Hopkins University on Coursera

March 2, 2016 Telvis

This class is great. I recommend purchasing Dr. Caffo's "Statistical inference for data science" book and working the problems prior to completing the quizzes.

The final project had 2 parts. For part 1, we investigate the exponential distribution in R and compare it with the Central Limit Theorem. For part 2, we explore the ToothGrowth dataset and perform t-tests on the data.

data-science

Review of Reproducible Research by Johns Hopkins University on Coursera

January 7, 2016 Telvis

This week I completed the course: Reproducible Research by Johns Hopkins University on Coursera The course introduces tools to publish research documents containing data processing code, raw data and results. Research is "reproducible" if an independent researcher can fetch the code, fetch the data, execute the scripts and verify the results.

IMO, this is akin to the software engineering practices of Software Quality Assurance, Code Reviews and Continuous Integration. These practices are meant to solve the problem where the code "works-on-my-machine" but not anywhere else. This is extremely important in bioinformatics because erroneous research can lead to erroneous clinical trials - as described in the lecture: The Importance of Reproducible Research in High-Throughput Biology.

Key Lectures
My favorite lecture of the course was the The Importance of Reproducible Research in High-Throughput Biology lecture given by Keith A. Baggerly, Ph.D. of the MD Anderson Cancer Center, Houston, TX. The lecture discusses Dr. Baggerly's attempt to reverse engineer the results of a study that had numerous errors. See this NYT article for more details.

Projects
For the first project, we analyzed activity monitoring data created by a fitness tracker. First, I calculate the mean number of steps for each 5-minute interval grouped by weekends and weekdays (i.e. 1 group for Monday-Friday intervals, 1 group for Saturday-Sunday intervals). I conclude that the user is most active on weekdays because the maximum 5-minute interval occurs in the weekday group.

For the second project, we analyze the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. First, I show the data processing steps performed prior to the analysis. Next, I calculate the sum for number of fatalities, number of injuries and economic cost per weather event type. Finally, I rank the weather event types based on (1) public health impact and (2) economic impact. The results show tornados pose a significant public health risk in terms of injuries, fatalities and economic cost. Additionally, excessive heat poses a public health risk based on fatalities. Floods pose the greatest risk in terms of economic cost. I also published the report to Rpubs.com

I earned a certificate for completing the course. The next course in the series is Statistical Inference.

me-and-my-bass-guitar

No Worries

June 27, 2015 Telvis

First "finished" track with the Fender J.

twitter_mining

Sparks and Shenanigans

March 20, 2015 Telvis

Shenanigans! https://github.com/telvis07/spark_shenanigans

I'm planning to use spark streaming in my religious tweet work . The first step was to get smart on Scala and Spark. This github repo contains examples to use Spark Streaming for (1) reading from twitter, (2) performing streaming queries and (3) writing to Elasticsearch.

me-and-my-bass-guitar

Singing ABC’s

February 15, 2015 Telvis

Singing ABC's with Dakota and Dani. Beat by Logic Pro X.

me-and-my-bass-guitar

funky-flight

February 9, 2015 Telvis

New #bass track 'funky-flight' posted to soundcloud :

me-and-my-bass-guitar

Kotaz with Steely Beats

January 31, 2015 Telvis

Fun with Kotaz ad-libbing and Logic Pro. Using the "Steely Beats" Drum Machine.

me-and-my-bass-guitar

Keep it simple

January 31, 2015 Telvis

This is my first song written/recorded with Logic Pro X. It has 2 bass tracks and a virtual drum track. The Logic drummer is "Nikki" with the "Four on the Floor" Drumkit.

habakkuk-mining, twitter_mining

Topic Analysis of Religious Tweets Using Scikit-Learn

December 17, 2014 Telvis

In prior posts, I've used used storm to filter the twitter sample stream for religious tweets and then use elasticsearch to perform simple analytics. Since then, I've accumulated about 1 million religious tweets. Now the challenge is: how to gain insights into this mass of 140-character messages. After trying to use mahout to analyze the tweets, I decided to try scikit-learn. Luckily, there is an excellent example using Non Negative Matrix Factorization to generate "topics" for a text corpus.

In the example, each topic is an array of terms extracted by the TfidfTransformer that are ordered by the term weights calculated by NMF. Here's an example "topic" extracted from tweets

[u'impossible possible', u'cast anxiety', u'anxiety cares', u'said mortals', u'mortals impossible', u'cares peter', u'cares 1peter', u'possible said', u'men impossible', u'1peter cast', u'possible mat', u'worries cast', u'rid worries', u'cares cares', u'worries cares', u'peter rid', u'said unto', u'unto men', u'beheld said', u'possible men']

5 Changes to the NMF Example

As with any code snippet one finds on the web, there are certain changes required to make the example fit a specific application. Here are 5 changes I made to the NMF topic extraction example to work with religious tweets.

1. Concatenate similar tweets into single documents within the corpus

When preparing the tweets for analysis, I concatenate similar tweets into 1 giant text blob. Then I pass several of these concatenated documents to the Vectorizer and NMF. This scales a little better than tokenizing and analyzing thousands of tweets as separate documents

2. Ensure the number of requested topics does not exceed the number of 'documents' in the corpus.

This is really simple but it prevented the majority of the failures I encountered when running NMF. A simple check as below fixed my issues.

 n_topics = min(n_samples, n_topics)

3. Use min-gram=2 and max-gram=2

I found that single word tokens were noisy but setting mingram=2 and maxgram=2 revealed useful bigrams that reflected natural language patterns.

vectorizer = TfidfVectorizer(max_features=n_features, ngram_range=(2,2))

4. Use a stop word list

I created a stop-word list to filter out tokens that are not interesting, such as the bibleverse citations (e.g. John 3:16), translation acronyms (e.g. NIV, KJV) or common twitter strings (e.g. RT, retweet). Then passed the stop word list to the TfidfVectorizer.

stoplist = ['retweet', 'rt', 'http', 'nlt', 'kjv']
vectorizer.set_params(stop_words=set(list(ENGLISH_STOP_WORDS)+stoplist+bv_tokens))
 counts = vectorizer.fit_transform(corpus)

5. Capture the weights and terms returned by NMF

I used the weights returned by NMF to rank results returned by the analysis . Capture the weights like this:

for topic_idx, topic in enumerate(nmf.components_):
  sorted_topics = topic.argsort()[:-n_top_words - 1:-1]
  print [{'text': feature_names[i], 'weight':topic[i]} for i in sorted_topics]

The output is something like this:

[{'text': u'impossible possible', 'weight': 0.45413113168606384}, {'text': u'cast anxiety', 'weight': 0.40382848799298487}, {'text': u'anxiety cares', 'weight': 0.40382848799298487}, {'text': u'said mortals', 'weight': 0.2794653118068085}, {'text': u'mortals impossible', 'weight': 0.2794653118068085}]

Topic Analysis in Action

People often tweet bible verses for religious holidays. What surprised me were the religious tweets sent for secular holidays. I used the NMF analysis to show the most interesting "topics" for religious tweets on Memorial Day 2014. I use the terms returned within a topic to seed several elasticsearch queries that return the phrases displayed by bakkify. The ordering uses the weights returned by NMF.

The analysis correlated 'Psalm 33:12' with "Happy Memorial Day" when I analyzed over 10k tweets sent the week of 2014-05-26.

Click here to see the Memorial Day topic analysis on bakkify.com. I also ran the analysis for other holidays such as: New Years, Valentines Day and Thanksgiving.