Category Archives: twitter_mining

Twitter mining project posts

Topic Analysis of Religious Tweets Using Scikit-Learn

In prior posts, I've used used storm to filter the twitter sample stream for religious tweets and then use elasticsearch to perform simple analytics. Since then, I've accumulated about 1 million religious tweets. Now the challenge is: how to gain insights into this mass of 140-character messages. After trying to use mahout to analyze the tweets, I decided to try scikit-learn. Luckily, there is an excellent example using Non Negative Matrix Factorization to generate "topics" for a text corpus.

In the example, each topic is an array of terms extracted by the TfidfTransformer that are ordered by the term weights calculated by NMF. Here's an example "topic" extracted from tweets

[u'impossible possible', u'cast anxiety', u'anxiety cares', u'said mortals', u'mortals impossible', u'cares peter', u'cares 1peter', u'possible said', u'men impossible', u'1peter cast', u'possible mat', u'worries cast', u'rid worries', u'cares cares', u'worries cares', u'peter rid', u'said unto', u'unto men', u'beheld said', u'possible men']

5 Changes to the NMF Example

As with any code snippet one finds on the web, there are certain changes required to make the example fit a specific application. Here are 5 changes I made to the NMF topic extraction example to work with religious tweets.

1. Concatenate similar tweets into single documents within the corpus

When preparing the tweets for analysis, I concatenate similar tweets into 1 giant text blob. Then I pass several of these concatenated documents to the Vectorizer and NMF. This scales a little better than tokenizing and analyzing thousands of tweets as separate documents

2. Ensure the number of requested topics does not exceed the number of 'documents' in the corpus.

This is really simple but it prevented the majority of the failures I encountered when running NMF. A simple check as below fixed my issues.

 n_topics = min(n_samples, n_topics)

3. Use min-gram=2 and max-gram=2

I found that single word tokens were noisy but setting mingram=2 and maxgram=2 revealed useful bigrams that reflected natural language patterns.

vectorizer = TfidfVectorizer(max_features=n_features, ngram_range=(2,2))

4. Use a stop word list

I created a stop-word list to filter out tokens that are not interesting, such as the bibleverse citations (e.g. John 3:16), translation acronyms (e.g. NIV, KJV) or common twitter strings (e.g. RT, retweet). Then passed the stop word list to the TfidfVectorizer.

stoplist = ['retweet', 'rt', 'http', 'nlt', 'kjv']
vectorizer.set_params(stop_words=set(list(ENGLISH_STOP_WORDS)+stoplist+bv_tokens))
 counts = vectorizer.fit_transform(corpus)

5. Capture the weights and terms returned by NMF

I used the weights returned by NMF to rank results returned by the analysis . Capture the weights like this:

for topic_idx, topic in enumerate(nmf.components_):
  sorted_topics = topic.argsort()[:-n_top_words - 1:-1]
  print [{'text': feature_names[i], 'weight':topic[i]} for i in sorted_topics]

The output is something like this:

[{'text': u'impossible possible', 'weight': 0.45413113168606384}, {'text': u'cast anxiety', 'weight': 0.40382848799298487}, {'text': u'anxiety cares', 'weight': 0.40382848799298487}, {'text': u'said mortals', 'weight': 0.2794653118068085}, {'text': u'mortals impossible', 'weight': 0.2794653118068085}]

Topic Analysis in Action

People often tweet bible verses for religious holidays. What surprised me were the religious tweets sent for secular holidays. I used the NMF analysis to show the most interesting "topics" for religious tweets on Memorial Day 2014. I use the terms returned within a topic to seed several elasticsearch queries that return the phrases displayed by bakkify. The ordering uses the weights returned by NMF.

The analysis correlated 'Psalm 33:12' with "Happy Memorial Day" when I analyzed over 10k tweets sent the week of 2014-05-26.

Click here to see the Memorial Day topic analysis on bakkify.com. I also ran the analysis for other holidays such as: New Years, Valentines Day and Thanksgiving.

Added JSON output to mahout clusterdump

In a prior post, I used mahout to cluster religious tweeters by bible books found in the tweets. The clusterdump utility prints the kmeans cluster output in free text format.  I've submitted a patch to mahout that adds JSON output format to clusterdump. JSON is machine readable and makes it easy for an application developed in another framework (like django) to read the clusters.

The code lives in my mahout fork on github. Run the commands below to build it.

git clone git@github.com:telvis07/mahout.git
cd mahout
mvn compile package -DskipTests

# to (optionally) run the unittest for this feature
mvn -pl integration \
 -Dtest=*.TestClusterDumper#testJsonClusterDumper test

./bin/mahout clusterdump -d dictionary -dt \
  text -i clusters/clusters-*-final -p clusters/clusteredPoints \
  -n 10 -o clusterdump.json -of JSON

The command produces output similar to this...

{
  "top_terms": [
    {
      "term": "proverbs",
      "weight": 0.19125590817015531
    },
    {
      "term": "romans",
      "weight": 0.16306549628629305
    }
  ],
  "points": [
    {
      "vector_name": "ssbo",
      "weight": "1.0",
      "point": "ssbo = [proverbs:1.000]"
    },
    {
      "vector_name": "37_DC",
      "weight": "1.0",
      "point": "37_DC = [proverbs:1.000]"
    },
    {
      "vector_name": "3HHHs",
      "weight": "1.0",
      "point": "3HHHs = [proverbs:1.000]"
    },
    {
      "vector_name": "EPUBC",
      "weight": "1.0",
      "point": "EPUBC = [proverbs:1.000]"
    },
    {
      "vector_name": "ILJ_4",
      "weight": "1.0",
      "point": "ILJ_4 = [romans:1.000]"
    }
  ],
  "cluster_id": 10515,
  "cluster": "VL-10515{n=5924 c=[genesis:0.000, exodus:0.009, ...]}"
}

Using Mahout to Group Religious Twitter Users

This post describes a method to group religious tweeters by their scripture reference patterns using the kmeans clustering algorithm. I use Apache Pig to process data retrieved by Habakkuk and Mahout to perform clustering.

Clustering Primer

Clustering consists of representing an entity (e.g. tweeter) as a feature vector, choosing a similarity metric and then applying a clustering algorithm to group the vectors based on similarity.

Feature Vector Example

Suppose we have two tweeters. Tweeter #1 tweets a reference to Exodus 1:1 and tweeter #2 tweets a reference to Genesis 1:1. Each tweeter can be represented as a "feature vector" that counts the number of references to a book by that tweeter.

Positions (Genesis Reference Count, Exodus Reference Count)
Tweeter #1: (0, 1)
Tweeter #2: (1, 0)

Each book has an assigned position within the feature vector. The first vector shows tweeter #1 referenced Genesis zero (0) times and Exodus one (1) time. The second vector shows tweeter #2 referenced Genesis one (1) time and Exodus zero (0) times. The vector can be extended to count every book or even every scripture reference. There are 66 books and 31102 bible verses in the KJV-based bible. Because the number is fixed, they can easily be mapped to integers used as vector position.

Similarity Metric

There are many methods for calculating the similarity between to vectors. Cosine similarity can measure the angle between vectors so smaller angles imply similar vectors. Euclidean distance measures the distance between vectors so "close" vectors are deemed similar. Other measures include Tanimoto and Manhattan distance. There are many techniques, consult the webernets for more details.

Kmeans Clustering

As the number of vectors grow, it becomes computationally expensive to calculate the similarity of all vectors in a data set. Kmeans clustering is an efficient algorithm to identify similar groups in large data sets.

The algorithm groups n vectors into k clusters of similar vectors. A cluster can be thought of as a geometric circle where the center of the circle defines a centroid point. The kmeans algorithm randomly picks k initial centroid points and assigns all n vectors to a cluster based on the nearest centroid. Next, a new round begins where new centroid points are calculated based on the mean of the vectors assigned to the k clusters in the previous round (hence k-means). The algorithm stops when the centroid points from subsequent rounds are 'close enough' or the maximum number of rounds have been reached.

Performing Kmeans clustering using Hadoop

Two Hadoop-based applications are used to perform this analysis. First, the raw tweets stored in JSON must be processed and converted into feature vectors using Apache Pig. Second, kmeans clustering is performed on the feature vectors using Apache Mahout.

Feature Extraction

Habakkuk tweets are "tagged" with book (e.g. genesis) and bibleverse (e.g genesis 1:1). The pig script below describes a data flow for transforming the habakkuk tweets into book feature vectors per tweeter.   The load statement uses elephant-bird to load the raw JSON from disk into PIG. The join statement uses the book name to correlate a book id with each tweet. The book id serves as the vector position. The foreach counts the book references per tweeter and the group by organizes the records by tweeter. Finally, the store uses elephant-bird to store the data in a format mahout can read.

This is just a code snippet. Check github for the full script.


-- load habakkuk json data, generate screenname and book reference
tweets = load '$data' using com.twitter.elephantbird.pig.load.JsonLoader();
filtered = foreach tweets generate
        (chararray)$0#'screenname' as screenname,
        (chararray)$0#'book' as book;

-- load book ids for join
bookids = load '$book_dict' as (book:chararray, docfreq:int, bookid:int);
filtered = join bookids by book, filtered by book;

-- group using tuple(screenname,book) as key
by_screen_book = group filtered by (screenname, bookids::bookid);

-- generate counts for each screenname, book
book_counts = foreach by_screen_book {
    generate group.screenname as screenname,
         group.bookids::bookid as bookid,
         COUNT(filtered) as count;
}

-- group by screenname: bag{(screenname, bookid, count)}
grpd = group book_counts by screenname;

-- nested projection to get: screenname: entries:bag{(bookid, count)}
-- uses ToTuple because SEQFILE_STORAGE expects bag to be in a tuple
vector_input = foreach grpd generate group,
       org.apache.pig.piggybank.evaluation.util.ToTuple(book_counts.(bookid, count));

-- store to sequence files
STORE vector_input INTO '$output' USING $SEQFILE_STORAGE (
  '-c $TEXT_CONVERTER', '-c $VECTOR_CONVERTER -- -cardinality 66'
);

Mahout in Action

The example mahout command below uses kmeans to generate 2 clusters (-k 2) and choose the initial clusters at random and place them in kmeans-initial-clusters. The maximum number of iterations is 10 (-x). Kmeans will use cosine distance measure (-dm) and a convergence threshold (-cd) of 0.1, instead of using the default value of 0.5 because cosine distances lie between 0 and 1.

$ mahout kmeans -i book_vectors-nv  \
-c kmeans-initial-clusters -k 2 -o clusters \
-x 10 -ow -cl -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
-cd 0.1

Results

Finally, the clusterdump command can print information about the clusters such as the top books and the tweeters in the cluster. These clusters were generated with a small sample with only 10 tweets.

$ mahout clusterdump -d join_data/book.dictionary \
      -dt text -s clusters/clusters-1 \
      -p clusters/clusteredPoints -n 10 -o clusterdump.log
$ cat clusterdump.log
CL-0 ...
    Top Terms: 
        luke                                    =>  0.4444444444444444
        matthew                                 =>  0.3333333333333333
        john                                    =>  0.2222222222222222
        galatians                               =>  0.1111111111111111
        philippians                             =>  0.1111111111111111
    Weight:  Point:
    1.0: Zigs26 = [luke:1.000]
    1.0: da_nellie = [john:1.000]
    1.0: austinn_21 = [luke:1.000]
    1.0: YUMADison22 = [luke:1.000]
    1.0: chap_stique = [galatians:1.000]
    1.0: ApesWhitelaw = [matthew:2.000, john:1.000]
    1.0: alexxrenee22 = [luke:1.000]
    1.0: AbigailObregon3 = [philippians:1.000]
    1.0: thezealofisrael = [matthew:1.000]
VL-7 ...
    Top Terms: 
        ephesians                               =>                 1.0
    Weight:  Point:
    1.0: Affirm_Success = [ephesians:1.000]

The results show 2 clusters. 1 cluster has 9 tweeters with the top books as luke, matthew, john, galations and phillippians. The second cluster has 1 tweeter with ephesians as a top book. Obviously, YMMV with different convergence thresholds, data and distance metrics.

References

I recommend the following books to anyone learning Pig and Mahout.

Valentine’s Day Scripture Usage on Twitter

heartsI'd like to know which bible verses were popular on twitter on Valentine's Day - 2013. I've added elasticsearch support to my project habakkuk to store religious tweets for analysis. It turns out to be a collection of love scriptures. Very Nice!

Results

  1. John 3:16  "For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life."
  2. 1  John 4:19 "We love because he first loved us."
  3. 1 Corinthians 13:4 "Love is patient, love is kind. It does not envy, it does not boast, it is not proud."
  4. 1 Corinthians 13:13 "And now these three remain: faith, hope and love. But the greatest of these is love."
  5. John 14:23  "Jesus replied, “Anyone who loves me will obey my teaching. My Father will love them, and we will come to them and make our home with them."
  6. Psalm 37:23 "The steps of a good man are ordered by the Lord: and he delighteth in his way."
  7. John 15:13 "Greater love has no one than this: to lay down one’s life for one’s friends."
  8. 1 Corinthians 13:7  "[Love]  always protects, always trusts, always hopes, always perseveres."
  9. Philippians 4:13 "I can do all this through him who gives me strength."
  10. Romans 5:8 "But God demonstrates his own love for us in this: While we were still sinners, Christ died for us."

Technical Details

To get the data for Valentines day 2013, I execute the following

$ python bible_facet.py -s 2013-02-14 -e 2013-02-15

It shows the raw query json for the Elasticsearch Query DSL. Below that it shows the top 10 bible references. Its basically just a faceted search on the bibleverse field.

    {
      "query": {
        "filtered": {
          "filter": {
            "range": {
              "created_at_date": {
                "to": "2013-02-15T00:00:00",
                "include_upper": false,
                "from": "2013-02-14T00:00:00"
              }
            }
          },
          "query": {
            "match_all": {}
          }
        }
      },
      "facets": {
        "bibleverse": {
          "terms": {
            "field": "bibleverse",
            "order": "count",
            "size": 10
          }
        }
      },
      "size": 0
    }

Total bibleverse 1568
  john 3:16 85
  i_john 4:19 42
  i_corinthians 13:4 31
  i_corinthians 13:13 24
  john 14:23 19
  psalm 37:23 18
  john 15:13 18
  i_corinthians 13:7 18
  philippians 4:13 17
  romans 5:8 15

 

Project Details

Please reference the elasticsearch readme in habakkuk for details regarding the query that obtained this data. Special thanks to http://www.biblegateway.com for the NIV Bible text. I should also shout out to openbible.info, his twitter Lent blog is awesome.

Finding religious tweets using storm

This is an example for using storm to filter tweets in real-time. Storm is a development platform designed for real-time streaming applications. It is open-source and can be found here. The storm-starter project is an excellent place to start for building your own project. Maven's project page is useful also.

Storm Overview

Storm applications are defined using a "topology" model that is composed of spouts and bolts. A spout defines the topology entry point and data source. A bolt defines a component that performs a mutation, filter or aggregation upon a data stream. Storm has multi-language support for bolts - in this example I use a python bolt. Spouts and bolts consume and emit a tuple - which is an object representation of messages that pass through the topology.

Advantages

The topology design has at least 2 advantages IMHO. First, the modular design is amenable to inter-changeability of components - making it easy to make major design changes by simply switching out bolts. Second, the platform is designed to scale horizontally (like hadoop) so one should be able to test on a single system then deploy on a large cluster with minimal change. That's pretty awesome!

Introducing Habakkuk

Habakkuk is an application for filtering tweets containing Christian Bible references. The goal is to capture the book name, chapter number, verse number and tweet text for further analysis.

Storm Topology

The storm topology consists of two components. The TwitterSampleSpout initiates a twitter streaming connection using the username and password from the command line. The ScriptureFilterBolt is a bolt written in python that applies a regex for bible reference matching. The code snippet below shows how the TwitterSampleSpout and ScriptureFilterBolt are connected using a storm TopologyBuilder class. The topology is run using the LocalCluster class that is intended for single host testing.

public static void main(String[] args) {
   String username = args[0];
   String pwd = args[1];
   TopologyBuilder builder = new TopologyBuilder();

   builder.setSpout("twitter",
                    new TwitterSampleSpout(username, pwd),
                    1);
   builder.setBolt("filter",
                   new ScriptureFilterBolt(), 1)
                   .shuffleGrouping("twitter");

   Config conf = new Config();
   LocalCluster cluster = new LocalCluster();
   cluster.submitTopology("test", conf, builder.createTopology());
 }

Twitter Spout

The TwitterSampleSpout is copied directly from the storm-starter project. The only change is in nextTuple(), where the username and status text are copied to a hashmap. I ran into serialization issues trying to emit the Twitter4j status object to a ShellBolt. There's probably a reasonable workaround for this but my java-fu wasn't up to the task. 🙂

The snippet below is the modified nextTuple() from TwitterSampleSpout.java. It simply puts the name and text in a hashmap and emits the hashmap.

   @Override
    public void nextTuple() {
        Status ret = queue.poll();
        if(ret==null) {
            Utils.sleep(50);
        } else {
            Map data = new HashMap();
            data.put("username", ret.getUser().getName());
            data.put("text", ret.getText());
            _collector.emit(new Values(data));
        }
    }

Python Bolt

The ScriptureFilterBolt is basically copied from the storm-starter project and modified to load my python script. The python module ScriptureFilterBolt.py simply applies a python regex to the status text and looks for matches. Right now, the match is printed to a console but in the future the tuple will be emitted to another bolt.

The snippet below shows how to define a ShellBolt to invoke a python module.

package technicalelvis.habakkuk.bolt;
import backtype.storm.task.ShellBolt;
import backtype.storm.topology.IRichBolt;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.tuple.Fields;
import java.util.Map;

public class ScriptureFilterBolt extends ShellBolt implements IRichBolt{
	private static final long serialVersionUID = 1L;
	public ScriptureFilterBolt() {
        super("python", "ScriptureFilterBolt.py");
    }

	public void declareOutputFields(OutputFieldsDeclarer declarer) {
        declarer.declare(new Fields("book"));
    }

    public Map getComponentConfiguration() {
        return null;
    }
}

The snippet below shows the python bolt implementation.

import storm
from find_all_scriptures import find_all_scriptures, filtergroupdict
import jsonlib2 as json
class ScriptureParserBolt(storm.BasicBolt):
    def process(self,tup):
        txt = tup.values[0]['text']
        matches = find_all_scriptures(txt)
        for ma in matches:
            storm.log("%s"%tup.values[0])
            ret = filtergroupdict(ma)
            #get matched string
            matext = \
              ma.string[ma.start():ma.end()]\
                .replace('\r\n','')
            storm.log("Match %s %s {STRING:'%s'}"%\
                      (ret['book'],ret['verse'], matext))
ScriptureParserBolt().run()

A snippet of the bible reference regex is shown below. The full source is on github.

import re

find_all_scriptures = re.compile("""
(
  (?Pge\w{0,5}\.?) # genesis
   |(?Pex\w{0,4}\.?) # exodus
   |(?Ple\w{0,7}\.?) # leviticus
   # other bible books
  )
  \s+(?P\d{1,3}\s*:\s*\d{1,3})
""",re.VERBOSE|re.MULTILINE).finditer
)

Build and Run

Please refer to the maven documentation for pom.xml info. Execute the following commands to build the jar and start the topology.

$ git clone git@github.com:telvis07/habakkuk.git
$ cd habakkuk/java/habakkuk-core
$ mvn compile
$ mvn package
$ storm jar target/habakkuk-core-0.0.1-SNAPSHOT-jar-with-dependencies.jar technicalelvis.habakkuk.MainTopology habakkuk.properties

You should see output like below.

143736 [Thread-27] INFO  backtype.storm.task.ShellBolt  - Shell msg: {u'username': u"someuser", u'text': u'I like John 3:16'}
143736 [Thread-27] INFO  backtype.storm.task.ShellBolt  - Shell msg: Match John 3:16 {STRING: 'John 3:16'}

Summary

Storm is a platform for developing real-time streaming, filtering and aggregation apps. A storm app is represented as a topology composed of a spouts and bolts. Bolts provide multi-language support that allows Python code integration into a topology.

Fork me on github

This code will evolve over time. Find the complete codebase on github at: https://github.com/telvis07/habakkuk. The develop branch has the latest stuff.

twitter mining: count hashtags per day

We can use CouchDB views to count twitter hashtags per day. I've used two views. The first view uses a mapper to map hashtags to a [YEAR, MONTH, DAY] tuple. The view can subsequently be queried hash tags for that date.

import couchdb
from couchdb.design import ViewDefinition

def time_hashtag_mapper(doc):
    """Hash tag by timestamp"""
    from datetime import datetime
    if doc.get('created_at'):
        _date = doc['created_at']
    else:
        _date = 0 # Jan 1 1970

    if doc.get('entities') and doc['entities'].get('hashtags'):
        dt = datetime.fromtimestamp(_date).utctimetuple()
        for hashtag in (doc['entities']['hashtags']):
            yield([dt.tm_year, dt.tm_mon, dt.tm_mday], 
                   hashtag['text'].lower())

view = ViewDefinition('index',
                      'time_hashtags',
                      time_hashtag_mapper,
                      language='python')
view.sync(db)

The second view maps each tweet to a tuple containing the [YEAR, MONTH, DAY, HASHTAG]. Then a reducer is used to count the tweets matching the tuple.

import couchdb
from couchdb.design import ViewDefinition

def date_hashtag_mapper(doc):
    """tweet by date+hashtag"""
    from datetime import datetime
    if doc.get('created_at'):
        _date = doc['created_at']
    else:
        _date = 0 # Jan 1 1970

    dt = datetime.fromtimestamp(_date).utctimetuple()
    if doc.get('entities') and doc['entities'].get('hashtags'):
        for hashtag in (doc['entities']['hashtags']):
            yield ([dt.tm_year, dt.tm_mon, dt.tm_mday, 
                    hashtag['text'].lower()], 
                   doc['_id'])

def sumreducer(keys, values, rereduce):
    """count then sum"""
    if rereduce:
        return sum(values)
    else:
        return len(values)

view = ViewDefinition('index',
                      'daily_tagcount',
                      date_hashtag_mapper,
                      reduce_fun=sumreducer,
                      language='python')
view.sync(db)

Finally, query the first view to find tags for the day and then query the second view for tweet counts per tag for the day.

import sys
import couchdb
import time
from datetime import date, datetime

server = couchdb.Server('http://localhost:5984')
dbname = sys.argv[1]
db = server[dbname]

_date  = sys.argv[2]
dt = datetime.strptime(_date,"%Y-%m-%d").utctimetuple()

# get tags for this time interval
_key = [dt.tm_year, dt.tm_mon, dt.tm_mday]
tags = [row.value for row in db.view('index/time_hashtags', key=_key)]
tags = list(set(tags))
print "Tags today",len(tags)
print ""

# get count for date and hashtag
for tag in sorted(tags):
    _key = [dt.tm_year, dt.tm_mon, dt.tm_mday, tag]
    tag_count = \
      [ (row.value) for row in db.view('index/daily_tagcount', key=_key) ]
    print "Found %d %s on %s-%s-%s "%\
      (tag_count[0],tag,_key[0],_key[1],_key[2])

This code will evolve over time.
Find the complete codebase on github at: https://github.com/telvis07/twitter_mining. The develop branch has the latest stuff.

twitter mining by geolocation

Twitter's streaming api permits filtering tweets by geolocation. According to the api documentation, only tweets that are created using the Geotagging API can be filtered. The code below uses tweepy to filter tweets for the San Francisco area.

#!/usr/bin/env python
import tweepy
import ConfigParser
import os, sys

class Listener(tweepy.StreamListener):
    def on_status(self, status):
        print "screen_name='%s' tweet='%s'"%(status.author.screen_name, status.text)

def login(config):
    """Tweepy oauth dance
    The config file should contain:

    [auth]
    CONSUMER_KEY = ...
    CONSUMER_SECRET = ...
    ACCESS_TOKEN = ...
    ACCESS_TOKEN_SECRET = ...
    """     
    CONSUMER_KEY = config.get('auth','CONSUMER_KEY')
    CONSUMER_SECRET = config.get('auth','CONSUMER_SECRET')
    ACCESS_TOKEN = config.get('auth','ACCESS_TOKEN')
    ACCESS_TOKEN_SECRET = config.get('auth','ACCESS_TOKEN_SECRET')
    
    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
    return auth


fn=sys.argv[1]
config = ConfigParser.RawConfigParser()
config.read(fn)
try:
    auth = login(config)
    streaming_api = tweepy.streaming.Stream(auth, Listener(), timeout=60)
    # San Francisco area.
    streaming_api.filter(follow=None, locations=[-122.75,36.8,-121.75,37.8]) 
except KeyboardInterrupt:
    print "got keyboardinterrupt"

Find the complete codebase on github at: https://github.com/telvis07/twitter_mining

twitter mining: top tweets with links

It's useful to filter out "conversational" tweets and look for tweets with links to another page or picture, etc.

We create a view that only map tweets with link entities.

import couchdb
from couchdb.design import ViewDefinition
import sys

def url_tweets_by_created_at(doc):
    if doc.get('created_at'):
        _date = doc['created_at']
    else:
        _date = 0 # Jan 1 1970

    if doc.get('entities') and doc['entities'].get('urls') 
      and len(doc['entities']['urls']):
        if doc.get('user'):
            yield (_date, doc)

view = ViewDefinition('index', 'daily_url_tweets', 
                      url_tweets_by_created_at, language='python')
view.sync(db)

Next we create an app that reads from this view and displays the results.

import couchdb
from datetime import datetime

def run(db, date, limit=10):
    """Query a couchdb view for tweets. Sort in memory by follower count.
    Return the top 10 tweeters and their tweets"""
    print "Finding top %d tweeters"%limit

    dt = datetime.strptime(date,"%Y-%m-%d")
    stime=int(time.mktime(dt.timetuple()))
    etime=stime+86400-1
    tweeters = {}
    tweets = {}
    # get screen_name, follower_counts and tweet ids for looking up later
    for row in db.view('index/daily_url_tweets', startkey=stime, endkey=etime):
        status = row.value
        screen_name = status['user']['screen_name']
        followers_count = status['user']['followers_count']
        tweeters[screen_name] = int(followers_count)
        if not tweets.has_key(screen_name):
            tweets[screen_name] = []
        tweets[screen_name].append(status['id_str'])

    # sort
    print len(tweeters.keys())
    di = tweeters.items()
    di.sort(key=lambda x: x[1], reverse=True)
    out = {}
    for i in range(limit):
        screen_name = di[i][0]
        followers_count = di[i][1]
        out[screen_name] = {}
        out[screen_name]['follower_count'] = followers_count
        out[screen_name]['tweets'] = {}
        # print i,screen_name,followers_count
        for tweetid in tweets[screen_name]:
            status = db[tweetid]
            text = status['orig_text']
            # print tweetid,orig_text
            urls = status['entities']['urls']
            #name = status['user']['name']
            for url in urls:
                text = text.replace(url['url'],url['expanded_url'])
            out[screen_name]['tweets'][tweetid] = text

    return out

server = couchdb.Server('http://localhost:5984')
db = server[dbname]
date = '2012-03-05'
output = run(db, date)

Find the complete codebase on github at: https://github.com/telvis07/twitter_mining

twitter mining: top tweets by follower count

We can find interesting tweets using the author's follower count and tweet timestamp. We store tweets using CouchDB and search for tweets using tweepy streaming. With these tools we can find the top N tweets per day. The code below uses the couchpy view server to write a view in python. The steps to setup couchpy are found here. Basically, you add the following to /etc/couchdb/local.ini and install couchpy.

Install couchpy and couchdb-python with the following command.

pip install couchdb

Test couchpy is installed.

$ which couchpy
/usr/bin/couchpy

Edit /etc/couchdb/local.ini

[query_servers]
python=/usr/bin/couchpy

This a simple view mapper that maps each tweet to a timestamp so we can query by start and end time.


import couchdb
from couchdb.design import ViewDefinition
import sys

server = couchdb.Server('http://localhost:5984')
db = sys.argv[1]
db = server[db]

def tweets_by_created_at(doc):
    if doc.get('created_at'):
        _date = doc['created_at']
    else:
        _date = 0 # Jan 1 1970
    
    if doc.get('user'):
        yield (_date, doc) 
        
view = ViewDefinition('index', 'daily_tweets', tweets_by_created_at, language='python')
view.sync(db)

The code below queries the view for all tweets within a date range. Then we sort in memory by the follower count.

import couchdb
from datetime import datetime

def run(db, date, limit=10):
    """Query a couchdb view for tweets. Sort in memory by follower count.
    Return the top 10 tweeters and their tweets"""
    print "Finding top %d tweeters"%limit
        
    dt = datetime.strptime(date,"%Y-%m-%d")
    stime=int(time.mktime(dt.timetuple()))
    etime=stime+86400-1
    tweeters = {}
    tweets = {}
    for row in db.view('index/daily_tweets', startkey=stime, endkey=etime):
        status = row.value
        screen_name = status['user']['screen_name']
        followers_count = status['user']['followers_count']
        tweeters[screen_name] = int(followers_count)
        if not tweets.has_key(screen_name):
            tweets[screen_name] = []
        tweets[screen_name].append(status['id_str'])
        
    # sort
    di = tweeters.items() 
    di.sort(key=lambda x: x[1], reverse=True)
    out = {}
    for i in range(limit):
        screen_name = di[i][0]
        followers_count = di[i][1]
        out[screen_name] = {}
        out[screen_name]['follower_count'] = followers_count
        out[screen_name]['tweets'] = {}
        # print i,screen_name,followers_count
        for tweetid in tweets[screen_name]:
            orig_text = db[tweetid]['orig_text']
            # print tweetid,orig_text
            out[screen_name]['tweets'][tweetid] = orig_text

    return out

server = couchdb.Server('http://localhost:5984')
db = server[dbname]
date = '2012-03-05'
output = run(db, date)

Find the complete codebase on github at: https://github.com/telvis07/twitter_mining