twitter mining: top tweets by follower count

We can find interesting tweets using the author's follower count and tweet timestamp. We store tweets using CouchDB and search for tweets using tweepy streaming. With these tools we can find the top N tweets per day. The code below uses the couchpy view server to write a view in python. The steps to setup couchpy are found here. Basically, you add the following to /etc/couchdb/local.ini and install couchpy.

Install couchpy and couchdb-python with the following command.

pip install couchdb

Test couchpy is installed.

$ which couchpy
/usr/bin/couchpy

Edit /etc/couchdb/local.ini

[query_servers]
python=/usr/bin/couchpy

This a simple view mapper that maps each tweet to a timestamp so we can query by start and end time.


import couchdb
from couchdb.design import ViewDefinition
import sys

server = couchdb.Server('http://localhost:5984')
db = sys.argv[1]
db = server[db]

def tweets_by_created_at(doc):
    if doc.get('created_at'):
        _date = doc['created_at']
    else:
        _date = 0 # Jan 1 1970
    
    if doc.get('user'):
        yield (_date, doc) 
        
view = ViewDefinition('index', 'daily_tweets', tweets_by_created_at, language='python')
view.sync(db)

The code below queries the view for all tweets within a date range. Then we sort in memory by the follower count.

import couchdb
from datetime import datetime

def run(db, date, limit=10):
    """Query a couchdb view for tweets. Sort in memory by follower count.
    Return the top 10 tweeters and their tweets"""
    print "Finding top %d tweeters"%limit
        
    dt = datetime.strptime(date,"%Y-%m-%d")
    stime=int(time.mktime(dt.timetuple()))
    etime=stime+86400-1
    tweeters = {}
    tweets = {}
    for row in db.view('index/daily_tweets', startkey=stime, endkey=etime):
        status = row.value
        screen_name = status['user']['screen_name']
        followers_count = status['user']['followers_count']
        tweeters[screen_name] = int(followers_count)
        if not tweets.has_key(screen_name):
            tweets[screen_name] = []
        tweets[screen_name].append(status['id_str'])
        
    # sort
    di = tweeters.items() 
    di.sort(key=lambda x: x[1], reverse=True)
    out = {}
    for i in range(limit):
        screen_name = di[i][0]
        followers_count = di[i][1]
        out[screen_name] = {}
        out[screen_name]['follower_count'] = followers_count
        out[screen_name]['tweets'] = {}
        # print i,screen_name,followers_count
        for tweetid in tweets[screen_name]:
            orig_text = db[tweetid]['orig_text']
            # print tweetid,orig_text
            out[screen_name]['tweets'][tweetid] = orig_text

    return out

server = couchdb.Server('http://localhost:5984')
db = server[dbname]
date = '2012-03-05'
output = run(db, date)

Find the complete codebase on github at: https://github.com/telvis07/twitter_mining

Leave a Reply

Your email address will not be published. Required fields are marked *