We can find interesting tweets using the author's follower count and tweet timestamp. We store tweets using CouchDB and search for tweets using tweepy streaming. With these tools we can find the top N tweets per day. The code below uses the couchpy view server to write a view in python. The steps to setup couchpy are found here. Basically, you add the following to /etc/couchdb/local.ini and install couchpy.
Install couchpy and couchdb-python with the following command.
pip install couchdb
Test couchpy is installed.
$ which couchpy /usr/bin/couchpy
Edit /etc/couchdb/local.ini
[query_servers] python=/usr/bin/couchpy
This a simple view mapper that maps each tweet to a timestamp so we can query by start and end time.
import couchdb from couchdb.design import ViewDefinition import sys server = couchdb.Server('http://localhost:5984') db = sys.argv[1] db = server[db] def tweets_by_created_at(doc): if doc.get('created_at'): _date = doc['created_at'] else: _date = 0 # Jan 1 1970 if doc.get('user'): yield (_date, doc) view = ViewDefinition('index', 'daily_tweets', tweets_by_created_at, language='python') view.sync(db)
The code below queries the view for all tweets within a date range. Then we sort in memory by the follower count.
import couchdb from datetime import datetime def run(db, date, limit=10): """Query a couchdb view for tweets. Sort in memory by follower count. Return the top 10 tweeters and their tweets""" print "Finding top %d tweeters"%limit dt = datetime.strptime(date,"%Y-%m-%d") stime=int(time.mktime(dt.timetuple())) etime=stime+86400-1 tweeters = {} tweets = {} for row in db.view('index/daily_tweets', startkey=stime, endkey=etime): status = row.value screen_name = status['user']['screen_name'] followers_count = status['user']['followers_count'] tweeters[screen_name] = int(followers_count) if not tweets.has_key(screen_name): tweets[screen_name] = [] tweets[screen_name].append(status['id_str']) # sort di = tweeters.items() di.sort(key=lambda x: x[1], reverse=True) out = {} for i in range(limit): screen_name = di[i][0] followers_count = di[i][1] out[screen_name] = {} out[screen_name]['follower_count'] = followers_count out[screen_name]['tweets'] = {} # print i,screen_name,followers_count for tweetid in tweets[screen_name]: orig_text = db[tweetid]['orig_text'] # print tweetid,orig_text out[screen_name]['tweets'][tweetid] = orig_text return out server = couchdb.Server('http://localhost:5984') db = server[dbname] date = '2012-03-05' output = run(db, date)
Find the complete codebase on github at: https://github.com/telvis07/twitter_mining