We can find interesting tweets using the author's follower count and tweet timestamp. We store tweets using CouchDB and search for tweets using tweepy streaming. With these tools we can find the top N tweets per day. The code below uses the couchpy view server to write a view in python. The steps to setup couchpy are found here. Basically, you add the following to /etc/couchdb/local.ini and install couchpy.
Install couchpy and couchdb-python with the following command.
pip install couchdb
Test couchpy is installed.
$ which couchpy /usr/bin/couchpy
Edit /etc/couchdb/local.ini
[query_servers] python=/usr/bin/couchpy
This a simple view mapper that maps each tweet to a timestamp so we can query by start and end time.
import couchdb
from couchdb.design import ViewDefinition
import sys
server = couchdb.Server('http://localhost:5984')
db = sys.argv[1]
db = server[db]
def tweets_by_created_at(doc):
if doc.get('created_at'):
_date = doc['created_at']
else:
_date = 0 # Jan 1 1970
if doc.get('user'):
yield (_date, doc)
view = ViewDefinition('index', 'daily_tweets', tweets_by_created_at, language='python')
view.sync(db)
The code below queries the view for all tweets within a date range. Then we sort in memory by the follower count.
import couchdb
from datetime import datetime
def run(db, date, limit=10):
"""Query a couchdb view for tweets. Sort in memory by follower count.
Return the top 10 tweeters and their tweets"""
print "Finding top %d tweeters"%limit
dt = datetime.strptime(date,"%Y-%m-%d")
stime=int(time.mktime(dt.timetuple()))
etime=stime+86400-1
tweeters = {}
tweets = {}
for row in db.view('index/daily_tweets', startkey=stime, endkey=etime):
status = row.value
screen_name = status['user']['screen_name']
followers_count = status['user']['followers_count']
tweeters[screen_name] = int(followers_count)
if not tweets.has_key(screen_name):
tweets[screen_name] = []
tweets[screen_name].append(status['id_str'])
# sort
di = tweeters.items()
di.sort(key=lambda x: x[1], reverse=True)
out = {}
for i in range(limit):
screen_name = di[i][0]
followers_count = di[i][1]
out[screen_name] = {}
out[screen_name]['follower_count'] = followers_count
out[screen_name]['tweets'] = {}
# print i,screen_name,followers_count
for tweetid in tweets[screen_name]:
orig_text = db[tweetid]['orig_text']
# print tweetid,orig_text
out[screen_name]['tweets'][tweetid] = orig_text
return out
server = couchdb.Server('http://localhost:5984')
db = server[dbname]
date = '2012-03-05'
output = run(db, date)
Find the complete codebase on github at: https://github.com/telvis07/twitter_mining