This post describes a method to group religious tweeters by their scripture reference patterns using the kmeans clustering algorithm. I use Apache Pig to process data retrieved by Habakkuk and Mahout to perform clustering.

## Clustering Primer

Clustering consists of representing an entity (e.g. tweeter) as a feature vector, choosing a similarity metric and then applying a clustering algorithm to group the vectors based on similarity.

### Feature Vector Example

Suppose we have two tweeters. Tweeter #1 tweets a reference to Exodus 1:1 and tweeter #2 tweets a reference to Genesis 1:1. Each tweeter can be represented as a “feature vector” that counts the number of references to a book by that tweeter.

Positions (Genesis Reference Count, Exodus Reference Count) Tweeter #1: (0, 1) Tweeter #2: (1, 0)

Each book has an assigned position within the feature vector. The first vector shows tweeter #1 referenced Genesis zero (0) times and Exodus one (1) time. The second vector shows tweeter #2 referenced Genesis one (1) time and Exodus zero (0) times. The vector can be extended to count every book or even every scripture reference. There are 66 books and 31102 bible verses in the KJV-based bible. Because the number is fixed, they can easily be mapped to integers used as vector position.

### Similarity Metric

There are many methods for calculating the similarity between to vectors. Cosine similarity can measure the angle between vectors so smaller angles imply similar vectors. Euclidean distance measures the distance between vectors so “close” vectors are deemed similar. Other measures include Tanimoto and Manhattan distance. There are many techniques, consult the webernets for more details.

### Kmeans Clustering

As the number of vectors grow, it becomes computationally expensive to calculate the similarity of all vectors in a data set. Kmeans clustering is an efficient algorithm to identify similar groups in large data sets.

The algorithm groups *n* vectors into *k* clusters of similar vectors. A cluster can be thought of as a geometric circle where the center of the circle defines a *centroid* point. The kmeans algorithm randomly picks *k* initial centroid points and assigns all *n* vectors to a cluster based on the nearest centroid. Next, a new round begins where new centroid points are calculated based on the *mean* of the vectors assigned to the *k* clusters in the previous round (hence k-means). The algorithm stops when the centroid points from subsequent rounds are ‘close enough’ or the maximum number of rounds have been reached.

## Performing Kmeans clustering using Hadoop

Two Hadoop-based applications are used to perform this analysis. First, the raw tweets stored in JSON must be processed and converted into feature vectors using Apache Pig. Second, kmeans clustering is performed on the feature vectors using Apache Mahout.

### Feature Extraction

Habakkuk tweets are “tagged” with book (e.g. genesis) and bibleverse (e.g genesis 1:1). The pig script below describes a data flow for transforming the habakkuk tweets into book feature vectors per tweeter. The **load** statement uses elephant-bird to load the raw JSON from disk into PIG. The **join** statement uses the book name to correlate a book id with each tweet. The book id serves as the vector position. The **foreach** counts the book references per tweeter and the **group by** organizes the records by tweeter. Finally, the **store** uses elephant-bird to store the data in a format mahout can read.

This is just a code snippet. Check github for the full script.

-- load habakkuk json data, generate screenname and book reference tweets = load '$data' using com.twitter.elephantbird.pig.load.JsonLoader(); filtered = foreach tweets generate (chararray)$0#'screenname' as screenname, (chararray)$0#'book' as book; -- load book ids for join bookids = load '$book_dict' as (book:chararray, docfreq:int, bookid:int); filtered = join bookids by book, filtered by book; -- group using tuple(screenname,book) as key by_screen_book = group filtered by (screenname, bookids::bookid); -- generate counts for each screenname, book book_counts = foreach by_screen_book { generate group.screenname as screenname, group.bookids::bookid as bookid, COUNT(filtered) as count; } -- group by screenname: bag{(screenname, bookid, count)} grpd = group book_counts by screenname; -- nested projection to get: screenname: entries:bag{(bookid, count)} -- uses ToTuple because SEQFILE_STORAGE expects bag to be in a tuple vector_input = foreach grpd generate group, org.apache.pig.piggybank.evaluation.util.ToTuple(book_counts.(bookid, count)); -- store to sequence files STORE vector_input INTO '$output' USING $SEQFILE_STORAGE ( '-c $TEXT_CONVERTER', '-c $VECTOR_CONVERTER -- -cardinality 66' );

### Mahout in Action

The example mahout command below uses kmeans to generate 2 clusters (-k 2) and choose the initial clusters at random and place them in kmeans-initial-clusters. The maximum number of iterations is 10 (-x). Kmeans will use cosine distance measure (-dm) and a convergence threshold (-cd) of 0.1, instead of using the default value of 0.5 because cosine distances lie between 0 and 1.

$ mahout kmeans -i book_vectors-nv \ -c kmeans-initial-clusters -k 2 -o clusters \ -x 10 -ow -cl -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ -cd 0.1

### Results

Finally, the *clusterdump* command can print information about the clusters such as the top books and the tweeters in the cluster. These clusters were generated with a small sample with only 10 tweets.

$ mahout clusterdump -d join_data/book.dictionary \ -dt text -s clusters/clusters-1 \ -p clusters/clusteredPoints -n 10 -o clusterdump.log $ cat clusterdump.log CL-0 ... Top Terms: luke => 0.4444444444444444 matthew => 0.3333333333333333 john => 0.2222222222222222 galatians => 0.1111111111111111 philippians => 0.1111111111111111 Weight: Point: 1.0: Zigs26 = [luke:1.000] 1.0: da_nellie = [john:1.000] 1.0: austinn_21 = [luke:1.000] 1.0: YUMADison22 = [luke:1.000] 1.0: chap_stique = [galatians:1.000] 1.0: ApesWhitelaw = [matthew:2.000, john:1.000] 1.0: alexxrenee22 = [luke:1.000] 1.0: AbigailObregon3 = [philippians:1.000] 1.0: thezealofisrael = [matthew:1.000] VL-7 ... Top Terms: ephesians => 1.0 Weight: Point: 1.0: Affirm_Success = [ephesians:1.000]

The results show 2 clusters. 1 cluster has 9 tweeters with the top books as luke, matthew, john, galations and phillippians. The second cluster has 1 tweeter with ephesians as a top book. Obviously, YMMV with different convergence thresholds, data and distance metrics.

# References

I recommend the following books to anyone learning Pig and Mahout.

hi am getting following error while using cluster dump

13/10/04 16:56:11 ERROR common.AbstractJob: Unexpected -s while processing Job-Specific Options:

usage: [Generic Options] [Job-Specific Options]

I was using mahout 0.5 when I wrote this. I think -s has changed to -i.

Hi

Can you please tell me how to decide which similarity measure should be used before applying any measures to generate recommendations based on data we have ?

Thanks

Gaurhari

Basicallly, trial and error. I generally perform the analysis with a small sample of the dataset using tools like scikit-learn. Then examine whether the recommendations “look correct” with various metrics. If the results don’t look good, then change the metric.

That said, I think it depends on the characteristics of your data. Check out Chapter 7 in “Mahout in Action”. http://www.manning.com/owen/