In a prior post, I used mahout to cluster religious tweeters by bible books found in the tweets. The clusterdump utility prints the kmeans cluster output in free text format. I've submitted a patch to mahout that adds JSON output format to clusterdump. JSON is machine readable and makes it easy for an application developed in another framework (like django) to read the clusters.

The code lives in my mahout fork on github. Run the commands below to build it.

git clone git@github.com:telvis07/mahout.git
cd mahout
mvn compile package -DskipTests

# to (optionally) run the unittest for this feature
mvn -pl integration \
 -Dtest=*.TestClusterDumper#testJsonClusterDumper test

./bin/mahout clusterdump -d dictionary -dt \
  text -i clusters/clusters-*-final -p clusters/clusteredPoints \
  -n 10 -o clusterdump.json -of JSON

The command produces output similar to this...

{
  "top_terms": [
    {
      "term": "proverbs",
      "weight": 0.19125590817015531
    },
    {
      "term": "romans",
      "weight": 0.16306549628629305
    }
  ],
  "points": [
    {
      "vector_name": "ssbo",
      "weight": "1.0",
      "point": "ssbo = [proverbs:1.000]"
    },
    {
      "vector_name": "37_DC",
      "weight": "1.0",
      "point": "37_DC = [proverbs:1.000]"
    },
    {
      "vector_name": "3HHHs",
      "weight": "1.0",
      "point": "3HHHs = [proverbs:1.000]"
    },
    {
      "vector_name": "EPUBC",
      "weight": "1.0",
      "point": "EPUBC = [proverbs:1.000]"
    },
    {
      "vector_name": "ILJ_4",
      "weight": "1.0",
      "point": "ILJ_4 = [romans:1.000]"
    }
  ],
  "cluster_id": 10515,
  "cluster": "VL-10515{n=5924 c=[genesis:0.000, exodus:0.009, ...]}"
}

This post describes a method to group religious tweeters by their scripture reference patterns using the kmeans clustering algorithm. I use Apache Pig to process data retrieved by Habakkuk and Mahout to perform clustering.

Clustering Primer

Clustering consists of representing an entity (e.g. tweeter) as a feature vector, choosing a similarity metric and then applying a clustering algorithm to group the vectors based on similarity.

Feature Vector Example

Suppose we have two tweeters. Tweeter #1 tweets a reference to Exodus 1:1 and tweeter #2 tweets a reference to Genesis 1:1. Each tweeter can be represented as a "feature vector" that counts the number of references to a book by that tweeter.

Positions (Genesis Reference Count, Exodus Reference Count)
Tweeter #1: (0, 1)
Tweeter #2: (1, 0)

Each book has an assigned position within the feature vector. The first vector shows tweeter #1 referenced Genesis zero (0) times and Exodus one (1) time. The second vector shows tweeter #2 referenced Genesis one (1) time and Exodus zero (0) times. The vector can be extended to count every book or even every scripture reference. There are 66 books and 31102 bible verses in the KJV-based bible. Because the number is fixed, they can easily be mapped to integers used as vector position.

Similarity Metric

There are many methods for calculating the similarity between to vectors. Cosine similarity can measure the angle between vectors so smaller angles imply similar vectors. Euclidean distance measures the distance between vectors so "close" vectors are deemed similar. Other measures include Tanimoto and Manhattan distance. There are many techniques, consult the webernets for more details.

Kmeans Clustering

As the number of vectors grow, it becomes computationally expensive to calculate the similarity of all vectors in a data set. Kmeans clustering is an efficient algorithm to identify similar groups in large data sets.

The algorithm groups n vectors into k clusters of similar vectors. A cluster can be thought of as a geometric circle where the center of the circle defines a centroid point. The kmeans algorithm randomly picks k initial centroid points and assigns all n vectors to a cluster based on the nearest centroid. Next, a new round begins where new centroid points are calculated based on the mean of the vectors assigned to the k clusters in the previous round (hence k-means). The algorithm stops when the centroid points from subsequent rounds are 'close enough' or the maximum number of rounds have been reached.

Performing Kmeans clustering using Hadoop

Two Hadoop-based applications are used to perform this analysis. First, the raw tweets stored in JSON must be processed and converted into feature vectors using Apache Pig. Second, kmeans clustering is performed on the feature vectors using Apache Mahout.

Feature Extraction

Habakkuk tweets are "tagged" with book (e.g. genesis) and bibleverse (e.g genesis 1:1). The pig script below describes a data flow for transforming the habakkuk tweets into book feature vectors per tweeter. The load statement uses elephant-bird to load the raw JSON from disk into PIG. The join statement uses the book name to correlate a book id with each tweet. The book id serves as the vector position. The foreach counts the book references per tweeter and the group by organizes the records by tweeter. Finally, the store uses elephant-bird to store the data in a format mahout can read.

This is just a code snippet. Check github for the full script.


-- load habakkuk json data, generate screenname and book reference
tweets = load '$data' using com.twitter.elephantbird.pig.load.JsonLoader();
filtered = foreach tweets generate
        (chararray)$0#'screenname' as screenname,
        (chararray)$0#'book' as book;

-- load book ids for join
bookids = load '$book_dict' as (book:chararray, docfreq:int, bookid:int);
filtered = join bookids by book, filtered by book;

-- group using tuple(screenname,book) as key
by_screen_book = group filtered by (screenname, bookids::bookid);

-- generate counts for each screenname, book
book_counts = foreach by_screen_book {
    generate group.screenname as screenname,
         group.bookids::bookid as bookid,
         COUNT(filtered) as count;
}

-- group by screenname: bag{(screenname, bookid, count)}
grpd = group book_counts by screenname;

-- nested projection to get: screenname: entries:bag{(bookid, count)}
-- uses ToTuple because SEQFILE_STORAGE expects bag to be in a tuple
vector_input = foreach grpd generate group,
       org.apache.pig.piggybank.evaluation.util.ToTuple(book_counts.(bookid, count));

-- store to sequence files
STORE vector_input INTO '$output' USING $SEQFILE_STORAGE (
  '-c $TEXT_CONVERTER', '-c $VECTOR_CONVERTER -- -cardinality 66'
);

Mahout in Action

The example mahout command below uses kmeans to generate 2 clusters (-k 2) and choose the initial clusters at random and place them in kmeans-initial-clusters. The maximum number of iterations is 10 (-x). Kmeans will use cosine distance measure (-dm) and a convergence threshold (-cd) of 0.1, instead of using the default value of 0.5 because cosine distances lie between 0 and 1.

$ mahout kmeans -i book_vectors-nv  \
-c kmeans-initial-clusters -k 2 -o clusters \
-x 10 -ow -cl -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
-cd 0.1

Results

Finally, the clusterdump command can print information about the clusters such as the top books and the tweeters in the cluster. These clusters were generated with a small sample with only 10 tweets.

$ mahout clusterdump -d join_data/book.dictionary \
      -dt text -s clusters/clusters-1 \
      -p clusters/clusteredPoints -n 10 -o clusterdump.log
$ cat clusterdump.log
CL-0 ...
    Top Terms: 
        luke                                    =>  0.4444444444444444
        matthew                                 =>  0.3333333333333333
        john                                    =>  0.2222222222222222
        galatians                               =>  0.1111111111111111
        philippians                             =>  0.1111111111111111
    Weight:  Point:
    1.0: Zigs26 = [luke:1.000]
    1.0: da_nellie = [john:1.000]
    1.0: austinn_21 = [luke:1.000]
    1.0: YUMADison22 = [luke:1.000]
    1.0: chap_stique = [galatians:1.000]
    1.0: ApesWhitelaw = [matthew:2.000, john:1.000]
    1.0: alexxrenee22 = [luke:1.000]
    1.0: AbigailObregon3 = [philippians:1.000]
    1.0: thezealofisrael = [matthew:1.000]
VL-7 ...
    Top Terms: 
        ephesians                               =>                 1.0
    Weight:  Point:
    1.0: Affirm_Success = [ephesians:1.000]

The results show 2 clusters. 1 cluster has 9 tweeters with the top books as luke, matthew, john, galations and phillippians. The second cluster has 1 tweeter with ephesians as a top book. Obviously, YMMV with different convergence thresholds, data and distance metrics.

References

I recommend the following books to anyone learning Pig and Mahout.

Telvis Calhoun

knock at the door

Happy in Work

Calm in the Storm

right direction

weights

Can’t Fail

nothing can stop you

Added JSON output to mahout clusterdump

Using Mahout to Group Religious Twitter Users