In a prior post, I used mahout to cluster religious tweeters by bible books found in the tweets. The clusterdump utility prints the kmeans cluster output in free text format. I've submitted a patch to mahout that adds JSON output format to clusterdump. JSON is machine readable and makes it easy for an application developed in another framework (like django) to read the clusters.
The code lives in my mahout fork on github. Run the commands below to build it.
git clone git@github.com:telvis07/mahout.git cd mahout mvn compile package -DskipTests # to (optionally) run the unittest for this feature mvn -pl integration \ -Dtest=*.TestClusterDumper#testJsonClusterDumper test ./bin/mahout clusterdump -d dictionary -dt \ text -i clusters/clusters-*-final -p clusters/clusteredPoints \ -n 10 -o clusterdump.json -of JSON
The command produces output similar to this...
{ "top_terms": [ { "term": "proverbs", "weight": 0.19125590817015531 }, { "term": "romans", "weight": 0.16306549628629305 } ], "points": [ { "vector_name": "ssbo", "weight": "1.0", "point": "ssbo = [proverbs:1.000]" }, { "vector_name": "37_DC", "weight": "1.0", "point": "37_DC = [proverbs:1.000]" }, { "vector_name": "3HHHs", "weight": "1.0", "point": "3HHHs = [proverbs:1.000]" }, { "vector_name": "EPUBC", "weight": "1.0", "point": "EPUBC = [proverbs:1.000]" }, { "vector_name": "ILJ_4", "weight": "1.0", "point": "ILJ_4 = [romans:1.000]" } ], "cluster_id": 10515, "cluster": "VL-10515{n=5924 c=[genesis:0.000, exodus:0.009, ...]}" }