Added JSON output to mahout clusterdump

In a prior post, I used mahout to cluster religious tweeters by bible books found in the tweets. The clusterdump utility prints the kmeans cluster output in free text format.  I've submitted a patch to mahout that adds JSON output format to clusterdump. JSON is machine readable and makes it easy for an application developed in another framework (like django) to read the clusters.

The code lives in my mahout fork on github. Run the commands below to build it.

git clone git@github.com:telvis07/mahout.git
cd mahout
mvn compile package -DskipTests

# to (optionally) run the unittest for this feature
mvn -pl integration \
 -Dtest=*.TestClusterDumper#testJsonClusterDumper test

./bin/mahout clusterdump -d dictionary -dt \
  text -i clusters/clusters-*-final -p clusters/clusteredPoints \
  -n 10 -o clusterdump.json -of JSON

The command produces output similar to this...

{
  "top_terms": [
    {
      "term": "proverbs",
      "weight": 0.19125590817015531
    },
    {
      "term": "romans",
      "weight": 0.16306549628629305
    }
  ],
  "points": [
    {
      "vector_name": "ssbo",
      "weight": "1.0",
      "point": "ssbo = [proverbs:1.000]"
    },
    {
      "vector_name": "37_DC",
      "weight": "1.0",
      "point": "37_DC = [proverbs:1.000]"
    },
    {
      "vector_name": "3HHHs",
      "weight": "1.0",
      "point": "3HHHs = [proverbs:1.000]"
    },
    {
      "vector_name": "EPUBC",
      "weight": "1.0",
      "point": "EPUBC = [proverbs:1.000]"
    },
    {
      "vector_name": "ILJ_4",
      "weight": "1.0",
      "point": "ILJ_4 = [romans:1.000]"
    }
  ],
  "cluster_id": 10515,
  "cluster": "VL-10515{n=5924 c=[genesis:0.000, exodus:0.009, ...]}"
}

Leave a Reply

Your email address will not be published. Required fields are marked *