Apache Mahout: Scalable machine learning and data mining
For clustering and classifying documents it is usually necessary to convert the raw text into vectors that can then be consumed by the clustering Algorithms. These approaches are described below. NOTE: Your Lucene index must be created with the same version of Lucene used in Mahout. As of Mahout 0.9 this is Lucene 4.6.1. If these versions dont match you will likely get ‘Exception in thread ‘main’ org.apache.lucene.index.CorruptIndexException: Unknown format version: -11′ as an error. Mahout has utilities that allow one to easily produce Mahout Vector representations from a Lucene (and Solr, since they are they same) index. For this, we assume you know how to build a Lucene/Solr index. For those who don’t, it is probably easiest to get up and running using Solr as it can ingest things like PDFs, XML, Office, etc. and create a Lucene index. For those wanting to use just Lucene, see the Lucene website or check out Lucene In Action by Erik Hatcher, Otis Gospodnetic and Mike McCandless. To get started, make sure you get a fresh copy of Mahout from GitHub and are comfortable building it. It defines interfaces and implementations for efficiently iterating over a data source (it only supports Lucene currently, but should be extensible to databases, Solr, etc.) and produces a Mahout Vector file and term dictionary which can then be used for clustering. The main code for driving this is the driver program located in the org.apache.mahout.utils.vectors package. The driver program offers several input options, which can be displayed by specifying the –help option. Examples of running the driver are included below: This uses the index specified by –dir and the body field in it and writes out the info to the output dir and the dictionary to dict.txt. It only outputs 50 vectors. If you don’t specify –max, then all the documents in the index are output. Mahout has utilities to generate Vectors from a directory of text documents. Before creating the vectors, you need to convert the documents to SequenceFile format. SequenceFile is a hadoop class which allows us to write arbitary (key, value) pairs into it. The DocumentVectorizer requires the key to be a Text with a unique document id, and value to be the Text content in UTF-8 format. Mahout has a nifty utility which reads a directory path including its sub-directories and creates the SequenceFile in a chunked manner for us. The output of seqDirectory will be a Sequence file <, Text, Text >, of all documents (/sub-directory-path/documentFileName, documentText). This will create SequenceFiles of tokenized documents <, Text, StringTuple >, (docID, tokenizedDoc) and vectorized documents <, Text, VectorWritable >, (docID, TF-IDF Vector). As well, seq2sparse will create SequenceFiles for: a dictionary (wordIndex, word), a word frequency count (wordIndex, count) and a document frequency count (wordIndex, DFCount) in the output directory. The –minSupport option is the min frequency for the word to be considered as a feature, –minDF is the min number of documents the word needs to be in, –maxDFPercent is the max value of the expression (document frequency of a word/total number of document) to be considered as good feature to be in the document. These options are helpful in removing high frequency features like stop words. The vectorized documents can then be used as input to many of Mahout’s classification and clustering algorithms. Vectorize the documents using trigrams, L_2 length normalization and a maximum document frequency cutoff of 85%. The sequence file in the $WORK_DIR/reuters-out-seqdir-sparse-kmeans/tfidf-vectors directory can now be used as input to the Mahout k-Means clustering algorithm. If you are in the happy position to already own a document (as in: texts, images or whatever item you wish to treat) processing pipeline, the question arises of how to convert the vectors into the Mahout vector format. Probably the easiest way to go would be to implement your own Iterable (called VectorIterable in the example below) and then reuse the existing VectorWriter classes: Copyright © 2014-2016 The Apache Software Foundation, Licensed under the Apache License, Version 2.0. Apache Mahout, Mahout, Apache, the Apache feather logo, and the elephant rider logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries. Source.