Starting with version 4.2, Lucene provides a document classification function. In this article, we will use the same corpus to perform document classification functions of both Lucene and Mahout to compare the results.
Lucene implements Naive Bayes and k-NN rule classifiers. The trunk equivalent to Lucene 5, the next major releases, implements boolean (2-class) classification perceptron in addition to these two. We use Lucene 4.6.1, the most recent version at the time of writing, to perform document classification with Naive Bayes and k-NN rule.
Meanwhile, let’s use Mahout to do document classification with Naive Bayes and Random Forest as well.
Overview of Lucene Document Classification
Lucene’s classifier for document classification is defined as the Classifier interface.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
public interface Classifier<T> {
/**
* Assign a class (with score) to the given text String
* @param text a String containing text to be classified
* @return a {@link ClassificationResult} holding assigned class of type <code>T</code> and score
* @throws IOException If there is a low-level I/O error.
*/
public ClassificationResult<T> assignClass(String text) throws IOException;
/**
* Train the classifier using the underlying Lucene index
* @param atomicReader the reader to use to access the Lucene index
* @param textFieldName the name of the field used to compare documents
* @param classFieldName the name of the field containing the class assigned to documents
* @param analyzer the analyzer used to tokenize / filter the unseen text
* @param query the query to filter which documents use for training
* @throws IOException If there is a low-level I/O error.
*/
public void train(AtomicReader atomicReader, String textFieldName, String classFieldName, Analyzer analyzer, Query query)
throws IOException;
} |
You need to have IndexReader with prepared index open and specify it as the first argument of the train() method because Classifier uses index as learning data. Also, set the Lucene field name that has text, which is tokenized and indexed, as the second argument of train() method. In addition, set the Lucene field that has document category as the third argument of train() method. In the same manner, set a Lucene Analyzer to the fourth argument and Query to the fifth argument. Analyzer then specifies Analyzer that is used to classify unknown document (In my personal opinion, this is a bit complicated and should use them as arguments for after-mentioned assignClass() method instead) . While Query is used to narrow down documents that are used for learning, null is used if there’s no need to do so. The train() method has 2 more varieties that have different arguments but I will skip the explanation for now.
Use unknown document in the String type as an argument to call the assignClass() method after you call train() of Classifier interface to obtain the result of classification. Classifier is an interface that uses Java Generics, and the ClassificationResult class that uses type variable T is the returned value of assignClass().
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
public class ClassificationResult<T> {
private final T assignedClass;
private final double score;
/**
* Constructor
* @param assignedClass the class <code>T</code> assigned by a {@link Classifier}
* @param score the score for the assignedClass as a <code>double</code>
*/
public ClassificationResult(T assignedClass, double score) {
this .assignedClass = assignedClass;
this .score = score;
}
/**
* retrieve the result class
* @return a <code>T</code> representing an assigned class
*/
public T getAssignedClass() {
return assignedClass;
}
/**
* retrieve the result score
* @return a <code>double</code> representing a result score
*/
public double getScore() {
return score;
}
} |
Calling the getAssignedClass() method of ClassificationResult gives you a classification result of the type T.
Note that Lucene’s classifier is unique in that the train() method does little work while the assignClass() does most of the work. This is where it is very different from the other commonly used machine learning software. In the learning phase of commonly used machine learning software, a model file is created by learning corpus according to a selected machine learning algorithm (This is where the most time/effort is put into. As Mahout is based on Hadoop, it uses MapReduce to try to reduce the time required here). And in the classification phase, an unknown document is classified by referring to a previously created model file. This phase usually requires little resource.
As Lucene uses an index as a model file, train() method, which is a learning phase, does almost nothing here (Its learning completes as soon as index is created). Lucene’s index, however, is optimized to perform high-speed keyword search and is not in an appropriate format for document classification model file. Therefore, here we do document classification by searching index with the assignClass() method that is a classification phase. Contrary to commonly used machine learning software, Lucene’s classifier requires very high computing power in the classification phase. For sites mainly focused on searching, this function that enables document classification should be appealing as they can create indexes without additional cost.
Now, let’s quickly go through how the 2 implement classes of Classifier interface do document classification and actually call them from a program.
Using Lucene SimpleNaiveBayesClassifier
SimpleNaiveBayesClassifier is the first implement class of Classifier interface. As you can see from the name, it’s a Naive Bayes classifier. Naive Bayes classification finds c where conditional probability P(c|d), the probability of class being c in document d, becomes the highest. Here you use Bayes’ theorem to do deformation of P(c|d) but you need to find P(c)P(d|c) to calculate class c with the highest probability. While you usually calculate logarithm to avoid underflow, the assignClass() method of SimpleNaiveBayesClassifier repeats this calculation as many times as the number of classes to perform MLE (maximum likelihood estimation).
We now use SimpleNaiveBayesClassifier, but before that, we need to prepare learning data in an index. Here we use livedoor news corpusas our corpus. Let’s add livedoor news corpus to the index using schema definition Solr as follows.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
<? xml version = "1.0" encoding = "UTF-8" ?>
< schema name = "example" version = "1.5" >
< fields >
< field name = "url" type = "string" indexed = "true" stored = "true" required = "true" multiValued = "false" />
< field name = "cat" type = "string" indexed = "true" stored = "true" required = "true" multiValued = "false" />
< field name = "title" type = "text_ja" indexed = "true" stored = "true" multiValued = "false" />
< field name = "body" type = "text_ja" indexed = "true" stored = "true" multiValued = "true" />
< field name = "date" type = "date" indexed = "true" stored = "true" />
</ fields >
< uniqueKey >url</ uniqueKey >
< types >
< fieldType name = "string" class = "solr.StrField" sortMissingLast = "true" />
< fieldType name = "boolean" class = "solr.BoolField" sortMissingLast = "true" />
< fieldType name = "int" class = "solr.TrieIntField" precisionStep = "0" positionIncrementGap = "0" />
< fieldType name = "float" class = "solr.TrieFloatField" precisionStep = "0" positionIncrementGap = "0" />
< fieldType name = "long" class = "solr.TrieLongField" precisionStep = "0" positionIncrementGap = "0" />
< fieldType name = "double" class = "solr.TrieDoubleField" precisionStep = "0" positionIncrementGap = "0" />
< fieldType name = "date" class = "solr.TrieDateField" precisionStep = "0" positionIncrementGap = "0" />
< fieldType name = "text_ja" class = "solr.TextField" positionIncrementGap = "100" autoGeneratePhraseQueries = "false" >
< analyzer >
< tokenizer class = "solr.JapaneseTokenizerFactory" mode = "search" />
< filter class = "solr.JapaneseBaseFormFilterFactory" />
< filter class = "solr.JapanesePartOfSpeechStopFilterFactory" tags = "lang/stoptags_ja.txt" />
< filter class = "solr.CJKWidthFilterFactory" />
< filter class = "solr.StopFilterFactory" ignoreCase = "true" words = "lang/stopwords_ja.txt" />
< filter class = "solr.JapaneseKatakanaStemFilterFactory" minimumLength = "4" />
< filter class = "solr.LowerCaseFilterFactory" />
</ analyzer >
</ fieldType >
</ types >
</ schema >
|
Note that the cat field is a classification class while body field is the target learning field. First, start Solr with the above schema.xml and add livedoor news corpus. You can stop Solr as soon as you finish adding the corpus.
Next, we need a Java program that uses SimpleNaiveBayesClassifier. To make things easier, we will use the same document we used for learning for classification test as is. The program looks like as follows.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
|
public final class TestLuceneIndexClassifier {
public static final String INDEX = "solr2/collection1/data/index" ;
public static final String[] CATEGORIES = {
"dokujo-tsushin" ,
"it-life-hack" ,
"kaden-channel" ,
"livedoor-homme" ,
"movie-enter" ,
"peachy" ,
"smax" ,
"sports-watch" ,
"topic-news"
};
private static int [][] counts;
private static Map<String, Integer> catindex;
public static void main(String[] args) throws Exception {
init();
final long startTime = System.currentTimeMillis();
SimpleNaiveBayesClassifier classifier = new SimpleNaiveBayesClassifier();
IndexReader reader = DirectoryReader.open(dir());
AtomicReader ar = SlowCompositeReaderWrapper.wrap(reader);
classifier.train(ar, "body" , "cat" , new JapaneseAnalyzer(Version.LUCENE_46));
final int maxdoc = reader.maxDoc();
for ( int i = 0 ; i < maxdoc; i++){
Document doc = ar.document(i);
String correctAnswer = doc.get( "cat" );
final int cai = idx(correctAnswer);
ClassificationResult<BytesRef> result = classifier.assignClass(doc.get( "body" ));
String classified = result.getAssignedClass().utf8ToString();
final int cli = idx(classified);
counts[cai][cli]++;
}
final long endTime = System.currentTimeMillis();
final int elapse = ( int )(endTime - startTime) / 1000 ;
// print results
int fc = 0 , tc = 0 ;
for ( int i = 0 ; i < CATEGORIES.length; i++){
for ( int j = 0 ; j < CATEGORIES.length; j++){
System.out.printf( " %3d " , counts[i][j]);
if (i == j){
tc += counts[i][j];
}
else {
fc += counts[i][j];
}
}
System.out.println();
}
float accrate = ( float )tc / ( float )(tc + fc);
float errrate = ( float )fc / ( float )(tc + fc);
System.out.printf( "\n\n*** accuracy rate = %f, error rate = %f; time = %d (sec); %d docs\n" , accrate, errrate, elapse, maxdoc);
reader.close();
}
static Directory dir() throws IOException {
return FSDirectory.open( new File(INDEX));
}
static void init(){
counts = new int [CATEGORIES.length][CATEGORIES.length];
catindex = new HashMap<String, Integer>();
for ( int i = 0 ; i < CATEGORIES.length; i++){
catindex.put(CATEGORIES[i], i);
}
}
static int idx(String cat){
return catindex.get(cat);
}
} |
Here we specified JapaneseAnalyzer as Analyzer (On the other hand, there is a slight difference when we create index because we use JapaneseTokenizer and relevant TokenFilter with a Solr function). A character string array CATEGORIES has document category hard-coded. Executing this program displays a confusion matrix like Mahout but the elements in the matrix are in the same order as array elements of document category that are hard-coded.
Executing this program displays the followings.
1
2
3
4
5
6
7
8
9
10
11
|
760 0 4 23 37 37 2 2 5 40 656 7 44 25 4 90 1 3
87 57 392 102 68 24 113 5 16
40 15 6 391 33 8 16 2 0
14 2 0 5 845 2 0 1 1
134 2 2 26 107 549 19 3 0
43 36 13 17 26 36 693 5 1
6 0 0 23 35 0 1 829 6
10 9 9 25 66 6 5 45 595
*** accuracy rate = 0.775078, error rate = 0.224922; time = 67 (sec); 7367 docs |
The classification accuracy rate went up to 77%.
Using Lucene KNearestNeighborClassifier
Another implement class for Classifier is KNearestNeighborClassifier. KNearestNeighborClassifier specifies k, which is no less than 1, in an argument for constructor to create an instance. You can use the program exactly the same as one for SimpleNaiveBayesClassifier. Only you need to do is to replace the portion that is creating an instance for SimpleNaiveBayesClassifier with KNearestNeighborClassifier.
The assignClass() method does all the work for KNearestNeighborClassifier as well in the same manner described before but one interesting point is that it is using Lucene MoreLikeThis. MoreLikeThis is a tool that sees document to become criteria as a query and performs search. With this, you can find documents that are similar to the ones to be criteria. KNearestNeighborClassifier uses MoreLikeThis to “k” number of documents that are most similar to the unknown document passed to the assignClass() method. Then, the majority rule is applied to that k number of documents to determine the document category of unknown document.
Executing the same program as KNearestNeighborClassifier will display the following when k=1.
1
2
3
4
5
6
7
8
9
10
11
|
724 14 28 22 6 30 8 18 20 121 630 41 13 2 9 35 6 13
165 28 582 10 5 16 26 7 25
229 15 15 213 6 14 6 2 11
134 37 15 8 603 12 19 7 35
266 38 39 24 14 412 22 9 18
810 16 1 3 2 3 32 1 2
316 18 14 12 5 7 8 439 81
362 17 29 10 1 7 7 16 321
*** accuracy rate = 0.536989, error rate = 0.463011; time = 13 (sec); 7367 docs |
Now the accuracy rate is 53%. In addition, if you take k=3, accuracy rate goes down to 48%.
1
2
3
4
5
6
7
8
9
10
11
|
652 5 78 3 7 40 13 38 34 127 540 82 15 1 10 58 23 14
169 34 553 3 7 16 38 15 29
242 10 32 156 12 13 15 10 21
136 30 21 9 592 11 19 15 37
309 34 58 5 23 318 40 28 27
810 8 3 1 0 10 37 1 0
312 8 44 7 5 2 13 442 67
362 11 45 5 6 10 16 34 281
*** accuracy rate = 0.484729, error rate = 0.515271; time = 9 (sec); 7367 docs |
Document Classification by NLP4L and Mahout
If you want to use Lucene’s index as an input data in Mahout, there’s a handy command available. However, the purpose is to do document classification for a class with an instructor, you need to output field information, which specifies a class, in addition to document vector.
The tools that can easily do this are NLP4L MSDDumper and TermsDumper that we developed. NLP4L stands for Natural Language Processing for Lucene and is a natural language processing tool set that sees Lucene’s index as corpus.
Depending on the setting, MSDDumper and TermsDumper select and extract important words from Lucene’s field according to keys like tf*idf and outputs them in a format that is easy for Mahout command to read. Let’s use this function to select 2,000 important words from the body field of index and do the Mahout classification.
Looking only at the result, Mahout Naive Bayes shows accuracy rate of 96%.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
======================================================= Summary ------------------------------------------------------- Correctly Classified Instances : 7128 96.7689% Incorrectly Classified Instances : 238 3.2311% Total Classified Instances : 7366 ======================================================= Confusion Matrix ------------------------------------------------------- a b c d e f g h i <--Classified as 823 1 1 6 12 19 2 4 2 | 870 a = dokujo-tsushin 1 848 2 1 0 1 11 4 2 | 870 b = it-life-hack 5 6 830 1 1 0 3 1 17 | 864 c = kaden-channel 2 6 6 486 3 1 6 0 0 | 510 d = livedoor-homme 0 0 1 1 865 1 0 1 1 | 870 e = movie-enter 31 3 6 12 14 762 6 4 4 | 842 f = peachy 0 0 2 0 0 1 867 0 0 | 870 g = smax 0 0 0 1 0 0 0 897 2 | 900 h = sports-watch 2 4 1 1 0 0 0 12 750 | 770 i = topic-news ======================================================= Statistics ------------------------------------------------------- Kappa 0.955 Accuracy 96.7689% Reliability 87.0076% Reliability (standard deviation) 0.307 |
Also, Mahout Random Forest shows accuracy rate of 97%.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
======================================================= Summary ------------------------------------------------------- Correctly Classified Instances : 7156 97.1359% Incorrectly Classified Instances : 211 2.8641% Total Classified Instances : 7367 ======================================================= Confusion Matrix ------------------------------------------------------- a b c d e f g h i <--Classified as 838 5 2 6 3 7 2 0 1 | 864 a = kaden-channel 0 895 0 1 4 0 0 0 0 | 900 b = sports-watch 0 0 869 0 0 1 0 0 0 | 870 c = smax 0 2 0 839 1 0 14 2 12 | 870 d = dokujo-tsushin 1 17 0 0 748 0 2 0 2 | 770 e = topic-news 1 5 0 1 5 855 2 0 1 | 870 f = it-life-hack 0 1 0 23 0 0 793 1 24 | 842 g = peachy 0 11 0 14 1 2 18 454 11 | 511 h = livedoor-homme 0 1 0 2 0 0 2 0 865 | 870 i = movie-enter ======================================================= Statistics ------------------------------------------------------- Kappa 0.9608 Accuracy 97.1359% Reliability 87.0627% Reliability (standard deviation) 0.3076 |
Summary
In this article, we used the same corpus to do document classification of the both Lucene and Mahout to compare their results. The accuracy rate seems to be higher for Mahout but, as already stated, its learning data classification use not all word but only top 2,000 important words in the body field. On the other hand, Lucene’s classifier, which accuracy rate was only 70%, uses the all words in body field. Lucene will be able to pass the 90% accuracy rate if you have a field to hold only the words reviewed specially for document classification. It may also be a good idea to create another Classifier implement class for train() method that has such function.
I should add that the accuracy rate goes down to around 80% when you do not use test data for learning but test it as real unknown data.
I hope this article will help you all in some way.
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html
相关推荐
Comparing Realism and Idealism as Classic Schools of Thought
“Analyzing and Comparing Montgomery Multiplication Algorithms”(分析与比较蒙哥马利模乘算法)这篇文章旨在深入探讨并对比不同的蒙哥马利模乘算法实现方法。蒙哥马利模乘算法是一种在计算机科学中广泛应用的...
By leveraging the signal processing functions described in this document, developers can significantly improve the efficiency and effectiveness of their applications. Whether you're working on real-...
By comparing the results of standard GA and GP implementation with several algorithmic extensions, the authors show how to substantially increase achievable solution quality. They also describe ...
Exploitation of Healthcare Databasesin Anesthesiology and Surgical Carefor Comparing Comorbidity Indexesin Cholecystectomized Patients Chapter 10. Sickness Absence and Record LinkageUsing Primary ...
根据提供的文件信息,这篇文章的标题是《Biomass to methane and ethanol的net energy yield比较研究》。文章的描述指出,它是由黄卫东和夏维东撰写的,文章对生物质发酵生产酒精和厌氧消化生产甲烷两种工艺的转化率...
A system simulation model was used to create scene-dependent noise masks that ...particularly valuable for comparing the impact of noise and other attributes, and for computing overall image quality.
These tools are used to analyze and prove properties of languages and provide the framework for combining and comparing language features. The broad range of concepts includes fundamental data types ...
Chapter 11 Segmentation: Clustering and Classification Chapter 12 Association Rules for Market Basket Analysis Chapter 13 Choice Modeling Chapter A Appendix: R Versions and Related Software Chapter B...
在内容部分中,提到的文章“Characterizing and Comparing Phylogenies from their Laplacian Spectrum”发表在系统生物学杂志(Systematic Biology)上,卷号为65,期号为3,页码范围495-507。文章是通过拉普拉斯谱...
Optimization of the calculation of minimum and maximum values in signal dialog Correction of a crash when copying nodes in a read-only database When exporting CSV signals, the minimum and maximum ...