项向量在Lucene中属于高级话题。利用项向量能实现很多很有意思的功能,比如返回跟当前商品相似的商品。当你需要实现返回与xxxxxxxx类似的东西时,就可以考虑使用项向量,在Lucene中是使用MoreLikeThis来实现。
项向量其实就是根据Term在文档中出现的频率和文档中包含Term的频率建立的数学模型,计算两个项向量的夹角的方式来判断他们的相似性。而Lucene5中内置的MoreLikeThis的实现方式却是使用打分的方式计算相似度,根据最终得分高低放入优先级队列,评分高的自然在队列最高处。
- /**
- * Create a PriorityQueue from a word->tf map.
- *
- * @param words a map of words keyed on the word(String) with Int objects as the values.
- */
- private PriorityQueue<ScoreTerm> createQueue(Map<String, Int> words) throws IOException {
- // have collected all words in doc and their freqs
- int numDocs = ir.numDocs();
- final int limit = Math.min(maxQueryTerms, words.size());
- FreqQ queue = new FreqQ(limit); // will order words by score
- for (String word : words.keySet()) { // for every word
- int tf = words.get(word).x; // term freq in the source doc
- if (minTermFreq > 0 && tf < minTermFreq) {
- continue; // filter out words that don't occur enough times in the source
- }
- // go through all the fields and find the largest document frequency
- String topField = fieldNames[0];
- int docFreq = 0;
- for (String fieldName : fieldNames) {
- int freq = ir.docFreq(new Term(fieldName, word));
- topField = (freq > docFreq) ? fieldName : topField;
- docFreq = (freq > docFreq) ? freq : docFreq;
- }
- if (minDocFreq > 0 && docFreq < minDocFreq) {
- continue; // filter out words that don't occur in enough docs
- }
- if (docFreq > maxDocFreq) {
- continue; // filter out words that occur in too many docs
- }
- if (docFreq == 0) {
- continue; // index update problem?
- }
- float idf = similarity.idf(docFreq, numDocs);
- float score = tf * idf;
- if (queue.size() < limit) {
- // there is still space in the queue
- queue.add(new ScoreTerm(word, topField, score, idf, docFreq, tf));
- } else {
- ScoreTerm term = queue.top();
- if (term.score < score) { // update the smallest in the queue in place and update the queue.
- term.update(word, topField, score, idf, docFreq, tf);
- queue.updateTop();
- }
- }
- }
- return queue;
- }
其实就是通过similarity来计算IDF-TF从而计算得分。
Lucene5中获取项向量的方法:
1.根据document id获取
- reader.getTermVectors(docID);
2.根据document id 和 FieldName
- Terms termFreqVector = reader.getTermVector(i, "subject");
下面是一个有关Lucene5中TermVector项向量操作的示例代码:
- package com.yida.framework.lucene5.termvector;
- import java.io.IOException;
- import java.nio.file.Paths;
- import org.apache.lucene.document.Document;
- import org.apache.lucene.index.DirectoryReader;
- import org.apache.lucene.index.IndexReader;
- import org.apache.lucene.index.Term;
- import org.apache.lucene.index.Terms;
- import org.apache.lucene.index.TermsEnum;
- import org.apache.lucene.search.BooleanClause;
- import org.apache.lucene.search.BooleanClause.Occur;
- import org.apache.lucene.search.BooleanQuery;
- import org.apache.lucene.search.IndexSearcher;
- import org.apache.lucene.search.TermQuery;
- import org.apache.lucene.search.TopDocs;
- import org.apache.lucene.store.Directory;
- import org.apache.lucene.store.FSDirectory;
- import org.apache.lucene.util.BytesRef;
- import org.apache.lucene.util.CharsRefBuilder;
- /**
- * 查找类似书籍-测试
- * @author Lanxiaowei
- *
- */
- public class BookLikeThis {
- public static void main(String[] args) throws IOException {
- String indexDir = "C:/lucenedir";
- Directory directory = FSDirectory.open(Paths.get(indexDir));
- IndexReader reader = DirectoryReader.open(directory);
- IndexSearcher searcher = new IndexSearcher(reader);
- // 最大的索引文档ID
- int numDocs = reader.maxDoc();
- BookLikeThis blt = new BookLikeThis();
- for (int i = 0; i < numDocs; i++) {
- System.out.println();
- Document doc = reader.document(i);
- System.out.println(doc.get("title"));
- Document[] docs = blt.docsLike(reader, searcher, i, 10);
- if (docs.length == 0) {
- System.out.println(" -> Sorry,None like this");
- }
- for (Document likeThisDoc : docs) {
- System.out.println(" -> " + likeThisDoc.get("title"));
- }
- }
- reader.close();
- directory.close();
- }
- public Document[] docsLike(IndexReader reader, IndexSearcher searcher,
- int id, int max) throws IOException {
- //根据文档id加载文档对象
- Document doc = reader.document(id);
- //获取所有的作者
- String[] authors = doc.getValues("author");
- BooleanQuery authorQuery = new BooleanQuery();
- //遍历所有的作者
- for (String author : authors) {
- //包含所有作者的书籍
- authorQuery.add(new TermQuery(new Term("author", author)),Occur.SHOULD);
- }
- //authorQuery权重乘以2
- authorQuery.setBoost(2.0f);
- //获取subject域的项向量
- Terms vector = reader.getTermVector(id, "subject");
- TermsEnum termsEnum = vector.iterator(null);
- CharsRefBuilder spare = new CharsRefBuilder();
- BytesRef text = null;
- BooleanQuery subjectQuery = new BooleanQuery();
- while ((text = termsEnum.next()) != null) {
- spare.copyUTF8Bytes(text);
- String term = spare.toString();
- //System.out.println("term:" + term);
- // if isNoiseWord
- TermQuery tq = new TermQuery(new Term("subject", term));
- //使用subject域中的项向量构建BooleanQuery
- subjectQuery.add(tq, Occur.SHOULD);
- }
- BooleanQuery likeThisQuery = new BooleanQuery();
- likeThisQuery.add(authorQuery, BooleanClause.Occur.SHOULD);
- likeThisQuery.add(subjectQuery, BooleanClause.Occur.SHOULD);
- //排除自身
- likeThisQuery.add(new TermQuery(new Term("isbn", doc.get("isbn"))),
- BooleanClause.Occur.MUST_NOT);
- TopDocs hits = searcher.search(likeThisQuery, 10);
- int size = max;
- if (max > hits.scoreDocs.length) {
- size = hits.scoreDocs.length;
- }
- Document[] docs = new Document[size];
- for (int i = 0; i < size; i++) {
- docs[i] = reader.document(hits.scoreDocs[i].doc);
- }
- return docs;
- }
- }
通过计算项向量夹角的方式判定相似度的代码示例:
- package com.yida.framework.lucene5.termvector;
- import java.io.IOException;
- import java.nio.file.Paths;
- import java.util.Iterator;
- import java.util.Map;
- import java.util.TreeMap;
- import org.apache.lucene.document.Document;
- import org.apache.lucene.index.DirectoryReader;
- import org.apache.lucene.index.IndexReader;
- import org.apache.lucene.index.Terms;
- import org.apache.lucene.index.TermsEnum;
- import org.apache.lucene.search.IndexSearcher;
- import org.apache.lucene.store.Directory;
- import org.apache.lucene.store.FSDirectory;
- import org.apache.lucene.util.BytesRef;
- import org.apache.lucene.util.CharsRefBuilder;
- /**
- * 利用项向量自动书籍分类[项向量夹角越小相似度越高]
- *
- * @author Lanxiaowei
- *
- */
- public class CategoryTest {
- public static void main(String[] args) throws IOException {
- String indexDir = "C:/lucenedir";
- Directory directory = FSDirectory.open(Paths.get(indexDir));
- IndexReader reader = DirectoryReader.open(directory);
- //IndexSearcher searcher = new IndexSearcher(reader);
- Map<String, Map<String, Integer>> categoryMap = new TreeMap<String, Map<String,Integer>>();
- //构建分类的项向量
- buildCategoryVectors(categoryMap, reader);
- getCategory("extreme agile methodology",categoryMap);
- getCategory("montessori education philosophy",categoryMap);
- }
- /**
- * 根据项向量自动判断分类[返回项向量夹角最小的即相似度最高的]
- *
- * @param subject
- * @return
- */
- private static String getCategory(String subject,
- Map<String, Map<String, Integer>> categoryMap) {
- //将subject按空格分割
- String[] words = subject.split(" ");
- Iterator<String> categoryIterator = categoryMap.keySet().iterator();
- double bestAngle = Double.MAX_VALUE;
- String bestCategory = null;
- while (categoryIterator.hasNext()) {
- String category = categoryIterator.next();
- double angle = computeAngle(categoryMap, words, category);
- // System.out.println(" -> angle = " + angle + " (" +
- // Math.toDegrees(angle) + ")");
- if (angle < bestAngle) {
- bestAngle = angle;
- bestCategory = category;
- }
- }
- System.out.println("The best like:" + bestCategory + "-->" + subject);
- return bestCategory;
- }
- public static void buildCategoryVectors(
- Map<String, Map<String, Integer>> categoryMap, IndexReader reader)
- throws IOException {
- int maxDoc = reader.maxDoc();
- // 遍历所有索引文档
- for (int i = 0; i < maxDoc; i++) {
- Document doc = reader.document(i);
- // 获取category域的值
- String category = doc.get("category");
- Map<String, Integer> vectorMap = categoryMap.get(category);
- if (vectorMap == null) {
- vectorMap = new TreeMap<String, Integer>();
- categoryMap.put(category, vectorMap);
- }
- Terms termFreqVector = reader.getTermVector(i, "subject");
- TermsEnum termsEnum = termFreqVector.iterator(null);
- addTermFreqToMap(vectorMap, termsEnum);
- }
- }
- /**
- * 统计项向量中每个Term出现的document个数,key为Term的值,value为document总个数
- *
- * @param vectorMap
- * @param termsEnum
- * @throws IOException
- */
- private static void addTermFreqToMap(Map<String, Integer> vectorMap,
- TermsEnum termsEnum) throws IOException {
- CharsRefBuilder spare = new CharsRefBuilder();
- BytesRef text = null;
- while ((text = termsEnum.next()) != null) {
- spare.copyUTF8Bytes(text);
- String term = spare.toString();
- int docFreq = termsEnum.docFreq();
- System.out.println("term:" + term + "-->docFreq:" + docFreq);
- // 包含该term就累加document出现频率
- if (vectorMap.containsKey(term)) {
- Integer value = (Integer) vectorMap.get(term);
- vectorMap.put(term, new Integer(value.intValue() + docFreq));
- } else {
- vectorMap.put(term, new Integer(docFreq));
- }
- }
- }
- /**
- * 计算两个Term项向量的夹角[夹角越小则相似度越大]
- *
- * @param categoryMap
- * @param words
- * @param category
- * @return
- */
- private static double computeAngle(Map<String, Map<String, Integer>> categoryMap,
- String[] words, String category) {
- Map<String, Integer> vectorMap = categoryMap.get(category);
- int dotProduct = 0;
- int sumOfSquares = 0;
- for (String word : words) {
- int categoryWordFreq = 0;
- if (vectorMap.containsKey(word)) {
- categoryWordFreq = vectorMap.get(word).intValue();
- }
- dotProduct += categoryWordFreq;
- sumOfSquares += categoryWordFreq * categoryWordFreq;
- }
- double denominator = 0.0d;
- if (sumOfSquares == words.length) {
- denominator = sumOfSquares;
- } else {
- denominator = Math.sqrt(sumOfSquares) * Math.sqrt(words.length);
- }
- double ratio = dotProduct / denominator;
- return Math.acos(ratio);
- }
- }
MoreLikeThis使用示例:
- package com.yida.framework.lucene5.termvector;
- import java.io.IOException;
- import java.nio.file.Paths;
- import org.apache.lucene.analysis.standard.StandardAnalyzer;
- import org.apache.lucene.document.Document;
- import org.apache.lucene.index.DirectoryReader;
- import org.apache.lucene.index.IndexReader;
- import org.apache.lucene.queries.mlt.MoreLikeThis;
- import org.apache.lucene.search.IndexSearcher;
- import org.apache.lucene.search.Query;
- import org.apache.lucene.search.ScoreDoc;
- import org.apache.lucene.search.TopDocs;
- import org.apache.lucene.store.Directory;
- import org.apache.lucene.store.FSDirectory;
- /**
- * MoreLikeThis[更多与此相似]
- *
- * @author Lanxiaowei
- *
- */
- public class MoreLikeThisTest {
- public static void main(String[] args) throws IOException {
- String indexDir = "C:/lucenedir";
- Directory directory = FSDirectory.open(Paths.get(indexDir));
- IndexReader reader = DirectoryReader.open(directory);
- IndexSearcher searcher = new IndexSearcher(reader);
- MoreLikeThis moreLikeThis = new MoreLikeThis(reader);
- moreLikeThis.setAnalyzer(new StandardAnalyzer());
- moreLikeThis.setFieldNames(new String[] { "title","author","subject" });
- moreLikeThis.setMinTermFreq(1);
- moreLikeThis.setMinDocFreq(1);
- int docNum = 1;
- Query query = moreLikeThis.like(docNum);
- //System.out.println(query.toString());
- TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);
- ScoreDoc[] scoreDocs = topDocs.scoreDocs;
- //文档id为1的书
- System.out.println(reader.document(docNum).get("title") + "-->");
- for (ScoreDoc sdoc : scoreDocs) {
- Document doc = reader.document(sdoc.doc);
- //找到与文档id为1的书相似的书
- System.out.println(" more like this: " + doc.get("title"));
- }
- }
- }
MoreLikeThisQuery使用示例:
- package com.yida.framework.lucene5.termvector;
- import java.io.IOException;
- import java.nio.file.Paths;
- import org.apache.lucene.analysis.Analyzer;
- import org.apache.lucene.analysis.standard.StandardAnalyzer;
- import org.apache.lucene.document.Document;
- import org.apache.lucene.index.DirectoryReader;
- import org.apache.lucene.index.IndexReader;
- import org.apache.lucene.queries.mlt.MoreLikeThis;
- import org.apache.lucene.queries.mlt.MoreLikeThisQuery;
- import org.apache.lucene.search.IndexSearcher;
- import org.apache.lucene.search.Query;
- import org.apache.lucene.search.ScoreDoc;
- import org.apache.lucene.search.TopDocs;
- import org.apache.lucene.store.Directory;
- import org.apache.lucene.store.FSDirectory;
- /**
- * MoreLikeThisQuery测试
- * @author Lanxiaowei
- *
- */
- public class MoreLikeThisQueryTest {
- public static void main(String[] args) throws IOException {
- String indexDir = "C:/lucenedir";
- Directory directory = FSDirectory.open(Paths.get(indexDir));
- IndexReader reader = DirectoryReader.open(directory);
- IndexSearcher searcher = new IndexSearcher(reader);
- String[] moreLikeFields = new String[] {"title","author"};
- MoreLikeThisQuery query = new MoreLikeThisQuery("lucene in action",
- moreLikeFields, new StandardAnalyzer(), "author");
- query.setMinDocFreq(1);
- query.setMinTermFrequency(1);
- //System.out.println(query.toString());
- TopDocs topDocs = searcher.search(query, Integer.MAX_VALUE);
- ScoreDoc[] scoreDocs = topDocs.scoreDocs;
- //文档id为1的书
- //System.out.println(reader.document(docNum).get("title") + "-->");
- for (ScoreDoc sdoc : scoreDocs) {
- Document doc = reader.document(sdoc.doc);
- //找到与文档id为1的书相似的书
- System.out.println(" more like this: " + doc.get("title"));
- }
- }
- }
注意MoreLikeThisQuery需要指定分词器,因为你需要指定likeText(即相似参照物),并对likeText进行分词得到多个Term,然后计算每个Term的IDF-TF最终计算得分。涉及到分词那就需要知道域的类型,比如你还必须指定一个fieldName即域名称,按照这个域的类型来进行分词,其他的参数moreLikeFields表示从哪些域里提取Term计算相似度。你还可以通过setStopWords去除likeText中的停用词。
最后说一点小技巧,reader.document(docId)根据docid可以加载索引文档对象,这个你们都知道,但它还有一个重载方法得引起你们的重视:
后面的Set集合表示你需要返回那些域值,不指定默认是返回所有Store.YES的域,这样做的好处就是减少内存占用,比如你确定某些域的值在你本次查询中你不需要返回给用户展示,那你可以在Set中不包含该域,就好比SQL里的select * from table 和 select id,name from table。
上面介绍了通过计算向量夹角和IDF-TF打分两种方式来计算相似度,你也可以实现自己的相似度算法,这个就超出了Lucene范畴了,那是算法设计问题了,好的算法决定了匹配的精准度。
相关推荐
- **文档向量(Document Vector)**:将每个文档表示为词项的集合,每个词项对应一个权重,权重通常由词频(Term Frequency, TF)和逆文档频率(Inverted Document Frequency, IDF)计算得出。 2. **倒排索引...
6. **Term Vector查看**:Luke可以显示文档的词项向量(Term Vectors),这包含每个词项的位置、频率和偏移量信息,对于理解TF-IDF算法和其他相关性计算很有价值。 7. **倒排索引查看**:Luke允许用户查看倒排索引...
**向量空间模型(Vector Space Model,VSM)** 向量空间模型是信息检索和自然语言处理领域中的一种重要概念,它将文档和查询表示为高维空间中的向量,从而进行相似度计算。在VSM中,每个文档或查询被视为一个由词项...
1. **协调因子**:衡量文档中查询Term出现的百分比,表示查询中Term的种类和文档中出现的数量之和的比率。出现的Term种类越多,得分越高。 2. **调节因子**:仅在检索时使用,用于在不同查询条件下比较排序结果,...
5. **Term Vector查看**:Luke可以显示文档的词项向量(Term Vectors),这在进行相关性分析和查询优化时很有帮助。 6. **多语言支持**:随着Lucene对多种语言的支持,Luke也能够处理不同语言的索引,提供相应的...
评分算法中使用的一个重要数学模型是向量空间模型(Vector Space Model,VSM),它根据词项的权重和文档间的相似度进行计算,影响着最终的搜索结果。 总体而言,该文档为读者提供了一个关于Lucene深入学习的通道。...
- **TermVectorsTermsWriterPerField.newTerm()/addTerm()**:如果启用了term vector,则将词汇位置信息也写入索引中。 #### 索引存储 - **数据存储类Directory**: - `org.apache.lucene.store.Directory`接口...
- **词向量(TermVector)的数据信息**(.tvx,.tvd,.tvf):记录了每个词在文档中的位置信息。 **2. 反向信息**:包括词典、文档号及词频等信息。 - **词典(tis)及词典索引(tii)信息**:记录了所有不同词的列表...
- **词向量(TermVector)的数据信息(.tvx,.tvd,.tvf)**:包含词项在文档中的位置信息。 #### 四、Lucene索引过程分析 创建索引的过程涉及多个阶段: - **创建IndexWriter对象**:这是索引过程的起点,`...
支持向量机(support vector machines)和核函数(kernel functions)是机器学习领域的重要组成部分,它们在信息检索中也有广泛应用。聚类分析(clustering)包括平面聚类(flat clustering)和层次聚类...
- **termvector**:表示是否保存词向量。 - **positions**:表示是否在词向量中保存位置信息。 - **offset**:表示是否在词向量中保存偏移量。 - **norms**:表示是否保存标准化因子。 - **payload**:表示是否...
- **TermVector**:Lucene 1.4.3 版本新增的功能,提供一种向量机制进行模糊查询。 #### 三、Field 构造函数解析 - **构造函数示例**: ```java Field(String name, byte[] value, Field.Store store) Field...