- 浏览: 1476939 次
- 性别:
- 来自: 北京
文章分类
- 全部博客 (691)
- linux (207)
- shell (33)
- java (42)
- 其他 (22)
- javascript (33)
- cloud (16)
- python (33)
- c (48)
- sql (12)
- 工具 (6)
- 缓存 (16)
- ubuntu (7)
- perl (3)
- lua (2)
- 超级有用 (2)
- 服务器 (2)
- mac (22)
- nginx (34)
- php (2)
- 内核 (2)
- gdb (13)
- ICTCLAS (2)
- mac android (0)
- unix (1)
- android (1)
- vim (1)
- epoll (1)
- ios (21)
- mysql (3)
- systemtap (1)
- 算法 (2)
- 汇编 (2)
- arm (3)
- 我的数据结构 (8)
- websocket (12)
- hadoop (5)
- thrift (2)
- hbase (1)
- graphviz (1)
- redis (1)
- raspberry (2)
- qemu (31)
- opencv (4)
- socket (1)
- opengl (1)
- ibeacons (1)
- emacs (6)
- openstack (24)
- docker (1)
- webrtc (11)
- angularjs (2)
- neutron (23)
- jslinux (18)
- 网络 (13)
- tap (9)
- tensorflow (8)
- nlu (4)
- asm.js (5)
- sip (3)
- xl2tp (5)
- conda (1)
- emscripten (6)
- ffmpeg (10)
- srt (1)
- wasm (5)
- bert (3)
- kaldi (4)
- 知识图谱 (1)
最新评论
-
wahahachuang8:
我喜欢代码简洁易读,服务稳定的推送服务,前段时间研究了一下go ...
websocket的helloworld -
q114687576:
http://www.blue-zero.com/WebSoc ...
websocket的helloworld -
zhaoyanzimm:
感谢您的分享,给我提供了很大的帮助,在使用过程中发现了一个问题 ...
nginx的helloworld模块的helloworld -
haoningabc:
leebyte 写道太NB了,期待早日用上Killinux!么 ...
qemu+emacs+gdb调试内核 -
leebyte:
太NB了,期待早日用上Killinux!
qemu+emacs+gdb调试内核
参考http://blog.csdn.net/zzpchina/archive/2006/01/15/579875.aspx
IR(Information Retrieval)来描述像Lucene这样的搜索工具。
lucene in action第二版,亚马逊没抢购到
直接看源码http://www.manning.com/hatcher3/LIAsourcecode.zip
使用的是lucene-core-3.0.2.jar
-------------------------------------------------------
Evaluating search quality:
评价搜索质量:
-----------------------------------------------------
helloword在
LIAsourcecode\lia2e\src\lia\meetlucene\Indexer.java
简化一下:
----------------------------------
不用lucene,直接用流统计一个文件夹中字符出现的个数
IR(Information Retrieval)来描述像Lucene这样的搜索工具。
lucene in action第二版,亚马逊没抢购到
直接看源码http://www.manning.com/hatcher3/LIAsourcecode.zip
使用的是lucene-core-3.0.2.jar
-------------------------------------------------------
Evaluating search quality:
评价搜索质量:
D.5.1 Precision and recall Precision and recall are standard metrics in the information retrieval community for objectively measuring relevance of search results. Precision measures what subset of the documents returned for each query were relevant. For example, if a query has 20 hits and only 1 is relevant, precision is 0.05. If only 1 hit was returned and it was relevant, precision is 1.0. Recall measures what percentage of the relevant documents for that query was actually returned. So if the query listed 8 documents as being relevant, but 6 were in the result set, that’s a recall of 0.75. In a properly configured search application, these two measures are naturally at odds with one another. Let’s say, on one extreme, you only show the user the very best (top 1) document matching their query. With such an approach, your precision will typically be high, because the first result has a good chance of being relevant, while your recall would be very low, because if there are many relevant documents for a given query you have only returned one of them. If we increase top 1 to top 10, then suddenly we will be returning many documents for each query. The precision will necessarily drop because most likely you are now allowing some non-relevant documents into the result set. But recall should increase because each query should return a larger subset of its relevant documents. Still, you’d like the relevant documents to be higher up in the ranking. To measure this, average precision is computed. This measure computes precision at each of the N cutoffs, where N ranges from 1 to a maximum value, and then takes the average. So this measure is higher if your search application generally returns relevant documents earlier in the result set. Mean average precision, or MAP, then measures the mean of average precision across a set of queries. A related measure, mean reciprocal rank or MRR, measures 1/M where M is the first rank that had a relevant document. You want both of these numbers to be as high as possible!
import java.io.File; import java.io.PrintWriter; import java.io.BufferedReader; import java.io.FileReader; import org.apache.lucene.search.*; import org.apache.lucene.store.*; import org.apache.lucene.benchmark.quality.*; import org.apache.lucene.benchmark.quality.utils.*; import org.apache.lucene.benchmark.quality.trec.*; public class PrecisionRecall { public static void main(String[] args) throws Throwable { File topicsFile = new File("D:/Workspaces/suanfa/sohu3/src/lia/benchmark/topics.txt"); File qrelsFile = new File("D:/Workspaces/suanfa/sohu3/src/lia/benchmark/qrels.txt"); Directory dir = FSDirectory.open(new File("indexes/MeetLucene")); org.apache.lucene.search.Searcher searcher = new IndexSearcher(dir, true); String docNameField = "filename"; PrintWriter logger = new PrintWriter(System.out, true); TrecTopicsReader qReader = new TrecTopicsReader(); //#1 QualityQuery qqs[] = qReader.readQueries( new BufferedReader(new FileReader(topicsFile))); //#1 Judge judge = new TrecJudge(new BufferedReader(new FileReader(qrelsFile))); //#2 judge.validateData(qqs, logger); //#3 QualityQueryParser qqParser = new SimpleQQParser("title", "contents"); //#4 QualityBenchmark qrun = new QualityBenchmark(qqs, qqParser, searcher, docNameField); SubmissionReport submitLog = null; QualityStats stats[] = qrun.execute(judge,submitLog, logger); QualityStats avg = QualityStats.average(stats); //#6 avg.log("SUMMARY",2,logger, " "); dir.close(); } }
-----------------------------------------------------
helloword在
LIAsourcecode\lia2e\src\lia\meetlucene\Indexer.java
简化一下:
import java.io.File; import java.io.FileFilter; import java.io.FileReader; import java.io.IOException; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.util.Version; public class Indexer { public static void main(String[] args) throws IOException { String indexDir = "D:\\Workspaces\\suanfa\\sohu3\\src\\lia\\meetlucene\\index";// args[0]; String dataDir = "D:\\Workspaces\\suanfa\\sohu3\\src\\lia\\meetlucene\\data";// args[1]; long start = System.currentTimeMillis(); Directory dir = FSDirectory.open(new File(indexDir)); IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED); // 3 int numIndexed = 0; try { TextFilesFilter filter = new TextFilesFilter(); File[] files = new File(dataDir).listFiles(); for (File f : files) { if (!f.isDirectory() && !f.isHidden() && f.exists()&& f.canRead() && (filter == null || filter.accept(f))) { // indexFile(f); System.out.println("Indexing " + f.getCanonicalPath()); Document doc = new Document(); doc.add(new Field("contents", new FileReader(f))); // 7 doc.add(new Field("filename", f.getName(),Field.Store.YES, Field.Index.NOT_ANALYZED));// 8 doc.add(new Field("fullpath", f.getCanonicalPath(), Field.Store.YES, Field.Index.NOT_ANALYZED));// 9 writer.addDocument(doc); numIndexed = writer.numDocs(); } } } finally { writer.close(); } long end = System.currentTimeMillis(); System.out.println("Indexing " + numIndexed + " files took "+ (end - start) + " milliseconds"); } private static class TextFilesFilter implements FileFilter { public boolean accept(File path) { return path.getName().toLowerCase().endsWith(".txt"); // 6 } } }
import org.apache.lucene.document.Document; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.TopDocs; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.store.Directory; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.util.Version; import java.io.File; import java.io.IOException; public class Searcher { public static void main(String[] args) throws IllegalArgumentException,IOException, ParseException { String indexDir = "D:\\Workspaces\\suanfa\\sohu3\\src\\lia\\meetlucene\\index";//args[0]; //1 String q = "Redistri*";//args[1]; //2 Directory dir = FSDirectory.open(new File(indexDir)); //3 IndexSearcher is = new IndexSearcher(dir); //3 QueryParser parser = new QueryParser(Version.LUCENE_30,"contents",new StandardAnalyzer(Version.LUCENE_30)); //4 Query query = parser.parse(q); //4 long start = System.currentTimeMillis(); TopDocs hits = is.search(query, 10); //5 long end = System.currentTimeMillis(); System.err.println("Found " + hits.totalHits + " document(s) (in " + (end - start) +" milliseconds) that matched query '" +q + "':"); // 6 for(ScoreDoc scoreDoc : hits.scoreDocs) { Document doc = is.doc(scoreDoc.doc); //7 System.out.println(doc.get("fullpath")); //8 } is.close(); } }
----------------------------------
不用lucene,直接用流统计一个文件夹中字符出现的个数
package com.hao; import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.FileReader; import java.io.FileWriter; import java.util.Iterator; import java.util.Map; import java.util.TreeMap; import java.util.regex.Matcher; import java.util.regex.Pattern; public class UserTreeMap { /** * @param args * @throws Exception */ public static void main(String[] args) throws Exception { //test(); Map map=getMapFromFile("D:\\Workspaces\\suanfa\\sohu3\\src\\english.txt"); Iterator it = map.entrySet().iterator(); while (it.hasNext()) { Map.Entry entry = (Map.Entry) it.next(); Object key = entry.getKey(); Object value = entry.getValue(); System.out.println(key+"--"+value); } } public static Map getMapFromFile(String filepath) throws Exception{ BufferedReader buf = new BufferedReader(new FileReader(filepath)); StringBuffer sbuf = new StringBuffer();// 缓冲字符串 String line = null; while ((line = buf.readLine()) != null) { sbuf.append(line);// 追加到缓冲字符串中 } buf.close();// 读取结束 Pattern expression = Pattern.compile("[a-zA-Z]+");// 定义正则表达式匹配单词 String string1 = sbuf.toString();//.toLowerCase();// 转换成小写 Matcher matcher = expression.matcher(string1);// 定义string1的匹配器 TreeMap myTreeMap = new TreeMap();// 创建树映射 存放键/值对 int n = 0;// 文章中单词总数 Object word = null;// 文章中的单词 Object num = null;// 出现的次数 while (matcher.find()) {// 是否匹配单词 word = matcher.group();// 得到一个单词-树映射的键 n++;// 单词数加1 if (myTreeMap.containsKey(word)) {// 如果包含该键,单词出现过 num = myTreeMap.get(word);// 得到单词出现的次数 Integer count = (Integer) num;// 强制转化 myTreeMap.put(word, new Integer(count.intValue() + 1)); } else { myTreeMap.put(word,new Integer(1));//否则单词第一次出现,添加到映射中 } } return myTreeMap; } public static void test() throws Exception{ BufferedReader buf = new BufferedReader(new FileReader("D:\\sohu3\\english.txt")); System.out.println("Read under this dir English.txt"); StringBuffer sbuf = new StringBuffer();// 缓冲字符串 String line = null; while ((line = buf.readLine()) != null) { sbuf.append(line);// 追加到缓冲字符串中 } buf.close();// 读取结束 Pattern expression = Pattern.compile("[a-zA-Z]+");// 定义正则表达式匹配单词 String string1 = sbuf.toString().toLowerCase();// 转换成小写 Matcher matcher = expression.matcher(string1);// 定义string1的匹配器 TreeMap myTreeMap = new TreeMap();// 创建树映射 存放键/值对 int n = 0;// 文章中单词总数 Object word = null;// 文章中的单词 Object num = null;// 出现的次数 while (matcher.find()) {// 是否匹配单词 word = matcher.group();// 得到一个单词-树映射的键 n++;// 单词数加1 if (myTreeMap.containsKey(word)) {// 如果包含该键,单词出现过 num = myTreeMap.get(word);// 得到单词出现的次数 Integer count = (Integer) num;// 强制转化 myTreeMap.put(word, new Integer(count.intValue() + 1)); } else {// src="http://images.csdn.net/syntaxhighlighting/OutliningIndicators/InBlock.gif" // alt="" /> myTreeMap.put(word,new Integer(1));//否则单词第一次出现,添加到映射中 } } System.out.println("统计分析如下:"); System.out.println(" 文章中单词总数" + n + "个"); System.out.println("具体的信息在当前目录的result.txt文件中"); BufferedWriter bufw = new BufferedWriter(new FileWriter("result.txt")); Iterator iter = myTreeMap.keySet().iterator();// 得到树映射键集合的迭代器 Object key = null; while (iter.hasNext()) {// 使用迭代器遍历树映射的键 key = iter.next(); bufw.write((String) key + ":" + myTreeMap.get(key));// 键/值写到文件中 bufw.newLine(); } bufw.write("english.txt中的单词总数" + n + "个"); bufw.newLine(); bufw.write("english.txt中不同单词" + myTreeMap.size() + "个"); bufw.close(); } }
发表评论
-
tomcat的https的例子
2016-09-22 17:50 459参考http://jingyan.baidu.com/arti ... -
jpa
2014-11-25 20:14 576可以使用jpa生成数据库表 import javax.pe ... -
hadoop复习 1搭建
2014-07-23 17:38 780参考http://hadoop.apache.org/docs ... -
jspwebshell
2014-06-20 12:35 6722<%@ page contentType=" ... -
jvm总结
2013-03-04 19:02 1736分代 年轻代: 所有新生 ... -
java的upload
2013-01-24 19:31 1961好久没写java的东西了 遇到个服务器,不是标准的httpSe ... -
spring lucene rmi例子
2012-09-18 16:24 1901http://www.blogjava.net/freeman ... -
jna的使用
2012-04-22 21:06 2911遇到java调用共享内存的 ... -
hessian
2012-04-10 10:40 919http://hessian.caucho.com/ 这个好 ... -
jvm调优应该知道的一点东西
2012-02-27 18:13 1261概念 概念 Java 工具 jsta ... -
java socket备份
2012-02-22 10:01 1060package org.hao; import java ... -
java操作内存
2011-12-29 00:57 5183How To Write Directly to a Memo ... -
单例模式7中
2011-12-28 09:56 919package com.sohu.zookeeper; // ... -
jprofiler6远程resin
2011-07-07 23:17 2696网上找的破解码:L-Larry_Lau@163.com#784 ... -
eclipse不装插件配置resin,可以debug
2011-06-27 23:41 1830新建java工程 1.目录如下 G:. ├─.settings ... -
session所有信息
2011-06-07 11:32 787转载http://www.99inf.net/Software ... -
看端口号用netstat -abn
2011-05-14 00:27 1306张孝祥讲过个fport,到哪都依赖他看端口被什么占用, 原来n ... -
hessian的helloworld
2011-05-12 11:02 1194参考http://hessian.caucho.com/#Ex ... -
resin日志
2011-05-03 14:40 1462resin-pro-3.1.9\conf\resin.conf ... -
maven-resin
2011-05-02 22:39 2341关于resin http://2hei.net/mt/2008 ...
相关推荐
lucene笔记
《Lucene笔记共38页.pdf》是一部深入探讨Apache Lucene全文搜索引擎库的详细资料,这份笔记涵盖了Lucene的核心概念、关键技术和实际应用。Lucene是Java开发的开源信息检索库,广泛应用于各种搜索和信息提取场景。...
### Lucene知识点详解 #### 一、Lucene简介 **1.1 什么是Lucene** Lucene是一个由Apache基金会维护的开源全文检索引擎工具包。它为开发者提供了一个简便的接口,使得在应用程序中实现高效的全文检索功能成为可能...
### Lucene笔记:全文检索的实现机制 #### 全文检索与中文支持 在全文检索领域,Lucene作为一款开源的高性能全文检索引擎,被广泛应用于各种场景之中。特别是对于中文用户而言,如何确保Lucene能够高效地支持中文...
【Lucene】 Apache Lucene 是一个开源的全文检索库,属于Java开发的文本搜索工具包。它不是一个完整的搜索引擎,而是一组用于构建全文检索应用的基础组件。Lucene提供了核心的索引和搜索功能,包括分词、建立倒排...
lucene使用总结笔记lucene使用总结笔记lucene使用总结笔记lucene使用总结笔记lucene使用总结笔记
【Lucene 3.6 学习笔记】 Lucene 是一个高性能、全文本搜索库,广泛应用于各种搜索引擎的开发。本文将深入探讨Lucene 3.6版本中的关键概念、功能以及实现方法。 ### 第一章 Lucene 基础 #### 1.1 索引部分的核心...
传智播客Lucene课程课堂笔记
### Lucene 课堂笔记知识点详解 #### 一、信息检索概览 **1.1 信息检索的概念** 信息检索指的是从海量信息集中筛选出与用户需求相关联的信息。本课程主要探讨文本信息的检索,虽然实际应用中还可能涉及图像、音频...
**Lucene文档笔记详解** Lucene,源自Apache软件基金会4 Jakarta项目组,是一个备受开发者青睐的开源全文检索引擎工具包。它提供了强大的文本分析、索引构建和搜索功能,但值得注意的是,Lucene本身并不包含完整的...
关于lucene开发的工作笔记,详细的介绍了lucene的索引在创建过程中应该主意的一些事项和要求
**Lucene 学习笔记 1** Lucene 是一个全文搜索引擎库,由 Apache 软件基金会开发。它提供了一个可扩展的、高性能的搜索框架,使得开发者能够在其应用程序中集成高级的搜索功能。本篇学习笔记将深入探讨 Lucene 的...
### Lucene 3.5 学习笔记 #### 一、Lucene 3.5 基本概念 ##### 1.1 Lucene 概述 **1.1.1 IndexWriter** `IndexWriter` 是 Lucene 中的核心类之一,用于创建或更新索引。它提供了添加文档、删除文档、优化索引等...