- 浏览: 523897 次
- 性别:
- 来自: 北京
文章分类
- 全部博客 (422)
- 重要 (12)
- BUG解决备忘录 (32)
- 环境搭建 (17)
- 开源组件 (4)
- 数据库 (16)
- 设计模式 (4)
- 测试 (3)
- javascript (5)
- Android (14)
- jdk相关 (9)
- struts2 (10)
- freemark (3)
- 自定义扩展及工具类 (5)
- jdk5新特性及java基础 (13)
- ssh及其他框架 (15)
- linux (32)
- tcp-ip http协议 (8)
- 服务器集群与负载均衡 (34)
- 项目管理相关 (11)
- 实用小技术 (10)
- 架构相关 (14)
- firefox组件 (11)
- spider (6)
- 产品设计 (11)
- PHP (1)
- ws (4)
- lucene (10)
- 其他 (2)
- BI (1)
- NoSQL (3)
- gzip (1)
- ext (4)
- db (6)
- socket (1)
- 源码阅读 (2)
- NIO (2)
- 图片处理 (1)
- java 环境 (2)
- 项目管理 (4)
- 从程序员到项目经理(一):没有捷径 (1)
- bug (1)
- JAVA BASE (8)
- 技术原理 (0)
- 新框架新技术 (1)
- 量化与python (1)
- 系统编程 (0)
- C语言 (0)
- 汇编 (0)
- 算法 (0)
最新评论
-
hyspace:
别逗了,最后一个算法根本不是最优的,sort(function ...
数组去重——一道前端校招试题 -
washingtin:
楼主能把策略和路由的类代码贴出来吗
Spring + iBatis 的多库横向切分简易解决思路 -
sdyjmc:
初略看了一下,没有闹明白啊,均衡负载使用Nginx,sessi ...
J2EE集群原理 I -
shandeai520:
谢谢大神!请教大神一个问题:假如我有三台服务器,连接池的上限是 ...
集群和数据库负载均衡的研究 -
hekuilove:
给lz推荐一下apache commonsStringUtil ...
request 获取 ip
最近一直再研究lucene,把入门的程序和大家分享:
对索引的操作类:
Java代码
- public class IndexDao {
- public IndexDao() {
- try {
- indexWriter = new IndexWriter(Constants.INDEX_STORE_PATH,
- Constants.analyzer, MaxFieldLength.LIMITED);
- } catch (Exception e) {
- e.printStackTrace();
- }
- }
- public IndexDao(Directory dir) {
- try {
- indexWriter = new IndexWriter(dir,Constants.analyzer,MaxFieldLength.LIMITED);
- } catch (Exception e) {
- e.printStackTrace();
- }
- }
- public IndexDao( boolean isCreate) {
- try {
- indexWriter = new IndexWriter(Constants.INDEX_STORE_PATH,Constants.analyzer, isCreate,MaxFieldLength.LIMITED);
- } catch (Exception e) {
- e.printStackTrace();
- }
- }
- // 索引器
- private IndexWriter indexWriter = null ;
- /**
- * 添加/创建索引
- *
- * @param folder
- * @throws IOException
- * @throws CorruptIndexException
- */
- public void saveIndex(File folder, String[] unIndeies)
- throws CorruptIndexException, IOException {
- if (folder.isDirectory()) {
- String[] files = folder.list();
- for ( int i = 0 ; i < files.length; i++) {
- File f = new File(folder, files[i]);
- if (!f.isHidden()) {
- if (f.isDirectory()) {
- saveIndex(f, unIndeies);// ② 递归
- }
- String fileTyep = ReadFile.validateFile(f);
- for ( int j = 0 ; j < unIndeies.length; j++) {
- if (fileTyep.equalsIgnoreCase(unIndeies[j])) {
- System.out.println("正在建立索引 : " + f.getName() + "" );
- Document doc = ReadFile.indexFile(f);
- indexWriter.addDocument(doc);
- }
- }
- }
- }
- }
- }
- /**
- * Term是搜索的最小单位,代表某个 Field 中的一个关键词,如:<title, lucene> new Term( "title",
- * "lucene" ); new Term( "id", "5" ); new Term( "id", UUID );
- *
- * @param term
- */
- public void deleteIndex(Term term) {
- try {
- indexWriter.deleteDocuments(term);
- } catch (Exception e) {
- throw new RuntimeException(e);
- } finally {
- try {
- indexWriter.close();
- } catch (Exception e) {
- e.printStackTrace();
- }
- }
- }
- /**
- * 更新索引 indexWriter.deleteDocuments(term); indexWriter.addDocument(doc);
- *
- * @param term
- * @param doc
- */
- public void updateIndex(Term term, Document doc) {
- try {
- indexWriter.updateDocument(term, doc);
- } catch (Exception e) {
- throw new RuntimeException(e);
- } finally {
- try {
- indexWriter.close();
- } catch (Exception e) {
- e.printStackTrace();
- }
- }
- }
- /**
- * 查询 totalPage = recordCount / pageSize; if (recordCount % pageSize > 0)
- * totalPage++;
- *
- * @param queryString
- * @param firstResult
- * @param maxResults
- * @return
- */
- public QueryResult search(String queryString, int firstResult,
- int maxResults) {
- try {
- // 1,把要搜索的文本解析为 Query
- String[] fields = { "name" , "content" };
- Map<String, Float> boosts = new HashMap<String, Float>();
- boosts.put("name" , 2f);
- boosts.put("content" , 3f); //默认为1.0f
- QueryParser queryParser = new MultiFieldQueryParser(fields,
- Constants.analyzer, boosts);
- Query query = queryParser.parse(queryString);
- // Query query = IKQueryParser.parse("content", queryString);
- Date start = new Date();
- QueryResult result = search(query, firstResult, maxResults);
- Date end = new Date();
- System.out.println("检索完成,用时" + (end.getTime() - start.getTime())
- + "毫秒" );
- return result;
- } catch (Exception e) {
- throw new RuntimeException(e);
- }
- }
- public QueryResult search(Query query, int firstResult, int maxResults) {
- IndexSearcher indexSearcher = null ;
- try {
- // 2,进行查询
- indexSearcher = new IndexSearcher(Constants.INDEX_STORE_PATH);
- Filter filter = new RangeFilter( "size" ,
- NumberTools.longToString(0 ), NumberTools
- .longToString(1000000 ), true , true );
- // 排序
- Sort sort = new Sort();
- sort.setSort(new SortField( "size" )); // 默认为升序
- // sort.setSort(new SortField("size", true));
- TopDocs topDocs = indexSearcher.search(query, filter, 10000 , sort);
- int recordCount = topDocs.totalHits;
- List<Document> recordList = new ArrayList<Document>();
- // 准备高亮器
- Formatter formatter = new SimpleHTMLFormatter( "<font color='red'>" ,
- "</font>" );
- Scorer scorer = new QueryScorer(query);
- Highlighter highlighter = new Highlighter(formatter, scorer);
- Fragmenter fragmenter = new SimpleFragmenter( 50 );
- highlighter.setTextFragmenter(fragmenter);
- // 3,取出当前页的数据
- int end = Math.min(firstResult + maxResults, topDocs.totalHits);
- for ( int i = firstResult; i < end; i++) {
- ScoreDoc scoreDoc = topDocs.scoreDocs[i];
- int docSn = scoreDoc.doc; // 文档内部编号
- Document doc = indexSearcher.doc(docSn); // 根据编号取出相应的文档
- // 高亮 返回高亮后的结果,如果当前属性值中没有出现关键字,会返回 null
- String hc = highlighter.getBestFragment(Constants.analyzer,
- "content" , doc.get( "content" ));
- if (hc == null ) {
- String content = doc.get("content" );
- int endIndex = Math.min( 50 , content.length());
- hc = content.substring(0 , endIndex); // 最多前50个字符
- }
- doc.getField("content" ).setValue(hc);
- recordList.add(doc);
- }
- // 返回结果
- return new QueryResult(recordCount, recordList);
- } catch (Exception e) {
- throw new RuntimeException(e);
- } finally {
- try {
- indexSearcher.close();
- } catch (IOException e) {
- e.printStackTrace();
- }
- }
- }
- public void close() {
- // 对索引进行优化
- try {
- indexWriter.optimize();
- indexWriter.close();
- } catch (CorruptIndexException e) {
- e.printStackTrace();
- } catch (IOException e) {
- e.printStackTrace();
- }
- }
- public void readIndex(String key, String value) {
- IndexReader reader;
- try {
- // Directory fsDir = FSDirectory.getDirectory(
- // Constants.INDEX_STORE_PATH, false);
- // if (IndexReader.isLocked(fsDir)) {
- // System.out.println("------unlock-----");
- // IndexReader.unlock(fsDir);
- // }
- reader = IndexReader.open(Constants.INDEX_STORE_PATH);
- for ( int i = 0 ; i < reader.numDocs(); i++)
- // System.out.println(reader.document(i));
- System.out.println("版本:" + reader.getVersion());
- System.out.println("索引内的文档数量:" + reader.numDocs());
- Term term = new Term(key, value);
- TermDocs docs = reader.termDocs(term);
- IndexSearcher indexSearcher = null ;
- indexSearcher = new IndexSearcher(Constants.INDEX_STORE_PATH);
- while (docs.next()) {
- int docSn = docs.doc(); // 文档内部编号
- Document doc = indexSearcher.doc(docSn); // 根据编号取出相应的文档
- System.out.println("文档路径 " + doc.get( "path" ));
- System.out.println("含有所查找的 " + term + "的Document的编号为: " + docs.doc());
- System.out.println("Term在文档中的出现 " + docs.freq()+ " 次" );
- }
- } catch (CorruptIndexException e) {
- e.printStackTrace();
- } catch (IOException e) {
- e.printStackTrace();
- }
- }
- }
public class IndexDao { public IndexDao() { try { indexWriter = new IndexWriter(Constants.INDEX_STORE_PATH, Constants.analyzer, MaxFieldLength.LIMITED); } catch (Exception e) { e.printStackTrace(); } } public IndexDao(Directory dir) { try { indexWriter = new IndexWriter(dir,Constants.analyzer,MaxFieldLength.LIMITED); } catch (Exception e) { e.printStackTrace(); } } public IndexDao(boolean isCreate) { try { indexWriter = new IndexWriter(Constants.INDEX_STORE_PATH,Constants.analyzer, isCreate,MaxFieldLength.LIMITED); } catch (Exception e) { e.printStackTrace(); } } // 索引器 private IndexWriter indexWriter = null; /** * 添加/创建索引 * * @param folder * @throws IOException * @throws CorruptIndexException */ public void saveIndex(File folder, String[] unIndeies) throws CorruptIndexException, IOException { if (folder.isDirectory()) { String[] files = folder.list(); for (int i = 0; i < files.length; i++) { File f = new File(folder, files[i]); if (!f.isHidden()) { if (f.isDirectory()) { saveIndex(f, unIndeies);// ② 递归 } String fileTyep = ReadFile.validateFile(f); for (int j = 0; j < unIndeies.length; j++) { if (fileTyep.equalsIgnoreCase(unIndeies[j])) { System.out.println("正在建立索引 : " + f.getName() + ""); Document doc = ReadFile.indexFile(f); indexWriter.addDocument(doc); } } } } } } /** * Term是搜索的最小单位,代表某个 Field 中的一个关键词,如:<title, lucene> new Term( "title", * "lucene" ); new Term( "id", "5" ); new Term( "id", UUID ); * * @param term */ public void deleteIndex(Term term) { try { indexWriter.deleteDocuments(term); } catch (Exception e) { throw new RuntimeException(e); } finally { try { indexWriter.close(); } catch (Exception e) { e.printStackTrace(); } } } /** * 更新索引 indexWriter.deleteDocuments(term); indexWriter.addDocument(doc); * * @param term * @param doc */ public void updateIndex(Term term, Document doc) { try { indexWriter.updateDocument(term, doc); } catch (Exception e) { throw new RuntimeException(e); } finally { try { indexWriter.close(); } catch (Exception e) { e.printStackTrace(); } } } /** * 查询 totalPage = recordCount / pageSize; if (recordCount % pageSize > 0) * totalPage++; * * @param queryString * @param firstResult * @param maxResults * @return */ public QueryResult search(String queryString, int firstResult, int maxResults) { try { // 1,把要搜索的文本解析为 Query String[] fields = { "name", "content" }; Map<String, Float> boosts = new HashMap<String, Float>(); boosts.put("name", 2f); boosts.put("content", 3f); //默认为1.0f QueryParser queryParser = new MultiFieldQueryParser(fields, Constants.analyzer, boosts); Query query = queryParser.parse(queryString); // Query query = IKQueryParser.parse("content", queryString); Date start = new Date(); QueryResult result = search(query, firstResult, maxResults); Date end = new Date(); System.out.println("检索完成,用时" + (end.getTime() - start.getTime()) + "毫秒"); return result; } catch (Exception e) { throw new RuntimeException(e); } } public QueryResult search(Query query, int firstResult, int maxResults) { IndexSearcher indexSearcher = null; try { // 2,进行查询 indexSearcher = new IndexSearcher(Constants.INDEX_STORE_PATH); Filter filter = new RangeFilter("size", NumberTools.longToString(0), NumberTools .longToString(1000000), true, true); // 排序 Sort sort = new Sort(); sort.setSort(new SortField("size")); // 默认为升序 // sort.setSort(new SortField("size", true)); TopDocs topDocs = indexSearcher.search(query, filter, 10000, sort); int recordCount = topDocs.totalHits; List<Document> recordList = new ArrayList<Document>(); // 准备高亮器 Formatter formatter = new SimpleHTMLFormatter("<font color='red'>", "</font>"); Scorer scorer = new QueryScorer(query); Highlighter highlighter = new Highlighter(formatter, scorer); Fragmenter fragmenter = new SimpleFragmenter(50); highlighter.setTextFragmenter(fragmenter); // 3,取出当前页的数据 int end = Math.min(firstResult + maxResults, topDocs.totalHits); for (int i = firstResult; i < end; i++) { ScoreDoc scoreDoc = topDocs.scoreDocs[i]; int docSn = scoreDoc.doc; // 文档内部编号 Document doc = indexSearcher.doc(docSn); // 根据编号取出相应的文档 // 高亮 返回高亮后的结果,如果当前属性值中没有出现关键字,会返回 null String hc = highlighter.getBestFragment(Constants.analyzer, "content", doc.get("content")); if (hc == null) { String content = doc.get("content"); int endIndex = Math.min(50, content.length()); hc = content.substring(0, endIndex);// 最多前50个字符 } doc.getField("content").setValue(hc); recordList.add(doc); } // 返回结果 return new QueryResult(recordCount, recordList); } catch (Exception e) { throw new RuntimeException(e); } finally { try { indexSearcher.close(); } catch (IOException e) { e.printStackTrace(); } } } public void close() { // 对索引进行优化 try { indexWriter.optimize(); indexWriter.close(); } catch (CorruptIndexException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } public void readIndex(String key, String value) { IndexReader reader; try { // Directory fsDir = FSDirectory.getDirectory( // Constants.INDEX_STORE_PATH, false); // if (IndexReader.isLocked(fsDir)) { // System.out.println("------unlock-----"); // IndexReader.unlock(fsDir); // } reader = IndexReader.open(Constants.INDEX_STORE_PATH); for (int i = 0; i < reader.numDocs(); i++) // System.out.println(reader.document(i)); System.out.println("版本:" + reader.getVersion()); System.out.println("索引内的文档数量:" + reader.numDocs()); Term term = new Term(key, value); TermDocs docs = reader.termDocs(term); IndexSearcher indexSearcher = null; indexSearcher = new IndexSearcher(Constants.INDEX_STORE_PATH); while (docs.next()) { int docSn = docs.doc(); // 文档内部编号 Document doc = indexSearcher.doc(docSn); // 根据编号取出相应的文档 System.out.println("文档路径 " + doc.get("path")); System.out.println("含有所查找的 " + term + "的Document的编号为: "+ docs.doc()); System.out.println("Term在文档中的出现 " + docs.freq()+" 次"); } } catch (CorruptIndexException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } }
读取文件工具类:
Java代码
- public class ReadFile {
- public static String readWord(File f) {
- StringBuffer content = new StringBuffer( "" ); // 文档内容
- try {
- HWPFDocument doc = new HWPFDocument( new FileInputStream(f));
- Range range = doc.getRange();
- int paragraphCount = range.numParagraphs(); // 段落
- for ( int i = 0 ; i < paragraphCount; i++) { // 遍历段落读取数据
- Paragraph pp = range.getParagraph(i);
- content.append(pp.text());
- }
- // System.out.println("-------word--------"+content.toString());
- } catch (Exception e) {
- System.out.println("建立索引出错 : " + f.getAbsolutePath() + "" );
- e.printStackTrace();
- }
- return content.toString().trim();
- }
- public static String readPdf(File f){
- StringBuffer content = new StringBuffer( "" ); // 文档内容
- PDDocument pdfDocument = null ;
- try {
- if (f.length()> 10048576 ){
- DecimalFormat df = new DecimalFormat( "#.00" );
- System.out.println("---------------------文件大小------" +df.format(( double ) f.length() / 1048576 ) + "M" );
- return f.getName();
- }
- FileInputStream fis = new FileInputStream(f);
- PDFTextStripper stripper = new PDFTextStripper();
- pdfDocument = PDDocument.load(fis);
- if (pdfDocument.isEncrypted()){
- return f.getName();
- }
- StringWriter writer = new StringWriter();
- stripper.writeText(pdfDocument, writer);
- content.append(writer.getBuffer().toString());
- fis.close();
- } catch (IOException e) {
- System.out.println("建立索引出错 : " + f.getAbsolutePath() + "" );
- System.err.println("IOException=" + e);
- //System.exit(1);
- } finally {
- if (pdfDocument != null ) {
- // System.err.println("Closing document " + f + "...");
- org.pdfbox.cos.COSDocument cos = pdfDocument.getDocument();
- try {
- cos.close();
- // System.err.println("Closed " + cos);
- pdfDocument.close();
- } catch (IOException e) {
- System.out.println("建立索引出错 : " + f.getAbsolutePath() + "" );
- e.printStackTrace();
- }
- }
- }
- // System.out.println("-------pdf--------"+content.toString());
- return content.toString().trim();
- }
- public static String readHtml(File f) {
- StringBuffer content = new StringBuffer( "" );
- FileInputStream fis = null ;
- try {
- fis = new FileInputStream(f);
- // 读取页面 这里的字符编码要注意,要对上html头文件的一致,否则会出乱码
- BufferedReader reader = new BufferedReader( new InputStreamReader(fis, "gb2312" ));
- String line = null ;
- while ((line = reader.readLine()) != null ) {
- content.append(line + "\n" );
- }
- reader.close();
- } catch (Exception e) {
- System.out.println("建立索引出错 : " + f.getAbsolutePath() + "" );
- e.printStackTrace();
- }
- String contentString = content.toString();
- // System.out.println("---------htm索引----"+contentString);
- return contentString;
- }
- public static String readTxt(File f) {
- StringBuffer content = new StringBuffer( "" );
- try {
- BufferedReader reader = new BufferedReader( new InputStreamReader(
- new FileInputStream(f)));
- for (String line = null ; (line = reader.readLine()) != null ;) {
- content.append(line).append("\n" );
- }
- } catch (IOException e) {
- System.out.println("建立索引出错 : " + f.getAbsolutePath() + "" );
- e.printStackTrace();
- }
- return content.toString().trim();
- }
- public static String readExcel(File f,String fileType){
- StringBuffer content = new StringBuffer( "" );
- try {
- ExcelReader er=new ExcelReader(f,fileType);
- String line=er.readLine();
- content.append(line).append("\n" );
- while (line!= null ){
- line=er.readLine();
- content.append(line).append("\n" );
- }
- er.close();
- }catch (Exception e){
- System.out.println("建立索引出错 : " + f.getAbsolutePath() + "" );
- e.printStackTrace();
- }
- return content.toString();
- }
- public static String validateFile(File f) {
- String fileType = "otherType" ;
- String fileName = f.getName();
- if (fileName.lastIndexOf( '.' ) == - 1 ) {
- fileType = "dir" ;
- return fileType;
- }
- fileName = fileName.substring(fileName.lastIndexOf('.' ) + 1 , fileName
- .length());
- int i = 0 ;
- String [] extension=Constants.EXTENSION;
- for (i = 0 ; i < extension.length; i++) {
- if (fileName.equalsIgnoreCase(extension[i])) {
- fileType = extension[i];
- break ;
- }
- }
- return fileType;
- }
- public static Document indexFile(File f) {
- Document doc = new Document();
- try {
- doc.add(new Field( "name" , f.getName(), Store.YES, Index.ANALYZED));
- doc.add(new Field( "size" , NumberTools.longToString(f.length()),
- Store.YES, Index.NOT_ANALYZED));
- doc.add(new Field( "path" , f.getAbsolutePath(), Store.YES,
- Index.NOT_ANALYZED));
- String fileType = validateFile(f);
- if (fileType.equals( "txt" )) {
- doc.add(new Field( "content" , ReadFile.readTxt(f), Store.YES,
- Index.ANALYZED));
- } else if (fileType.equals( "pdf" )) {
- doc.add(new Field( "content" , ReadFile.readPdf(f), Store.YES,
- Index.ANALYZED));
- } else if (fileType.equals( "doc" )) {
- doc.add(new Field( "content" , ReadFile.readWord(f), Store.YES,
- Index.ANALYZED));
- } else if (fileType.equals( "htm" )) {
- doc.add(new Field( "content" , ReadFile.readHtml(f), Store.YES,
- Index.ANALYZED));
- } else if (fileType.equals( "xls" )){
- doc.add(new Field( "content" , ReadFile.readExcel(f, fileType), Store.YES,
- Index.ANALYZED));
- }else {
- doc.add(new Field( "content" , f.getName(), Store.YES, Index.ANALYZED));
- }
- } catch (Exception e) {
- System.out.println("建立索引出错 : " + f.getAbsolutePath() + "" );
- e.printStackTrace();
- }
- return doc;
- }
- }
public class ReadFile { public static String readWord(File f) { StringBuffer content = new StringBuffer("");// 文档内容 try { HWPFDocument doc = new HWPFDocument(new FileInputStream(f)); Range range = doc.getRange(); int paragraphCount = range.numParagraphs();// 段落 for (int i = 0; i < paragraphCount; i++) {// 遍历段落读取数据 Paragraph pp = range.getParagraph(i); content.append(pp.text()); } // System.out.println("-------word--------"+content.toString()); } catch (Exception e) { System.out.println("建立索引出错 : " + f.getAbsolutePath() + ""); e.printStackTrace(); } return content.toString().trim(); } public static String readPdf(File f){ StringBuffer content = new StringBuffer("");// 文档内容 PDDocument pdfDocument = null; try { if(f.length()>10048576){ DecimalFormat df = new DecimalFormat("#.00"); System.out.println("---------------------文件大小------"+df.format((double) f.length() / 1048576) + "M"); return f.getName(); } FileInputStream fis = new FileInputStream(f); PDFTextStripper stripper = new PDFTextStripper(); pdfDocument = PDDocument.load(fis); if(pdfDocument.isEncrypted()){ return f.getName(); } StringWriter writer = new StringWriter(); stripper.writeText(pdfDocument, writer); content.append(writer.getBuffer().toString()); fis.close(); } catch (IOException e) { System.out.println("建立索引出错 : " + f.getAbsolutePath() + ""); System.err.println("IOException=" + e); //System.exit(1); } finally { if (pdfDocument != null) { // System.err.println("Closing document " + f + "..."); org.pdfbox.cos.COSDocument cos = pdfDocument.getDocument(); try { cos.close(); // System.err.println("Closed " + cos); pdfDocument.close(); } catch (IOException e) { System.out.println("建立索引出错 : " + f.getAbsolutePath() + ""); e.printStackTrace(); } } } // System.out.println("-------pdf--------"+content.toString()); return content.toString().trim(); } public static String readHtml(File f) { StringBuffer content = new StringBuffer(""); FileInputStream fis = null; try { fis = new FileInputStream(f); // 读取页面 这里的字符编码要注意,要对上html头文件的一致,否则会出乱码 BufferedReader reader = new BufferedReader(new InputStreamReader(fis, "gb2312")); String line = null; while ((line = reader.readLine()) != null) { content.append(line + "\n"); } reader.close(); } catch (Exception e) { System.out.println("建立索引出错 : " + f.getAbsolutePath() + ""); e.printStackTrace(); } String contentString = content.toString(); // System.out.println("---------htm索引----"+contentString); return contentString; } public static String readTxt(File f) { StringBuffer content = new StringBuffer(""); try { BufferedReader reader = new BufferedReader(new InputStreamReader( new FileInputStream(f))); for (String line = null; (line = reader.readLine()) != null;) { content.append(line).append("\n"); } } catch (IOException e) { System.out.println("建立索引出错 : " + f.getAbsolutePath() + ""); e.printStackTrace(); } return content.toString().trim(); } public static String readExcel(File f,String fileType){ StringBuffer content = new StringBuffer(""); try{ ExcelReader er=new ExcelReader(f,fileType); String line=er.readLine(); content.append(line).append("\n"); while(line!=null){ line=er.readLine(); content.append(line).append("\n"); } er.close(); }catch(Exception e){ System.out.println("建立索引出错 : " + f.getAbsolutePath() + ""); e.printStackTrace(); } return content.toString(); } public static String validateFile(File f) { String fileType = "otherType"; String fileName = f.getName(); if (fileName.lastIndexOf('.') == -1) { fileType = "dir"; return fileType; } fileName = fileName.substring(fileName.lastIndexOf('.') + 1, fileName .length()); int i = 0; String [] extension=Constants.EXTENSION; for (i = 0; i < extension.length; i++) { if (fileName.equalsIgnoreCase(extension[i])) { fileType = extension[i]; break; } } return fileType; } public static Document indexFile(File f) { Document doc = new Document(); try { doc.add(new Field("name", f.getName(), Store.YES, Index.ANALYZED)); doc.add(new Field("size", NumberTools.longToString(f.length()), Store.YES, Index.NOT_ANALYZED)); doc.add(new Field("path", f.getAbsolutePath(), Store.YES, Index.NOT_ANALYZED)); String fileType = validateFile(f); if (fileType.equals("txt")) { doc.add(new Field("content", ReadFile.readTxt(f), Store.YES, Index.ANALYZED)); } else if (fileType.equals("pdf")) { doc.add(new Field("content", ReadFile.readPdf(f), Store.YES, Index.ANALYZED)); } else if (fileType.equals("doc")) { doc.add(new Field("content", ReadFile.readWord(f), Store.YES, Index.ANALYZED)); } else if (fileType.equals("htm")) { doc.add(new Field("content", ReadFile.readHtml(f), Store.YES, Index.ANALYZED)); } else if(fileType.equals("xls")){ doc.add(new Field("content", ReadFile.readExcel(f, fileType), Store.YES, Index.ANALYZED)); }else { doc.add(new Field("content", f.getName(), Store.YES, Index.ANALYZED)); } } catch (Exception e) { System.out.println("建立索引出错 : " + f.getAbsolutePath() + ""); e.printStackTrace(); } return doc; } }
Java代码
- public class ExcelReader {
- // 创建文件输入流
- private BufferedReader reader = null ;
- // 文件类型
- private String filetype;
- // 文件二进制输入流
- private InputStream is = null ;
- // 当前的Sheet
- private int currSheet;
- // 当前位置
- private int currPosition;
- // Sheet数量
- private int numOfSheets;
- // HSSFWorkbook
- HSSFWorkbook workbook = null ;
- // 设置Cell之间以空格分割
- private static String EXCEL_LINE_DELIMITER = " " ;
- // 设置最大列数
- // private static int MAX_EXCEL_COLUMNS = 64;
- // 构造函数创建一个ExcelReader
-
public
发表评论
-
SolrJ
2012-07-31 20:42 0http://www.ibm.com/developerwor ... -
lucene FilteredQuery
2010-12-28 20:49 1630FilteredQuery包含两个成员变量: Query ... -
ibatis insert 返回对象
2010-12-27 23:09 1405<!-- Insert example, using t ... -
Compass入门 及于spring和ibatis结合
2010-12-23 22:31 1007开始之前 什么 ... -
Compass 入门指南
2010-12-23 22:28 957在新架构中打算选 ... -
使用 Compass 第三方框架维护索引库数据
2010-12-22 16:59 1308使用 Compass 第三方框架维护索引库数据 ... -
Lucene搜索方法总结
2010-12-22 16:28 908更多lucene信息欢迎查看http://summerbell ... -
lucene的分组查询(类似sql的group by)的解决方法
2010-12-22 15:42 1470通过lucene搜索去除相同结果。 在网上找了很久到没有答案 ... -
MyEclipse手动修改WebRoot目录后无法打包
2010-12-17 16:04 1644环境: MyEclipse Enterpr ... -
lucene 单机 io性能提高策略
2010-11-16 13:47 997蓝山咖啡(36668534) 13:30:35 所以,首先 ...
相关推荐
Lucene入门与使用,非常简单,适合入门
`lucene入门小实例.txt` 文件中可能包含了一个简单的Lucene使用示例,例如: 1. 创建 `Directory` 对象,比如使用 `FSDirectory.open()` 打开一个文件系统的目录来存储索引。 2. 实例化 `Analyzer`,如使用 `...
这个“lucene入门小例子”很可能是为了帮助初学者理解并掌握Lucene的基本用法而设计的一系列示例代码。 Lucene的核心概念包括索引、文档、字段和查询。首先,你需要理解索引的概念,它类似于传统数据库中的索引,但...
以上是Lucene入门的基本知识和关键概念,通过深入学习和实践,你可以掌握如何利用Lucene构建强大的全文搜索引擎。记住,实践中遇到的问题往往是最好的学习资源,不断尝试和解决,你将逐渐成为Lucene的专家。
**Lucene入门学习文档** **一、什么是Lucene** Lucene是Apache软件基金会下的一个开源全文检索库,它提供了一个高性能、可扩展的信息检索服务。Lucene最初由Doug Cutting开发,现在已经成为Java社区中事实上的标准...
【Lucene】Lucene入门心得 Lucene是一个高性能、全文本搜索库,由Apache软件基金会开发,被广泛应用于各种搜索引擎的构建。它提供了一个简单的API,使得开发者可以方便地在自己的应用程序中集成全文检索功能。...
**全文搜索引擎Lucene入门** 全文搜索引擎Lucene是Apache软件基金会的一个开放源代码项目,它为Java开发者提供了一个高性能、可扩展的信息检索库。Lucene以其强大的文本搜索功能和高效的索引能力,在各种需要全文...
《Lucene入门到项目开发》 Lucene是一个高性能、全文本搜索库,由Apache软件基金会开发,被广泛应用于各种搜索引擎的构建。它提供了一个简单但功能强大的API,可以帮助开发者快速实现对文本数据的检索和分析。 一...
### Lucene 入门基础教程知识点详解 #### 一、Lucene简介 - **定义**:Lucene是一款高性能、全功能的文本搜索引擎库,由Java编写而成,属于Apache项目的一部分。 - **适用场景**:适合于任何需要进行全文检索的应用...
在这个经典Lucene入门模块中,我们将深入理解如何使用Lucene进行索引创建和搜索操作。 首先,我们来看Lucene如何建立数据的索引。这通常涉及以下几个步骤: 1. **索引创建**:使用 `IndexWriter` 对象来创建或更新...
这个“Lucene入门demo”将帮助我们理解如何使用 Lucene 进行基本的索引和搜索操作。 **一、Lucene 的核心概念** 1. **索引(Indexing)**: 在 Lucene 中,索引是文档内容的预处理结果,类似于数据库中的索引。通过...
**Lucene 入门指南** Lucene 是一个高性能、全文本搜索库,由 Apache 软件基金会开发并维护。它是 Java 开发人员用来构建搜索引擎应用程序的基础工具。本指南将帮助初学者理解 Lucene 的核心概念,以及如何利用它来...
【Lucene 入门体会】 Lucene 是一个强大的全文检索工具包,主要由 Java 编写,它提供了索引和搜索功能,使得开发者能够轻松地为应用程序添加高级搜索能力。作为Apache Jakarta家族的一员,Lucene 开源且免费,被...
### Lucene入门指南 #### 一、Lucene简介 **Lucene** 是一款高性能的全文检索引擎工具包,由 **Apache 软件基金会** 的 **Jakarta 项目组** 开发并维护。作为一款完全开放源代码的工具,Lucene 提供了一系列的功能...
【Lucene入门知识详解】 Lucene是一个基于Java的全文索引引擎工具包,它并不是一个完整的全文搜索引擎,而是提供了一套构建搜索引擎的基础组件。Lucene的主要目标是方便开发者将其集成到各类应用程序中,以实现高效...