lucene如何解析Doc文档

笑我痴狂

浏览: 287128 次
性别:
来自: 湖南

最近访客更多访客>>

lvye351

xiangshouxiyang

fhtwins

wueuru

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

lucene

lucene Java Apache

加入poi-scratchpad-3.0.2-FINAL-20080204.jar到lib下

package com.cs;

public interface Parsable {
	
	public String getTitle() ;
	public String getContent()  ;
	public String getSummary()  ;
}

package com.cs;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;

import org.apache.poi.hwpf.extractor.WordExtractor;

public class DocParser implements Parsable {

	private File file;

	private String content;

	private WordExtractor wordExtractor;

	public DocParser(File file) {
		this.file = file;
	}

	public String getContent() {

		try {
			if (content != null) {
				return content;
			}

			InputStream is = null;
			is = new FileInputStream(file);
			wordExtractor = new WordExtractor(is);
			content = wordExtractor.getText();
			return content;
		} catch (FileNotFoundException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
		return null;

	}

	/**
	 * summary取内容的前200个字符
	 */
	public String getSummary()  {
		String summary;
		if (content == null) {
			getContent();
		}

		if (content.length() > 200) {
			summary = content.substring(0, 200);
		} else {
			summary = content;
		}

		return summary;
	}

	public String getTitle() {

		return file.getName();
	}
	public static void main(String[] args) {
		DocParser docParser = new DocParser(new File("E:\\EclipseStudyWorkspace\\LuceneParse\\fileSource\\XPDF使用文档.doc")) ;
		System.out.println("doc content : "+docParser.getContent()) ;
	}
}

txt的解析

package com.cs;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;

public class TextParser implements Parsable {
	
	private File file ;
	
	private String content  ;
	
	public TextParser(File file) {
		super();
		this.file = file;
	}
	
	public String getContent() {
		if (content != null ) {
			return content ;
		}
	    try {
			BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file))) ;
			StringBuffer sb = new StringBuffer() ;
			String line = null ;
			while ((line = br.readLine()) != null) {
				sb.append(line).append("\n") ;
			}
			content = sb.toString() ;
			return content ;
			
	    
	    } catch (FileNotFoundException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
		
		return null;
	}

	public String getSummary() {
		String summary ;
		if (content == null ) {
			getContent() ;
		}
		
		if (content.length() > 200) {
			summary = content.substring(0, 200) ;
		}else {
			summary = content ;
		}
		
		return summary;
	}

	public String getTitle() {
		
		return file.getName();
	}
	public static void main(String[] args) {
		TextParser textParser = new TextParser(new File("E:\\EclipseStudyWorkspace\\LuceneParse\\fileSource\\文档.txt")) ;
		System.out.println("text content : "+textParser.getContent()) ;
	}
	
}

分享到：

lucene根据文件类型自动解析的工厂类 | lucene如何解析pdf文档

2010-10-10 15:11
浏览 1766
评论(0)
分类:企业架构
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

lucene如何解析Doc文档

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

lucene如何解析Doc文档

评论

发表评论

相关推荐

lucene入门到项目开发

lucene根据文件类型自动解析的工厂类

lucene如何解析pdf文档

lucene如何解析PPT文档

lucene如何抽取html网页

最近访客更多访客>>