PDFBOX 解析PDF

小小流浪猪

浏览: 177320 次
性别:
来自: 北京

最近访客更多访客>>

eagle19830803

yinzichun

dalongxn

商冬兰

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

javaEE中级

lucene Eclipse 搜索引擎 Adobe OS

1、使用PDFBox处理PDF文档

PDF全称Portable Document Format，是Adobe公司开发的电子文件格式。这种文件格式与操作系统平台无关，可以在Windows、Unix或Mac OS等操作系统上通用。

PDF文件格式将文字、字型、格式、颜色及独立于设备和分辨率的图形图像等封装在一个文件中。如果要抽取其中的文本信息，需要根据它的文件格式来进行解析。幸好目前已经有不少工具能帮助我们做这些事情，其中就有PDFBOX

PDFBox是Java实现的PDF文档API库，提供PDF文档的一系列操作。例如创建、处理以及文档内容提取等功能，也包含了一些命令行实用工具。

主要有以下特性：

PDF格式的文本抽取
合并PDF文档
PDF文档的加密与解密
Lucene搜索引擎集成
填充表单数据
创建一个文本文件的PDF
创建PDF页面图象
打印PDF文档

2、PDFBox的下载

最常见的一种PDF文本抽取工具就是PDFBox了，访问网址http://sourceforge.net/projects/pdfbox/。读者可以在该网页下载其最新的版本。本书采用的是PDFBox-0.7.3版本。PDFBox是一个开源的Java PDF库，这个库允许你访问PDF文件的各项信息。

3、在Eclipse中配置

以下是在Eclipse中创建工程，并导入pdf工具类的过程

（1）在Eclipse的workspace中创建一个普通的Java工程:pdfprj

（2）把下载的PDFBox-0.7.3.zip解压。

（3）进入external目录下，可以看到，这里包括了PDFBox所有用到的外部包。复制下面的Jar包到工程pdfprj的lib目录下（如还未建立lib目录，则先创建一个）。

bcmail-jdk14-132.jar
bcprov-jdk14-132.jar
checkstyle-all-4.2.jar
FontBox-0.1.0-dev.jar
lucene-core-2.0.0.jar

然后再从PDFBox的lib目录下，复制PDFBox-0.7.3.jar到工程的lib目录下。

（4）在工程上单击右键，在弹出的快捷菜单中选择“Build Path->Config Build Path->Add Jars”命令，把工程lib目录下面的包都加入工程的Build Path。

4.使用PDFBox解析PDF内容

抽取pdf文本内容

private PDDocument document = null;
	public static void main(String[] args) throws IOException {
		String file = "d:\\pdf\\pdf-type.pdf";
		PDFBOX parse = new PDFBOX();
		parse.openPDFFile(file);
		}
	public void openPDFFile(String file) throws IOException {
		InputStream is = null;
		File f = new File(file);
		is = new FileInputStream(f);
		this.document = this.parseDocument(is);
		//获取页数
		List pages = this.document.getDocumentCatalog().getAllPages();
		int pageSize = pages.size();
		System.out.println("pdf页数:"+pageSize);
		this.getPdfText();

	}
	public PDDocument parseDocument(InputStream input) throws IOException {
		PDDocument document = PDDocument.load(input);
		if (document.isEncrypted()) {
			try {
				document.decrypt("");
			} catch (CryptographyException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			} catch (InvalidPasswordException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			}
		}
		return document;
	}
     /*
      * 抽取pdf文本内容
      */
	public void getPdfText() throws IOException {
		PDFTextStripper stripper = new PDFTextStripper();
		OutputStreamWriter osw = new OutputStreamWriter(new FileOutputStream(
				"d:\\pdf-type.txt"));
		BufferedWriter bw = new BufferedWriter(osw);
		stripper.setShouldSeparateByBeads(true);
		stripper.writeText(document, bw);
		bw.close();
		document.close();
	}

抽取pdf文档信息:

  
    public static final String DATE_FORMAT = "yyyy-MM-dd HH:mm:ss";   
       
    /**  
     * 解析pdf文档信息  
     * @param pdfPath   pdf文档路径  
     * @throws Exception  
     */  
    public static void pdfParse( String pdfPath, String imgSavePath ) throws Exception   
    {   
        InputStream input = null;   
        File pdfFile = new File( pdfPath );   
        PDDocument document = null;   
        try{   
            input = new FileInputStream( pdfFile );   
            //加载 pdf 文档   
            document = PDDocument.load( input );   
               
            /** 文档属性信息 **/  
            PDDocumentInformation info = document.getDocumentInformation();   
            System.out.println( "标题:" + info.getTitle() );   
            System.out.println( "主题:" + info.getSubject() );   
            System.out.println( "作者:" + info.getAuthor() );   
            System.out.println( "关键字:" + info.getKeywords() );   
               
            System.out.println( "应用程序:" + info.getCreator() );   
            System.out.println( "pdf 制作程序:" + info.getProducer() );   
               
            System.out.println( "作者:" + info.getTrapped() );   
               
            System.out.println( "创建时间:" + dateFormat( info.getCreationDate() ));   
            System.out.println( "修改时间:" + dateFormat( info.getModificationDate()));   
      
            /** 文档页面信息 **/  
            PDDocumentCatalog cata = document.getDocumentCatalog();   
            List pages = cata.getAllPages();   
            int count = 1;   
            for( int i = 0; i < pages.size(); i++ )   
            {   
                PDPage page = ( PDPage ) pages.get( i );   
                if( null != page )   
                {   
                    PDResources res = page.findResources();   
                       
                    //获取页面图片信息   
                    Map imgs = res.getImages();   
                    if( null != imgs )   
                    {   
                        Set keySet = imgs.keySet();   
                        Iterator it = keySet.iterator();   
                        while( it.hasNext() )   
                        {   
                            Object obj =  it.next();   
                            PDXObjectImage img = ( PDXObjectImage ) imgs.get( obj );   
                            img.write2file( imgSavePath + count );   
                            count++;   
                        }   
                    }   
                }   
            }   
        }catch( Exception e)   
        {   
            throw e;   
        }finally{   
            if( null != input )   
                input.close();   
            if( null != document )   
                document.close();   
        }   
    }   
       
    /**  
     * 获取格式化后的时间信息  
     * @param dar   时间信息  
     * @return  
     * @throws Exception  
     */  
    public static String dateFormat( Calendar calendar ) throws Exception   
    {   
        if( null == calendar )   
            return null;   
        String date = null;   
        try{   
            String pattern = DATE_FORMAT;   
            SimpleDateFormat format = new SimpleDateFormat( pattern );   
            date = format.format( calendar.getTime() );   
        }catch( Exception e )   
        {   
            throw e;   
        }   
        return date == null ? "" : date;   
    }

分享到：

解析pdf方法总结 | IO流解密WORD文档

2010-09-26 19:06
浏览 10815
评论(1)
分类:编程语言
查看更多

1 楼 wqy110 2012-08-03

PDFBOX ，提示找不到这个类。

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论