`
deepfuture
  • 浏览: 4397117 次
  • 性别: Icon_minigender_1
  • 来自: 湛江
博客专栏
073ec2a9-85b7-3ebf-a3bb-c6361e6c6f64
SQLite源码剖析
浏览量:80019
1591c4b8-62f1-3d3e-9551-25c77465da96
WIN32汇编语言学习应用...
浏览量:69953
F5390db6-59dd-338f-ba18-4e93943ff06a
神奇的perl
浏览量:103276
Dac44363-8a80-3836-99aa-f7b7780fa6e2
lucene等搜索引擎解析...
浏览量:285462
Ec49a563-4109-3c69-9c83-8f6d068ba113
深入lucene3.5源码...
浏览量:14999
9b99bfc2-19c2-3346-9100-7f8879c731ce
VB.NET并行与分布式编...
浏览量:67479
B1db2af3-06b3-35bb-ac08-59ff2d1324b4
silverlight 5...
浏览量:32088
4a56b548-ab3d-35af-a984-e0781d142c23
算法下午茶系列
浏览量:45961
社区版块
存档分类
最新评论

lucene入门-解析pdf(使用xpdf解析中文PDF详细过程)

阅读更多

下载xpdf和xpdf-chinese-simplified.tar.gz ,然后将xpdf-chinese-simplified.tar.gz解压到xpdf所在的目录形成一个子目录

http://www.foolabs.com/xpdf/download.html

The following packages are available:

中文包的配置说明

Xpdf: Chinese Simplified support package
========================================

Xpdf project: http://www.foolabs.com/xpdf/
2004-jul-27

If this package includes CMap files, they contain their own copyright
notices and distribution conditions. All other files in the package
are Copyright 2002-2004 Glyph & Cog, LLC, and are licensed under the
GNU General Public License (GPL), version 2.

This package provides support files needed to use the Xpdf tools with
Chinese (Simplified) PDF files.

Contents:
- Adobe-GB1 character collection support
- ISO-2022-CN encoding
- EUC-CN encoding
- GBK encoding

Place all of these files in a directory, typically:

Unix - /usr/local/share/xpdf/chinese-simplified
Win32 - C:\Program Files\xpdf\chinese-simplified

Add the contents of the "add-to-xpdfrc" file to your system-wide
xpdfrc config file, which is typically:

Unix - /usr/local/etc/xpdfrc
Win32 - C:\Program Files\xpdf\xpdfrc

Alternatively, on Unix systems you can add these lines to your
personal xpdfrc file in $HOME/.xpdfrc.

能运行以下平台中

Precompiled binaries are available for the following machines:

  • x86, Linux (staticly linked to Motif, t1lib, and FreeType):
    xpdf-3.02pl4-linux.tar.gz (11985186 bytes)
  • SPARC, Solaris 10 (staticly linked to t1lib and FreeType):
    not currently available
  • x64, Solaris 10 (staticly linked to t1lib and FreeType):
    not currently available
  • x86, DOS/Win32 -- pdftops, pdftotext, pdfimages, pdfinfo, and pdffonts only:
    Win32 (built with MSVC): xpdf-3.02pl4-win32.zip (2046671e bytes)
    DOS6 (built with djgpp, with DPMI support from csdpmi5b): xpdf-3.02pl4-dos6.zip (1754621 bytes)

I've received reports of xpdf compiling successfully on the following systems (but binaries are not available on the net):

  • x86 and MIPS, SINIX V5.4 (email f.miane@opengroup.org for binaries) (xpdf 0.5)
  • Apollo 425e, DomainOS 10.4.1.2 (xpdf 0.5)
  • m68k (HP-9000/425), HP-UX 9.0 (xpdf 0.5)
  • Alpha, Linux (xpdf 0.7)
  • POWER, AIX 4.2.1, gcc 2.8.1 (xpdf 0.7a)
  • UltraSPARC 2, Linux 2.2.5 (xpdf 0.80)
  • SPARC, Solaris 2.7, gcc 2.8.1 (xpdf 0.90)
  • DG/UX (xpdf 0.90)
  • LynxOS 2.5.1 (xpdf 0.90)
  • HP-UX 10.20 and 11.00 (xpdf 0.90)
  • MacOS X / Darwin (xpdf 0.92)
  • QNX / X11 (xpdf 0.93)
  • x86, OpenBSD 3.0 (xpdf 1.00)
  • MacOS X / Darwin (xpdf 2.03)

,xpdf比pdfbox适应性更强,既能解析英文PDF,也能解析包括中文在内的PDF,但是XPDF实际上是在命令行运行

下面是在命令行运行,解析英文PDF后的效果

命令如下:

D:\workspace\testsearch2\xpdf>pdftotext ../htmls/xxxx.pdf xxxx.txt

 

编辑xpdfrc文件

cidToUnicode Adobe-GB1 D:\workspace\testsearch2\xpdf\xpdf-chinese-simplified\Adobe-GB1.cidToUnicode
unicodeMap ISO-2022-CN D:\workspace\testsearch2\xpdf\xpdf-chinese-simplified\ISO-2022-CN.unicodeMap
unicodeMap EUC-CN D:\workspace\testsearch2\xpdf\xpdf-chinese-simplified\EUC-CN.unicodeMap
unicodeMap GBK D:\workspace\testsearch2\xpdf\xpdf-chinese-simplified\GBK.unicodeMap
cMapDir Adobe-GB1 D:\workspace\testsearch2\xpdf\xpdf-chinese-simplified\CMap
toUnicodeDir D:\workspace\testsearch2\xpdf\xpdf-chinese-simplified\CMap

fontDir c:\windows\Fonts
displayCIDFontTT Adobe-GB1 c:\windows\fonts\SimHei.ttf

textEOL dos
在LINUX下可以查看add-to-xpdfrc文档,将该文档内容复制到xpdfrc中

解析中文PDF,需要加参数(同样的参数-enc GBK也能解析英文文档)

D:\workspace\testsearch2\xpdf>pdftotext -layout -enc GBK ..\htmls\readme.pdf

 

 

主要参数如下:

OPTIONS
Many of the following options can be set with configuration file com-
mands. These are listed in square brackets with the description of the
corresponding command line option.

-f number
Specifies the first page to convert.

-l number
Specifies the last page to convert.

-layout
Maintain (as best as possible) the original physical layout of
the text. The default is to 'undo' physical layout (columns,
hyphenation, etc.) and output the text in reading order.

-fixed number
Assume fixed-pitch (or tabular) text, with the specified charac-
ter width (in points). This forces physical layout mode.

-raw Keep the text in content stream order. This is a hack which
often "undoes" column formatting, etc. Use of raw mode is no
longer recommended.

-htmlmeta
Generate a simple HTML file, including the meta information.
This simply wraps the text in <pre> and </pre> and prepends the
meta headers.

-enc encoding-name

简体中文包只包含下面三种语言

ISO-2022-CN
EUC-CN

GBK


Sets the encoding to use for text output. The encoding-name
must be defined with the unicodeMap command (see xpdfrc(5)).
The encoding name is case-sensitive. This defaults to "Latin1"
(which is a built-in encoding). [config file: textEncoding]

-eol unix | dos | mac
Sets the end-of-line convention to use for text output. [config
file: textEOL]

-nopgbrk
Don't insert page breaks (form feed characters) between pages.
[config file: textPageBreaks]

-opw password
Specify the owner password for the PDF file. Providing this
will bypass all security restrictions.

-upw password
Specify the user password for the PDF file.

-q Don't print any messages or errors. [config file: errQuiet]

-cfg config-file
Read config-file in place of ~/.xpdfrc or the system-wide config
file.

-v Print copyright and version information.

-h Print usage information. (-help and --help are equivalent.)

下面我们使用JAVA将命令行包装起来形成一个类

package extract;

import java.io.*;

public class ExtractorCJKPDF {

/**
* @param args
*/

public static void pdf2text(String pdffile,String txtfile) throws IOException{

String pdfname=pdffile;
String txtname=txtfile;
String xpdfpath="D:/workspace/testsearch2/xpdf/";
String[] cmd=new String[]{xpdfpath+"pdftotext","-layout","-enc","GBK","-nopgbrk",pdfname,txtname};
//-layout表示保持原有的layout,enc指定字符集,-nopgbrk指定不分页
Process p=Runtime.getRuntime().exec(cmd);
}
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
pdf2text("D:/workspace/testsearch2/htmls/123.pdf","D:/workspace/testsearch2/htmls/123.txt");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

}

0
0
分享到:
评论

相关推荐

    lucene-analyzers-smartcn-7.7.0-API文档-中文版.zip

    赠送jar包:lucene-analyzers-smartcn-7.7.0.jar; 赠送原API文档:lucene-analyzers-smartcn-7.7.0-javadoc.jar; 赠送源代码:lucene-analyzers-smartcn-7.7.0-sources.jar; 赠送Maven依赖信息文件:lucene-...

    lucene-core-7.7.0-API文档-中文版.zip

    赠送jar包:lucene-core-7.7.0.jar; 赠送原API文档:lucene-core-7.7.0-javadoc.jar; 赠送源代码:lucene-core-7.7.0-sources.jar; 赠送Maven依赖信息文件:lucene-core-7.7.0.pom; 包含翻译后的API文档:lucene...

    lucene-analyzers-common-6.6.0-API文档-中文版.zip

    赠送jar包:lucene-analyzers-common-6.6.0.jar; 赠送原API文档:lucene-analyzers-common-6.6.0-javadoc.jar; 赠送源代码:lucene-analyzers-common-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-...

    lucene-core-7.2.1-API文档-中文版.zip

    赠送jar包:lucene-core-7.2.1.jar; 赠送原API文档:lucene-core-7.2.1-javadoc.jar; 赠送源代码:lucene-core-7.2.1-sources.jar; 赠送Maven依赖信息文件:lucene-core-7.2.1.pom; 包含翻译后的API文档:lucene...

    lucene-suggest-6.6.0-API文档-中文版.zip

    赠送jar包:lucene-suggest-6.6.0.jar; 赠送原API文档:lucene-suggest-6.6.0-javadoc.jar; 赠送源代码:lucene-suggest-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-suggest-6.6.0.pom; 包含翻译后的API...

    lucene-backward-codecs-7.3.1-API文档-中英对照版.zip

    包含翻译后的API文档:lucene-backward-codecs-7.3.1-javadoc-API文档-中文(简体)-英语-对照版.zip; Maven坐标:org.apache.lucene:lucene-backward-codecs:7.3.1; 标签:apache、lucene、backward、codecs、中英...

    lucene-core-6.6.0-API文档-中文版.zip

    赠送jar包:lucene-core-6.6.0.jar; 赠送原API文档:lucene-core-6.6.0-javadoc.jar; 赠送源代码:lucene-core-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-core-6.6.0.pom; 包含翻译后的API文档:lucene...

    lucene-sandbox-7.2.1-API文档-中文版.zip

    赠送jar包:lucene-sandbox-7.2.1.jar; 赠送原API文档:lucene-sandbox-7.2.1-javadoc.jar; 赠送源代码:lucene-sandbox-7.2.1-sources.jar; 赠送Maven依赖信息文件:lucene-sandbox-7.2.1.pom; 包含翻译后的API...

    lucene-spatial-extras-7.3.1-API文档-中英对照版.zip

    包含翻译后的API文档:lucene-spatial-extras-7.3.1-javadoc-API文档-中文(简体)-英语-对照版.zip; Maven坐标:org.apache.lucene:lucene-spatial-extras:7.3.1; 标签:apache、lucene、spatial、extras、中英对照...

    lucene-core-2.9.4,lucene-core-3.0.2,lucene-core-3.0.3,lucene-core-3.4.0

    《Apache Lucene核心技术详解:从2.9.4到3.4.0的演变》 Apache Lucene,作为开源的全文检索库,是Java开发人员进行高效信息检索的重要工具。这个压缩包文件包含了Lucene从2.9.4版本到3.4.0版本的核心组件,让我们...

    lucene-analyzers-smartcn-7.7.0-API文档-中英对照版.zip

    包含翻译后的API文档:lucene-analyzers-smartcn-7.7.0-javadoc-API文档-中文(简体)-英语-对照版.zip; Maven坐标:org.apache.lucene:lucene-analyzers-smartcn:7.7.0; 标签:apache、lucene、analyzers、smartcn...

    lucene-spatial-extras-7.2.1-API文档-中英对照版.zip

    包含翻译后的API文档:lucene-spatial-extras-7.2.1-javadoc-API文档-中文(简体)-英语-对照版.zip; Maven坐标:org.apache.lucene:lucene-spatial-extras:7.2.1; 标签:apache、lucene、spatial、extras、中英对照...

    lucene-spatial-extras-6.6.0-API文档-中英对照版.zip

    包含翻译后的API文档:lucene-spatial-extras-6.6.0-javadoc-API文档-中文(简体)-英语-对照版.zip; Maven坐标:org.apache.lucene:lucene-spatial-extras:6.6.0; 标签:apache、lucene、extras、spatial、jar包、...

    lucene-backward-codecs-6.6.0-API文档-中文版.zip

    赠送jar包:lucene-backward-codecs-6.6.0.jar; 赠送原API文档:lucene-backward-codecs-6.6.0-javadoc.jar; 赠送源代码:lucene-backward-codecs-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-backward-...

    lucene-memory-6.6.0-API文档-中文版.zip

    赠送jar包:lucene-memory-6.6.0.jar; 赠送原API文档:lucene-memory-6.6.0-javadoc.jar; 赠送源代码:lucene-memory-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-memory-6.6.0.pom; 包含翻译后的API文档...

    lucene-suggest-7.7.0-API文档-中文版.zip

    赠送jar包:lucene-suggest-7.7.0.jar; 赠送原API文档:lucene-suggest-7.7.0-javadoc.jar; 赠送源代码:lucene-suggest-7.7.0-sources.jar; 赠送Maven依赖信息文件:lucene-suggest-7.7.0.pom; 包含翻译后的API...

    lucene-highlighter-6.6.0-API文档-中文版.zip

    赠送jar包:lucene-highlighter-6.6.0.jar; 赠送原API文档:lucene-highlighter-6.6.0-javadoc.jar; 赠送源代码:lucene-highlighter-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-highlighter-6.6.0.pom;...

    lucene-core-6.6.0-API文档-中英对照版.zip

    包含翻译后的API文档:lucene-core-6.6.0-javadoc-API文档-中文(简体)-英语-对照版.zip; Maven坐标:org.apache.lucene:lucene-core:6.6.0; 标签:core、apache、lucene、jar包、java、API文档、中英对照版; 使用...

    lucene-backward-codecs-7.2.1-API文档-中英对照版.zip

    包含翻译后的API文档:lucene-backward-codecs-7.2.1-javadoc-API文档-中文(简体)-英语-对照版.zip; Maven坐标:org.apache.lucene:lucene-backward-codecs:7.2.1; 标签:apache、lucene、backward、codecs、中英...

    lucene-spatial-6.6.0-API文档-中文版.zip

    赠送jar包:lucene-spatial-6.6.0.jar; 赠送原API文档:lucene-spatial-6.6.0-javadoc.jar; 赠送源代码:lucene-spatial-6.6.0-sources.jar; 赠送Maven依赖信息文件:lucene-spatial-6.6.0.pom; 包含翻译后的API...

Global site tag (gtag.js) - Google Analytics