精华帖 (0) :: 良好帖 (0) :: 新手帖 (0) :: 隐藏帖 (0)
|
|
---|---|
作者 | 正文 |
发表时间:2010-02-03
最后修改:2010-04-07
软件版本:pdfbox-0.8.0-incubating PDF转换软件:Adobe Acrobat6.0,Foxit PDF Creator 问题描述:用比较专业的Foxit PDF Creator转换没有问题,用Acrobat转换时,转换出的pdf可以正常用Adobe Reader打开,但是用pdfbox打开出现乱码。 转换方法,打开word文件,选择打印,选择打印机如图 转换后可以看到pdf的字体有所不同,是Identity-H 而用Foxit转换出来的正常可以读取的pdf文件字体应该是UniGB-USC2-H 所以字体应该出现在字体编码上,不知道有没有解决方案。运行结果如下: 解析pdf的源码如下:
public void testPDF() { try { String ts = GetTextFromPdf("c:\\temp\\test.pdf"); System.out.println(ts); } catch (Exception e) { e.printStackTrace(); } } public String GetTextFromPdf(String filename) throws Exception { String temp = null; PDDocument pdfdocument = null; FileInputStream is = new FileInputStream(filename); PDFParser parser = new PDFParser(is); parser.parse(); pdfdocument = parser.getPDDocument(); PDFTextStripper ts = new PDFTextStripper(); String content = ts.getText(pdfdocument); return content; } 附件中有word文件和异常的pdf文件。 × 建议用其他包转换的勿复 //********************************************************* 今天下载了1.1.0版本,惊喜的发现,不仅解决了UniGB-UCS2-H问题,而且Identity-H编码的问题也得到解决了,还解决了以前一个不大不小的bug,容易在文字中出现空格,到底是官方升级的好阿,比自己杂七杂八揍了一堆补丁要舒服多了。顺利通过了自己的十来个测试样本,只有一个嵌入字体的pdf出现问题,异常如下:
java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider at org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1098) at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:579) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:235) at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180) at testPDFBox.GetTextFromPdf(testPDFBox.java:56) at testPDFBox.testPDF(testPDFBox.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at junit.framework.TestCase.runTest(TestCase.java:168) at junit.framework.TestCase.runBare(TestCase.java:134) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:124) at junit.framework.TestSuite.runTest(TestSuite.java:232) at junit.framework.TestSuite.run(TestSuite.java:227) at org.junit.internal.runners.OldTestClassRunner.run(OldTestClassRunner.java:76) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:38) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196) 下载bcprov-jdk15-133.jar包后,不再提示找不到类,出现解析错误:
2010-4-7 15:26:28 org.apache.pdfbox.filter.FlateFilter decode 严重: Stop reading corrupt stream java.io.IOException: Error: Expected an integer type, actual='' at org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1275) at org.apache.pdfbox.pdfparser.PDFObjectStreamParser.parse(PDFObjectStreamParser.java:81) at org.apache.pdfbox.cos.COSDocument.dereferenceObjectStreams(COSDocument.java:449) at org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1100) at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:579) at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:235) at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180) at testPDFBox.GetTextFromPdf(testPDFBox.java:56) at testPDFBox.testPDF(testPDFBox.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at junit.framework.TestCase.runTest(TestCase.java:168) at junit.framework.TestCase.runBare(TestCase.java:134) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:124) at junit.framework.TestSuite.runTest(TestSuite.java:232) at junit.framework.TestSuite.run(TestSuite.java:227) at org.junit.internal.runners.OldTestClassRunner.run(OldTestClassRunner.java:76) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:38) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196) 声明:ITeye文章版权属于作者,受法律保护。没有作者书面许可不得转载。
推荐链接
|
|
返回顶楼 | |
发表时间:2010-12-13
你附件中的test.pdf 我这里能正确抽出来呀
|
|
返回顶楼 | |
浏览 13780 次