poi完美word转html(表格、图片、样式)

chembo

浏览: 946569 次
性别:
来自: 广州

最近访客更多访客>>

zhaokui

hh_qq_love_hi

visimar

t1275674474

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

poi word html

直入正题，需求为页面预览word文档，用的是poi3.8，以下代码支持表格、图片，不支持分页，只支持doc，不支持docx；

/**
 * 
 */


import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.util.List;

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import org.apache.commons.io.output.ByteArrayOutputStream;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.converter.PicturesManager;
import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.model.PicturesTable;
import org.apache.poi.hwpf.usermodel.CharacterRun;
import org.apache.poi.hwpf.usermodel.Paragraph;
import org.apache.poi.hwpf.usermodel.Picture;
import org.apache.poi.hwpf.usermodel.PictureType;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.hwpf.usermodel.Table;
import org.apache.poi.hwpf.usermodel.TableCell;
import org.apache.poi.hwpf.usermodel.TableIterator;
import org.apache.poi.hwpf.usermodel.TableRow;
import org.w3c.dom.Document;

/**
 * @author: Chembo Huang
 * @since: May 3, 2012
 * @modified: May 3, 2012
 * @version:
 */
public class Word2Html {

	public static void main(String argv[]) {
		try {
			convert2Html("D://1.doc","D://1.html");
		} catch (Exception e) {
			e.printStackTrace();
		}
	}

	public static void writeFile(String content, String path) {
		FileOutputStream fos = null;
		BufferedWriter bw = null;
		try {
			File file = new File(path);
			fos = new FileOutputStream(file);
			bw = new BufferedWriter(new OutputStreamWriter(fos,"GB2312"));
			bw.write(content);
		} catch (FileNotFoundException fnfe) {
			fnfe.printStackTrace();
		} catch (IOException ioe) {
			ioe.printStackTrace();
		} finally {
			try {
				if (bw != null)
					bw.close();
				if (fos != null)
					fos.close();
			} catch (IOException ie) {
			}
		}
	}

	public static void convert2Html(String fileName, String outPutFile)
			throws TransformerException, IOException,
			ParserConfigurationException {
		HWPFDocument wordDocument = new HWPFDocument(new FileInputStream(fileName));//WordToHtmlUtils.loadDoc(new FileInputStream(inputFile));
		WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
				DocumentBuilderFactory.newInstance().newDocumentBuilder()
						.newDocument());
		 wordToHtmlConverter.setPicturesManager( new PicturesManager()
         {
             public String savePicture( byte[] content,
                     PictureType pictureType, String suggestedName,
                     float widthInches, float heightInches )
             {
                 return "test/"+suggestedName;
             }
         } );
		wordToHtmlConverter.processDocument(wordDocument);
		//save pictures
		List pics=wordDocument.getPicturesTable().getAllPictures();
		if(pics!=null){
			for(int i=0;i<pics.size();i++){
				Picture pic = (Picture)pics.get(i);
				System.out.println();
				try {
					pic.writeImageContent(new FileOutputStream("D:/test/"
							+ pic.suggestFullFileName()));
				} catch (FileNotFoundException e) {
					e.printStackTrace();
				}  
			}
		}
		Document htmlDocument = wordToHtmlConverter.getDocument();
		ByteArrayOutputStream out = new ByteArrayOutputStream();
		DOMSource domSource = new DOMSource(htmlDocument);
		StreamResult streamResult = new StreamResult(out);

		TransformerFactory tf = TransformerFactory.newInstance();
		Transformer serializer = tf.newTransformer();
		serializer.setOutputProperty(OutputKeys.ENCODING, "GB2312");
		serializer.setOutputProperty(OutputKeys.INDENT, "yes");
		serializer.setOutputProperty(OutputKeys.METHOD, "html");
		serializer.transform(domSource, streamResult);
		out.close();
		writeFile(new String(out.toByteArray()), outPutFile);
	}
}

查看图片附件

分享到：

spring注解 @autowired @resource | apache server-status

2012-05-03 21:44
浏览 82505
评论(41)
分类:开源软件
查看更多

41 楼 zhanglongbin 2016-05-03

感谢楼主分享！！
我遇到的问题：本地word转html 两个编码用GB2312，乱码！两个改为UTF-8 解决。
部署到服务器中文部分乱码，捣鼓来捣鼓去各种不同的转码都解决不了，最后
serializer.setOutputProperty(OutputKeys.ENCODING, "GB2312");
bw = new BufferedWriter(new OutputStreamWriter(fos,"UTF-8")); 这样转码服务器正常，但是本地调试中文乱码。楼主知道其中原因么？为什么不同的电脑不一样？

40 楼 wjs876046992 2016-04-18

文档编号显示不对，读出来全是1和1.1，我的文档是1，1.1,1.2.3.。。。。。这样子的。有什么办法解决吗？

39 楼 408729253 2015-12-09

为什么我转换的是乱码？

38 楼 huazi221 2015-05-16

请问楼主，用poi如何获取word文档中的每一个页面？？

比如一个word有三个页面，我想获取这个3个页面然后分别做处理，一直找不到api是哪个
求教

37 楼 chembo 2015-01-14

jiefalcon 写道

怎么兼容读取docx.xlsx,xls

貌似支持不了。亲，上面就是源码了。

36 楼 jiefalcon 2015-01-08

怎么兼容读取docx.xlsx,xls

35 楼 jiefalcon 2015-01-08

哥们可以把源码共享下吗

34 楼奔跑者java 2014-12-10

你好，我在项目中需要用到word转html，也从网上弄了个例子，但是总会报一个 The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)。网上查了下，没有找到问题。所以想像你请教一下，这问题你遇到过没?

33 楼 chembo 2014-10-28

feng_xing2 写道

这种从本地读取的都简单，关键是实际项目中的读取是不会从本地读取的，一般都是通过文件上传读取的，不知道楼主从文件上传中获取word文件内容有没有做出来，我写的从文件上传中读取HWPFDocument doc = new HWPFDocument(request.getInputStream()); 这么写直接报错
java.io.IOException: Invalid header signature; read 0x65572D2D2D2D2D2D, expected 0xE11AB1A1E011CFD0

可以尝试先存入本地，再从本地读取。

32 楼 feng_xing2 2014-10-14

31 楼 zhanghao88915 2014-09-17

我引用的jar包不对吗报错org.apache.poi.hwpf.model.CHPX cannot be cast to org.apache.poi.hwpf.usermodel.Range

30 楼 zhanghao88915 2014-09-17

没有源码吗

29 楼 you2790 2014-08-18

拨弦waitC 写道

为什么运行到这句wordToHtmlConverter.processDocument(wordDocument)，
然后就出错，出错信息是：java.lang.NoSuchMethodError: org.w3c.dom.Attr.getTextContent()Ljava/lang/String;

你少了个包呗 poi-scratchpad-3.8-20120326.jar

28 楼拨弦waitC 2014-04-25

为什么运行到这句wordToHtmlConverter.processDocument(wordDocument)，
然后就出错，出错信息是：java.lang.NoSuchMethodError: org.w3c.dom.Attr.getTextContent()Ljava/lang/String;

27 楼 e_endswell 2013-12-13

如何区分来源 doc 的分页，然后我想尽量把输出的html 也做成类似doc的一页一页。然后转成pdf

26 楼 a30292330 2013-12-09

楼主，你好，我想问一下，我按照你的方法进行转换以后，为什么格式会变化呢？

25 楼 chembo 2013-11-26

waytofall 写道

chembo 写道

waytofall 写道

哥们儿一个奇怪的问题，
bw = new BufferedWriter(new OutputStreamWriter(fos,"GB2312"));
serializer.setOutputProperty(OutputKeys.ENCODING, "GB2312");
两处如果改成utf-8就会出现乱码。html页面显示是乱码，用notepad++打开源代码也是乱码。
很奇怪，按说这个只是输出的编码问题，word应该是默认unicode编码的吧，如果输出GB2312能正确，说明解码没问题，那为什输出utf-8会出问题呢？

是的，我也发现了这个问题，我的处理就是偷懒，写死了gb2312；后来也没时间去看一下源代码，你要是有时间发现问题在哪，记得也告诉我一下！

这个问题我知道怎么回事了，不是poi的问题，而是java读写文件的问题。以utf-8为编码写完文件用notepad++打开是能正常显示无乱码的，用菜单中的工具查看其编码为utf-8无bom。无bom格式的好像用浏览器打开是会乱码的（而GBK等ANSI编码的文件是能正确打开的），如果用notepad++将编码改为utf-8（也就是带bom的），则浏览器是能正确打开的（我用的chrome）。java写文件的时候都是默认不写bom的，你可以在文件开头把bom写进去，然后再写文件内容。:)

专业素养！

24 楼 waytofall 2013-11-25

chembo 写道

waytofall 写道

是的，我也发现了这个问题，我的处理就是偷懒，写死了gb2312；后来也没时间去看一下源代码，你要是有时间发现问题在哪，记得也告诉我一下！

23 楼 chembo 2013-09-26

waytofall 写道

是的，我也发现了这个问题，我的处理就是偷懒，写死了gb2312；后来也没时间去看一下源代码，你要是有时间发现问题在哪，记得也告诉我一下！

22 楼 waytofall 2013-09-25

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论