htmlparser解析一些网页时,繁体中文会变成乱码

lzj0470

浏览: 1287530 次
性别:
来自: 深圳

最近访客更多访客>>

gljhh

hedgehog12

chen88358323

wyx065747

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

java

htmlparser解析一些网页时繁体中文会变成乱码

htmlparser解析一些网页时,繁体中文会变成乱码

最近发现用htmlparser解析一些网页时,繁体中文会变成乱码.分析了下原因,发现在用stringbean的时候htmlparser会自己根据meta来决定用哪种内码来解码,而有的网站在meta中是用gb2312来做charset,实际应用的时候又用到了gbk.gb2312是不能表示繁体的,所以就出现了乱码.解决的办法很简单,gbk是兼容gb2312的,所以在htmlparser的page.java的getcharser()那里加一句判断,如果ret是gb2312就设置为gbk,这样问题就解决了.

修改的page.java的代码如下(/lexer/page.java)

    public String getCharset (String content)
    {
        final String CHARSET_STRING = "charset";
        int index;
        String ret;

        if (null == mSource)
            ret = DEFAULT_CHARSET;
        else
            // use existing (possibly supplied) character set:
            // bug #1322686 when illegal charset specified
            ret = mSource.getEncoding ();
        if (null != content)
        {
            index = content.indexOf (CHARSET_STRING);

            if (index != -1)
            {
                content = content.substring (index +
                    CHARSET_STRING.length ()).trim ();
                if (content.startsWith ("="))
                {
                    content = content.substring (1).trim ();
                    index = content.indexOf (";");
                    if (index != -1)
                        content = content.substring (0, index);

                    //remove any double quotes from around charset string
                    if (content.startsWith ("\"") && content.endsWith ("\"")
                        && (1 < content.length ()))
                        content = content.substring (1, content.length () - 1);

                    //remove any single quote from around charset string
                    if (content.startsWith ("'") && content.endsWith ("'")
                        && (1 < content.length ()))
                        content = content.substring (1, content.length () - 1);

ret = findCharset (content, ret);

                    // Charset names are not case-sensitive;
                    // that is, case is always ignored when comparing
                    // charset names.
//                    if (!ret.equalsIgnoreCase (content))
//                    {
//                        System.out.println (
//                            "detected charset \""
//                            + content
//                            + "\", using \""
//                            + ret
//                            + "\"");
//                    }
                }
            }
        }
        if(ret.equalsIgnoreCase("gb2312"))ret="GBK"; //to avoid decode problem
                                                                                        //edited by linyunfan
        return (ret);
    }

在最后加入了这句

if(ret.equalsIgnoreCase("gb2312"))ret="GBK";

分享到：

httpclient提交post处理中文乱码问题 | java用正则表达式实现去掉URL一个参数和它 ...

2008-11-15 00:26
浏览 2615
评论(3)
查看更多

3 楼 qq123zhz 2011-10-21

确实是对的。。。

2 楼 flash 2008-11-26

不好意思，没看完就下结论了。错怪了。。。

1 楼 flash 2008-11-26

如果页面是utf-8编码呢？你这个方法明显不能解决所有中文乱码的问题

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论