htmlparser(2)

tw5566

浏览: 463233 次
性别:
来自: 长沙

最近访客更多访客>>

yitian_web

hhhh5597

fantasy0407

wenhaokl

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

java

上接 1

else if (node instanceof TextNode) {

              stringText = node.toPlainTextString();

              if ( "" .equals( title ))

                  continue ;

              stringText = stringText.replaceAll( "[ \t\n\f\r 　 ]+" , " " );

              stringText = TextHtml.html2text (stringText.trim());

              if (! "" .equals(stringText)) {

                  body .append(stringText);

                  body .append( " " );

              }

           } else if (node instanceof TagNode) {

              TagNode tagNode = (TagNode) node;

              String name = ((TagNode) node).getTagName();

              if (name.equals( "TITLE" ) && !tagNode.isEndTag()) {

                  node = lexer.nextNode();

                  stringText = node.toPlainTextString().trim();

                  if (! "" .equals(stringText)) {

                     title = stringText;

                  }

              } else if (name.equals( "META" )) {

                  String contentCharSet = tagNode.getAttribute( "CONTENT" );

                  // System.out.println("contentCharset="+contentCharSet);

                  int b = contentCharSet.toLowerCase().indexOf( "charset" );

                  if (b > -1) {

                     String newCharSet = getCharset (contentCharSet);

                     // System.out.println("newCharSet=" + newCharSet);

                     if (!charSet.equals(newCharSet)) {

                         tryAgain = true ;

                         charSet = newCharSet;

                         // System.out.println("charSet=" + charSet);

                         // System.out.println("newCharSet=" + newCharSet);

                         break ;

                     }

                  }

              }

           }

       }

       /**   如果在 Meta 信息中检测到新的字符编码，则需要按照 meta 信息中的编码再次解析网页。   **/

       if (tryAgain) {

           body = new StringBuffer();

           try {

                uc = (HttpURLConnection) uc.getURL().openConnection();

              lexer = new Lexer( new Page(uc.getInputStream(), charSet));

           } catch (Exception e) {

              e.printStackTrace();

           }

           lexer.setNodeFactory( new PrototypicalNodeFactory());

           while ( null != (node = lexer.nextNode())) {

              if (node instanceof TextNode) {

                  stringText = node.toPlainTextString();

                  if ( "" .equals( title ))

                     continue ;

                  stringText = stringText.replaceAll( "[ \t\n\f\r 　 ]+" , " " );

                  stringText = TextHtml.html2text (stringText.trim());

                  if (! "" .equals(stringText)) {

                     body .append(stringText);

                     body .append( " " );

                  }

              }

           }

       }

    }



    /**

      * 找出最终的网页编码

      * @param name 经过 getCharset 方法处理后 meta 标签的值

      * @param _default 默认的编码集

      * @return

      */

    public static String findCharset(String name, String _default) {

       String ret;

       try {

           Class<java.nio.charset.Charset> cls;

           Method method;

           Object object;

           cls = java.nio.charset.Charset. class ;

           method = cls.getMethod( "forName" , new Class[] { String. class });

           object = method.invoke( null , new Object[] { name });

           method = cls.getMethod( "name" , new Class[] {});

           object = method.invoke(object, new Object[] {});

           ret = (String) object;

       } catch (NoSuchMethodException nsme) {

           ret = name;

       } catch (IllegalAccessException ia) {

           ret = name;

       } catch (InvocationTargetException ita) {

           ret = _default;

           System. out

                  .println( "unable to determine cannonical charset name for "

                         + name + " - using " + _default);

       }

       return (ret);

    }

   未完，接3

分享到：

htmlparser(3) | 使用Htmlparser对网页进行解析获取内容的一 ...

2009-01-16 15:21
浏览 1858
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

htmlparser(2)

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

htmlparser(2)

评论

发表评论

相关推荐

java实现ftp上传和删除

深入jvm学习(1)--体系结构

java 序列化了解

java各类pdf

Java中各类Cache机制实现解决方案

flash 在ie下的缓存处理

java base64编码和解码案例

在 Ajax 应用程序中实现实时数据推送

tomcat问题解决

spring aop资料

从JSP,ASP等动态页面生成静态页面的实现方式

openfire源码开发学习网站

java领域即时通信的解决方案二(openfire+spark+smack)

java领域即时通信的解决方案一(openfire+spark+smack)

hibernate源码下载

java生成dll工具ikvm.net

JSP的执行过程 & Servlet的生命周期

接口的总结

企业移动应用平台demo

java 学习网址

最近访客更多访客>>