网页爬虫，HttpClient+Jericho HTML Parser 实现网页的抓取

guoyiqi

浏览: 1044276 次

最近访客更多访客>>

wry3407

zzc125

bingjava

秋天你慢慢来

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Jericho HTML Parser是一个简单而功能强大的Java HTML解析器库，可以分析和处理HTML文档的一部分，包括一些通用的服务器端标签，同时也可以重新生成无法识别的或无效的HTML。它也提供了一个有用的HTML表单分析器。
下载地址:http://sourceforge.net/project/showfiles.php?group_id=101067

HttpClient作为HTTP客户端组件与服务器进行通讯，同时使用了jdom进行XML数据的解析。

HttpClient 可以在http://jakarta.apache.org/commons/httpclient/downloads.html下载
HttpClient 用到了 Apache Jakarta common 下的子项目 logging，你可以从这个地址http://jakarta.apache.org/site/downloads/downloads_commons-logging.cgi下载到 common logging，从下载后的压缩包中取出 commons-logging.jar 加到 CLASSPATH 中
HttpClient 用到了 Apache Jakarta common 下的子项目 codec，你可以从这个地址http://jakarta.apache.org/site/downloads/downloads_commons-codec.cgi 下载到最新的 common codec，从下载后的压缩包中取出 commons-codec-1.x.jar 加到 CLASSPATH 中

在对网页信息进行抓取时,主要会用到GET 方法

使用 HttpClient 需要以下 6 个步骤：

1. 创建 HttpClient 的实例

2. 创建某种连接方法的实例，在这里是 GetMethod。在 GetMethod 的构造函数中传入待连接的地址

3. 调用第一步中创建好的实例的 execute 方法来执行第二步中创建好的 method 实例

4. 读 response

5. 释放连接。无论执行方法是否成功，都必须释放连接

6. 对得到后的内容进行处理

在eclipse下建立工程 -->snatch
将上面下载的四个jar文件导入到项目路径中.
环境搭建完成

现在,首先介绍一下HttpClient的使用
在工程目录下创建test包,在包中创建Httpclient Test类

packagetest;

importjava.io.IOException;

importorg.apache.commons.httpclient.*;

importorg.apache.commons.httpclient.methods.GetMethod;

importorg.apache.commons.httpclient.params.HttpMethodParams;

publicclassHttpClientTest{

publicstaticvoidmain(String[]args){

//构造HttpClient的实例

HttpClienthttpClient=newHttpClient();

//创建GET方法的实例

GetMethodgetMethod=newGetMethod("http://www.google.com.cn");

//使用系统提供的默认的恢复策略

getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER,

newDefaultHttpMethodRetryHandler());

try{

//执行getMethod

intstatusCode=httpClient.executeMethod(getMethod);

if(statusCode!=HttpStatus.SC_OK){

System.err.println("Methodfailed:"

+getMethod.getStatusLine());

}

//读取内容

byte[]responseBody=getMethod.getResponseBoy();

//处理内容

System.out.println(newString(responseBody));

}catch(HttpExceptione){

//发生致命的异常，可能是协议不对或者返回的内容有问题

System.out.println("Pleasecheckyourprovidedhttpaddress!");

e.printStackTrace();

}catch(IOExceptione){

//发生网络异常

e.printStackTrace();

}finally{

//释放连接

getMethod.releaseConnection();

}

这样得到的是页面的源代码.
这里 byte[]responseBody=getMethod.getResponseBoy();是读取内容
除此之外,我们还可以这样读取:
InputStream inputStream= getMethod.getResponseBodyAsStream();
String responseBody = getMethod.getResponseBodyAsString();

下面结合两者给个事例
取出http://www.ahcourt.gov.cn/gb/ahgy_2004/fyxw/index.html
中"信息快递"栏的前几条信息.
新建类CourtNews

packagetest;

importjava.io.IOException;

importjava.util.ArrayList;

importjava.util.Iterator;

importjava.util.List;

importorg.apache.commons.httpclient.DefaultHttpMethodRetryHandler;

importorg.apache.commons.httpclient.HttpClient;

importorg.apache.commons.httpclient.HttpException;

importorg.apache.commons.httpclient.HttpStatus;

importorg.apache.commons.httpclient.methods.GetMethod;

importorg.apache.commons.httpclient.params.HttpMethodParams;

importau.id.jericho.lib.html.Element;

importau.id.jericho.lib.html.HTMLElementName;

importau.id.jericho.lib.html.Segment;

importau.id.jericho.lib.html.Source;

/**

*@authoroscar07-5-17

publicclassCourtNews{

privateintnewsCount=3;

privateListnewsList=newArrayList();

publicintgetNewsCount(){

returnnewsCount;

}

publicvoidsetNewsCount(intnewsCount){

this.newsCount=newsCount;

}

publicListgetNewsList(){

HttpClienthttpClient=newHttpClient();

GetMethodgetMethod=newGetMethod(

"http://www.ahcourt.gov.cn/gb/ahgy_2004/fyxw/index.html");

getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER,

newDefaultHttpMethodRetryHandler());

try{

intstatusCode=httpClient.executeMethod(getMethod);

if(statusCode!=HttpStatus.SC_OK){

System.err

.println("Methodfailed:"+getMethod.getStatusLine());

}

StringresponseBody=getMethod.getResponseBodyAsString();

responseBody=newString(responseBody.getBytes("ISO-8859-1"),

"GB2312");

Sourcesource=newSource(responseBody);

inttableCount=0;

for(Iteratori=source.findAllElements(HTMLElementName.TABLE)

.iterator();i.hasNext();tableCount++){

Segmentsegment=(Segment)i.next();

if(tableCount==13){

inthrefCount=0;

for(Iteratorj=segment

.findAllElements(HTMLElementName.A).iterator();j

.hasNext();){

Segmentchildsegment=(Segment)j.next();

Stringtitle=childsegment.extractText();

title.replace(""," ");

title=trimTitle(title);

Elementchildelement=(Element)childsegment;

if(hrefCount<newsCount){

String[]news=newString[]{

title,

"http://www.ahcourt.gov.cn"

+childelement

.getAttributeValue("href")};

newsList.add(news);

hrefCount++;

}

}catch(HttpExceptione){

System.out.println("pleasecheckyourprovidedhttpaddress!");

e.printStackTrace();

}catch(IOExceptione){

e.printStackTrace();

}finally{

getMethod.releaseConnection();

}

returnnewsList;

}

privateStringtrimTitle(Stringtitle){

Stringtitlenew="";

for(inti=0;i<title.length();i++){

if(Character.isSpaceChar(title.charAt(i)))

titlenew+="";

else{

titlenew+=title.charAt(i);

}

returntitlenew;

}

publicstaticvoidmain(String[]args){

//TODOAuto-generatedmethodstub

CourtNewsjustice=newCourtNews();

justice.setNewsCount(4);

Listlist=justice.getNewsList();

Iteratorit=list.iterator();

while(it.hasNext()){

String[]news=(String[])it.next();

System.out.println(news[0]);

System.out.println(news[1]);

}

分享到：

JavaFX发现之旅：JavaFX Script With Eclip ... | RIA技术

2007-05-17 15:02
浏览 383
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论