commons-httpclient已经不再更新了,
httpcomponents是commons-httpclient后继项目。
本方法的目的是使用httpcomponents-client-4.0.1获取整个页面的内容
稍微修改了一下examples中的ClientAbortMethod
【添加代码已用注释标注,就是读个输入流,也没啥的】
java 写道
/*
* ====================================================================
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
* ====================================================================
*
* This software consists of voluntary contributions made by many
* individuals on behalf of the Apache Software Foundation. For more
* information on the Apache Software Foundation, please see
* .
*
*/
package org.apache.http.examples.client;
import java.io.BufferedReader;
import java.io.InputStream;
import java.io.InputStreamReader;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
/**
* This example demonstrates how to abort an HTTP method before its normal completion.
*/
public class MyClientAbortMethod {
public final static void main(String[] args) throws Exception {
HttpClient httpclient = new DefaultHttpClient();
HttpGet httpget = new HttpGet("http://www.apache.org/");
System.out.println("executing request " + httpget.getURI());
HttpResponse response = httpclient.execute(httpget);
HttpEntity entity = response.getEntity();
System.out.println("----------------------------------------");
System.out.println(response.getStatusLine());
if (entity != null) {
System.out.println("Response content length: " + entity.getContentLength());
//start 读取整个页面内容
InputStream is = entity.getContent();
BufferedReader in = new BufferedReader(new InputStreamReader(is));
StringBuffer buffer = new StringBuffer();
String line = "";
while ((line = in.readLine()) != null) {
buffer.append(line+"\n");
}
//end 读取整个页面内容
System.out.println(buffer.toString());
}
System.out.println("----------------------------------------");
// Do not feel like reading the response body
// Call abort on the request object
httpget.abort();
// When HttpClient instance is no longer needed,
// shut down the connection manager to ensure
// immediate deallocation of all system resources
httpclient.getConnectionManager().shutdown();
}
}
分享到:
相关推荐
4. **解析HTML**:将获取到的HTML内容传递给HtmlParser,使用过滤器或访问者模式解析页面。例如,可以查找所有的`<a>`标签(链接)或其他感兴趣的元素。 5. **提取数据**:根据需求,遍历解析后的HTML节点,提取出...
在爬虫中,我们通常发送GET请求到目标网站的URL,以获取HTML页面内容。 1. **导入必要的库** 要在Java中执行HTTP GET请求,我们需要引入Apache HttpClient库或者使用Java内置的HttpURLConnection。这里以...
例如,可以使用Apache Commons IO库中的FileUtils类来下载文件,或者使用更高级的下载管理工具,如Apache HttpComponents。 在单独抓取视频和音频文件之后,如果需要将它们合并成一个文件,那么javacv工具包就显得...
整个过程包括发现新URL、下载网页内容和解析提取数据。 在Java中,我们可以使用HttpURLConnection或者Apache HttpClient库来创建HTTP请求,获取网页内容。例如,使用HttpURLConnection: ```java URL url = new ...
在IT行业中,网络爬虫是数据挖掘的一种常见技术,它允许程序自动地遍历Web页面以获取所需信息。Apache HttpComponents项目提供了两个关键组件:HttpCore和HttpClient,这两个组件是Java开发者实现网络爬虫时常用的...
- **抓取网页**:通过HttpClient发起GET请求,获取指定网页的内容。 - **解析数据**:对获取的HTML内容进行解析,提取有用信息。 - **实现方式**:可以选择广度优先或深度优先的遍历方式来抓取整个站点。 #### ...
在这个项目中,开发者可能利用这些工具实现对指定网页的深度遍历,获取所需内容,同时处理反爬虫策略,如设置User-Agent、延时请求等。 接着是中文分词模块。中文分词是中文信息处理的基础,它将连续的汉字序列切...
在本文中,我们将深入探讨如何使用Java调用微信支付功能,并通过具体的示例代码来指导实践。首先,确保你已经具备了微信支付所需的必要信息,包括APPID、MCHID、KEY、APPSECRET等。这些信息通常由你的微信支付服务商...