网络爬虫-如何将相对路径转为绝对路径

johnson.lee

浏览: 53082 次
性别:
来自: 上海

最近访客更多访客>>

苏生xy

jjmmdu

wang7393

bboxmaster

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

通信协议

Web ASP.net 浏览器 ASP OS

最近在写一个蜘蛛程序，碰到了一个问题，就是如何将页面内超链接的相对路径转为绝对路径，Google了一遍，有很多例子，看了一下，全是不太正确的做法，考虑太片面了，它们的做法大致是这样：
如当前页面是：http://localhost:5995/web/index.html
代码如下：

String url = "http://localhost:5995/web/index.html";
int offset = url.lastIndexOf("/");
String cate = url.substring(0, 0ffset + 1);
String absPath = cate + "页面内的相对路径";

上面的代码只是在当前请求的路径不是服务器端的目录时适用，有些站点没有屏蔽对目录的访问，就拿上面的url做例子，当前站点的目录结构如下：
web
├─images
│ ├─logo.gif
│ ├─bg.gif
│ ├─banner.swf
│ └─style.css
├─scripts
│ ├─swfobject.js
│ ├─common.js
│ └─AjaxRequest.js
└─upload
    ├─201001090001.rar
    ├─201001090002.rar
    ├─201001090003.rar
    ├─201001090004.rar
    └─201001090005.rar
假设此站点没有屏蔽对目录的访问，那么在浏览器上请求http://localhost:5995/web/scripts,查看源代码：

目录清单 -- /web/scripts/ 

--------------------------------------------------------------------------------

[To Parent Directory]

 Wednesday, December 16, 2009 12:48 PM        8,670 swfobject.js
    Monday, December 07, 2009 05:09 PM        1,450 common.js
  Thursday, December 31, 2009 01:54 PM        3,309 AjaxRequest.js


--------------------------------------------------------------------------------
版本信息: ASP.NET Development Server 8.0.0.0

查看源代码：

<html>
    <head>
    <title>目录清单 -- /web/scripts/</title>
        <style>
        	body {font-family:"Verdana";font-weight:normal;font-size: 8pt;color:black;} 
        	p {font-family:"Verdana";font-weight:normal;color:black;margin-top: -5px}
        	b {font-family:"Verdana";font-weight:bold;color:black;margin-top: -5px}
        	h1 { font-family:"Verdana";font-weight:normal;font-size:18pt;color:red }
        	h2 { font-family:"Verdana";font-weight:normal;font-size:14pt;color:maroon }
        	pre {font-family:"Lucida Console";font-size: 8pt}
        	.marker {font-weight: bold; color: black;text-decoration: none;}
        	.version {color: gray;}
        	.error {margin-bottom: 10px;}
        	.expandable { text-decoration:underline; font-weight:bold; color:navy; cursor:hand; }
        </style>
    </head>
    <body bgcolor="white">

    <h2> <i>目录清单 -- /web/scripts/</i> </h2></span>

            <hr width=100% size=1 color=silver>

<PRE>
<A href="/web/">[To Parent Directory]</A>

 Wednesday, December 16, 2009 12:48 PM        8,670 <A href="swfobject.js">swfobject.js</A>
    Monday, December 07, 2009 05:09 PM        1,450 <A href="common.js">common.js</A>
  Thursday, December 31, 2009 01:54 PM        3,309 <A href="AjaxRequest.js">AjaxRequest.js</A>
</PRE>
            <hr width=100% size=1 color=silver>

              <b>版本信息:</b>&nbsp;ASP.NET Development Server 8.0.0.0

            </font>

    </body>
</html>

蜘蛛程序分析到"swfobject.js","common.js","AjaxRequest.js"后，如果按上面截取的做法，肯定就错了。这里就不能截取最后的"/"，应该是"http://localhost:5995/web/scripts" + "/" + "swfobject.js"，但是浏览器为什么能准确分析出来呢？当在页面http://localhost:5995/web/scripts上将鼠标放在超链接swfobject.js上，状态栏显示了它的绝对路径为http://localhost:5995/web/scripts/swfobject.js而不是http://localhost:5995/web/swfobject.js。于是我猜想，肯定在HTTP响应头信息里有相应的信息，于是用下面的代码测试：


try {
	URL base = new URL("http://localhost:5995/web/scripts");
	Map<String, List<String>> props = base.openConnection().getHeaderFields();
	for (Iterator<String> iterator = props.keySet().iterator(); iterator.hasNext();) {
		String key = (String) iterator.next();
		System.out.println(key + "=" + props.get(key));
	}
} catch (IOException e) {
	e.printStackTrace();
}

打印信息如下：

localhost:5995
null=[HTTP/1.1 200 OK]
Date=[Sat, 09 Jan 2010 01:00:25 GMT]
Content-type=[text/html; charset=utf-8]
Content-Length=[1494]
Connection=[Close]
Server=[ASP.NET Development Server/8.0.0.0]

从打印内容里，根本看不出里面有什么东西可以区别目录和html文件的，后来又想，如果用Socket模拟浏览器请求，或许有意想不到的收获呢！先贴代码：

public static void main(String[] args) {
	try {
		Socket s = new Socket("127.0.0.1", 5995);
		OutputStream os = s.getOutputStream();
		os.write(("GET /web/scripts HTTP/1.1\r\n" +
				"Host: localhost\r\n" +
				"Connection: keep-alive\r\n" +
				"\r\n").getBytes());
		os.flush();
		InputStream is = s.getInputStream();
		int c = -1;
		while ((c = is.read()) != -1) {
			System.out.print((char)c);
		}
		os.close();
		is.close();
	} catch (UnknownHostException e) {
		e.printStackTrace();
	} catch (IOException e) {
		e.printStackTrace();
	}
	
}

打印信息如下：

HTTP/1.1 302 Found
Server: ASP.NET Development Server/8.0.0.0
Date: Sat, 09 Jan 2010 01:11:42 GMT
Content-Length: 135
Location: /web/scripts/
Connection: Close

<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a href='/website1/scripts/'>here</a>.</h2>
</body></html>

我想不用我说大家都知道了吧？关键在于浏览器第一次请求http://localhost:5995/web/scripts时，服务器给浏览器一个302的响应，告诉浏览器这个资源被移动到别的地方了，你访问这个路径吧：Location: /web/scripts/，于是浏览器再请求http://localhost:5995/web/scripts/这次后面就多了个"/"，于是浏览器就知道这个页面内的所有相对路径的超链接的绝对路径了。^_^
现在知道了用URLConnection测试时，它做了二次请求，所以看不到302的响应。

1
顶

0
踩

分享到：

剖析HTTP协议GET/POST请求 | Log4j记录日志到数据库

2010-01-09 09:26
浏览 9720
评论(2)
分类:编程语言
查看更多

2 楼蜗牛笔 2012-06-05

johnson.lee 写道

其实JDK内置的URLConnection类的getURL()可以获取请求的URL的真实路径，见代码：

try {
	//请求的是服务器的一个目录
	URL url = new URL("http://localhost:5995/web/scripts");
	URLConnection conn = url.openConnection();
	url = conn.getURL();
	System.out.println(url);
} catch (IOException e) {
	e.printStackTrace();
}

打印结果如下：

http://localhost:5995/web/scripts/

因为JDK内置的HttpURLConnection自动处理了302的响应，所以在原来的URL后面加了一个"/"，所以当前页面内的相对路径转为绝对路径于是可以通过URLConnection#getURL()返回的路径 + 相对路径即可得到正确的绝对路径了。

你确定？我为什么试着不是这样呢？

1 楼 johnson.lee 2010-03-03