浏览 3855 次
精华帖 (0) :: 良好帖 (0) :: 新手帖 (0) :: 隐藏帖 (0)
|
|
---|---|
作者 | 正文 |
发表时间:2010-08-11
最后修改:2010-08-11
版本1.2.3中,访问url的api为: String url = ""; Document doc = Jsoup.parse(new URL(url), 3000); 不能设置Cookie等头信息。 下面我们进行源码的修改。先浏览源码: Jsoup.parse的源码为: public static Document parse(URL url, int timeoutMillis) throws IOException { return DataUtil.load(url, timeoutMillis); } DataUtil.load的源码为: static Document load(URL url, int timeoutMillis) throws IOException { String protocol = url.getProtocol(); Validate.isTrue(protocol.equals("http") || protocol.equals("https"), "Only http & https protocols supported"); HttpURLConnection conn = (HttpURLConnection) url.openConnection(); conn.setInstanceFollowRedirects(true); conn.setConnectTimeout(timeoutMillis); conn.setReadTimeout(timeoutMillis); conn.connect(); 很明显,调用HttpURLConnection.setRequestProperty来设置头信息即可。 我在DataUtil.java中重载了load函数,并抽取了一个公用方法,如下: static Document load(URL url, Map<String, String> requestMap, int timeoutMillis) throws IOException { String protocol = url.getProtocol(); Validate.isTrue(protocol.equals("http") || protocol.equals("https"), "Only http & https protocols supported"); HttpURLConnection conn = (HttpURLConnection) url.openConnection(); conn.setInstanceFollowRedirects(true); conn.setConnectTimeout(timeoutMillis); conn.setReadTimeout(timeoutMillis); // set request prop for (String key : requestMap.keySet()) { conn.setRequestProperty(key, requestMap.get(key)); } conn.connect(); return load(url, conn); } static Document load(URL url, int timeoutMillis) throws IOException { String protocol = url.getProtocol(); Validate.isTrue(protocol.equals("http") || protocol.equals("https"), "Only http & https protocols supported"); HttpURLConnection conn = (HttpURLConnection) url.openConnection(); conn.setInstanceFollowRedirects(true); conn.setConnectTimeout(timeoutMillis); conn.setReadTimeout(timeoutMillis); conn.connect(); return load(url, conn); } private static Document load(URL url, HttpURLConnection conn) throws IOException { int res = conn.getResponseCode(); if (res != HttpURLConnection.HTTP_OK) throw new IOException(res + " error loading URL " + url.toString()); String contentType = conn.getContentType(); if (contentType == null || !contentType.startsWith("text/")) throw new IOException(String.format("Unhandled content type \"%s\" on URL %s. Must be text/*", contentType, url.toString())); InputStream inStream = new BufferedInputStream(conn.getInputStream()); String charSet = getCharsetFromContentType(contentType); // may be null, readInputStream deals with it Document doc = readInputStream(inStream, charSet, url.toExternalForm()); inStream.close(); return doc; } 我在Jsoup.java中重载了parse函数: public static Document parse(URL url, Map<String, String> requestMap, int timeoutMillis) throws IOException { return DataUtil.load(url, requestMap, timeoutMillis); } 完成以上修改了,编写ant脚本打jar包即可。 调用例子: String url = ""; Map<String, String> requestMap = new HashMap<String, String>(); requestMap.put("Cookie", ""); Document doc = Jsoup.parse(new URL(url), requestMap, 3000); 这样就可以设置Cookie信息了,可解决需要登录验证的页面爬取问题。 声明:ITeye文章版权属于作者,受法律保护。没有作者书面许可不得转载。
推荐链接
|
|
返回顶楼 | |