HttpClient + Jsoup 模拟登陆,解析HTML获取信息
最近在做一个校园综合Android客户端,主要是想把学校各类网站信息进行整合,放在一个平台上,供学校学生阅览。
思路如下:
拿广东工业大学图书馆网站作为一个例子
实现目标:用个人账号登陆图书馆并获取到个人借阅情况。
登陆地址 http://222.200.98.171:81/login.aspx
这里会用到Chrome的开发者工具(浏览器按F12可以开启)
打开登陆界面的源码,下面是源码中的form标签
<form name="aspnetForm" method="post" action="login.aspx?ReturnUrl=%2fuser%2fuserinfo.aspx" onsubmit="javascript:return WebForm_OnSubmit();" id="aspnetForm"> <div> <input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" /> <input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" /> <input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwULLTE0MjY3MDAxNzcPZBYCZg9kFgoCAQ8PFgIeCEltYWdlVXJsBRt+XGltYWdlc1xoZWFkZXJvcGFjNGdpZi5naWZkZAICDw8WAh4EVGV4dAUt5bm/5Lic5bel5Lia5aSn5a2m5Zu+5Lmm6aaG5Lmm55uu5qOA57Si57O757ufZGQCAw8PFgIfAQUcMjAxM+W5tDAz5pyIMDXml6UgIOaYn+acn+S6jGRkAgQPZBYEZg9kFgQCAQ8WAh4LXyFJdGVtQ291bnQCCBYSAgEPZBYCZg8VAwtzZWFyY2guYXNweAAM55uu5b2V5qOA57SiZAICD2QWAmYPFQMTcGVyaV9uYXZfY2xhc3MuYXNweAAM5YiG57G75a+86IiqZAIDD2QWAmYPFQMOYm9va19yYW5rLmFzcHgADOivu+S5puaMh+W8lWQCBA9kFgJmDxUDCXhzdGIuYXNweAAM5paw5Lmm6YCa5oqlZAIFD2QWAmYPFQMUcmVhZGVycmVjb21tZW5kLmFzcHgADOivu+iAheiNkOi0rWQCBg9kFgJmDxUDE292ZXJkdWVib29rc19mLmFzcHgADOaPkOmGkuacjeWKoWQCBw9kFgJmDxUDEnVzZXIvdXNlcmluZm8uYXNweAAP5oiR55qE5Zu+5Lmm6aaGZAIID2QWAmYPFQMbaHR0cDovL2xpYnJhcnkuZ2R1dC5lZHUuY24vAA/lm77kuabppobpppbpobVkAgkPZBYCAgEPFgIeB1Zpc2libGVoZAIDDxYCHwJmZAIBD2QWBAIDD2QWBAIBDw9kFgIeDGF1dG9jb21wbGV0ZQUDb2ZmZAIHDw8WAh8BZWRkAgUPZBYGAgEPEGRkFgFmZAIDDxBkZBYBZmQCBQ8PZBYCHwQFA29mZmQCBQ8PFgIfAQWlAUNvcHlyaWdodCAmY29weTsyMDA4LTIwMDkuIFNVTENNSVMgT1BBQyA0LjAxIG9mIFNoZW56aGVuIFVuaXZlcnNpdHkgTGlicmFyeS4gIEFsbCByaWdodHMgcmVzZXJ2ZWQuPGJyIC8+54mI5p2D5omA5pyJ77ya5rex5Zyz5aSn5a2m5Zu+5Lmm6aaGIEUtbWFpbDpzenVsaWJAc3p1LmVkdS5jbmRkZL5QuJMrEZz+0UxuTVpXZ/EaY5A4" /> </div> <script type="text/javascript"> //<![CDATA[ var theForm = document.forms['aspnetForm']; if (!theForm) { theForm = document.aspnetForm; } function __doPostBack(eventTarget, eventArgument) { if (!theForm.onsubmit || (theForm.onsubmit() != false)) { theForm.__EVENTTARGET.value = eventTarget; theForm.__EVENTARGUMENT.value = eventArgument; theForm.submit(); } } //]]> </script> <script src="/WebResource.axd?d=kbLQnwjf5uNQN4GcWRC5kD1rIySOzkR3uLyKE5xUO0j4Fa2lQPZwQlk_qYaspRXtlojncSBfRJNkA00qXOMQqsKd8WY1&t=634751988274393221" type="text/javascript"></script> <script src="/WebResource.axd?d=nsbO6ZJty6_6fuRufFNYnRiJ-xEoD0xQr70NX6g0v64gngATPLSnyyt7jyZkELLW6THXmh92_m0Y5TyvhES_-JroQeU1&t=634751988274393221" type="text/javascript"></script> <script type="text/javascript"> //<![CDATA[ function WebForm_OnSubmit() { if (typeof(ValidatorOnSubmit) == "function" && ValidatorOnSubmit() == false) return false; return true; } //]]> </script> <div> <input type="hidden" name="__EVENTVALIDATION" id="__EVENTVALIDATION" value="/wEWBQKa7ezdCwKOmK5RApX9wcYGAsP9wL8JAqW86pcIaBhXmFYzd5pGDTk/afln2TfArPw=" /> </div> <input name="ctl00$ContentPlaceHolder1$txtlogintype" type="hidden" id="ctl00_ContentPlaceHolder1_txtlogintype" value="0" /> <div id="Login" class="clearFix"> <div class="LoginTitle"> 登录我的图书馆 </div> <div class="LeftLogin"> <div class="LoginDiv"> <div class="loginContent"> <div class="loginInfo"> <span class="leftInfo">图书证号:</span> <span class="rightInfo"> <input name="ctl00$ContentPlaceHolder1$txtUsername_Lib" type="text" id="ctl00_ContentPlaceHolder1_txtUsername_Lib" class="txtInput" autocomplete="off" /><span id="ctl00_ContentPlaceHolder1_rfv_UserName_Lib" style="color:Red;display:none;">请输入证号</span> </span> </div> <div class="loginInfo"> <span class="leftInfo">密 码:</span> <span class="rightInfo"> <input name="ctl00$ContentPlaceHolder1$txtPas_Lib" type="password" id="ctl00_ContentPlaceHolder1_txtPas_Lib" class="txtInput" /><span id="ctl00_ContentPlaceHolder1_rfv_Password_Lib" style="color:Red;display:none;">请输入密码</span> </span> </div> <div> <span id="ctl00_ContentPlaceHolder1_lblErr_Lib"></span> </div> <div class="loginInfo"> <input type="submit" name="ctl00$ContentPlaceHolder1$btnLogin_Lib" value="登录" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$ContentPlaceHolder1$btnLogin_Lib", "", true, "", "", false, false))" id="ctl00_ContentPlaceHolder1_btnLogin_Lib" class="btn" /> <input type="button" value="清空" onclick="rset()" class="btn"/> </div> </div> </div> </div> <div class="RightDescription"> <img src="images/pin.gif" /> <br/> 1. 如果您使用的是公共电脑,请在使用完毕后,务必退出登录,以保安全。<br /> 2. 首次登录,请先<a href="changepas.aspx">修改初始密码</a>。 </div> </div> <script type="text/javascript"> //<![CDATA[ var Page_Validators = new Array(document.getElementById("ctl00_ContentPlaceHolder1_rfv_UserName_Lib"), document.getElementById("ctl00_ContentPlaceHolder1_rfv_Password_Lib")); //]]> </script> <script type="text/javascript"> //<![CDATA[ var ctl00_ContentPlaceHolder1_rfv_UserName_Lib = document.all ? document.all["ctl00_ContentPlaceHolder1_rfv_UserName_Lib"] : document.getElementById("ctl00_ContentPlaceHolder1_rfv_UserName_Lib"); ctl00_ContentPlaceHolder1_rfv_UserName_Lib.controltovalidate = "ctl00_ContentPlaceHolder1_txtUsername_Lib"; ctl00_ContentPlaceHolder1_rfv_UserName_Lib.focusOnError = "t"; ctl00_ContentPlaceHolder1_rfv_UserName_Lib.errormessage = "请输入证号"; ctl00_ContentPlaceHolder1_rfv_UserName_Lib.display = "Dynamic"; ctl00_ContentPlaceHolder1_rfv_UserName_Lib.evaluationfunction = "RequiredFieldValidatorEvaluateIsValid"; ctl00_ContentPlaceHolder1_rfv_UserName_Lib.initialvalue = ""; var ctl00_ContentPlaceHolder1_rfv_Password_Lib = document.all ? document.all["ctl00_ContentPlaceHolder1_rfv_Password_Lib"] : document.getElementById("ctl00_ContentPlaceHolder1_rfv_Password_Lib"); ctl00_ContentPlaceHolder1_rfv_Password_Lib.controltovalidate = "ctl00_ContentPlaceHolder1_txtPas_Lib"; ctl00_ContentPlaceHolder1_rfv_Password_Lib.focusOnError = "t"; ctl00_ContentPlaceHolder1_rfv_Password_Lib.errormessage = "请输入密码"; ctl00_ContentPlaceHolder1_rfv_Password_Lib.display = "Dynamic"; ctl00_ContentPlaceHolder1_rfv_Password_Lib.evaluationfunction = "RequiredFieldValidatorEvaluateIsValid"; ctl00_ContentPlaceHolder1_rfv_Password_Lib.initialvalue = ""; //]]> </script> <script type="text/javascript"> //<![CDATA[ var Page_ValidationActive = false; if (typeof(ValidatorOnLoad) == "function") { ValidatorOnLoad(); } function ValidatorOnSubmit() { if (Page_ValidationActive) { return ValidatorCommonOnSubmit(); } else { return true; } } //]]> </script> </form>
里面很多代码,我们要从中提取出我们登陆所需要的表单信息,input 和 select 这些标签都是作为登陆表单内容,这里只有input标签我们就提取它就好了,代码如下:
initLoginParmas(String userName,StringpassWord)和getLoginFormData(String url)两个方法
/** * 初始化参数 * * @param userName * @param passWord * @return * @throws ParseException * @throws IOException */ public static List<NameValuePair> initLoginParmas(String userName, String passWord) throws ParseException, IOException { List<NameValuePair> parmasList = new ArrayList<NameValuePair>(); HashMap<String, String> parmasMap = getLoginFormData(LoginUrl); Set<String> keySet = parmasMap.keySet(); for (String temp : keySet) { if (temp.contains("Username")) { parmasMap.put(temp, userName); } else if (temp.contains("txtPas")) { parmasMap.put(temp, passWord); } } Set<String> keySet2 = parmasMap.keySet(); System.out.println("表单内容:"); for (String temp : keySet2) { System.out.println(temp + " = " + parmasMap.get(temp)); } for (String temp : keySet2) { parmasList.add(new BasicNameValuePair(temp, parmasMap.get(temp))); } // System.out.println("initParams \n" + parmasMap); return parmasList; }
/** * 获取登录表单input内容 * * @param url * @return * @throws IOException * @throws ParseException */ public static HashMap<String, String> getLoginFormData(String url) throws ParseException, IOException { Document document = Jsoup.parse(getHtml(url)); Elements element1 = document.getElementsByTag("form");// 找出所有form表单 Element element = element1.select("[method=post]").first();// 筛选出提交方法为post的表单 Elements elements = element.select("input[name]");// 把表单中带有name属性的input标签取出 HashMap<String, String> parmas = new HashMap<String, String>(); for (Element temp : elements) { parmas.put(temp.attr("name"), temp.attr("value"));// 把所有取出的input,取出其name,放入Map中 } return parmas; }
最后表单结果是:
表单内容:
ctl00$ContentPlaceHolder1$txtlogintype = 0 __VIEWSTATE = /wEPDwULLTE0MjY3MDAxNzcPZBYCZg9kFgoCAQ8PFgIeCEltYWdlVXJsBRt+XGltYWdlc1xoZWFkZXJvcGFjNGdpZi5naWZkZAICDw8WAh4EVGV4dAUt5bm/5Lic5bel5Lia5aSn5a2m5Zu+5Lmm6aaG5Lmm55uu5qOA57Si57O757ufZGQCAw8PFgIfAQUcMjAxM+W5tDAz5pyIMDXml6UgIOaYn+acn+S6jGRkAgQPZBYEZg9kFgQCAQ8WAh4LXyFJdGVtQ291bnQCCBYSAgEPZBYCZg8VAwtzZWFyY2guYXNweAAM55uu5b2V5qOA57SiZAICD2QWAmYPFQMTcGVyaV9uYXZfY2xhc3MuYXNweAAM5YiG57G75a+86IiqZAIDD2QWAmYPFQMOYm9va19yYW5rLmFzcHgADOivu+S5puaMh+W8lWQCBA9kFgJmDxUDCXhzdGIuYXNweAAM5paw5Lmm6YCa5oqlZAIFD2QWAmYPFQMUcmVhZGVycmVjb21tZW5kLmFzcHgADOivu+iAheiNkOi0rWQCBg9kFgJmDxUDE292ZXJkdWVib29rc19mLmFzcHgADOaPkOmGkuacjeWKoWQCBw9kFgJmDxUDEnVzZXIvdXNlcmluZm8uYXNweAAP5oiR55qE5Zu+5Lmm6aaGZAIID2QWAmYPFQMbaHR0cDovL2xpYnJhcnkuZ2R1dC5lZHUuY24vAA/lm77kuabppobpppbpobVkAgkPZBYCAgEPFgIeB1Zpc2libGVoZAIDDxYCHwJmZAIBD2QWBAIDD2QWBAIBDw9kFgIeDGF1dG9jb21wbGV0ZQUDb2ZmZAIHDw8WAh8BZWRkAgUPZBYGAgEPEGRkFgFmZAIDDxBkZBYBZmQCBQ8PZBYCHwQFA29mZmQCBQ8PFgIfAQWlAUNvcHlyaWdodCAmY29weTsyMDA4LTIwMDkuIFNVTENNSVMgT1BBQyA0LjAxIG9mIFNoZW56aGVuIFVuaXZlcnNpdHkgTGlicmFyeS4gIEFsbCByaWdodHMgcmVzZXJ2ZWQuPGJyIC8+54mI5p2D5omA5pyJ77ya5rex5Zyz5aSn5a2m5Zu+5Lmm6aaGIEUtbWFpbDpzenVsaWJAc3p1LmVkdS5jbmRkZL5QuJMrEZz+0UxuTVpXZ/EaY5A4 ctl00$ContentPlaceHolder1$txtPas_Lib =密码不告诉你 __EVENTVALIDATION = /wEWBQKa7ezdCwKOmK5RApX9wcYGAsP9wL8JAqW86pcIaBhXmFYzd5pGDTk/afln2TfArPw= ctl00$ContentPlaceHolder1$txtUsername_Lib = 3110006527 ctl00$ContentPlaceHolder1$btnLogin_Lib = 登录
接下来是要登陆获取权限也就是获取到Cookie
代码如下:
/** * 图书馆登陆 * * @param context * @return 返回登陆后的界面Html代码 * @throws ClientProtocolException * @throws IOException */ public static String login() throws ClientProtocolException, IOException { List<NameValuePair> parmasList = new ArrayList<NameValuePair>(); parmasList = initLoginParmas("3110006527", "2787457"); HttpPost post = new HttpPost(LoginUrl); post.getParams().setParameter(ClientPNames.HANDLE_REDIRECTS, false); // 阻止自动重定向,目的是获取第一个ResponseHeader的Cookie和Location post.setHeader("Content-Type", "application/x-www-form-urlencoded;charset=gbk"); // 设置编码为GBK post.setEntity(new UrlEncodedFormEntity(parmasList, "GBK")); HttpResponse response = new DefaultHttpClient().execute(post); cookie = response.getFirstHeader("Set-Cookie").getValue(); // 取得cookie并保存起来 // System.out.println("cookie= " + cookie); location = response.getFirstHeader("Location").getValue(); // 重定向地址,目的是连接到主页 mainUrl = Host + location; // 构建主页地址 String html = getHtml(mainUrl); return html; }
登陆获取Cookie时候会遇到返回状态码是302,这个时候Post方法的话,系统会自动重定向到Location地址,这时候你看到的ResponseHeader已经不是你登陆后返回的那个了,而是你访问重定向地址时候返回的ResponseHeader,而cookie是含在登陆时候返回的ResponseHeader里面所以特别要注意添加语句
post.getParams().setParameter(ClientPNames.HANDLE_REDIRECTS,false);
给Post设置参数,这样就会阻止重定向,从而可以获取Cookie和Location(为了访问主页界面)
cookie =response.getFirstHeader("Set-Cookie").getValue();
接下来需要做的是根据Location得到主页地址,用Jsoup去解析主页,分析出我的借书情况的页面地址
接下来我们访问其他网页的时候就需要用到cookie 了,所以在用post或者get方法的时候要调用addHeader()或者setHeader();把Cookie设置进去
/** * 获取网页HTML源代码 * * @param url * @return * @throws ParseException * @throws IOException */ private static String getHtml(String url) throws ParseException, IOException { // TODO Auto-generated method stub HttpGet get = new HttpGet(url); if ("" != cookie) { get.addHeader("Cookie", cookie); } HttpResponse httpResponse = new DefaultHttpClient().execute(get); HttpEntity entity = httpResponse.getEntity(); return EntityUtils.toString(entity); }
通过Chrome浏览器分析页面源码,可以看到该标签
<a href="bookborrowed.aspx" >当前借阅情况和续借</a>
bookborrowed.aspx 这一段就是我们需要的
获取代码如下:
public static void getMyBorrowedBooks() { try { Document document = Jsoup.parse(login()); Elements elements1 = document .getElementsContainingOwnText("当前借阅情况和续借");// 通过text关键字找到所要的<a>标签 String url = elements1.first().attr("href"); borrowedBooksUrl = mainUrl.substring(0, mainUrl.lastIndexOf("/") + 1) + url;// 取值和mainUrl进行拼凑组织借阅情况地址 getBookBorrowedData(getHtml(borrowedBooksUrl)); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } }
获取到借书情况的地址后,我们就去访问这个地址,获取源码。
我们所需要的事这部分的数据(只截取一部分):
<tr> <td width="5%"> 续满 </td> <td width="10%">2013-04-10</td> <td width="35%"><a href="../bookinfo.aspx?ctrlno=571892" target="_blank">编写高质量代码 [专著]:改善Java程序的151个建议=Writing solw Java cove:151 suggestons to improve your Java program/秦小波著</a></td> <td width="5%"> </td> <td width="8%">中文图书</td> <td width="7%">A2973844</td> <td width="10%">2012-12-05</td> </tr> <tr>
通过下面代码 用Jsoup进行筛选
/** * 获取借书情况具体数据(List<BookEntity>) * * @param src * @return List<BookEntity> */ private static List<BookEntity> getBookBorrowedData(String src) { List<BookEntity> data = new ArrayList<BookEntity>(); Document document = Jsoup.parse(src); Element element = document.select("[id=borrowedcontent]").first() .getElementsByTag("table").first(); Elements elements2 = element.getElementsByTag("tr"); for (Element temp2 : elements2) { Elements elements3 = temp2.getElementsByTag("td"); BookEntity entity = new test().new BookEntity() .setIsFullData(elements3.get(0).text()) .setData2Return(elements3.get(1).text()) .setName(elements3.get(2).text()) .setData2Borrowed(elements3.get(6).text()); data.add(entity); } data.remove(0); System.out.println("借书情况\n"); for (BookEntity temp : data) { System.out.println(temp.getName() + "\n" + temp.getData2Borrowed() + "\n" + temp.getData2Return() + "\n" + temp.getIsFullData()); } return data; }
最后打印出来结果是:
借书情况 编写高质量代码 [专著]:改善Java程序的151个建议=Writing solw Java cove:151 suggestons to improve your Java program/秦小波著 2012-12-05 2013-04-10 续满 疯狂Java [专著]:突破程序员基本功的16课/李刚编著 2012-12-05 2013-04-10 续满 程序员修炼之道 [专著]:从小工到专家=The pragmatic programmer:From journeyman to master:评注版/(美)Andrew Hunt,(美)David Thomas著;周爱民,蔡学镛评注 2012-11-22 2013-04-10 续满 重构:改善既有代码的设计=Refactoring:improving the design of existing code/(美)Martin Fowler著;熊节译 2012-11-22 2013-04-10 续满 Android高薪之路 [专著]:Android程序员面试宝典/李宁编著 2012-11-29 2013-04-10 续满 Android技术内幕 [专著]·系统卷=Android internals·System/杨丰盛著 2012-12-04 2013-04-10 续满 我编程, 我快乐 [专著]:程序员职业规划之道=The passionate programmer:creating a remarkable career in software development/(美) Chad Fowler著;于梦瑄译 2013-01-17 2013-04-17 续满
完整代码: package moniLogin; import java.io.IOException; import java.util.ArrayList; import java.util.HashMap; import java.util.Iterator; import java.util.List; import java.util.Set; import org.apache.http.Header; import org.apache.http.HeaderElement; import org.apache.http.HttpEntity; import org.apache.http.HttpResponse; import org.apache.http.NameValuePair; import org.apache.http.ParseException; import org.apache.http.client.ClientProtocolException; import org.apache.http.client.entity.UrlEncodedFormEntity; import org.apache.http.client.methods.HttpGet; import org.apache.http.client.methods.HttpPost; import org.apache.http.client.params.ClientPNames; import org.apache.http.impl.client.DefaultHttpClient; import org.apache.http.message.BasicNameValuePair; import org.apache.http.util.EntityUtils; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class test { private static String LoginUrl = "http://222.200.98.171:81/login.aspx"; private static String Host = "http://222.200.98.171:81"; private static String mainUrl = ""; private static String borrowedBooksUrl = ""; private static String cookie = ""; private static String location = ""; /** * @param args */ public static void main(String[] args) { // TODO Auto-generated method stub getMyBorrowedBooks(); } public static void getMyBorrowedBooks() { try { Document document = Jsoup.parse(login()); Elements elements1 = document .getElementsContainingOwnText("当前借阅情况和续借");// 通过text关键字找到所要的<a>标签 String url = elements1.first().attr("href"); borrowedBooksUrl = mainUrl.substring(0, mainUrl.lastIndexOf("/") + 1) + url;// 取值和mainUrl进行拼凑组织借阅情况地址 getBookBorrowedData(getHtml(borrowedBooksUrl)); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } /** * 获取借书情况具体数据(List<BookEntity>) * * @param src * @return List<BookEntity> */ private static List<BookEntity> getBookBorrowedData(String src) { List<BookEntity> data = new ArrayList<BookEntity>(); Document document = Jsoup.parse(src); Element element = document.select("[id=borrowedcontent]").first() .getElementsByTag("table").first(); Elements elements2 = element.getElementsByTag("tr"); for (Element temp2 : elements2) { Elements elements3 = temp2.getElementsByTag("td"); BookEntity entity = new test().new BookEntity() .setIsFullData(elements3.get(0).text()) .setData2Return(elements3.get(1).text()) .setName(elements3.get(2).text()) .setData2Borrowed(elements3.get(6).text()); data.add(entity); } data.remove(0); System.out.println("借书情况\n"); for (BookEntity temp : data) { System.out.println(temp.getName() + "\n" + temp.getData2Borrowed() + "\n" + temp.getData2Return() + "\n" + temp.getIsFullData()); } return data; } /** * 图书馆登陆 * * @param context * @return 返回登陆后的界面Html代码 * @throws ClientProtocolException * @throws IOException */ public static String login() throws ClientProtocolException, IOException { List<NameValuePair> parmasList = new ArrayList<NameValuePair>(); parmasList = initLoginParmas("3110006527", "密码不告诉你"); HttpPost post = new HttpPost(LoginUrl); post.getParams().setParameter(ClientPNames.HANDLE_REDIRECTS, false); // 阻止自动重定向,目的是获取第一个ResponseHeader的Cookie和Location post.setHeader("Content-Type", "application/x-www-form-urlencoded;charset=gbk"); // 设置编码为GBK post.setEntity(new UrlEncodedFormEntity(parmasList, "GBK")); HttpResponse response = new DefaultHttpClient().execute(post); cookie = response.getFirstHeader("Set-Cookie").getValue(); // 取得cookie并保存起来 // System.out.println("cookie= " + cookie); location = response.getFirstHeader("Location").getValue(); // 重定向地址,目的是连接到主页 mainUrl = Host + location; // 构建主页地址 String html = getHtml(mainUrl); return html; } /** * 获取网页HTML源代码 * * @param url * @return * @throws ParseException * @throws IOException */ private static String getHtml(String url) throws ParseException, IOException { // TODO Auto-generated method stub HttpGet get = new HttpGet(url); if ("" != cookie) { get.addHeader("Cookie", cookie); } HttpResponse httpResponse = new DefaultHttpClient().execute(get); HttpEntity entity = httpResponse.getEntity(); return EntityUtils.toString(entity); } /** * 初始化参数 * * @param userName * @param passWord * @return * @throws ParseException * @throws IOException */ public static List<NameValuePair> initLoginParmas(String userName, String passWord) throws ParseException, IOException { List<NameValuePair> parmasList = new ArrayList<NameValuePair>(); HashMap<String, String> parmasMap = getLoginFormData(LoginUrl); Set<String> keySet = parmasMap.keySet(); for (String temp : keySet) { if (temp.contains("Username")) { parmasMap.put(temp, userName); } else if (temp.contains("txtPas")) { parmasMap.put(temp, passWord); } } Set<String> keySet2 = parmasMap.keySet(); System.out.println("表单内容:"); for (String temp : keySet2) { System.out.println(temp + " = " + parmasMap.get(temp)); } for (String temp : keySet2) { parmasList.add(new BasicNameValuePair(temp, parmasMap.get(temp))); } // System.out.println("initParams \n" + parmasMap); return parmasList; } /** * 获取登录表单input内容 * * @param url * @return * @throws IOException * @throws ParseException */ public static HashMap<String, String> getLoginFormData(String url) throws ParseException, IOException { Document document = Jsoup.parse(getHtml(url)); Elements element1 = document.getElementsByTag("form");// 找出所有form表单 Element element = element1.select("[method=post]").first();// 筛选出提交方法为post的表单 Elements elements = element.select("input[name]");// 把表单中带有name属性的input标签取出 HashMap<String, String> parmas = new HashMap<String, String>(); for (Element temp : elements) { parmas.put(temp.attr("name"), temp.attr("value"));// 把所有取出的input,取出其name,放入Map中 } return parmas; } class BookEntity { /** * 书名 * */ private String name; /** * 可借数 */ private String leandableNum; /** * 索引号 */ private String callNumber; /** * 作者 */ private String writer; /** * 出版社 */ private String publisher; /** * 还书时间 */ private String data2Return; /** * 借书时间 */ private String data2Borrowed; /** * 是否续满 */ private String isFullData; public BookEntity() { } public String getName() { return name; } public String getLeandableNum() { return leandableNum; } public String getCallNumber() { return callNumber; } public String getWriter() { return writer; } public String getPublisher() { return publisher; } public BookEntity setName(String name) { this.name = name; return this; } public BookEntity setLeandableNum(String leandableNum) { this.leandableNum = leandableNum; return this; } public BookEntity setCallNumber(String callNumber) { this.callNumber = callNumber; return this; } public BookEntity setWriter(String writer) { this.writer = writer; return this; } public BookEntity setPublisher(String publisher) { this.publisher = publisher; return this; } public String getData2Return() { return data2Return; } public String getData2Borrowed() { return data2Borrowed; } public String getIsFullData() { return isFullData; } public BookEntity setData2Return(String data2Return) { this.data2Return = data2Return; return this; } public BookEntity setData2Borrowed(String data2Borrowed) { this.data2Borrowed = data2Borrowed; return this; } public BookEntity setIsFullData(String isFullData) { this.isFullData = isFullData; return this; } } }
关于Jsoup怎么使用这里就不详细说了,
详细请查阅这个网站:http://www.open-open.com/jsoup/
相关推荐
基于SSM+maven+httpClient+jsoup实现小说网站项目 基于SSM+maven+httpClient+jsoup实现小说网站项目 基于SSM+maven+httpClient+jsoup实现小说网站项目 基于SSM+maven+httpClient+jsoup实现小说网站项目 基于SSM+...
基于SSM+maven+httpClient+jsoup实现小说网站项目源码.zip 基于SSM+maven+httpClient+jsoup实现小说网站项目源码.zip 基于SSM+maven+httpClient+jsoup实现小说网站项目源码.zip 基于SSM+maven+httpClient+jsoup实现...
实际操作时,可能还需要考虑登录、分页、动态加载等问题,这些可以通过HttpClient设置cookie和处理重定向,或者使用Jsoup的Jsoup.connect模拟JavaScript执行来解决。 总之,HttpClient和Jsoup是Java开发者进行网页...
在IT领域,网络爬虫是一种自动化程序,用于从...总之,HttpClient和Jsoup是Java爬虫开发中的两个强大工具,它们结合使用能有效地抓取和解析网页信息。理解它们的工作原理和用法,对于构建高效的网络爬虫至关重要。
Jsoup+httpclient模拟登陆和抓取页面.pdf
在IT领域,网络爬虫是获取网页数据的重要手段,而HttpClient和Jsoup是两种常用的Java库,用于实现这一目的。HttpClient提供了低级别的HTTP通信能力,而Jsoup则是一个解析和操作HTML文档的强大工具。本教程将详细介绍...
java爬虫,代码写的有点丑,反正是能用。 今天给大家分享一个多线程的知识点,和线程池,最近任务是写爬虫,五百个网址,循环很慢,然后考虑用多线程,今天看了一下多线程,氛围继承thread 和实现runnuble接口,...
包含httpclient-4.5.3.jar,以及其依赖包commons-codec-1.9.jar,commons-logging-1.2.jar,httpcore-4.4.6.jar。包含jsoup-1.10.2.jar
【标题】中的“基于SSM+maven+httpClient+jsoup实现的java爬虫项目”揭示了这个Java项目的核心技术和用途。下面将详细解释这些技术及其在项目中的作用: 1. **SSM框架**:SSM是Spring、SpringMVC和MyBatis的缩写,...
Jsoup+httpclient 模拟登录和抓取知识点: 1. Jsoup库使用介绍: - Jsoup是一个Java的HTML解析器,能够直接解析HTML文档,提供类似于jQuery的操作方法。 - 主要功能包括从URL、文件或字符串中解析HTML,使用DOM或...
包含jsoup-1.7.3.jar,jsoup-1.7.3-javadoc.jar,jsoup-1.7.3-sources.jar,com.springsource.org.apache.commons.httpclient-3.1.0.jar,org.apache.commons.httpclient.jar
HttpClient和Jsoup是Java开发中常用的两个库,用于网络数据采集和HTML解析。HttpClient提供了强大的HTTP客户端服务,而Jsoup则是一个优秀的库,用于处理和理解HTML文档结构。本篇文章将深入探讨这两个库的使用方法...