论坛首页 → Java企业应用论坛 →

爬取远程博文入本地数据库小应用（阉割在摇篮之中）

全部 Hibernate Spring Struts iBATIS 企业应用 Lucene SOA Java综合 Tomcat 设计模式 OO JBoss

« 上一页 1 2 下一页 »

浏览 6814 次

锁定老帖子主题：爬取远程博文入本地数据库小应用（阉割在摇篮之中）该帖已经被评为隐藏帖
作者	正文
C_J 等级: 性别: 文章: 331 积分: 330 来自: 北京	发表时间：2010-07-25 最后修改：2010-10-21 相关推荐: scrapy爬取伯乐在线博客文章保存到本地数据库 scrapy爬虫爬取oschina开源中国博客文章保存到本地数据库将爬虫爬取的网页添加到数据库中关系数据库理论之最小函数依赖集 python爬取微博关键词搜索博文更多相关推荐 Java综合题记：今天闲着蛋疼，想弄个自己的博客，于是前台需要一个美观的页面，后台就需要爬爬XXX，因为看到XXX有RSS，原以为抓抓网页就省事了，可没想到.....更没想到... Page：先搞了个page，向CSS牛人学习下。 Rot：原以为URLConnection抓到xml页面就可以了，可悲剧发生了，直接遭到XXX的拒绝。 <body> <div style="padding:50px 0 0 300px"> <h1>您的访问被拒绝</h1> <p>您可能使用了网络爬虫！</p> XXXXXXXXX </div> </body> - -！于是就自然而然的自己构造http包，对XXX的80端口直接发送http包，折腾了几个小时，弄完后虽然没有被XXX直接拒收，但由于对HTTP协议不够深入，请求页面没被执行成功，如下： www.XXXXX.com/XXX.XXX.XXX.XXX 80 HTTP/1.1 400 Bad Request Connection: close Content-Type: text/html Content-Length: 349 Date: Sat, 24 Jul 2010 16:52:47 GMT Server: lighttpd/1.4.20 <?xml version="1.0" encoding="iso-8859-1"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>400 - Bad Request</title> </head> <body> <h1>400 - Bad Request</h1> </body> </html> 无奈，不想弄HTTP包了，用URLConnection伪装个User-Agent，结果竟然被抓出来了，汗一个！！ <?xml version="1.0" encoding="UTF-8" ?> <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx </rss> </xml> XML（待续）拿到博客的InputStream后，开始解析XML流并入后台数据库。 package org.blog.xml; import java.io.IOException; import java.io.InputStream; import java.util.Map; import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.ParserConfigurationException; import org.w3c.dom.Document; import org.w3c.dom.Element; import org.w3c.dom.NodeList; import org.xml.sax.SAXException; /** * * @author cjcj * / public class XMLParser { public Document parser(InputStream is) throws ParserConfigurationException, SAXException, IOException{ DocumentBuilderFactory f=DocumentBuilderFactory.newInstance(); DocumentBuilder builder=f.newDocumentBuilder(); Document doc=builder.parse(is); getItems(doc.getDocumentElement()); return doc; } private Map<String,String> getItems(Element n){ if(n==null)throw new NullPointerException(); // get the item.. NodeList nl=n.getElementsByTagName("item"); for(int i=0;nl!=null&&i<nl.getLength();++i){ Element et=(Element) nl.item(i); System.out.println(getTextValue(et,"title"));// get the title.... } return null; } private String getTextValue(Element e,String tagNm){ NodeList nl=e.getElementsByTagName(tagNm); return nl!=null&&nl.getLength()>0?nl.item(0).getFirstChild().getNodeValue():null; } } package org.blog.xml; import java.io.IOException; import java.io.InputStream; import java.util.Map; import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.ParserConfigurationException; import org.w3c.dom.Document; import org.w3c.dom.Node; import org.xml.sax.SAXException; /* * * @author cjcj * */ public class XMLParser { public Document parser(InputStream is) throws ParserConfigurationException, SAXException, IOException{ DocumentBuilderFactory f=DocumentBuilderFactory.newInstance(); DocumentBuilder builder=f.newDocumentBuilder(); Document doc=builder.parse(is); getItems(doc); return doc; } public Map<String,String> getItems(Node n){ if(n==null)throw new NullPointerException(); //Map<String,String> items=new HashMap<String,String>(); //NodeList lists=doc.getChildNodes(); System.out.println(n.getNodeName()); System.out.println(n.getNodeValue()); //NamedNodeMap map=n.getAttributes(); //Node lists=map.getNamedItem("item"); return null; } } Filter 压缩 DB 智能检测更新与定时器方案一：通过比对<pubDate></pubDate>标签来判定更新。大小: 15.4 KB 查看图片附件声明：ITeye文章版权属于作者，受法律保护。没有作者书面许可不得转载。推荐链接
返回顶楼

heqishan 等级: 初级会员性别: 文章: 43 积分: 0 来自: 广州	发表时间：2010-07-25 期待你的待续，话说，自己做blog为啥不用wrodpress？还有，你这篇文章写的太简单了吧。？
返回顶楼	回帖地址 1 0 请登录后投票

onlylau 等级: 初级会员性别: 文章: 147 积分: 0 来自: 南京	发表时间：2010-07-25 我前段时间写的一个爬虫程序，遇到的一个网站是通过cookie方式防爬的
返回顶楼	回帖地址 0 0 请登录后投票

danielli007 等级: 初级会员文章: 29 积分: 0 来自: ...	发表时间：2010-07-25 你真是蛋疼的非常明显！
返回顶楼	回帖地址 0 0 请登录后投票

C_J 等级: 性别: 文章: 331 积分: 330 来自: 北京	发表时间：2010-07-25 最后修改：2010-07-25 heqishan 写道期待你的待续，话说，自己做blog为啥不用wrodpress？还有，你这篇文章写的太简单了吧。？貌似牵涉到安全问题，不好细究吧？因为闲着蛋疼就自己写写咯：）主要是想提醒XXX站，是否要多考虑下安全问题。楼上说的cookie方式能具体点吗？
返回顶楼	回帖地址 0 0 请登录后投票

qiren83 等级: 初级会员性别: 文章: 191 积分: 30 来自: 上海	发表时间：2010-07-25 就不能具体点如怎么使用过 user-angent == 或是直接放点源码来让大家学习研究下先谢了
返回顶楼	回帖地址 0 0 请登录后投票

taoyu3781212 等级: 初级会员性别: 文章: 20 积分: 0 来自: 北京	发表时间：2010-07-26 最后修改：2010-07-26 urlConnection.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");
返回顶楼	回帖地址 0 0 请登录后投票

southgate 等级: 初级会员文章: 77 积分: 10	发表时间：2010-07-26 httpclient不是省事嘛
返回顶楼	回帖地址 0 0 请登录后投票

pochonlee 等级: 初级会员性别: 文章: 47 积分: 0 来自: 上海	发表时间：2010-07-26 知识量太少...
返回顶楼	回帖地址 0 0 请登录后投票

luoyahu 等级: 初级会员性别: 文章: 238 积分: 40 来自: 火星	发表时间：2010-07-26 被楼主骗了。只有问题没有答案
返回顶楼	回帖地址 0 0 请登录后投票

« 上一页 1 2 下一页 »

论坛首页 → Java企业应用版

跳转论坛:

Global site tag (gtag.js) - Google Analytics