xtuhcy

浏览: 144541 次
性别:
来自: 北京

最近访客更多访客>>

gnomewarlock

zlf3865072

james1110

orangehome

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

教您使用java爬虫gecco抓取JD全部商品信息（一）

博客分类：

gecco

java 爬虫 gecco 京东

教您使用java爬虫gecco抓取JD全部商品信息（一）

gecco爬虫

如果对gecco还没有了解可以参看一下gecco的github首页。gecco爬虫十分的简单易用，JD全部商品信息的抓取9个类就能搞定。

JD网站的分析

要抓取JD网站的全部商品信息，我们要先分析一下网站，京东网站可以大体分为三级，首页上通过分类跳转到商品列表页，商品列表页对每个商品有详情页。那么我们通过找到所有分类就能逐个分类抓取商品信息。

入口地址

http://www.jd.com/allSort.aspx，这个地址是JD全部商品的分类列表，我们以该页面作为开始页面，抓取JD的全部商品信息

新建开始页面的HtmlBean类AllSort

@Gecco(matchUrl="http://www.jd.com/allSort.aspx", pipelines={"consolePipeline", "allSortPipeline"})
public classAllSortimplementsHtmlBean{

    private static final long serialVersionUID = 665662335318691818L;

    @Request
    private HttpRequest request;

    //手机
    @HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(2) > div.mc > div.items > dl")
    private List<Category> mobile;

    //家用电器
    @HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(3) > div.mc > div.items > dl")
    private List<Category> domestic;

    public List<Category> getMobile(){
        return mobile;
    }

    publicvoidsetMobile(List<Category> mobile){
        this.mobile = mobile;
    }

    public List<Category> getDomestic(){
        return domestic;
    }

    publicvoidsetDomestic(List<Category> domestic){
        this.domestic = domestic;
    }

    public HttpRequest getRequest(){
        return request;
    }

    publicvoidsetRequest(HttpRequest request){
        this.request = request;
    }
}

可以看到，这里以抓取手机和家用电器两个大类的商品信息为例，可以看到每个大类都包含若干个子分类，用List<Category>表示。gecco支持Bean的嵌套，可以很好的表达html页面结构。Category表示子分类信息内容，HrefBean是共用的链接Bean。

public classCategoryimplementsHtmlBean{

    private static final long serialVersionUID = 3018760488621382659L;

    @Text
    @HtmlField(cssPath="dt a")
    private String parentName;

    @HtmlField(cssPath="dd a")
    private List<HrefBean> categorys;

    public String getParentName(){
        return parentName;
    }

    publicvoidsetParentName(String parentName){
        this.parentName = parentName;
    }

    public List<HrefBean> getCategorys(){
        return categorys;
    }

    publicvoidsetCategorys(List<HrefBean> categorys){
        this.categorys = categorys;
    }

}

获取页面元素cssPath的小技巧

上面两个类难点就在cssPath的获取上，这里介绍一些cssPath获取的小技巧。用Chrome浏览器打开需要抓取的网页，按F12进入发者模式。选择你要获取的元素，如图：

在浏览器右侧选中该元素，鼠标右键选择Copy--Copy selector，即可获得该元素的cssPath

body > div:nth-child(5) > div.main-classify > div.list > div.category-items.clearfix > div:nth-child(1) > div:nth-child(2) > div.mc > div.items

如果你对jquery的selector有了解，另外我们只希望获得dl元素，因此即可简化为：

.category-items > div:nth-child(1) > div:nth-child(2) > div.mc > div.items > dl

编写AllSort的业务处理类

完成对AllSort的注入后，我们需要对AllSort进行业务处理，这里我们不做分类信息持久化等处理，只对分类链接进行提取，进一步抓取商品列表信息。看代码：

@PipelineName("allSortPipeline")
public classAllSortPipelineimplementsPipeline<AllSort> {

    @Override
    public void process(AllSort allSort) {
        List<Category> categorys = allSort.getMobile();
        for(Category category : categorys) {
            List<HrefBean> hrefs = category.getCategorys();
            for(HrefBean href : hrefs) {
                String url = href.getUrl()+"&delivery=1&page=1&JL=4_10_0&go=0";
                HttpRequest currRequest = allSort.getRequest();
                SchedulerContext.into(currRequest.subRequest(url));
            }
        }
    }

}

@PipelinName定义该pipeline的名称，在AllSort的@Gecco注解里进行关联，这样，gecco在抓取完并注入Bean后就会逐个调用@Gecco定义的pipeline了。为每个子链接增加"&delivery=1&page=1&JL=4_10_0&go=0"的目的是只抓取京东自营并且有货的商品。SchedulerContext.into()方法是将待抓取的链接放入队列中等待进一步抓取。

5
顶

5
踩

分享到：

教您使用java爬虫gecco抓取JD全部商品信息 ... | tomcat类加载顺序

2016-02-24 16:44
浏览 3142
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

教您使用java爬虫gecco抓取JD全部商品信息（一）

教您使用java爬虫gecco抓取JD全部商品信息（一）

gecco爬虫

JD网站的分析

入口地址

新建开始页面的HtmlBean类AllSort

获取页面元素cssPath的小技巧

编写AllSort的业务处理类

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

教您使用java爬虫gecco抓取JD全部商品信息（一）

教您使用java爬虫gecco抓取JD全部商品信息（一）

gecco爬虫

JD网站的分析

入口地址

新建开始页面的HtmlBean类AllSort

获取页面元素cssPath的小技巧

编写AllSort的业务处理类

评论

发表评论

相关推荐

无头浏览器，从phantomjs到webkit4j

教您使用DynamicGecco抓取JD全部商品信息

DynamicGecco实现爬取规则的动态加载

Gecco爬虫框架的线程和队列模型

Gecco框架典型案例—闲逛APP

gecco 1.1.0稳定版发布，易用的轻量化爬虫

gecco 1.0.9 发布，易用的轻量化爬虫

java开源爬虫gecco 发布1.0.8版本

java爬虫gecco的稳定性测试

java爬虫gecco监控来了，不再裸奔

java开源爬虫gecco发布版本1.0.6，更灵活的配置downloader

java爬虫gecco支持htmlunit

教您使用java爬虫gecco抓取JD全部商品信息（三）

教您使用java爬虫gecco抓取JD全部商品信息（二）

java开源爬虫gecco详细文档新鲜出炉

Java主题爬虫Gecco发布1.0.4版本

最近访客更多访客>>