Heritrix取消robots.txt

neilone.cn

浏览: 22562 次
性别:
来自: 杭州

最近访客更多访客>>

磐石_康

raymond.chen

dangys

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

2010-09 ( 17)
更多存档...

博客分类：

j2ee

网络协议搜索引擎

Robots.txt是一种用于限制网络爬虫的文件，如果在构建网站时，在站点内放置一个Robots.txt文件，在其中可以声明不希望搜索引擎访问的部分。然而，这也是Heritrix爬虫在抓取网页时花费过多的时间去判断该Robots.txt文件是否存在。。。好在这个协议本身是一种附加协议，完全可以不遵守。

在Heritrix的org.archive.crawler.prefetch.PreconditionEnforcer类中定义了获取Robots.txt的方法，我的选择是无论Robots.txt是否存在，都返回不存在，修改方法如下

    private boolean considerRobotsPreconditions(CrawlURI curi) {
    	//此处为提高抓取效率，将忽略Robots.txt协议
    	return false;
        // treat /robots.txt fetches specially
        /*UURI uuri = curi.getUURI();
        try {
            if (uuri != null && uuri.getPath() != null &&
                    curi.getUURI().getPath().equals("/robots.txt")) {
                // allow processing to continue
                curi.setPrerequisite(true);
                return false;
            }
        }
        catch (URIException e) {
            logger.severe("Failed get of path for " + curi);
        }
        // require /robots.txt if not present
        if (isRobotsExpired(curi)) {
        	// Need to get robots
            if (logger.isLoggable(Level.FINE)) {
                logger.fine( "No valid robots for " +
                    getController().getServerCache().getServerFor(curi) +
                    "; deferring " + curi);
            }

            // Robots expired - should be refetched even though its already
            // crawled.
            try {
                String prereq = curi.getUURI().resolve("/robots.txt").toString();
                curi.markPrerequisite(prereq,
                    getController().getPostprocessorChain());
            }
            catch (URIException e1) {
                logger.severe("Failed resolve using " + curi);
                throw new RuntimeException(e1); // shouldn't ever happen
            }
            return true;
        }
        // test against robots.txt if available
        CrawlServer cs = getController().getServerCache().getServerFor(curi);
        if(cs.isValidRobots()){
            String ua = getController().getOrder().getUserAgent(curi);
            if(cs.getRobots().disallows(curi, ua)) {
                if(((Boolean)getUncheckedAttribute(curi,ATTR_CALCULATE_ROBOTS_ONLY)).booleanValue() == true) {
                    // annotate URI as excluded, but continue to process normally
                    curi.addAnnotation("robotExcluded");
                    return false; 
                }
                // mark as precluded; in FetchHTTP, this will
                // prevent fetching and cause a skip to the end
                // of processing (unless an intervening processor
                // overrules)
                curi.setFetchStatus(S_ROBOTS_PRECLUDED);
                curi.putString("error","robots.txt exclusion");
                logger.fine("robots.txt precluded " + curi);
                return true;
            }
            return false;
        }
        // No valid robots found => Attempt to get robots.txt failed
        curi.skipToProcessorChain(getController().getPostprocessorChain());
        curi.setFetchStatus(S_ROBOTS_PREREQUISITE_FAILURE);
        curi.putString("error","robots.txt prerequisite failed");
        if (logger.isLoggable(Level.FINE)) {
            logger.fine("robots.txt prerequisite failed " + curi);
        }*/
        //return true;
    }

分享到：

GAE原来有Eclipse 3.5的插件

2010-09-12 21:50
浏览 2558
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Heritrix取消robots.txt

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Heritrix取消robots.txt

评论

发表评论

相关推荐

Yahoo、Google样式的Logo生成网站

为啥Hibernate的HQL查询要使用别名呢？

Struts2的Action配置中使用斜杠

eclipse中Alt+/失效的解决方案

subclipse中文转英文

最近访客更多访客>>