`
diddyrock
  • 浏览: 46528 次
  • 性别: Icon_minigender_1
  • 来自: 上海
文章分类
社区版块
存档分类
最新评论

笔记笔记

阅读更多
网页快照乱码问题修正。修改页面tcached.jsp,将content = new String(bean.getContent(details))修改为content = new String(bean.getContent(details),"utf-8")。
还有就是tomcat 6以上版本 jsp中不能有" " "号,要以" ' "取代

protocal:
每一次返回的是插件里面的protocal,
FetcherThread->protocolFactory->extention->instance.
parserFactory.getParsers()可以修改自定义插件,据说有个什么鸟carrot2插件,有空看看!


crawl没有使用线程池,一个鸟threadpool让我找了好久shit!java已经不再推荐自己编写线程池,反正nutch也没有使用,n自己在fetch的时候自己创建了一驼线程,每个从fetcher 的input里面读取一个键值作为爬行地址。

调用injector比较简单,先生产datum,然后输出到文件夹,然后generator调用selector.class来进行map,这里面selector可以加以修改添加插件或者url列表之类,不过太tmd侵入式了。

如果是不存在的链接,直接喀嚓然后output.collect(sortValue, entry);其中entry包含了crawlDatum和url,这两个鸟东西居然分开存放,
在generator中partitioner为hostPartitioner.getPartition(((SelectorEntry)value).url,  key,numReduceTasks);这个比较重要

selector主要有两种选择方式byip不byip
if (byIP) {
            try {
              InetAddress ia = InetAddress.getByName(host);
              host = ia.getHostAddress();
            } catch (UnknownHostException uhe) {
              if (LOG.isDebugEnabled()) {
                LOG.debug("DNS lookup failed: " + host + ", skipping.");
              }
              dnsFailure++;
              if ((dnsFailure % 1000 == 0) && (LOG.isWarnEnabled())) {
                LOG.warn("DNS failures: " + dnsFailure);
              }
              continue;
            }
          }

下面是关键
u = new URL(u.getProtocol(), host, u.getPort(), u.getFile());
          String urlString = u.toString();
          try {
            urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
            host = new URL(urlString).getHost();
          } catch (Exception e) {
            LOG.warn("Malformed URL: '" + urlString + "', skipping (" +
                StringUtils.stringifyException(e) + ")");
            continue;
          }
          IntWritable hostCount = (IntWritable)hostCounts.get(host);
          if (hostCount == null) {
            hostCount = new IntWritable();
            hostCounts.put(host, hostCount);
          }

          // increment hostCount
          hostCount.set(hostCount.get() + 1);

          // skip URL if above the limit per host.
          if (hostCount.get() > maxPerHost) {
            if (hostCount.get() == maxPerHost + 1) {
              if (LOG.isInfoEnabled()) {
                LOG.info("Host " + host + " has more than " + maxPerHost +
                         " URLs." + " Skipping additional.");
              }
            }
            continue;
          }
        }

接下來,調用inversemapper,将url与datum分别写入一个segment.
然后调用crawlDbUpdater output.collect(key, orig(datum));
后面两个job的输入都是第一个job的输出,这里面不涉及算法,纯顺序执行没有什么技术含量
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics