笔记笔记

diddyrock

浏览: 46997 次
性别:
来自: 上海

最近访客更多访客>>

InJavaWeTrust

sea1984619

lemon0

xilou

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (26)

社区版块

存档分类

Bean JSP Tomcat 算法

网页快照乱码问题修正。修改页面tcached.jsp，将content = new String(bean.getContent(details))修改为content = new String(bean.getContent(details),"utf-8")。
还有就是tomcat 6以上版本 jsp中不能有" " "号，要以" ' "取代

protocal:
每一次返回的是插件里面的protocal,
FetcherThread->protocolFactory->extention->instance.
parserFactory.getParsers()可以修改自定义插件，据说有个什么鸟carrot2插件，有空看看！

crawl没有使用线程池，一个鸟threadpool让我找了好久shit！java已经不再推荐自己编写线程池，反正nutch也没有使用，n自己在fetch的时候自己创建了一驼线程，每个从fetcher 的input里面读取一个键值作为爬行地址。

调用injector比较简单，先生产datum,然后输出到文件夹，然后generator调用selector.class来进行map，这里面selector可以加以修改添加插件或者url列表之类，不过太tmd侵入式了。

如果是不存在的链接，直接喀嚓然后output.collect(sortValue, entry);其中entry包含了crawlDatum和url,这两个鸟东西居然分开存放，
在generator中partitioner为hostPartitioner.getPartition(((SelectorEntry)value).url, key,numReduceTasks);这个比较重要

selector主要有两种选择方式byip不byip
if (byIP) {
            try {
              InetAddress ia = InetAddress.getByName(host);
              host = ia.getHostAddress();
            } catch (UnknownHostException uhe) {
              if (LOG.isDebugEnabled()) {
                LOG.debug("DNS lookup failed: " + host + ", skipping.");
              }
              dnsFailure++;
              if ((dnsFailure % 1000 == 0) && (LOG.isWarnEnabled())) {
                LOG.warn("DNS failures: " + dnsFailure);
              }
              continue;
            }
          }

下面是关键
u = new URL(u.getProtocol(), host, u.getPort(), u.getFile());
          String urlString = u.toString();
          try {
            urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
            host = new URL(urlString).getHost();
          } catch (Exception e) {
            LOG.warn("Malformed URL: '" + urlString + "', skipping (" +
                StringUtils.stringifyException(e) + ")");
            continue;
          }
          IntWritable hostCount = (IntWritable)hostCounts.get(host);
          if (hostCount == null) {
            hostCount = new IntWritable();
            hostCounts.put(host, hostCount);
          }

          // increment hostCount
          hostCount.set(hostCount.get() + 1);

          // skip URL if above the limit per host.
          if (hostCount.get() > maxPerHost) {
            if (hostCount.get() == maxPerHost + 1) {
              if (LOG.isInfoEnabled()) {
                LOG.info("Host " + host + " has more than " + maxPerHost +
                         " URLs." + " Skipping additional.");
              }
            }
            continue;
          }
        }

接下來，調用inversemapper,将url与datum分别写入一个segment.
然后调用crawlDbUpdater output.collect(key, orig（datum）);
后面两个job的输入都是第一个job的输出，这里面不涉及算法，纯顺序执行没有什么技术含量

分享到：

排序算法 | distributed

2009-01-06 17:54
浏览 1153
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

笔记笔记

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

笔记笔记

评论

发表评论

相关推荐

最近访客更多访客>>