nutch源码阅读(2)-Injector的初始化 -

defungo

浏览: 82042 次
性别:
来自: 北京

最近访客更多访客>>

csyfly2003

david_xu

melin

biyelei

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

nutch源码阅读(2)-Injector的初始化

博客分类：

nutch

从上篇的Crawl可以看到，抓取过程是按一个一个阶段，逐步进行。所以先看Injector（ org.apache.nutch.crawl.Injector）

// initialize crawlDb
    injector.inject(crawlDb, rootUrlDir);

,从代码可以很明显看出，nutch是建立于hadoop之上，只不过使用的是旧的api。

Injector主要功能：

1.对url文件进行规范化和过滤，将结果存入临时文件夹

2.将上述结果与老的crawldb/current合并，产生一个新的，来替换原有的。

  public void inject(Path crawlDb, Path urlDir) throws IOException {
    SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
    long start = System.currentTimeMillis();
    if (LOG.isInfoEnabled()) {
      LOG.info("Injector: starting at " + sdf.format(start));
      LOG.info("Injector: crawlDb: " + crawlDb);
      LOG.info("Injector: urlDir: " + urlDir);
    }
    //建立临时目录，用于mapreduce的临时输出
    Path tempDir =
      new Path(getConf().get("mapred.temp.dir", ".") +
               "/inject-temp-"+
               Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));

    // map text input file to a <url,CrawlDatum> file
    if (LOG.isInfoEnabled()) {
      LOG.info("Injector: Converting injected urls to crawl db entries.");
    }
    JobConf sortJob = new NutchJob(getConf());
    sortJob.setJobName("inject " + urlDir);
    FileInputFormat.addInputPath(sortJob, urlDir);
    sortJob.setMapperClass(InjectMapper.class);
    FileOutputFormat.setOutputPath(sortJob, tempDir);
    sortJob.setOutputFormat(SequenceFileOutputFormat.class);
    sortJob.setOutputKeyClass(Text.class);
    //输出数据类型为CrawlDatum.class
    sortJob.setOutputValueClass(CrawlDatum.class);
    sortJob.setLong("injector.current.time", System.currentTimeMillis());
    //提交job
    RunningJob mapJob = JobClient.runJob(sortJob);

    long urlsInjected = mapJob.getCounters().findCounter("injector", "urls_injected").getValue();
    long urlsFiltered = mapJob.getCounters().findCounter("injector", "urls_filtered").getValue();
    LOG.info("Injector: total number of urls rejected by filters: " + urlsFiltered);
    LOG.info("Injector: total number of urls injected after normalization and filtering: "
        + urlsInjected);

    // merge with existing crawl db  合并已存在crawlDb
    if (LOG.isInfoEnabled()) {
      LOG.info("Injector: Merging injected urls into crawl db.");
    }
    
    JobConf mergeJob = CrawlDb.createJob(getConf(), crawlDb);
    FileInputFormat.addInputPath(mergeJob, tempDir);
    mergeJob.setReducerClass(InjectReducer.class);
    JobClient.runJob(mergeJob);
    CrawlDb.install(mergeJob, crawlDb);

    // clean up 删除临时文件夹
    FileSystem fs = FileSystem.get(getConf());
    fs.delete(tempDir, true);

    long end = System.currentTimeMillis();
    LOG.info("Injector: finished at " + sdf.format(end) + ", elapsed: " + TimingUtil.elapsedTime(start, end));
  }

下面先看下对url文件操作的InjectMapper

 public void configure(JobConf job) {
      this.jobConf = job;
      //初始化URLNormalizers URL规范器,
      urlNormalizers = new URLNormalizers(job, URLNormalizers.SCOPE_INJECT);
      interval = jobConf.getInt("db.fetch.interval.default", 2592000);
      //初始化过滤器
      filters = new URLFilters(jobConf);
     //初始化分数过滤器
      scfilters = new ScoringFilters(jobConf);
      //初始化，新加入url的得分
      scoreInjected = jobConf.getFloat("db.score.injected", 1.0f);
      curTime = job.getLong("injector.current.time", System.currentTimeMillis());
    }

可以看到findExtensions方法用来加载urlnormalizer的策略，不同的scope可以配置不同策略

/**
   * searches a list of suitable url normalizer plugins for the given scope.
   * 
   * @param scope
   *          Scope for which we seek a url normalizer plugin.
   * @return List - List of extensions to be used for this scope. If none,
   *         returns null.
   * @throws PluginRuntimeException
   */
  private List<Extension> findExtensions(String scope) {

    String[] orders = null;
    String orderlist = conf.get("urlnormalizer.order." + scope);
    if (orderlist == null) orderlist = conf.get("urlnormalizer.order");
    if (orderlist != null && !orderlist.trim().equals("")) {
      orders = orderlist.trim().split("\\s+");
    }
    String scopelist = conf.get("urlnormalizer.scope." + scope);
    Set<String> impls = null;
    if (scopelist != null && !scopelist.trim().equals("")) {
      String[] names = scopelist.split("\\s+");
      impls = new HashSet<String>(Arrays.asList(names));
    }
    Extension[] extensions = this.extensionPoint.getExtensions();
    HashMap<String, Extension> normalizerExtensions = new HashMap<String, Extension>();
    for (int i = 0; i < extensions.length; i++) {
      Extension extension = extensions[i];
      if (impls != null && !impls.contains(extension.getClazz()))
        continue;
      normalizerExtensions.put(extension.getClazz(), extension);
    }
    List<Extension> res = new ArrayList<Extension>();
    if (orders == null) {
      res.addAll(normalizerExtensions.values());
    } else {
      // first add those explicitly named in correct order
      for (int i = 0; i < orders.length; i++) {
        Extension e = normalizerExtensions.get(orders[i]);
        if (e != null) {
          res.add(e);
          normalizerExtensions.remove(orders[i]);
        }
      }
      // then add all others in random order
      res.addAll(normalizerExtensions.values());
    }
    return res;
  }

urlnormalizer相关配置文件

<!-- URL normalizer properties -->

<property>
  <name>urlnormalizer.order</name>
  <value>org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value>
  <description>Order in which normalizers will run. If any of these isn't
  activated it will be silently skipped. If other normalizers not on the
  list are activated, they will run in random order after the ones
  specified here are run.
  </description>
</property>

<property>
  <name>urlnormalizer.regex.file</name>
  <value>regex-normalize.xml</value>
  <description>Name of the config file used by the RegexUrlNormalizer class.
  </description>
</property>

<property>
  <name>urlnormalizer.loop.count</name>
  <value>1</value>
  <description>Optionally loop through normalizers several times, to make
  sure that all transformations have been performed.
  </description>
</property>

Urlfilter的初始化，可以看到是由插件仓库

    ExtensionPoint point = PluginRepository.get(conf).getExtensionPoint(
            URLFilter.X_POINT_ID);
        if (point == null)
          throw new RuntimeException(URLFilter.X_POINT_ID + " not found.");

 /**
   * @return a cached instance of the plugin repository
   */
  public static synchronized PluginRepository get(Configuration conf) {
    String uuid = NutchConfiguration.getUUID(conf);
    if (uuid == null) {
      uuid = "nonNutchConf@" + conf.hashCode(); // fallback
    }
    PluginRepository result = CACHE.get(uuid);
    //如果为空，初始化
    if (result == null) {
      result = new PluginRepository(conf);
      CACHE.put(uuid, result);
    }
    return result;
  }

public PluginRepository(Configuration conf) throws RuntimeException {
    //初始化活动插件的集合
    fActivatedPlugins = new HashMap<String, Plugin>();
    //初始化扩展点的集合
    fExtensionPoints = new HashMap<String, ExtensionPoint>();
    this.conf = conf;
    //读取配置，是否自动激活
    this.auto = conf.getBoolean("plugin.auto-activation", true);
    //读取配置,插件存放目录
    String[] pluginFolders = conf.getStrings("plugin.folders");
    //工具类，作用就是遍历插件存放目录，找到plugin.xml，每一个插件对应一个plugin.xml。
    //根据plugin生成PluginDescriptor的集合
    PluginManifestParser manifestParser = new PluginManifestParser(conf, this);
    Map<String, PluginDescriptor> allPlugins = manifestParser
        .parsePluginFolder(pluginFolders);
    //要排除的插件，正则表达式
    Pattern excludes = Pattern.compile(conf.get("plugin.excludes", ""));
    //要包含的插件，正则表达式
    Pattern includes = Pattern.compile(conf.get("plugin.includes", ""));
    //对不适用的插件进行过滤
    Map<String, PluginDescriptor> filteredPlugins = filter(excludes, includes,
        allPlugins);
    //对插件的依赖关系检查
    fRegisteredPlugins = getDependencyCheckedPlugins(filteredPlugins,
        this.auto ? allPlugins : filteredPlugins);
    //安装扩展点
    installExtensionPoints(fRegisteredPlugins);
    try {
      installExtensions(fRegisteredPlugins);
    } catch (PluginRuntimeException e) {
        LOG.error(e.toString());
      throw new RuntimeException(e.getMessage());
    }
    displayStatus();
  }

 /**
   * Returns a list of all found plugin descriptors.
   * 
   * @param pluginFolders
   *          folders to search plugins from
   * @return A {@link Map} of all found {@link PluginDescriptor}s.
   */
  public Map<String, PluginDescriptor> parsePluginFolder(String[] pluginFolders) {
    Map<String, PluginDescriptor> map = new HashMap<String, PluginDescriptor>();

    if (pluginFolders == null) {
      throw new IllegalArgumentException("plugin.folders is not defined");
    }

    for (String name : pluginFolders) {
      File directory = getPluginFolder(name);
      if (directory == null) {
        continue;
      }
      LOG.info("Plugins: looking in: " + directory.getAbsolutePath());
      for (File oneSubFolder : directory.listFiles()) {
        if (oneSubFolder.isDirectory()) {
          String manifestPath = oneSubFolder.getAbsolutePath() + File.separator
              + "plugin.xml";
          try {
            LOG.debug("parsing: " + manifestPath);
            PluginDescriptor p = parseManifestFile(manifestPath);
            map.put(p.getPluginId(), p);
          } catch (MalformedURLException e) {
            LOG.warn(e.toString());
          } catch (SAXException e) {
            LOG.warn(e.toString());
          } catch (IOException e) {
            LOG.warn(e.toString());
          } catch (ParserConfigurationException e) {
            LOG.warn(e.toString());
          }
        }
      }
    }
    return map;
  }

分享到：

nutch源码阅读(3)-Injector的Mapper | nutch源码阅读(1)-Crawl

2013-05-27 17:28
浏览 1540
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

nutch源码阅读(2)-Injector的初始化

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

nutch源码阅读(2)-Injector的初始化

评论

发表评论

相关推荐

nutch源码阅读(10)-Fetch

nutch源码阅读(9)-Fetch

nutch源码阅读(8)-Generator

nutch源码阅读(7)-Generator

nutch源码阅读(6)-Generator

nutch源码阅读(5)-Injector总结

nutch源码阅读(5)-Generator

nutch源码阅读(4)-Injector的第二个MapReduce

nutch源码阅读(3)-Injector的Mapper

nutch源码阅读(1)-Crawl

nutch博客

Failed to set permissions of path问题

nutch eclipse 缺少的jar

nutch杂记

Nutch相关框架安装使用最佳指南

最近访客更多访客>>