从上篇的Crawl可以看到,抓取过程是按一个一个阶段,逐步进行。所以先看Injector( org.apache.nutch.crawl.Injector)
// initialize crawlDb injector.inject(crawlDb, rootUrlDir);
,从代码可以很明显看出,nutch是建立于hadoop之上,只不过使用的是旧的api。
Injector主要功能:
1.对url文件进行规范化和过滤,将结果存入临时文件夹
2.将上述结果与老的crawldb/current合并,产生一个新的,来替换原有的。
public void inject(Path crawlDb, Path urlDir) throws IOException { SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); long start = System.currentTimeMillis(); if (LOG.isInfoEnabled()) { LOG.info("Injector: starting at " + sdf.format(start)); LOG.info("Injector: crawlDb: " + crawlDb); LOG.info("Injector: urlDir: " + urlDir); } //建立临时目录,用于mapreduce的临时输出 Path tempDir = new Path(getConf().get("mapred.temp.dir", ".") + "/inject-temp-"+ Integer.toString(new Random().nextInt(Integer.MAX_VALUE))); // map text input file to a <url,CrawlDatum> file if (LOG.isInfoEnabled()) { LOG.info("Injector: Converting injected urls to crawl db entries."); } JobConf sortJob = new NutchJob(getConf()); sortJob.setJobName("inject " + urlDir); FileInputFormat.addInputPath(sortJob, urlDir); sortJob.setMapperClass(InjectMapper.class); FileOutputFormat.setOutputPath(sortJob, tempDir); sortJob.setOutputFormat(SequenceFileOutputFormat.class); sortJob.setOutputKeyClass(Text.class); //输出数据类型为CrawlDatum.class sortJob.setOutputValueClass(CrawlDatum.class); sortJob.setLong("injector.current.time", System.currentTimeMillis()); //提交job RunningJob mapJob = JobClient.runJob(sortJob); long urlsInjected = mapJob.getCounters().findCounter("injector", "urls_injected").getValue(); long urlsFiltered = mapJob.getCounters().findCounter("injector", "urls_filtered").getValue(); LOG.info("Injector: total number of urls rejected by filters: " + urlsFiltered); LOG.info("Injector: total number of urls injected after normalization and filtering: " + urlsInjected); // merge with existing crawl db 合并已存在crawlDb if (LOG.isInfoEnabled()) { LOG.info("Injector: Merging injected urls into crawl db."); } JobConf mergeJob = CrawlDb.createJob(getConf(), crawlDb); FileInputFormat.addInputPath(mergeJob, tempDir); mergeJob.setReducerClass(InjectReducer.class); JobClient.runJob(mergeJob); CrawlDb.install(mergeJob, crawlDb); // clean up 删除临时文件夹 FileSystem fs = FileSystem.get(getConf()); fs.delete(tempDir, true); long end = System.currentTimeMillis(); LOG.info("Injector: finished at " + sdf.format(end) + ", elapsed: " + TimingUtil.elapsedTime(start, end)); }
下面先看下对url文件操作的InjectMapper
public void configure(JobConf job) { this.jobConf = job; //初始化URLNormalizers URL规范器, urlNormalizers = new URLNormalizers(job, URLNormalizers.SCOPE_INJECT); interval = jobConf.getInt("db.fetch.interval.default", 2592000); //初始化过滤器 filters = new URLFilters(jobConf); //初始化分数过滤器 scfilters = new ScoringFilters(jobConf); //初始化,新加入url的得分 scoreInjected = jobConf.getFloat("db.score.injected", 1.0f); curTime = job.getLong("injector.current.time", System.currentTimeMillis()); }
可以看到findExtensions方法用来加载urlnormalizer的策略,不同的scope可以配置不同策略
/** * searches a list of suitable url normalizer plugins for the given scope. * * @param scope * Scope for which we seek a url normalizer plugin. * @return List - List of extensions to be used for this scope. If none, * returns null. * @throws PluginRuntimeException */ private List<Extension> findExtensions(String scope) { String[] orders = null; String orderlist = conf.get("urlnormalizer.order." + scope); if (orderlist == null) orderlist = conf.get("urlnormalizer.order"); if (orderlist != null && !orderlist.trim().equals("")) { orders = orderlist.trim().split("\\s+"); } String scopelist = conf.get("urlnormalizer.scope." + scope); Set<String> impls = null; if (scopelist != null && !scopelist.trim().equals("")) { String[] names = scopelist.split("\\s+"); impls = new HashSet<String>(Arrays.asList(names)); } Extension[] extensions = this.extensionPoint.getExtensions(); HashMap<String, Extension> normalizerExtensions = new HashMap<String, Extension>(); for (int i = 0; i < extensions.length; i++) { Extension extension = extensions[i]; if (impls != null && !impls.contains(extension.getClazz())) continue; normalizerExtensions.put(extension.getClazz(), extension); } List<Extension> res = new ArrayList<Extension>(); if (orders == null) { res.addAll(normalizerExtensions.values()); } else { // first add those explicitly named in correct order for (int i = 0; i < orders.length; i++) { Extension e = normalizerExtensions.get(orders[i]); if (e != null) { res.add(e); normalizerExtensions.remove(orders[i]); } } // then add all others in random order res.addAll(normalizerExtensions.values()); } return res; }
urlnormalizer相关配置文件
<!-- URL normalizer properties --> <property> <name>urlnormalizer.order</name> <value>org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value> <description>Order in which normalizers will run. If any of these isn't activated it will be silently skipped. If other normalizers not on the list are activated, they will run in random order after the ones specified here are run. </description> </property> <property> <name>urlnormalizer.regex.file</name> <value>regex-normalize.xml</value> <description>Name of the config file used by the RegexUrlNormalizer class. </description> </property> <property> <name>urlnormalizer.loop.count</name> <value>1</value> <description>Optionally loop through normalizers several times, to make sure that all transformations have been performed. </description> </property>
Urlfilter的初始化,可以看到是由插件仓库
ExtensionPoint point = PluginRepository.get(conf).getExtensionPoint( URLFilter.X_POINT_ID); if (point == null) throw new RuntimeException(URLFilter.X_POINT_ID + " not found.");
/** * @return a cached instance of the plugin repository */ public static synchronized PluginRepository get(Configuration conf) { String uuid = NutchConfiguration.getUUID(conf); if (uuid == null) { uuid = "nonNutchConf@" + conf.hashCode(); // fallback } PluginRepository result = CACHE.get(uuid); //如果为空,初始化 if (result == null) { result = new PluginRepository(conf); CACHE.put(uuid, result); } return result; }
public PluginRepository(Configuration conf) throws RuntimeException { //初始化活动插件的集合 fActivatedPlugins = new HashMap<String, Plugin>(); //初始化扩展点的集合 fExtensionPoints = new HashMap<String, ExtensionPoint>(); this.conf = conf; //读取配置,是否自动激活 this.auto = conf.getBoolean("plugin.auto-activation", true); //读取配置,插件存放目录 String[] pluginFolders = conf.getStrings("plugin.folders"); //工具类,作用就是遍历插件存放目录,找到plugin.xml,每一个插件对应一个plugin.xml。 //根据plugin生成PluginDescriptor的集合 PluginManifestParser manifestParser = new PluginManifestParser(conf, this); Map<String, PluginDescriptor> allPlugins = manifestParser .parsePluginFolder(pluginFolders); //要排除的插件,正则表达式 Pattern excludes = Pattern.compile(conf.get("plugin.excludes", "")); //要包含的插件,正则表达式 Pattern includes = Pattern.compile(conf.get("plugin.includes", "")); //对不适用的插件进行过滤 Map<String, PluginDescriptor> filteredPlugins = filter(excludes, includes, allPlugins); //对插件的依赖关系检查 fRegisteredPlugins = getDependencyCheckedPlugins(filteredPlugins, this.auto ? allPlugins : filteredPlugins); //安装扩展点 installExtensionPoints(fRegisteredPlugins); try { installExtensions(fRegisteredPlugins); } catch (PluginRuntimeException e) { LOG.error(e.toString()); throw new RuntimeException(e.getMessage()); } displayStatus(); }
/** * Returns a list of all found plugin descriptors. * * @param pluginFolders * folders to search plugins from * @return A {@link Map} of all found {@link PluginDescriptor}s. */ public Map<String, PluginDescriptor> parsePluginFolder(String[] pluginFolders) { Map<String, PluginDescriptor> map = new HashMap<String, PluginDescriptor>(); if (pluginFolders == null) { throw new IllegalArgumentException("plugin.folders is not defined"); } for (String name : pluginFolders) { File directory = getPluginFolder(name); if (directory == null) { continue; } LOG.info("Plugins: looking in: " + directory.getAbsolutePath()); for (File oneSubFolder : directory.listFiles()) { if (oneSubFolder.isDirectory()) { String manifestPath = oneSubFolder.getAbsolutePath() + File.separator + "plugin.xml"; try { LOG.debug("parsing: " + manifestPath); PluginDescriptor p = parseManifestFile(manifestPath); map.put(p.getPluginId(), p); } catch (MalformedURLException e) { LOG.warn(e.toString()); } catch (SAXException e) { LOG.warn(e.toString()); } catch (IOException e) { LOG.warn(e.toString()); } catch (ParserConfigurationException e) { LOG.warn(e.toString()); } } } } return map; }
相关推荐
5. **配置文件**:如 `conf/nutch-default.xml` 和 `conf/nutch-site.xml`,分别包含 Nutch 的默认配置和用户自定义配置。 6. **抓取策略**:Nutch 支持基于链接的抓取策略,如 PR(PageRank)和 TF-IDF(Term ...
### Nutch 1.2 源码阅读深入解析 #### Crawl类核心作用与流程概览 在深入了解Nutch 1.2源码之前,我们先明确Nutch的架构和工作流程。Nutch作为一款开源搜索引擎框架,其功能涵盖网页抓取、索引构建以及查询处理。...
Nutch-1.9 是一个开源的网络爬虫软件,被广泛用于数据挖掘、搜索引擎构建以及网络信息提取。它的最新版本提供了许多改进和优化,使得它成为开发者和研究者手中的利器。Nutch的设计目标是易用性和可扩展性,允许用户...
这个源码包 "apache-nutch-1.3-src.tar.gz" 和 "nutch-1.3.tar.gz" 包含了 Nutch 1.3 的源代码和编译后的二进制文件,对于开发者和研究者来说是非常有价值的资源。 **Nutch 概述** Nutch 是基于 Java 开发的,遵循 ...
在`apache-nutch-2.2.1`这个压缩包中,你将找到以下关键组成部分: 1. **源代码结构**:Nutch 的源代码通常分为几个主要模块,包括`conf`(配置文件)、`bin`(脚本和可执行文件)、`src`(源代码)以及`lib`(库...
2. **配置**:编辑conf/nutch-site.xml文件,设置如存储路径、抓取间隔、抓取范围等相关参数。 3. **创建种子**:在conf/urls目录下创建种子文件,列出要开始抓取的初始URL。 4. **运行Nutch**:使用bin/nutch命令行...
"apache-nutch-1.4-src.zip"是Nutch源码的zip压缩版本,用户可以直接解压并访问其中的源代码。 要获取和解压这些源码,你可以使用各种工具,如在Linux或Mac系统中使用命令行的tar和unzip命令,或者在Windows中使用...
nutch配置nutch-default.xml
在“apache-nutch-1.7-src.tar.gz”这个压缩包中,你将获得Nutch 1.7的源代码,这使得开发者可以深入了解其工作原理,并对其进行定制和扩展。解压后的文件夹“apache-nutch-1.7”包含了所有必要的组件和配置文件。 ...
Nutch-1.5.1源码是Apache Nutch项目的一个重要版本,它是一个高度可扩展的、开源的网络爬虫和全文搜索引擎框架。Nutch最初由Doug Cutting创建,后来成为了Hadoop项目的一部分,因为其在大数据处理和分布式计算方面的...
2. **配置文件**:在`conf`目录下,有默认的Nutch配置文件,如`nutch-site.xml`,定义了Nutch运行时的各种参数,如抓取间隔、抓取策略、存储路径等。用户可以根据实际需求修改这些配置。 3. **插件框架**:Nutch...
apache-nutch-2.3.1-src.tar ,网络爬虫的源码, 用ivy2管理, ant runtime 编译 apache-nutch-2.3.1-src.tar ,网络爬虫的源码, 用ivy2管理, ant runtime 编译
《lucene+nutch搜索引擎光盘源码(1-8章)》是一套全面解析Lucene和Nutch搜索引擎技术的源代码教程,涵盖了从基础到进阶的多个层面。这套资源包含8个章节的源码,由于文件大小限制,被分成了多个部分进行上传。 ...
- 初始化Nutch的配置文件,根据需求修改`conf/nutch-site.xml`。 - 运行Nutch的基本命令,如抓取种子URL (`bin/nutch inject`), 分析网页 (`bin/nutch fetch`), 解析内容 (`bin/nutch parse`), 更新数据库 (`bin/...
4. **配置与部署**:解压 "apache-nutch-1.9" 文件后,需要根据你的环境配置`conf/nutch-site.xml`文件,设置包括抓取间隔、并发度、存储路径等参数。同时,可能还需要配置`conf/regex-urlfilter.txt`和`conf/...
3. **配置Nutch**:修改`conf/nutch-site.xml`等配置文件,设置爬虫的启动参数,如抓取范围、URL过滤规则等。 4. **创建数据库**:Nutch通常使用Hadoop HDFS作为数据存储,因此需要设置Hadoop环境,并创建相应的...
apache-nutch-1.4-bin.tar.gz.part2
Nutch是一款刚刚诞生的完整的开源搜索引擎系统,可以结合数据库进行索引,能快速构建所需系统。Nutch 是基于Lucene的,Lucene为 Nutch 提供了文本索引和搜索的API,所以它使用Lucene作为索引和检索的模块。Nutch的...
Nutch 1.6 是一个开源的网络爬虫项目,由Apache软件基金会开发,主要用于抓取、索引和搜索Web内容。...通过阅读和研究源码,可以提升对这些技术的理解,并能为开发自己的爬虫或搜索引擎项目打下坚实基础。
Nutch 是一个高度可扩展且开放源码的网络爬虫项目,主要用于抓取和索引互联网上的数据。本篇将基于提供的文件内容对 Nutch 的参数设置进行深入解析,帮助读者更好地理解 Nutch 中各个组件的工作原理及配置方式。 ##...