英文原文出处:
DissectingTheNutchCrawler 转载本文请注明出处:http://blog.csdn.net/pwlazy
Factory classes: '''URLFilterFactory'''
> Class net.nutch.net.URLFilterFactory
> used by:
> - net.nutch.db.WebDBInjector
> - net.nutch.tools.UpdateDatabaseTool
URLFilterFactory is not strictly part of the crawler, but it is a good extension point within Nutch. Here's how it works:
-
When the class is loaded, URLFILTER_CLASS is set to the value returned by NutchConf for the key "urlfilter.class"
-
When getFilter() is called, it checks to see if the filter class has already been loaded. If not, we load it using Class.forName(URLFILTER_CLASS), and the class is returned.
It loads one class, which is configurable via "urlfilter.class". By default, nutch-default.xml specifies this as follows:
<!-- urlfilter properties -->
<property>
<name>urlfilter.class</name>
<value>net.nutch.net.RegexURLFilter</value>
<description>Name of the class used to filterURLs.</description>
</property>
<property>
<name>urlfilter.regex.file</name>
<value>regex-urlfilter.txt</value>
<description>Name of file onCLASSPATH containing default regular
expressions used byRegexURLFilter.</description>
</property>
Now let's look at the crawler factories, which are a bit more complex.
工厂类:''URLFilterFactory'''
类 net.nutch.net.URLFilterFactory 被net.nutch.db.WebDBInjector 和net.nutch.tools.UpdateDatabaseTool 使用
URLFilterFactory is not strictly part of the crawler, but it is a good extension point within Nutch. Here's how it works:
URLFilterFactory 严格意义上并不属于crawler,但它是一个好的扩展点。让我们看看它的工作机制:
- 当该类被加载时,属性URLFILTER_CLASS被赋值为NutchConf.get().get("urlfilter.class")
- 当getFilter()方法被调用,它检查是否该类被加载,如果没有,通过Class.forName(URLFILTER_CLASS)来加载,否则直接返回该类
它通过可配置的urlfilter.class特性加载该类。默认情况下,nutch-default.xml定义如下
<!--urlfilterproperties-->
<property>
<name>urlfilter.class</name>
<value>net.nutch.net.RegexURLFilter</value>
<description>NameoftheclassusedtofilterURLs.</description>
</property>
<property>
<name>urlfilter.regex.file</name>
<value>regex-urlfilter.txt</value>
<description>NameoffileonCLASSPATHcontainingdefaultregular
expressionsusedbyRegexURLFilter.</description>
</property>
让我们再看看与crawler相关的工厂,那可是有点复杂。
分享到:
相关推荐
"Dissecting the Hotspot JVM" 本文档是关于 Java 虚拟机(JVM)的深入分析,作者 Martin Toshev 通过分享 JVM 的架构、实现机理和调试技术,帮助读者更好地理解 JVM,并为其提供了实践经验。 虚拟机基础 虚拟机...
##### Dissecting a Font 分解字体 理解字体是由多个部分组成的:家族名、大小、斜体、加粗等属性。Perl/Tk允许开发者对这些属性进行细致的控制。 ##### Using Fonts 使用字体 通过设置Tk::Font对象,可以为不同...
Offensive Malware Analysis - Dissecting OSXFruitFly Via A Custom C&C Server OSXFruitFly是一种复杂的恶意软件,最初由Malwarebytes发现。该恶意软件使用了自定义的C&C服务器,以绕过传统的安全防护机制。为了...
解剖图像作物这是B. Van Hoorick和C. Vondrick的正式资料库,“解剖图像作物”, arXiv预印本arXiv:2011.11831,2020 。简而言之,我们研究了视觉裁剪留下的痕迹。基本用法说明步骤1:使用高分辨率图像文件填充data...
这份报告“信息安全_数据安全_us-18-Goland-Dissecting-Non-Mali.pdf”主要由研究人员Ido Naor和Dani Goland探讨了一个鲜为人知的问题:非恶意工件(Non-malicious Artifacts)如何导致敏感数据泄露,并提出了如何...
Chapter 11 - Dissecting Classes Chapter 12 - Compositional Design Chapter 13 - Extending Class Functionality Through Inheritance Part III - Implementing Polymorphic Behavior Chapter 14 - Ad ...
在IT领域,尤其是在软件开发与编程教育中,《Dissecting a C# Application Inside SharpDevelop》是一本具有指导意义的专业书籍,由Christian Holm、Mike Krüger和Bernhard Spuida三位作者共同撰写,于2004年由...
Real World Java EE Night Hacks--Dissecting the Business Tier.jpg(电子书的封面图片)
In 2019, the rapid rate at which GPU manufacturers refresh their designs, coupled with their reluctance to disclose microarchitectural details, is still a hurdle for those software designers who want ...
GTC 2018Dissecting the Volta GPU Architecture throughMicrobenchmarkingZhe Jia, Marco Maggioni, Benjamin Staiger, Daniele P. ScarpazzaHigh-Performance Computing Group• Micro-architectural details ...
2018CVPR_Dissecting Person Re-identification from the Viewpoint of Viewpoint
这篇文档主要讨论的是一个关于信息安全和数据安全的主题,特别是在云连接设备,如电动滑板车(E-Scooter)上的应用。演讲者Nikias Bassen是一位来自德国的IT专家,拥有计算机科学学位,并在逆向工程(RE)和安全研究...
### MS11-046: 深度解析零日攻击 #### 摘要 本文将深入探讨一种利用MS11-046漏洞进行的零日攻击,该攻击能够实现权限提升,使攻击者能够在受限用户账户下运行原本无法执行的命令。所涉及的特定漏洞为“MS11-046: ...
H0w t0 R34d Dissecting the Hack: The F0rb1dd3n Network xvii About the Authors xix PART 1 F0RB1DD3N PR010gu3 3 A New Assignment 3 ChAPTeR 0N3 15 Problem Solved 15 Getting Started 21 The Acquisition 22 ...
Dissecting the Activity Building and Running the Activity ■Chapter 4: Using XML-Based Layouts What Is an XML-Based Layout? Why Use XML-Based Layouts? OK, So What Does It Look Like? What’s with ...
《解剖入侵:F0rb1dd3n网络》这本书是由Jayson E. Street、Kent Nabors、Brian Baskin和Marcus Carey共同撰写,并由Dustin D. Trammell担任技术编辑。本书的修订版由Syngress出版社出版,该出版社属于Elsevier旗下,...