英文原文出处:
DissectingTheNutchCrawler 转载本文请注明出处:http://blog.csdn.net/pwlazy
Summary: Nutch crawler extension points
The main ways to configure the Nutch crawler are as follows:
-
Configuration files. Default values are in nutch-default.xml, and you should override them in nutch-site.xml.
-
URLFilter interface. By default, the class net.nutch.net.RegexURLFilter is used, which reads regular expression patterns from regex-urlfilter.txt. So, you can:
-
Edit that file to tune its behavior
-
Or, write a new class that implements net.nutch.net.URLFilter, and change nutch-site.xml to use it.
-
Protocol interface. To add support for a new protocol, write or add a plugin to the "plugins" directory. To change protocol behavior, modify the appropriate plugin.
-
Parser interface. As for Protocol, you should add/create a plugin for any new content-types. Otherwise, you will need to replace the appropriate plugin if you want to modify its behavior.
-
If you need to make other changes, refer to our discussion of Fetcher and FetchListTool. Consider subclassing these classes, overriding the appropriate method, then calling your class from the "nutch" script using the full class path.
综述:Nutch crawler的扩展点
配置Nutch crawler的主要方式如下:
- 配置文件。 nutch-default.xml设置了默认值,你应该在nutch-site.xml覆盖相应默认值
-
URLFilter接口。默认情况下,系统使用class net.nutch.net.RegexURLFilter,它从regex-urlfilter.txt读取正则表达式,所以你可以:
- 编辑regex-urlfilter.txt来调整RegexURLFilter得行为
- Protocol接口。添加对新的协议得支持,写个插件改变协议行为或者修个某个适合的插件放入plugins目录,
- Parser接口。就解析器来说(译注: 原文此处为协议应该是笔误),你应该增加一个插件用于新的内容类型。否则如果你想修改相关插件行为你需要替换相应插件
- 如果你想作其他改变,参考我们关于Fetcher and FetchListTool 的讨论。你可以继承这些类,然后覆盖合适的方法,然后将相应的完全的类路径写入nutch脚本,最后调用它
分享到:
相关推荐
"Dissecting the Hotspot JVM" 本文档是关于 Java 虚拟机(JVM)的深入分析,作者 Martin Toshev 通过分享 JVM 的架构、实现机理和调试技术,帮助读者更好地理解 JVM,并为其提供了实践经验。 虚拟机基础 虚拟机...
解剖图像作物这是B. Van Hoorick和C. Vondrick的正式资料库,“解剖图像作物”, arXiv预印本arXiv:2011.11831,2020 。简而言之,我们研究了视觉裁剪留下的痕迹。基本用法说明步骤1:使用高分辨率图像文件填充data...
Offensive Malware Analysis - Dissecting OSXFruitFly Via A Custom C&C Server OSXFruitFly是一种复杂的恶意软件,最初由Malwarebytes发现。该恶意软件使用了自定义的C&C服务器,以绕过传统的安全防护机制。为了...
这份报告“信息安全_数据安全_us-18-Goland-Dissecting-Non-Mali.pdf”主要由研究人员Ido Naor和Dani Goland探讨了一个鲜为人知的问题:非恶意工件(Non-malicious Artifacts)如何导致敏感数据泄露,并提出了如何...
Real World Java EE Night Hacks--Dissecting the Business Tier.jpg(电子书的封面图片)
In 2019, the rapid rate at which GPU manufacturers refresh their designs, coupled with their reluctance to disclose microarchitectural details, is still a hurdle for those software designers who want ...
GTC 2018Dissecting the Volta GPU Architecture throughMicrobenchmarkingZhe Jia, Marco Maggioni, Benjamin Staiger, Daniele P. ScarpazzaHigh-Performance Computing Group• Micro-architectural details ...
2018CVPR_Dissecting Person Re-identification from the Viewpoint of Viewpoint
### MS11-046: 深度解析零日攻击 #### 摘要 本文将深入探讨一种利用MS11-046漏洞进行的零日攻击,该攻击能够实现权限提升,使攻击者能够在受限用户账户下运行原本无法执行的命令。所涉及的特定漏洞为“MS11-046: ...
这篇文档主要讨论的是一个关于信息安全和数据安全的主题,特别是在云连接设备,如电动滑板车(E-Scooter)上的应用。演讲者Nikias Bassen是一位来自德国的IT专家,拥有计算机科学学位,并在逆向工程(RE)和安全研究...
##### Dissecting a Font 分解字体 理解字体是由多个部分组成的:家族名、大小、斜体、加粗等属性。Perl/Tk允许开发者对这些属性进行细致的控制。 ##### Using Fonts 使用字体 通过设置Tk::Font对象,可以为不同...
C++ For Artists: The Art, Philosophy, and Science of Object-Oriented Programming by Rick Miller ISBN:1932504028 Biblio Distribution ? 2003 (590 pages) Intended as both a classroom and reference ...
H0w t0 R34d Dissecting the Hack: The F0rb1dd3n Network xvii About the Authors xix PART 1 F0RB1DD3N PR010gu3 3 A New Assignment 3 ChAPTeR 0N3 15 Problem Solved 15 Getting Started 21 The Acquisition 22 ...
在IT领域,尤其是在软件开发与编程教育中,《Dissecting a C# Application Inside SharpDevelop》是一本具有指导意义的专业书籍,由Christian Holm、Mike Krüger和Bernhard Spuida三位作者共同撰写,于2004年由...
本书通过深入了解SharpDevelop(一种用C#编写的完整集成开发环境)来教授高级.NET编程技术。
《解剖入侵:F0rb1dd3n网络》这本书是由Jayson E. Street、Kent Nabors、Brian Baskin和Marcus Carey共同撰写,并由Dustin D. Trammell担任技术编辑。本书的修订版由Syngress出版社出版,该出版社属于Elsevier旗下,...