英文原文出处:
DissectingTheNutchCrawler 转载本文请注明出处:http://blog.csdn.net/pwlazy
Factory classes: '''ParserFactory''', '''ProtocolFactory'''
> Class net.nutch.parser.ParserFactory
> used by:
> - net.nutch.db.WebDBInjector
> - net.nutch.fetcher.Fetcher
> - net.nutch.parser.ParserChecker
>
> Class net.nutch.protocol.ProtocolFactory
> used by:
> - net.nutch.fetcher.Fetcher
> - net.nutch.parser.ParserChecker
>
> Class net.nutch.plugin.PluginRepository: used by all of the above
ParserFactory and ProtocolFactory are called directly from net.nutch.fetcher.Fetcher, to get the appropriate Parser and Protocol objects for a given content_type and url. They both use an instance of net.nutch.plugin.PluginRepository to find and load Java classes.
By default, nutch-default.xml tells PluginRepository to look for classes in a directory called "plugins" somewhere on the Java classpath. Normally you'll just use the one in your Nutch install directory.
<!-- plugin properties -->
<property>
<name>plugin.folders</name>
<value>plugins</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>
Inside the plugin directory you will find a handful of sub-directories, each containing a file called "plugin.xml" and one or more Java archive (.jar) files. Directories include:
-
parse-html
-
parse-text
-
parse-msword
-
parse-pdf
-
protocol-file
-
protocol-ftp
-
protocol-http
One directory, plus the "plugin.xml" and .jar file contents, constitutes one "plugin".
TheXML file is a descriptor that is read by PluginRepository to determine two main things:
-
What "extension point" (Java interface) the plugin implements, and
Here is the plugin.xml file for "protocol-file":
<?xml version="1.0" encoding="UTF-8"?>
<plugin
id="protocol-file"
name="File Protocol Plug-in"
version="1.0.0"
provider-name="nutch.org">
<extension-point
id="net.nutch.protocol.Protocol"
name="Nutch Protocol"/>
<runtime>
<library name="protocol-file.jar">
<export name="*"/>
</library>
</runtime>
<extension id="net.nutch.protocol.file"
name="FileProtocol"
point="net.nutch.protocol.Protocol">
<implementation id="net.nutch.protocol.file.File"
class="net.nutch.protocol.file.File"
protocolName="file"/>
</extension>
</plugin>
Since the plugin is named "protocol-file", you probably guessed already that this is a protocol handler for loading files on disk. But this descriptor tells us -- and PluginRepository -- precisely what it does:
Thus, when Nutch sees aURL that starts with " file://", it will know to call this plugin to fetch that page.
Look at the descriptors for "protocol-http" and "protocol-ftp". You should see that the extension-point is exactly the same as for protocol-file, but the protocolName is different: "http" and "ftp", respectively.
Now let's examine the descriptor for parse-text:
<?xml version="1.0" encoding="UTF-8"?>
<plugin
id="parse-text"
name="Text Parse Plug-in"
version="1.0.0"
provider-name="nutch.org">
<extension-point
id="net.nutch.parse.Parser"
name="Nutch Content Parser"/>
<runtime>
<library name="parse-text.jar">
<export name="*"/>
</library>
</runtime>
<extension id="net.nutch.parse.text"
name="TextParse"
point="net.nutch.parse.Parser">
<implementation id="net.nutch.parse.text.TextParser"
class="net.nutch.parse.text.TextParser"
contentType="text/plain"
pathSuffix="txt"/>
</extension>
</plugin>
Note that the extension-point is now net.nutch.parse.Parser. And this time, <extension><implementation> doesn't specify a protocolName. Instead, we see "contentType" and "pathSuffix".
So now we see how PluginRepository chooses which plugin to use for a given task:
-
It finds the set of plugins that implement a certain extension-point
-
Then, from that set, it finds one that works for the content at hand (protocolName, contentType, or pathSuffix).
Look at the descriptor for parse-html. You'll see that it follows these rules. It implements the same extension-point as parse-text (net.nutch.parse.Parser), but it has different values for contentType and pathSuffix values:
contentType="text/html"
pathSuffix=""
This entry looks a bit strange with the empty pathSuffix value. But that just means that this plugin doesn't match any pathSuffix value. So, parse-html is only used when we fetch remoteURLs, not anything residing on the local filesystem.
Factory classes: '''ParserFactory''', '''ProtocolFactory'''
工厂类:''ParserFactory'' 和 ''ProtocolFactory''
类net.nutch.parser.ParserFactory 被如下类使用
- net.nutch.db.WebDBInjector
- net.nutch.fetcher.Fetcher
- net.nutch.parser.ParserChecker
类Class net.nutch.protocol.ProtocolFactory 被如下类使用
类net.nutch.plugin.PluginRepository: 被上面所有类使用
net.nutch.fetcher.Fetcher直接调用ParserFactory 和 ProtocolFactory 根据传入的内容类型和url获取合适的Parser和Protocol对象 , 两个工厂类都使用net.nutch.plugin.PluginRepository 的实例获取和加载相关java类
默认情况下,nutch-default.xml告诉了
PluginRepository 从位于类路径的plugins目录中获取类。通常情况下你应该使用你的Nutch安装目录中那个plugins目录
<!--pluginproperties-->
<property>
<name>plugin.folders</name>
<value>plugins</value>
<description>Directorieswherenutchpluginsarelocated.Each
elementmaybearelativeorabsolutepath.Ifabsolute,itisused
asis.Ifrelative,itissearchedforontheclasspath.</description>
</property>
在plugin目录下,你会看到一些子目录。每个子目录包含一个名为plugin.xml的文件和一个或多个jar文件。目录包括
-
parse-html
-
parse-text
-
parse-msword
-
parse-pdf
-
protocol-file
-
protocol-ftp
- protocol-http
一个目录加上目录里的plugin.xml及jar文件构成了一个插件
那个xml文件是个描述,由 PluginRepository 读取从而决定两个主要的事情:
- 该插件实现了什么扩展点(java接口)
- 如何加载其内容
以下是protocol-file目录下(译注:或者说protocol-file插件)的plugin.xml
<?xmlversion="1.0"encoding="UTF-8"?>
<plugin
id="protocol-file"
name="FileProtocolPlug-in"
version="1.0.0"
provider-name="nutch.org">
<extension-point
id="net.nutch.protocol.Protocol"
name="NutchProtocol"/>
<runtime>
<libraryname="protocol-file.jar">
<exportname="*"/>
</library>
</runtime>
<extensionid="net.nutch.protocol.file"
name="FileProtocol"
point="net.nutch.protocol.Protocol">
<implementationid="net.nutch.protocol.file.File"
class="net.nutch.protocol.file.File"
protocolName="file"/>
</extension>
</plugin>
因为这个插件叫protocol-file,所以你很可能已经猜到这是一个加载磁盘文件的协议处理器。但这个xml描述能精确地告诉我们和PluginRepository 这个插件到底做什么用
the extension-point (Java interface) name is "net.nutch.protocol.Protocol"
- 这个扩展点(java接口)名是net.nutch.protocol.Protocol
-
协议名是 "file"
因此,当nutch看到一个url以file:// 开始,它就会用这个插件获取那个页面
看"protocol-http" 和 "protocol-ftp".的xml描述,你会看到它们的扩展点一样,但协议名不同一个是http,另一个是ftp
下面让我们看看parse-text的描述
<?xmlversion="1.0"encoding="UTF-8"?>
<plugin
id="parse-text"
name="TextParsePlug-in"
version="1.0.0"
provider-name="nutch.org">
<extension-point
id="net.nutch.parse.Parser"
name="NutchContentParser"/>
<runtime>
<libraryname="parse-text.jar">
<exportname="*"/>
</library>
</runtime>
<extensionid="net.nutch.parse.text"
name="TextParse"
point="net.nutch.parse.Parser">
<implementationid="net.nutch.parse.text.TextParser"
class="net.nutch.parse.text.TextParser"
contentType="text/plain"
pathSuffix="txt"/>
</extension>
</plugin>
注意上面的扩展点是net.nutch.parse.Parser.这一次<extension><implementation>与协议无关了,我们看到的是contentType和pathSuffix
现在我们看看PluginRepository是如何根据给定任务选择插件的
- 找到实现某个扩展点的插件组
- 然后从插件组件中选择一个合适的用于目前的给定(比如协议名,内容类型或者路径后缀)
我们来看看parse-html的描述。你将会发现如下规则:它实现了和parse-text (net.nutch.parse.Parser)同样的扩展点,但它有不同的内容类型和路径后缀
contentType="text/html"
pathSuffix=""
上面最后一句中路径后缀为空,这看上去有些奇怪。但这也意味着这个插件不匹配任何后缀。所以parse-html插件只用于我们获取远程url而不是位于本地文件系统的任何冬冬
分享到:
相关推荐
"Dissecting the Hotspot JVM" 本文档是关于 Java 虚拟机(JVM)的深入分析,作者 Martin Toshev 通过分享 JVM 的架构、实现机理和调试技术,帮助读者更好地理解 JVM,并为其提供了实践经验。 虚拟机基础 虚拟机...
##### Dissecting a Font 分解字体 理解字体是由多个部分组成的:家族名、大小、斜体、加粗等属性。Perl/Tk允许开发者对这些属性进行细致的控制。 ##### Using Fonts 使用字体 通过设置Tk::Font对象,可以为不同...
Offensive Malware Analysis - Dissecting OSXFruitFly Via A Custom C&C Server OSXFruitFly是一种复杂的恶意软件,最初由Malwarebytes发现。该恶意软件使用了自定义的C&C服务器,以绕过传统的安全防护机制。为了...
解剖图像作物这是B. Van Hoorick和C. Vondrick的正式资料库,“解剖图像作物”, arXiv预印本arXiv:2011.11831,2020 。简而言之,我们研究了视觉裁剪留下的痕迹。基本用法说明步骤1:使用高分辨率图像文件填充data...
这份报告“信息安全_数据安全_us-18-Goland-Dissecting-Non-Mali.pdf”主要由研究人员Ido Naor和Dani Goland探讨了一个鲜为人知的问题:非恶意工件(Non-malicious Artifacts)如何导致敏感数据泄露,并提出了如何...
Chapter 11 - Dissecting Classes Chapter 12 - Compositional Design Chapter 13 - Extending Class Functionality Through Inheritance Part III - Implementing Polymorphic Behavior Chapter 14 - Ad ...
在IT领域,尤其是在软件开发与编程教育中,《Dissecting a C# Application Inside SharpDevelop》是一本具有指导意义的专业书籍,由Christian Holm、Mike Krüger和Bernhard Spuida三位作者共同撰写,于2004年由...
Real World Java EE Night Hacks--Dissecting the Business Tier.jpg(电子书的封面图片)
In 2019, the rapid rate at which GPU manufacturers refresh their designs, coupled with their reluctance to disclose microarchitectural details, is still a hurdle for those software designers who want ...
GTC 2018Dissecting the Volta GPU Architecture throughMicrobenchmarkingZhe Jia, Marco Maggioni, Benjamin Staiger, Daniele P. ScarpazzaHigh-Performance Computing Group• Micro-architectural details ...
2018CVPR_Dissecting Person Re-identification from the Viewpoint of Viewpoint
这篇文档主要讨论的是一个关于信息安全和数据安全的主题,特别是在云连接设备,如电动滑板车(E-Scooter)上的应用。演讲者Nikias Bassen是一位来自德国的IT专家,拥有计算机科学学位,并在逆向工程(RE)和安全研究...
### MS11-046: 深度解析零日攻击 #### 摘要 本文将深入探讨一种利用MS11-046漏洞进行的零日攻击,该攻击能够实现权限提升,使攻击者能够在受限用户账户下运行原本无法执行的命令。所涉及的特定漏洞为“MS11-046: ...
H0w t0 R34d Dissecting the Hack: The F0rb1dd3n Network xvii About the Authors xix PART 1 F0RB1DD3N PR010gu3 3 A New Assignment 3 ChAPTeR 0N3 15 Problem Solved 15 Getting Started 21 The Acquisition 22 ...
Dissecting the Activity Building and Running the Activity ■Chapter 4: Using XML-Based Layouts What Is an XML-Based Layout? Why Use XML-Based Layouts? OK, So What Does It Look Like? What’s with ...
Completely updated and featuring 12 new chapters, Gray Hat Hacking: The Ethical Hacker's Handbook, Fourth Edition explains the enemy’s current weapons, skills, and tactics and offers field-tested ...