PluginRepository 是plugin的入口,保存了所有的Plugins,加载流程如下:
1. 解析plugin.folder下面的所有plugin的plugin.xml文件:
几个主要的解析函数如下:
(1) parseExtension(rootElement, pluginDescriptor);
解析extension element:
<extension id="org.apache.nutch.net.urlfilter.urllength"
name="Nutch URL Length Filter"
point="org.apache.nutch.net.urlfilter">
<implementation id="UrlLengthFilter"
class="org.apache.nutch.net.urlfilter.urllength.UrlLengthFilter">
</implementation>
</extension>
解析后加载到PluginDescriptor:
pPluginDescriptor.addExtension(extension);
(2)parseLibraries(rootElement, pluginDescriptor);
解析下列的lib element:
<runtime>
<library name="lib-http.jar">
<export name="*"/>
</library>
</runtime>
解析后加载到PluginDescriptor:
pDescriptor.addNotExportedLibRelative(libName);
pDescriptor.addNotExportedLibRelative(libName);
(3)parseRequires(rootElement, pluginDescriptor);
解析requires :
<requires>
<import plugin="nutch-extensionpoints"/>
</requires>
解析后加载到PluginDescriptor:
pDescriptor.addDependency(plugin); 确定依赖关系
(4) parseExtensionPoints(rootElement, pluginDescriptor);
解析extension point: 主要针对nutch-extensionpoints下面的plugin.xml
<extension-point
id="org.apache.nutch.indexer.field.FieldFilter"
name="Nutch Field Filter"/>
解析后加载到PluginDescriptor:
pPluginDescriptor.addExtensionPoint(extensionPoint);
2. 对plugin的过滤:
根据plugin.includes及plugin.excludes过滤plugin,并检查plugin的依赖关系,确认是否有“missing dependency”或“circular dependency”存在。
3. installExtensionPoints: 集合所有的ExtensionPoints;
4. installExtensions: 验证每个extension是否有对应的ExtensionPoint。
5. extension中的point的value必须在extension-point 中有定义, 即:在定义了某个plugin的plugin.xml之后,必须在nutch-extensionpoints的plugin.xml中注册下 .
6.
Nutch-site.xml中需定义“Plugin.folders”的value,指定plugin的路径。
<property>
<name>plugin.folders</name>
<value>plugins</value>
<description>Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.
</description>
</property>
另外需要定义“plugin.includes”,确定要加载的plugin。
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library.
</description>
</property>
7.plugin.folders只是指示了在哪个文件目录可以找到所有的plugin;
plugin.includes必须把plugin包含进去,才能在PluginRepository中get想要的plugin。
---------------------------------------------------------------自定义plugin--------------------------------------------------------------------------------
1. 定义URLFilter的interface, 必须指定X_POINT_ID:(由于nutch已经定义了URLFilter插件了,这步省略)
public interface URLFilter extends Pluggable, Configurable {
/** The name of the extension point. */
public final static String X_POINT_ID = URLFilter.class.getName();
}
2. 定义UrlLengthFilter:
public class UrlLengthFilter implements URLFilter{
//TODO: 具体实现
}
3. 在"plugin.folder"目录下添加一个 urlfilter-urllength 的plugin,相应的plugin.xml如下: extension节点的id属性值是UrlLengthFilter 类所在的package name,
implementation 节点属性的class指定需要实现的具体类名,通过该名找到相关的类。
<plugin
id="urlfilter-urllength"
name="URL length Filter"
version="1.0.0"
provider-name="nutch.org">
<requires>
<import plugin="nutch-extensionpoints"/>
</requires>
<extension id="org.apache.nutch.net.urlfilter.urllength"
name="Nutch URL Length Filter"
point="org.apache.nutch.net.urlfilter">
<implementation id="UrlLengthFilter"
class="org.apache.nutch.net.urlfilter.urllength.UrlLengthFilter">
</implementation>
</extension>
</plugin>
4. 确保nutch-extensionpoints的plugin.xml中有如下的extensionpoint定义: 这里id必须与上一步extension中的point对应。
<extension-point
id="com.roboo.procrawl.net.URLFilter"
name="Nutch URL Filter"
/>
5. 将这个插件的id(即urlfilter-urllength)添加到nutch-site.xml的"plugin.includes"定义中。