Nutch 1.4 与 Eclipse 整合

andy_ghg

浏览: 295088 次
性别:
来自: 扬州

最近访客更多访客>>

qq849397558

ug02j4

kingtsing

男人50

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Java
Nutch
Eclipse

java solr eclipse mac os

环境：
操作系统：Mac OS X Lion
Nutch版本：1.4
Eclipse版本：Eclipse Java EE IDE for Web Developers. Indigo

第一步：新建普通Java工程

第二步：将Nutch源码（路径是：Nutch根目录下“src/java/”下所有的文件）拷贝到java工程中的src目录下。

第三步：将Nutch运行依赖包加到class path中去。依赖包可以在Nutch根目录下runtime/local/lib文件夹中找到，切记不可全选，请排除nutch-1.4.jar，否则当运行的时候，会首先去nutch-1.4.jar中寻找配置文件。可能会报http.agent.name异常。

第四步：将runtime/local/下的conf和plugins文件夹拷贝到java工程中去。此时，文件结构大致如下所示：

第五步：右键工程－》Properties－》Build Path－》选择Libraries－》点击Add Class Folder－》选择conf文件夹－》点击确定

第六步：右键Crawl.java文件－》Run As－》Run Configurations－》切换到Arguments选项卡－》加入运行参数，我的是官方例子中的参数，既：

urls -solr http://localhost:8080/solr/ -depth 3 -topN 5

因为我在早先配置好了solr，所以我的参数中带有solr的地址，大伙儿可以自行配置所需参数。

PS：我在运行中，solr端抛出unknown field content异常，在schema.xml文件中也配置了相关参数，但是还是不行，希望走过路过的大侠可以帮忙解决一下这个问题，help you help me.谢谢各位。

unknown field ""异常已经解决，附上schema.xml

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="notch" version="1.4">
    <types>
        <fieldType name="string" class="solr.StrField" sortMissingLast="true"
            omitNorms="true"/> 
        <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
            omitNorms="true" positionIncrementGap="0"/>
        <fieldType name="float" class="solr.TrieFloatField" precisionStep="0"
            omitNorms="true" positionIncrementGap="0"/>
        <fieldType name="date" class="solr.TrieDateField" precisionStep="0"
            omitNorms="true" positionIncrementGap="0"/>

        <fieldType name="text" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.StopFilterFactory"
                    ignoreCase="true" words="stopwords.txt"/>
                <filter class="solr.WordDelimiterFilterFactory"
                    generateWordParts="1" generateNumberParts="1"
                    catenateWords="1" catenateNumbers="1" catenateAll="0"
                    splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EnglishPorterFilterFactory"
                    protected="protwords.txt"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
        </fieldType>
        <fieldType name="url" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"
                    generateWordParts="1" generateNumberParts="1"/>
            </analyzer>
        </fieldType>
        
        <fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />
    </types>
    <fields>
    	<dynamicField name="*" type="ignored" multiValued="true" />
        <field name="id" type="string" stored="true" indexed="true"/>
		<field name="tstamp" type="date" stored="true" indexed="false" />
        <!-- core fields -->
        <field name="segment" type="string" stored="true" indexed="false"/>
        <field name="digest" type="string" stored="true" indexed="false"/>
        <field name="boost" type="float" stored="true" indexed="false"/>

        <!-- fields for index-basic plugin -->
        <field name="host" type="url" stored="false" indexed="true"/>
        <field name="site" type="string" stored="false" indexed="true"/>
        <field name="url" type="url" stored="true" indexed="true"
            required="true"/>
        <field name="content" type="text" stored="false" indexed="true"/>
        <field name="title" type="text" stored="true" indexed="true"/>
        <field name="cache" type="string" stored="true" indexed="false"/>
        <field name="stamp" type="date" stored="true" indexed="false"/>

        <!-- fields for index-anchor plugin -->
        <field name="anchor" type="string" stored="true" indexed="true"
            multiValued="true"/>

        <!-- fields for index-more plugin -->
        <field name="type" type="string" stored="true" indexed="true"
            multiValued="true"/>
        <field name="contentLength" type="long" stored="true"
            indexed="false"/>
        <field name="lastModified" type="date" stored="true"
            indexed="false"/>
        <field name="date" type="date" stored="true" indexed="true"/>

        <!-- fields for language identifier plugin -->
        <field name="lang" type="string" stored="true" indexed="true"/>

        <!-- fields for sub collection plugin -->
        <field name="sub collection" type="string" stored="true"
            indexed="true" multiValued="true"/>

        <!-- fields for feed plugin (tag is also used by microformats-reltag)-->
        <field name="author" type="string" stored="true" indexed="true"/>
        <field name="tag" type="string" stored="true" indexed="true" multiValued="true"/>
        <field name="feed" type="string" stored="true" indexed="true"/>
        <field name="publishedDate" type="date" stored="true"
            indexed="true"/>
        <field name="updatedDate" type="date" stored="true"
            indexed="true"/>

        <!-- fields for creative commons plugin -->
        <field name="cc" type="string" stored="true" indexed="true"
            multiValued="true"/>
    </fields>
    <uniqueKey>id</uniqueKey>
    <defaultSearchField>content</defaultSearchField>
    <solrQueryParser defaultOperator="OR"/>
</schema>

查看图片附件

分享到：

SEVERE: org.apache.solr.common.SolrExcep ... | Hadoop学习笔记－读取视频流给Flash播放器

2012-06-03 00:48
浏览 3009
评论(0)
分类:企业架构
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论