nutch-1.2结合hadoop分布式搜索

p_x1984

浏览: 1189202 次
性别:
来自: 北京

最近访客更多访客>>

u012363178

清风_秋雨

sun80264629

shaoaj

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

nutch

nutch-1.2结合hadoop分布式搜索。
1、网上关于nutch分布式搜索的配置有些BLOG写的很详细了。有那些地方有疑问的，我这里也给一个连接<<nutch分布式搜索配置>>
2、在这里主要想写下工作过程当中遇到的一些问题：
------0-------
------1-------
------2-------
------3-------
java.io.IOException: Filesystem closed
    at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:226)
    at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:67)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1756)
    at java.io.DataInputStream.read(Unknown Source)
    at org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:178)
    at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:160)
    at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
    at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:81)
    at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:222)
    at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:879)
    at org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:574)
    at org.apache.lucene.index.IndexReader.document(IndexReader.java:658)
    at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:163)
    at org.apache.nutch.searcher.IndexSearcher.getDetails(IndexSearcher.java:110)
    at org.apache.nutch.searcher.LuceneSearchBean.getDetails(LuceneSearchBean.java:107)
    at org.apache.nutch.searcher.NutchBean.getDetails(NutchBean.java:359)
    at com.yichen.node.ThreadPoolTaskSearch.query(ThreadPoolTaskSearch.java:89)
    at com.yichen.node.ThreadPoolTaskSearch.query(ThreadPoolTaskSearch.java:59)
    at com.yichen.node.ThreadPoolTaskSearch.search(ThreadPoolTaskSearch.java:38)
    at com.yichen.node.ThreadPoolTaskSearch.run(ThreadPoolTaskSearch.java:130)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
nutchBean closed。。。。
nutchBean closed。。。。
{indexNo=0, uniqueKey=35, su=null, post=IT工程师, company=卡斯柯信号有限公司北京分公司, salary=(0-0), type=job, updatetime=20110621}
no found result。。。。
{indexNo=0, uniqueKey=19, su=null, post=【知名合资IT企业】高级营销经理（安全）–CEN810, company=大连博科人才有限公司, salary=(0-0), type=job, updatetime=20110621}
{indexNo=0, uniqueKey=18, su=null, post=【知名合资IT企业】高级拓展经理（安全）–CEN811, company=大连博科人才有限公司, salary=(0-0), type=job, updatetime=20110621}
{indexNo=0, uniqueKey=20, su=null, post=【知名合资IT企业】高级规划经理（安全）–CEN809, company=大连博科人才有限公司, salary=(0-0), type=job, updatetime=20110621}
{indexNo=0, uniqueKey=21, su=null, post=理财产品销售专员（综合金融）, company=平安金融服务公司, salary=(4000-50000), type=job, updatetime=20110621}
{indexNo=0, uniqueKey=25, su=null, post=理财金融营销专员, company=平安金融服务公司, salary=(4000-50000), type=job, updatetime=20110620}
{indexNo=0, uniqueKey=28, su=null, post=金融产品理财专员, company=平安金融服务公司, salary=(5000-20000), type=job, updatetime=20110620}
{indexNo=0, uniqueKey=22, su=null, post=理财客户金融经理, company=平安金融服务公司, salary=(6001-8000), type=job, updatetime=20110620}
{indexNo=0, uniqueKey=24, su=null, post=理财金融专员, company=平安金融服务公司, salary=(3000-20000), type=job, updatetime=20110621}
{indexNo=0, uniqueKey=31, su=null, post=金融理财经理（综合金融）, company=平安金融服务公司, salary=(8001-10000), type=job, updatetime=20110620}

分析原因：单个线程在分布式中搜索没有出现问题，以上出现错误原因是多线程搜索时出现的。由于每次打开的连接次数太多，导致连接没有关闭。出现上面的错误。
解决办法：
1、在servlet初始化中，加入：
    public void init(ServletConfig config) throws ServletException {
        try {
            this.conf = NutchConfiguration.get(config.getServletContext());
            bean = NutchBean.get(config.getServletContext(), this.conf);
        } catch (IOException e) {
            throw new ServletException(e);
        }
        MAX_HITS_PER_PAGE = conf.getInt("searcher.max.hits.per.page", -1);
    }
2、修改web.xml，加入：
<listener>
<listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class>
</listener>

<servlet>
<servlet-name>Cached</servlet-name>
<servlet-class>org.apache.nutch.servlet.Cached</servlet-class>
</servlet>
3、在自己的servlet中把NutchBean的实例和NutchConfiguration的实例传递过去。保证初始化时只打开一次index。

linux下如何配置分布式检索.pdf (40 KB)
下载次数: 96

分享到：

nutch-1.2在搜索HDFS过程中高并发的处理 | 大数据量的走向趋势

2011-07-13 10:50
浏览 2674
评论(0)
论坛回复 / 浏览 (0 / 2565)
分类:行业应用
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论