nutch-1.2结合hadoop分布式搜索。
1、网上关于nutch分布式搜索的配置有些BLOG写的很详细了。有那些地方有疑问的,我这里也给一个连接<<nutch分布式搜索配置>>
2、在这里主要想写下工作过程当中遇到的一些问题:
------0-------
------1-------
------2-------
------3-------
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:226)
at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:67)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1756)
at java.io.DataInputStream.read(Unknown Source)
at org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:178)
at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:160)
at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:81)
at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:222)
at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:879)
at org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:574)
at org.apache.lucene.index.IndexReader.document(IndexReader.java:658)
at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:163)
at org.apache.nutch.searcher.IndexSearcher.getDetails(IndexSearcher.java:110)
at org.apache.nutch.searcher.LuceneSearchBean.getDetails(LuceneSearchBean.java:107)
at org.apache.nutch.searcher.NutchBean.getDetails(NutchBean.java:359)
at com.yichen.node.ThreadPoolTaskSearch.query(ThreadPoolTaskSearch.java:89)
at com.yichen.node.ThreadPoolTaskSearch.query(ThreadPoolTaskSearch.java:59)
at com.yichen.node.ThreadPoolTaskSearch.search(ThreadPoolTaskSearch.java:38)
at com.yichen.node.ThreadPoolTaskSearch.run(ThreadPoolTaskSearch.java:130)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
nutchBean closed。。。。
nutchBean closed。。。。
{indexNo=0, uniqueKey=35, su=null, post=IT工程师, company=卡斯柯信号有限公司北京分公司, salary=(0-0), type=job, updatetime=20110621}
no found result。。。。
{indexNo=0, uniqueKey=19, su=null, post=【知名合资IT企业】高级营销经理(安全)–CEN810, company=大连博科人才有限公司, salary=(0-0), type=job, updatetime=20110621}
{indexNo=0, uniqueKey=18, su=null, post=【知名合资IT企业】高级拓展经理(安全)–CEN811, company=大连博科人才有限公司, salary=(0-0), type=job, updatetime=20110621}
{indexNo=0, uniqueKey=20, su=null, post=【知名合资IT企业】高级规划经理(安全)–CEN809, company=大连博科人才有限公司, salary=(0-0), type=job, updatetime=20110621}
{indexNo=0, uniqueKey=21, su=null, post=理财产品销售专员(综合金融), company=平安金融服务公司, salary=(4000-50000), type=job, updatetime=20110621}
{indexNo=0, uniqueKey=25, su=null, post=理财金融营销专员, company=平安金融服务公司, salary=(4000-50000), type=job, updatetime=20110620}
{indexNo=0, uniqueKey=28, su=null, post=金融产品理财专员, company=平安金融服务公司, salary=(5000-20000), type=job, updatetime=20110620}
{indexNo=0, uniqueKey=22, su=null, post=理财客户金融经理, company=平安金融服务公司, salary=(6001-8000), type=job, updatetime=20110620}
{indexNo=0, uniqueKey=24, su=null, post=理财金融专员, company=平安金融服务公司, salary=(3000-20000), type=job, updatetime=20110621}
{indexNo=0, uniqueKey=31, su=null, post=金融理财经理(综合金融), company=平安金融服务公司, salary=(8001-10000), type=job, updatetime=20110620}
分析原因:单个线程在分布式中搜索没有出现问题,以上出现错误原因是多线程搜索时出现的。由于每次打开的连接次数太多,导致连接没有关闭。出现上面的错误。
解决办法:
1、在servlet初始化中,加入:
public void init(ServletConfig config) throws ServletException {
try {
this.conf = NutchConfiguration.get(config.getServletContext());
bean = NutchBean.get(config.getServletContext(), this.conf);
} catch (IOException e) {
throw new ServletException(e);
}
MAX_HITS_PER_PAGE = conf.getInt("searcher.max.hits.per.page", -1);
}
2、修改web.xml,加入:
<listener>
<listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class>
</listener>
<servlet>
<servlet-name>Cached</servlet-name>
<servlet-class>org.apache.nutch.servlet.Cached</servlet-class>
</servlet>
3、在自己的servlet中把NutchBean的实例和NutchConfiguration的实例传递过去。保证初始化时只打开一次index。