如何读取nutch抓取数据

全部 Hibernate Spring Struts iBATIS 企业应用 Lucene SOA Java综合 Tomcat 设计模式 OO JBoss

浏览 4082 次

锁定老帖子主题：如何读取nutch抓取数据精华帖 (0) :: 良好帖 (0) :: 新手帖 (7) :: 隐藏帖 (0)
作者	正文
p_x1984 等级: 性别: 文章: 207 积分: 850 来自: 北京	发表时间：2009-12-16 相关推荐: nutch搜索引擎数据获取 mahout读取nutch抓取数据后的文件 Nutch抓取数据分析 Nutch 关于读取资源数据的命令 Nutch源代码研究网页抓取数据结构更多相关推荐企业应用如何读取nutch抓取数据 1.首先nutch的配置已经在博客里面写好了，如果还不知道，建议现看下，然后再读这篇文章。 2.用一个SequenceFile.Reader来读取排序的输入。SequenceFile.Reader m_reader = m_reader = new SequenceFile.Reader(fs, content, conf); 3.用NutchConfiguration.create()实例化一个Configuration的对象conf。 Configuration conf = NutchConfiguration.create(); //实例化一个path的路径，"path"是我们通过读取配置文件(conf.properties)获取的路径 Path content = new Path(path + "/data"); //通过这个路径就可有得到文件所在的位置。 FileSystem fs = content.getFileSystem(conf); 4.下面看下完整的代码 public class ContentReader { private SequenceFile.Reader m_reader = null; public ContentReader(String path) throws Exception { Configuration conf = NutchConfiguration.create(); Path content = new Path(path + "/data"); FileSystem fs = content.getFileSystem(conf); m_reader = new SequenceFile.Reader(fs, content, conf); } public boolean next(Content content) throws Exception { Text key = new Text(); boolean ret = m_reader.next(key, content); if (!ret) { m_reader.close(); } return ret; } } 5.通过读取配置文件获取HDFS的路径（1）：我们获取的nutch所产生的路径是:/home/user/xipei/nutch1.0/crawl/segments/20091215145839/content/data. 这里面有一些属性比如version、url、content等,有兴趣的朋友可以看下它的源代码。（2）：20091215145839 ：这是nutch在抓取时候所产生的14位的时间。当然我们完全可有只通过读./segements/。就可以读取到下面所有的14位时间的文件夹。下面来看程序： /** * According to the path of a path to obtain hdfs * @param prefixPath * @return / public static List<String> getHdfsPath(String prefixPath) { List<String> hdfsPaths = new ArrayList<String>(); Path path = new Path(prefixPath); Configuration conf = NutchConfiguration.create(); JobConf job = new JobConf(conf); try { FileSystem fs = FileSystem.get(job); FileStatus[] fileStatus = fs.listStatus(path); String suffixPath = "content"+File.separator+"part-00000"; if (fileStatus == null) return null; for (int i = 0; i < fileStatus.length; i++) { hdfsPaths.add(prefixPath + File.separator + fileStatus[i].getPath().getName() + File.separator + suffixPath); } } catch(Exception e) { e.printStackTrace(); } return hdfsPaths; } 注意：这个方法的 prefixPath ：就是上面所写的那个路径。只要传递正确就可有循环获取到。（3）：简单看下怎么读取里面的url就可以了，其它的类似。 /* * get list<String> * @param hdfsPath * @return */ public static List<String> getUrl(String hdfsPath) { List<String> urls = new ArrayList<String>(); try { reader = new ContentReader(hdfsPath); Content content = new Content(); while(reader.next(content)) { String url = content.getUrl() != null ? content.getUrl() : ""; urls.add(url); } } catch (Exception e) { e.printStackTrace(); } return urls; } 声明：ITeye文章版权属于作者，受法律保护。没有作者书面许可不得转载。推荐链接
返回顶楼

comsci 等级: 性别: 文章: 989 积分: 2110 来自: 成都	发表时间：2009-12-26 一直都在关注NUTCH，很高兴楼主分享那么多经验
返回顶楼	回帖地址 0 0 请登录后投票

langxiashahai 等级: 初级会员性别: 文章: 6 积分: 20 来自: 广州	发表时间：2010-07-09 lz,貌似不行哦，提供一份正确的代码吧
返回顶楼	回帖地址 0 0 请登录后投票

论坛首页 → Java企业应用版

跳转论坛: