如何读取nutch抓取数据

p_x1984

浏览: 1187492 次
性别:
来自: 北京

最近访客更多访客>>

u012363178

清风_秋雨

sun80264629

shaoaj

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

nutch

如何读取nutch抓取数据

1.首先nutch的配置已经在博客里面写好了，如果还不知道，建议现看下，然后再读这篇文章。
2.用一个SequenceFile.Reader来读取排序的输入。SequenceFile.Reader m_reader = m_reader = new SequenceFile.Reader(fs, content, conf);
3.用NutchConfiguration.create()实例化一个Configuration的对象conf。
Configuration conf = NutchConfiguration.create();
//实例化一个path的路径，"path"是我们通过读取配置文件(conf.properties)获取的路径
Path content = new Path(path + "/data");
//通过这个路径就可有得到文件所在的位置。
FileSystem fs = content.getFileSystem(conf);
4.下面看下完整的代码
public class ContentReader
{
    private SequenceFile.Reader m_reader = null;

    public ContentReader(String path) throws Exception
    {
        Configuration conf = NutchConfiguration.create();
        Path content = new Path(path + "/data");
        FileSystem fs = content.getFileSystem(conf);
        m_reader = new SequenceFile.Reader(fs, content, conf);
    }

    public boolean next(Content content) throws Exception
    {
        Text key = new Text();
        boolean ret = m_reader.next(key, content);

        if (!ret)
        {
            m_reader.close();
        }
        return ret;
    }
}
5.通过读取配置文件获取HDFS的路径
（1）：我们获取的nutch所产生的路径是:/home/user/xipei/nutch1.0/crawl/segments/20091215145839/content/data.
      这里面有一些属性比如version、url、content等,有兴趣的朋友可以看下它的源代码。
（2）：20091215145839 ：这是nutch在抓取时候所产生的14位的时间。当然我们完全可有只通过读./segements/。就可以读取到下面所有的14位时间的文件夹。下面来看程序：
    /**
    * According to the path of a path to obtain hdfs
    * @param prefixPath
    * @return
    */
    public static List<String> getHdfsPath(String prefixPath) {
        List<String> hdfsPaths = new ArrayList<String>();
        Path path = new Path(prefixPath);
        Configuration conf = NutchConfiguration.create();
        JobConf job = new JobConf(conf);
        try {
        FileSystem fs = FileSystem.get(job);
        FileStatus[] fileStatus = fs.listStatus(path);
        String suffixPath = "content"+File.separator+"part-00000";
        if (fileStatus == null) return null;
        for (int i = 0; i < fileStatus.length; i++) {
            hdfsPaths.add(prefixPath + File.separator + fileStatus[i].getPath().getName() + File.separator + suffixPath);
        }

        } catch(Exception e) {
            e.printStackTrace();
        }
        return hdfsPaths;
    }
     注意：这个方法的 prefixPath ：就是上面所写的那个路径。只要传递正确就可有循环获取到。
（3）：简单看下怎么读取里面的url就可以了，其它的类似。
    /**
    * get list<String>
    * @param hdfsPath
    * @return
    */
    public static List<String> getUrl(String hdfsPath) {
        List<String> urls = new ArrayList<String>();
        try {
            reader = new ContentReader(hdfsPath);
            Content content = new Content();
            while(reader.next(content)) {
                String url = content.getUrl() != null ? content.getUrl() : "";
                urls.add(url);
            }
        } catch (Exception e) {
            e.printStackTrace();
        }

        return urls;
    }

分享到：

nutch plugin的流程以及如何自定义plugin | svn 版本过低的问题(This client is too ol ...

2009-12-16 17:43
浏览 3613
评论(2)
论坛回复 / 浏览 (2 / 4080)
分类:企业架构
查看更多

2 楼 langxiashahai 2010-07-09

lz,貌似不行哦，提供一份正确的代码吧

1 楼 comsci 2009-12-26

一直都在关注NUTCH，很高兴楼主分享那么多经验

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论