(转)HDFS中PathFilter类 -

zhouchaofei2010

浏览: 1115181 次
性别:
来自: 上海

最近访客更多访客>>

eoasis

xutao2811

wangjn1982

取个名字好难1234

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

(转)HDFS中PathFilter类

博客分类：

hadoop

在单个操作中处理一批文件，这是很常见的需求。比如说处理日志的MapReduce作业可能需要分析一个月内包含在大量目录中的日志文件。在一个表达式中使用通配符在匹配多个文件时比较方便的，无需列举每个文件和目录来指定输入。hadoop为执行通配提供了两个FIleSystem方法：

1 public FileStatus[] globStatus(Path pathPattern) throw IOException
2 public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throw IOException

　　globStatus()方法返回与路径想匹配的所有文件的FileStatus对象数组，并按路径排序。hadoop所支持的通配符与Unix bash相同。

　　第二个方法传了一个PathFilter对象作为参数，PathFilter可以进一步对匹配进行限制。PathFilter是一个接口，里面只有一个方法accept(Path path)。

下面看一个例子演示PathFilter的作用：

　　RegexExcludePathFilter.java:该类实现了PathFilter接口，重写了accept方法

 1 class RegexExcludePathFilter implements PathFilter{
 2     private final String regex;
 3     public RegexExcludePathFilter(String regex) {
 4         this.regex = regex;
 5     }
 6     @Override
 7     public boolean accept(Path path) {
 8         return !path.toString().matches(regex);
 9     }
10     
11 }

　　该方法就是打印符合通配的路径：

 1 //通配符的使用
 2     public static void list() throws IOException{
 3         Configuration conf = new Configuration();
 4         FileSystem fs = FileSystem.get(conf);
 5         //PathFilter是过滤布符合置顶表达式的路径，下列就是把以txt结尾的过滤掉
 6         FileStatus[] status = fs.globStatus(new Path("hdfs://master:9000/user/hadoop/test/*"),new RegexExcludePathFilter(".*txt"));
 7         //FileStatus[] status = fs.globStatus(new Path("hdfs://master:9000/user/hadoop/test/*"));
 8         Path[] listedPaths = FileUtil.stat2Paths(status);
 9         for (Path p : listedPaths) {
10             System.out.println(p);
11         }
12     }

如果注释第6行，取消第7行的注释，则输出结果如下：
hdfs://master:9000/user/hadoop/test/a.txt
hdfs://master:9000/user/hadoop/test/b.txt
hdfs://master:9000/user/hadoop/test/c.aaa
hdfs://master:9000/user/hadoop/test/c.txt
hdfs://master:9000/user/hadoop/test/cc.aaa

如果注释第7行，取消第6行的注释，则输出结果如下：

hdfs://master:9000/user/hadoop/test/c.aaa
hdfs://master:9000/user/hadoop/test/cc.aaa

由此可见，PathFilter就是在匹配前面条件之后再加以限制，将匹配PathFilter的路径去除掉。其实由accept方法里面的return !path.toString().matches(regex);可以看出来，就是将匹配的全部去除掉，如果改为return path.toString().matches(regex);就是将匹配regex的Path输出，将不匹配的去除。

转 :http://www.cnblogs.com/liuling/p/2013-6-18-02.html

分享到：

(转)hadoop多文件格式输入 | 小米手机不能上网

2014-07-18 10:18
浏览 832
评论(0)
分类:企业架构
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

(转)HDFS中PathFilter类

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

(转)HDFS中PathFilter类

评论

发表评论

相关推荐

mapreduce作业状态一直是ACCEPTED

脚本启动zookeeper集群需要的特别配置

（转）Linux里如何查找文件内容

JournalNode 和 Secondary NameNode

hadoop jar 命令

hadoop安装前的linux 设置

hadoop2.2.0 源码远程调试

Linux的网卡由eth0变成了eth1，如何修复

(转)hadoop多文件格式输入

(转)MapReduce源码分析总结

paxos算法如何容错的--讲述五虎将的实践(转)

GFS架构分析(转)

MapReduce原理浅析（转）

Windows下搭建Hadoop开发环境

最近访客更多访客>>