- 浏览: 113079 次
- 性别:
- 来自: 北京
文章分类
- 全部博客 (109)
- hive (5)
- web (1)
- spring (7)
- struts2 (1)
- s2sh (2)
- mysql (3)
- hadoop (31)
- hbase (6)
- java (8)
- ubuntu (8)
- pig (2)
- Interview (2)
- zookeeper (1)
- system (1)
- 遥控 (1)
- linux (3)
- myeclipse (2)
- Oracle (1)
- redis (9)
- ibatis (2)
- 架构 (2)
- 解析xml (1)
- autoProxy (0)
- jedis (6)
- http://www.infoq.com/cn/articles/tq-redis-copy-build-scalable-cluster (1)
- xmemcached (1)
- 图片服务器 (1)
- 对象池 (0)
- netty (1)
最新评论
-
laoma102:
已经不好使了,能找到最新的吗
spring官方文档 -
di1984HIT:
不错,。不错~
pig安装
在Hadoop编写生产环境的任务时,定义以下任务,要求是相同的MapReduce任务,但Hadoop0.20API中并不总是可用。
1) 获取HDFS文件或目录的大小
通过查看执行任务的输入数据的数量,动态改变使用到任务中的reducer的数量。
2) 从HDFS目录中递归移除所有零字节文件
reducer中使用MultipleOutput类时(作用比Mapper中要小),会产生很多这类文件。很多时间reducer获取不到MultipleOutput文件的任何记录,最好是在任务完成后删除。
3) 递归获取某个目录的所有子目录
4) 递归获取某个目录的所有文件和目录的子目录
默认地,现在,运行hadoop任务时,它只处理输入目录最新文件,输入路径下子目录的任何文件不处理,因此如果想要处理子目录下的所有文件,最好创建一个列表,用逗号分隔所有的输入路径下的文件,再提交给任务执行。
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import java.io.IOException;
import java.net.URI;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map;
import java.util.Stack;
import java.util.regex.Pattern;
public class ExtendedFileUtil extends FileUtil {
private String[] getFilesAndDirectories(String fileOrDirList, boolean recursively
, boolean getDirectories, boolean getFiles) throws IOException {
Configuration configuration = new Configuration();
String root = configuration.get("fs.default.name");
ArrayList<String> arraylist = new ArrayList<String>();
Stack<Path> stack = new Stack<Path>();
String uri = null;
FileSystem fs1 = null;
String[] fileOrDir = fileOrDirList.split(",", -1);
for (String aFileOrDir : fileOrDir) {
if (aFileOrDir.indexOf(root) == -1) {
uri = root + aFileOrDir;
} else {
uri = aFileOrDir;
}
FileSystem fs = FileSystem.get(URI.create(uri), configuration);
Path[] paths = new Path[1];
paths[0] = new Path(uri);
FileStatus[] status = fs.listStatus(paths);
for (FileStatus statu : status) {
if (statu.isDir()) {
stack.push(statu.getPath());
if (getDirectories) {
arraylist.add(statu.getPath().toString());
}
} else {
if (getFiles) {
arraylist.add(statu.getPath().toString());
}
}
}
if (recursively) {
Path p1 = null;
FileStatus[] status1 = null;
while (!stack.empty()) {
p1 = stack.pop();
fs1 = FileSystem.get(URI.create(p1.toString()), configuration);
paths[0] = new Path(p1.toString());
status1 = fs1.listStatus(paths);
for (FileStatus aStatus1 : status1) {
if (aStatus1.isDir()) {
stack.push(aStatus1.getPath());
if (getDirectories) {
arraylist.add(aStatus1.getPath().toString());
}
} else {
if (getFiles) {
arraylist.add(aStatus1.getPath().toString());
}
}
}
}
}
fs.close();
}
arraylist.trimToSize();
String[] returnArray = new String[arraylist.size()];
return arraylist.toArray(returnArray);
}
/**
* @param fileOrDir Comma delimited list of input files or directories in HDFS. Input can be given with HDFS URL.
* i.e. "hdfs://hd4.ev1.yellowpages.com:9000/user/directory" and "/user/directory" means the same
* @param recursively When set to "true" then recursively opens all sub directories and returns files
*/
public String[] getFilesOnly(String fileOrDir, boolean recursively) throws IOException {
return this.getFilesAndDirectories(fileOrDir, recursively, false, true);
}
/**
* Same as String[] getFilesOnly(String fileOrDir, boolean recursively) except that it only returns paths
* that match the regex
*/
public String[] getFilesOnly(String fileOrDir, boolean recursively, String regex) throws IOException {
ArrayList<String> arraylist = new ArrayList<String>();
String[] tempArr = this.getFilesOnly(fileOrDir, recursively);
Pattern p = Pattern.compile(".*" + regex + ".*");
// Extract the file names that match the regex
for (String aTempArr : tempArr) {
if (p.matcher(aTempArr).matches()) {
arraylist.add(aTempArr);
}
}
arraylist.trimToSize();
String[] returnArray = new String[arraylist.size()];
returnArray = arraylist.toArray(returnArray);
return returnArray;
}
/**
* @param fileOrDir Comma delimited list of input files or directories in HDFS. Input can be given with HDFS URL.
* i.e. "hdfs://hd4.ev1.yellowpages.com:9000/user/directory" and "/user/directory" means the same
* @param recursively When set to "true" then recursively opens all sub directories and returns sub directories
*/
public String[] getDirectoriesOnly(String fileOrDir, boolean recursively) throws IOException {
return this.getFilesAndDirectories(fileOrDir, recursively, true, false);
}
/**
* @param fileOrDir Comma delimited list of input files or directories in HDFS. Input can be given with HDFS URL.
* i.e. "hdfs://hd4.ev1.yellowpages.com:9000/user/directory" and "/user/directory" means the same
* @param recursively When set to "true" then recursively opens all sub directories and returns files and sub directories
*/
public String[] getFilesAndDirectories(String fileOrDir, boolean recursively) throws IOException {
return this.getFilesAndDirectories(fileOrDir, recursively, true, true);
}
/**
* This method uses recursion to retrieve a list of files/directories
*
* @param p Path to the directory or file you want to start at.
* @param configuration Configuration
* @param files a Map<Path,FileStatus> of path names to FileStatus objects.
* @throws IOException
*/
public void getFiles(Path p, Configuration configuration, Map<Path, FileStatus> files) throws IOException {
FileSystem fs = FileSystem.get(p.toUri(), configuration);
if (files == null) {
files = new HashMap();
}
if (fs.isFile(p)) {
files.put(p, fs.getFileStatus(p));
} else {
FileStatus[] statuses = fs.listStatus(p);
for (FileStatus s : statuses) {
if (s.isDir()) {
getFiles(s.getPath(), configuration, files);
} else {
files.put(s.getPath(), s);
}
}
}
fs.close();
}
/**
* This method deletes all zero byte files within a directory and all its subdirectories
*
* @param fileOrDir If file then delete the file if its zero bytes, if directory then delete
* all zero bytes files from the directory
*/
public void removeAllZeroByteFiles(String fileOrDir) {
try {
Configuration configuration = new Configuration();
Map<Path, FileStatus> files = new HashMap<Path, FileStatus>();
this.getFiles(new Path(fileOrDir), configuration, files);
for (Path p : files.keySet()) {
FileStatus s = files.get(p);
if (s.getLen() == 0) {
FileSystem fs = FileSystem.get(p.toUri(), configuration);
fs.delete(p, false);
fs.close();
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
/**
* This method returns the size of file or a directory in HDFS.
*
* @param fileOrDir file or diretory or list of files or directories in HDFS, if directory then size of all
* files within the directory and its subdirectories are returned
* @return size of the file or directory (sum of all files in the directory and sub directories)
*/
public long size(String fileOrDir) throws IOException {
long totalSize = 0;
Configuration configuration = new Configuration();
String allFiles[] = fileOrDir.split(",", -1);
for (String allFile : allFiles) {
Path p = new Path(allFile);
FileSystem fs = FileSystem.get(p.toUri(), configuration);
totalSize = totalSize + fs.getContentSummary(p).getLength();
fs.close();
}
return totalSize;
}
/**
* The method moves a single or multiple files or directories, if exists, to trash.
* It also accepts list of hdfs file or directory delimited by comma.
*
* @param fileOrDir HDFS file or directory name or list of HDFS file or directory names
* @throws IOException
*/
public void removeHdfsPath(String fileOrDir)
throws IOException {
Configuration configuration = new Configuration();
FileSystem fs = FileSystem.newInstance(URI.create(fileOrDir), configuration);
String[] fileList = fileOrDir.split(",", -1);
Trash trash = new Trash(configuration);
trash.expunge();
for (String aFileList : fileList) {
Path p = new Path(aFileList);
if (fs.exists(p)) {
trash.moveToTrash(p);
}
}
fs.close();
}
}
1) 获取HDFS文件或目录的大小
通过查看执行任务的输入数据的数量,动态改变使用到任务中的reducer的数量。
2) 从HDFS目录中递归移除所有零字节文件
reducer中使用MultipleOutput类时(作用比Mapper中要小),会产生很多这类文件。很多时间reducer获取不到MultipleOutput文件的任何记录,最好是在任务完成后删除。
3) 递归获取某个目录的所有子目录
4) 递归获取某个目录的所有文件和目录的子目录
默认地,现在,运行hadoop任务时,它只处理输入目录最新文件,输入路径下子目录的任何文件不处理,因此如果想要处理子目录下的所有文件,最好创建一个列表,用逗号分隔所有的输入路径下的文件,再提交给任务执行。
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import java.io.IOException;
import java.net.URI;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Map;
import java.util.Stack;
import java.util.regex.Pattern;
public class ExtendedFileUtil extends FileUtil {
private String[] getFilesAndDirectories(String fileOrDirList, boolean recursively
, boolean getDirectories, boolean getFiles) throws IOException {
Configuration configuration = new Configuration();
String root = configuration.get("fs.default.name");
ArrayList<String> arraylist = new ArrayList<String>();
Stack<Path> stack = new Stack<Path>();
String uri = null;
FileSystem fs1 = null;
String[] fileOrDir = fileOrDirList.split(",", -1);
for (String aFileOrDir : fileOrDir) {
if (aFileOrDir.indexOf(root) == -1) {
uri = root + aFileOrDir;
} else {
uri = aFileOrDir;
}
FileSystem fs = FileSystem.get(URI.create(uri), configuration);
Path[] paths = new Path[1];
paths[0] = new Path(uri);
FileStatus[] status = fs.listStatus(paths);
for (FileStatus statu : status) {
if (statu.isDir()) {
stack.push(statu.getPath());
if (getDirectories) {
arraylist.add(statu.getPath().toString());
}
} else {
if (getFiles) {
arraylist.add(statu.getPath().toString());
}
}
}
if (recursively) {
Path p1 = null;
FileStatus[] status1 = null;
while (!stack.empty()) {
p1 = stack.pop();
fs1 = FileSystem.get(URI.create(p1.toString()), configuration);
paths[0] = new Path(p1.toString());
status1 = fs1.listStatus(paths);
for (FileStatus aStatus1 : status1) {
if (aStatus1.isDir()) {
stack.push(aStatus1.getPath());
if (getDirectories) {
arraylist.add(aStatus1.getPath().toString());
}
} else {
if (getFiles) {
arraylist.add(aStatus1.getPath().toString());
}
}
}
}
}
fs.close();
}
arraylist.trimToSize();
String[] returnArray = new String[arraylist.size()];
return arraylist.toArray(returnArray);
}
/**
* @param fileOrDir Comma delimited list of input files or directories in HDFS. Input can be given with HDFS URL.
* i.e. "hdfs://hd4.ev1.yellowpages.com:9000/user/directory" and "/user/directory" means the same
* @param recursively When set to "true" then recursively opens all sub directories and returns files
*/
public String[] getFilesOnly(String fileOrDir, boolean recursively) throws IOException {
return this.getFilesAndDirectories(fileOrDir, recursively, false, true);
}
/**
* Same as String[] getFilesOnly(String fileOrDir, boolean recursively) except that it only returns paths
* that match the regex
*/
public String[] getFilesOnly(String fileOrDir, boolean recursively, String regex) throws IOException {
ArrayList<String> arraylist = new ArrayList<String>();
String[] tempArr = this.getFilesOnly(fileOrDir, recursively);
Pattern p = Pattern.compile(".*" + regex + ".*");
// Extract the file names that match the regex
for (String aTempArr : tempArr) {
if (p.matcher(aTempArr).matches()) {
arraylist.add(aTempArr);
}
}
arraylist.trimToSize();
String[] returnArray = new String[arraylist.size()];
returnArray = arraylist.toArray(returnArray);
return returnArray;
}
/**
* @param fileOrDir Comma delimited list of input files or directories in HDFS. Input can be given with HDFS URL.
* i.e. "hdfs://hd4.ev1.yellowpages.com:9000/user/directory" and "/user/directory" means the same
* @param recursively When set to "true" then recursively opens all sub directories and returns sub directories
*/
public String[] getDirectoriesOnly(String fileOrDir, boolean recursively) throws IOException {
return this.getFilesAndDirectories(fileOrDir, recursively, true, false);
}
/**
* @param fileOrDir Comma delimited list of input files or directories in HDFS. Input can be given with HDFS URL.
* i.e. "hdfs://hd4.ev1.yellowpages.com:9000/user/directory" and "/user/directory" means the same
* @param recursively When set to "true" then recursively opens all sub directories and returns files and sub directories
*/
public String[] getFilesAndDirectories(String fileOrDir, boolean recursively) throws IOException {
return this.getFilesAndDirectories(fileOrDir, recursively, true, true);
}
/**
* This method uses recursion to retrieve a list of files/directories
*
* @param p Path to the directory or file you want to start at.
* @param configuration Configuration
* @param files a Map<Path,FileStatus> of path names to FileStatus objects.
* @throws IOException
*/
public void getFiles(Path p, Configuration configuration, Map<Path, FileStatus> files) throws IOException {
FileSystem fs = FileSystem.get(p.toUri(), configuration);
if (files == null) {
files = new HashMap();
}
if (fs.isFile(p)) {
files.put(p, fs.getFileStatus(p));
} else {
FileStatus[] statuses = fs.listStatus(p);
for (FileStatus s : statuses) {
if (s.isDir()) {
getFiles(s.getPath(), configuration, files);
} else {
files.put(s.getPath(), s);
}
}
}
fs.close();
}
/**
* This method deletes all zero byte files within a directory and all its subdirectories
*
* @param fileOrDir If file then delete the file if its zero bytes, if directory then delete
* all zero bytes files from the directory
*/
public void removeAllZeroByteFiles(String fileOrDir) {
try {
Configuration configuration = new Configuration();
Map<Path, FileStatus> files = new HashMap<Path, FileStatus>();
this.getFiles(new Path(fileOrDir), configuration, files);
for (Path p : files.keySet()) {
FileStatus s = files.get(p);
if (s.getLen() == 0) {
FileSystem fs = FileSystem.get(p.toUri(), configuration);
fs.delete(p, false);
fs.close();
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
/**
* This method returns the size of file or a directory in HDFS.
*
* @param fileOrDir file or diretory or list of files or directories in HDFS, if directory then size of all
* files within the directory and its subdirectories are returned
* @return size of the file or directory (sum of all files in the directory and sub directories)
*/
public long size(String fileOrDir) throws IOException {
long totalSize = 0;
Configuration configuration = new Configuration();
String allFiles[] = fileOrDir.split(",", -1);
for (String allFile : allFiles) {
Path p = new Path(allFile);
FileSystem fs = FileSystem.get(p.toUri(), configuration);
totalSize = totalSize + fs.getContentSummary(p).getLength();
fs.close();
}
return totalSize;
}
/**
* The method moves a single or multiple files or directories, if exists, to trash.
* It also accepts list of hdfs file or directory delimited by comma.
*
* @param fileOrDir HDFS file or directory name or list of HDFS file or directory names
* @throws IOException
*/
public void removeHdfsPath(String fileOrDir)
throws IOException {
Configuration configuration = new Configuration();
FileSystem fs = FileSystem.newInstance(URI.create(fileOrDir), configuration);
String[] fileList = fileOrDir.split(",", -1);
Trash trash = new Trash(configuration);
trash.expunge();
for (String aFileList : fileList) {
Path p = new Path(aFileList);
if (fs.exists(p)) {
trash.moveToTrash(p);
}
}
fs.close();
}
}
发表评论
-
mapreduce Bet
2012-04-11 15:00 917import java.io.IOException; imp ... -
hadoop 输出格式
2012-04-05 17:18 722http://blog.csdn.net/dajuezhao/ ... -
hadoop mapreduce 原理
2012-03-31 16:14 677http://www.cnblogs.com/forfutur ... -
hadoop搭建问题
2012-03-30 13:23 801file:///E:/hadoop/搭建/hadoop集群搭建 ... -
hadoop输出文件格式
2012-03-26 10:09 639http://apps.hi.baidu.com/share/ ... -
hadoop 学习
2012-03-26 09:48 636http://hi.baidu.com/shuyan50/bl ... -
hadoop提高性能建议
2012-03-22 22:40 669http://langyu.iteye.com/blog/91 ... -
hadoop例子
2012-03-22 22:09 725http://www.hadoopor.com/thread- ... -
hadoop
2012-04-25 13:16 748精通HADOOP http://blog.csdn.net/ ... -
Hadoop Hive与Hbase整合
2012-03-07 15:02 346http://www.open-open.com/lib/vi ... -
hive hadoop 代码解析
2012-04-25 13:16 772http://www.tbdata.org/archives/ ... -
Hadoop MapReduce操作MySQL
2012-03-05 17:33 886http://www.javabloger.com/artic ... -
hadoop hdfs常用操作类
2012-03-05 10:03 1938import java.io.IOException; ... -
hdfs 操作类自己的
2012-03-02 17:57 542package operateFile; import ... -
hadoo 文件常用操作
2012-03-02 15:53 747http://www.360doc.com/content/1 ... -
Mapper,Reducer,Wrapper的Java模板
2012-03-02 08:24 1111http://www.easyigloo.org/?p=114 ... -
hadoop基础知识
2012-03-02 08:00 714http://www.blogjava.net/killme2 ... -
hadoop 自己封装的接口
2012-04-25 13:16 677http://www.360doc.com/content/1 ... -
HadoopFileUtil
2012-03-01 14:42 1833import java.io.File; import jav ... -
hadoop StringUtil
2012-03-01 14:33 845import java.util.*; public cla ...
相关推荐
在IT行业中,Hadoop是一个广泛使用的开源框架,主要用于大数据处理和分布式存储。Hadoop 2.7.3是这个框架的一个稳定版本,它包含了多个改进和优化,以提高性能和稳定性。在这个版本中,Winutils.exe和hadoop.dll是两...
Hadoop是一个开源的分布式计算框架,由Apache基金会开发,它主要设计用于处理和存储大量数据。在提供的信息中,我们关注的是"Hadoop的dll文件",这是一个动态链接库(DLL)文件,通常在Windows操作系统中使用,用于...
在大数据处理领域,Hadoop是一个不可或缺的开源框架,它提供了分布式存储和计算的能力。本文将详细探讨与"Hadoop.dll"和"winutils.exe"相关的知识点,以及它们在Hadoop-2.7.1版本中的作用。 Hadoop.dll是Hadoop在...
在Hadoop生态系统中,`hadoop.dll`和`winutils.exe`是两个关键组件,尤其对于Windows用户来说,它们在本地开发和运行Hadoop相关应用时必不可少。`hadoop.dll`是一个动态链接库文件,主要用于在Windows环境中提供...
Hadoop 是一个处理、存储和分析海量的分布式、非结构化数据的开源框架。最初由 Yahoo 的工程师 Doug Cutting 和 Mike Cafarella Hadoop 是一个处理、存储和分析海量的分布式、非结构化数据的开源框架。最初由 Yahoo...
在Hadoop生态系统中,`hadoop.dll`和`winutils.exe`是两个关键组件,尤其对于Windows用户来说。本文将详细介绍这两个文件以及它们在Hadoop 2.6.0版本中的作用。 `hadoop.dll`是Hadoop在Windows环境下运行所必需的一...
在Hadoop生态系统中,Hadoop 2.7.7是一个重要的版本,它为大数据处理提供了稳定性和性能优化。Hadoop通常被用作Linux环境下的分布式计算框架,但有时开发者或学习者在Windows环境下也需要进行Hadoop相关的开发和测试...
在IT行业中,Hadoop是一个广泛使用的开源框架,主要用于大数据处理和分布式存储。Hadoop 2.7.3是Hadoop发展中的一个重要版本,它包含了众多的优化和改进,旨在提高性能、稳定性和易用性。在这个版本中,`hadoop.dll`...
Hadoop是一个由Apache基金会所开发的分布式系统基础架构。用户可以在不了解分布式底层细节的情况下,开发分布式程序。充分利用集群的威力进 Hadoop是一个由Apache基金会所开发的分布式系统基础架构。用户可以在不...
在IT行业中,Hadoop是一个广泛使用的开源框架,主要用于大数据处理和分布式存储。它是由Apache软件基金会开发并维护的,旨在实现高效、可扩展的数据处理能力。Hadoop的核心由两个主要组件构成:Hadoop Distributed ...
Hadoop是Apache软件基金会开发的一个开源分布式计算框架,它允许在普通硬件上高效处理大量数据。在Windows环境下,Hadoop的使用与Linux有所不同,因为它的设计最初是针对Linux操作系统的。"winutils"和"hadoop.dll...
在Windows环境下安装Hadoop 3.1.0是学习和使用大数据处理技术的重要步骤。Hadoop是一个开源框架,主要用于分布式存储和处理大规模数据集。在这个过程中,我们将详细讲解Hadoop 3.1.0在Windows上的安装过程以及相关...
标题 "hadoop2.6 hadoop.dll+winutils.exe" 提到的是Hadoop 2.6版本中的两个关键组件:`hadoop.dll` 和 `winutils.exe`,这两个组件对于在Windows环境中配置和运行Hadoop至关重要。Hadoop原本是为Linux环境设计的,...
Apache Hadoop是一个开源框架,主要用于分布式存储和计算大数据集。Hadoop 3.1.0是这个框架的一个重要版本,提供了许多性能优化和新特性。在Windows环境下安装和使用Hadoop通常比在Linux上更为复杂,因为Hadoop最初...
Hadoop是Apache软件基金会开发的一个开源分布式计算框架,主要由HDFS(Hadoop Distributed File System)和MapReduce两大部分组成,旨在提供一种可靠、可扩展、高效的数据处理和存储解决方案。在标题中提到的...
《Hadoop Eclipse Plugin:开发利器的进化》 在大数据领域,Hadoop作为开源分布式计算框架,扮演着核心角色。为了方便开发者在Eclipse或MyEclipse这样的集成开发环境中高效地进行Hadoop应用开发,Hadoop-Eclipse-...
在Hadoop生态系统中,`winutils.exe`和`hadoop.dll`是Windows环境下运行Hadoop必备的组件,尤其对于开发和测试环境来说至关重要。这里我们深入探讨这两个组件以及与Eclipse插件的相关性。 首先,`winutils.exe`是...
在IT行业中,Hadoop是一个广泛使用的开源框架,主要用于大数据处理和分布式存储。Hadoop2.6.0是这个框架的一个重要版本,它包含了多项优化和改进,以提高系统的稳定性和性能。在这个压缩包中,我们关注的是与Windows...
Hadoop是Apache软件基金会开发的一个开源分布式计算框架,它允许在廉价硬件上处理大量数据,是大数据处理领域的重要工具。2.7.3是Hadoop的一个稳定版本,提供了可靠的分布式存储系统HDFS(Hadoop Distributed File ...
在IT行业中,Hadoop是一个广泛使用的开源框架,主要用于大数据处理和分析。这个压缩包文件包含的是"Hadoop.dll"和"winutils.exe"两个关键组件,它们对于在Windows环境下配置和运行Hadoop生态系统至关重要。 首先,...