- 浏览: 405528 次
- 性别:
- 来自: 北京
文章分类
最新评论
-
秦时明月黑:
深入浅出,楼主很有功底
hive编译部分的源码结构 -
tywo45:
感觉好多错误,但还是支持!
HDFS+MapReduce+Hive+HBase十分钟快速入门 -
xbbHistory:
解析的很棒!!
Linux-VFS -
darrendu:
执行这个命令,bin/hadoop fs -ls /home/ ...
Hadoop示例程序WordCount运行及详解 -
moudaen:
请问楼主,我执行总后一条语句时,执行的是自带的1.sql,你当 ...
TPC-H on Hive
From http://developer.yahoo.com/hadoop/tutorial/module2.html
Rebalancing Blocks
如何添加新节点到集群:
New nodes can be added to a cluster in a straightforward manner. On the new node, the same Hadoop version and configuration ( conf/hadoop-site.xml ) as on the rest of the cluster should be installed. Starting the DataNode daemon on the machine will cause it to contact the NameNode and join the cluster. (The new node should be added to the slaves file on the master server as well, to inform the master how to invoke script-based commands on the new node.)
如何在新的节点上平衡数据:
But the new DataNode will have no data on board initially; it is therefore not alleviating space concerns on the existing nodes. New files will be stored on the new DataNode in addition to the existing ones, but for optimum usage, storage should be evenly balanced across all nodes.
This can be achieved with the automatic balancer tool included with Hadoop. The Balancer class will intelligently balance blocks across the nodes to achieve an even distribution of blocks within a given threshold, expressed as a percentage. (The default is 10%.) Smaller percentages make nodes more evenly balanced, but may require more time to achieve this state. Perfect balancing (0%) is unlikely to actually be achieved.
The balancer script can be run by starting bin/start-balancer.sh in the Hadoop directory. The script can be provided a balancing threshold percentage with the -threshold parameter;
e.g., bin/start-balancer.sh -threshold 5 .
The balancer will automatically terminate when it achieves its goal, or when an error occurs, or it cannot find more candidate blocks to move to achieve better balance. The balancer can always be terminated safely by the administrator by running bin/stop-balancer.sh .
The balancing script can be run either when nobody else is using the cluster (e.g., overnight), but can also be run in an "online" fashion while many other jobs are on-going. To prevent the rebalancing process from consuming large amounts of bandwidth and significantly degrading the performance of other processes on the cluster, the dfs.balance.bandwidthPerSec configuration parameter can be used to limit the number of bytes/sec each node may devote to rebalancing its data store.
Copying Large Sets of Files
When migrating a large number of files from one location to another (either from one HDFS cluster to another, from S3 into HDFS or vice versa, etc), the task should be divided between multiple nodes to allow them all to share in the bandwidth required for the process. Hadoop includes a tool called distcp for this purpose.
By invoking bin/hadoop distcp src dest , Hadoop will start a MapReduce task to distribute the burden of copying a large number of files from src to dest . These two parameters may specify a full URL for the the path to copy. e.g., "hdfs://SomeNameNode:9000/foo/bar/" and "hdfs://OtherNameNode:2000/baz/quux/" will copy the children of /foo/bar on one cluster to the directory tree rooted at /baz/quux on the other. The paths are assumed to be directories, and are copied recursively. S3 URLs can be specified with s3://bucket-name /key .
Decommissioning Nodes
如何从集群中删除节点:
In addition to allowing nodes to be added to the cluster on the fly, nodes can also be removed from a cluster while it is running , without data loss. But if nodes are simply shut down "hard," data loss may occur as they may hold the sole copy of one or more file blocks.
Nodes must be retired on a schedule that allows HDFS to ensure that no blocks are entirely replicated within the to-be-retired set of DataNodes.
HDFS provides a decommissioning feature which ensures that this process is performed safely. To use it, follow the steps below:
Step 1: Cluster configuration . If it is assumed that nodes may be retired in your cluster, then before it is started, an excludes file must be configured. Add a key named dfs.hosts.exclude to your conf/hadoop-site.xml file. The value associated with this key provides the full path to a file on the NameNode's local file system which contains a list of machines which are not permitted to connect to HDFS.
Step 2: Determine hosts to decommission . Each machine to be decommissioned should be added to the file identified by dfs.hosts.exclude , one per line. This will prevent them from connecting to the NameNode.
Step 3: Force configuration reload . Run the command bin/hadoop dfsadmin -refreshNodes . This will force the NameNode to reread its configuration, including the newly-updated excludes file. It will decommission the nodes over a period of time, allowing time for each node's blocks to be replicated onto machines which are scheduled to remain active.
Step 4: Shutdown nodes . After the decommission process has completed, the decommissioned hardware can be safely shutdown for maintenance, etc. The bin/hadoop dfsadmin -report command will describe which nodes are connected to the cluster.
Step 5: Edit excludes file again . Once the machines have been decommissioned, they can be removed from the excludes file. Running bin/hadoop dfsadmin -refreshNodes again will read the excludes file back into the NameNode, allowing the DataNodes to rejoin the cluster after maintenance has been completed, or additional capacity is needed in the cluster again, etc.
Verifying File System Health
After decommissioning nodes, restarting a cluster, or periodically during its lifetime, you may want to ensure that the file system is healthy--that files are not corrupted or under-replicated, and that blocks are not missing.
Hadoop provides an fsck command to do exactly this. It can be launched at the command line like so:
bin/hadoop fsck [path ] [options ]
If run with no arguments, it will print usage information and exit. If run with the argument / , it will check the health of the entire file system and print a report. If provided with a path to a particular directory or file, it will only check files under that path. If an option argument is given but no path, it will start from the file system root (/ ). The options may include two different types of options:
Action options specify what action should be taken when corrupted files are found. This can be -move , which moves corrupt files to /lost+found , or -delete , which deletes corrupted files.
Information options specify how verbose the tool should be in its report. The -files option will list all files it checks as it encounters them. This information can be further expanded by adding the -blocks option, which prints the list of blocks for each file. Adding -locations to these two options will then print the addresses of the DataNodes holding these blocks. Still more information can be retrieved by adding -racks to the end of this list, which then prints the rack topology information for each location. (See the next subsection for more information on configuring network rack awareness.) Note that the later options do not imply the former; you must use them in conjunction with one another. Also, note that the Hadoop program uses -files in a "common argument parser" shared by the different commands such as dfsadmin , fsck , dfs , etc. This means that if you omit a path argument to fsck, it will not receive the -files option that you intend. You can separate common options from fsck-specific options by using -- as an argument, like so:
bin/hadoop fsck -- -files -blocks
The -- is not required if you provide a path to start the check from, or if you specify another argument first such as -move .
By default, fsck will not operate on files still open for write by another client. A list of such files can be produced with the -openforwrite option.
发表评论
-
Hadoop的Secondary NameNode方案
2012-11-13 10:39 1282http://book.51cto.com/art/20120 ... -
hadoop
2011-10-08 12:20 1104hadoop job解决 ... -
hadoop作业调优参数整理及原理
2011-04-15 14:02 13111 Map side tuning 参数 ... -
Job运行流程分析
2011-03-31 11:04 1672http://www.cnblogs.com/forfutur ... -
hadoop作业运行部分源码
2011-03-31 10:51 1417一、客户端 Map-Reduce的过程首先是由客户端提交 ... -
eclipse中编译hadoop(hive)源码
2011-03-24 13:20 3419本人按照下面编译Hadoop 所说的方法在eclipse中编 ... -
Configuration Parameters: What can you just ignore?
2011-03-11 15:16 865http://www.cloudera.com/blog/20 ... -
7 Tips for Improving MapReduce Performance
2011-03-11 15:06 1003http://www.cloudera.com/blog ... -
hadoop 源码分析一
2011-02-22 15:29 1203InputFormat : 将输入的 ... -
hadoop参数配置(mapreduce数据流)
2011-01-14 11:08 2902Hadoop配置文件设定了H ... -
混洗和排序
2011-01-05 19:33 3250在mapreduce过程中,map ... -
hadoop中每个节点map和reduce个数的设置调优
2011-01-05 19:28 8377map red.tasktracker.map.tasks. ... -
hadoop profiling
2010-12-20 20:52 2637和debug task一样,profiling一个运行在分布 ... -
关于JVM内存设置
2010-12-20 20:49 1351运行map、reduce任务的JVM内存调整:(我当时是在jo ... -
HADOOP报错Incompatible namespaceIDs
2010-12-14 12:56 1008HADOOP报错Incomp ... -
node1-node6搭建hadoop
2010-12-13 18:42 1126环境: node1-node6 node1为主节点 ... -
hadoop启动耗时
2010-12-07 17:28 1321http://blog.csdn.net/AE86_FC/ar ... -
namenode 内部关键数据结构简介
2010-12-07 16:35 1282http://www.tbdata.org/archiv ... -
HDFS常用命令
2010-12-04 14:59 1314文件系统检查 bin/hadoop fsck [pa ... -
hadoop 0.20 程式開發
2010-11-30 17:15 1293hadoop 0.20 程式開發 ecl ...
相关推荐
HDFS能 够提供对数据的可扩展访问,通过简单地往集群里添加节点就可以解决大量客户端同时访问的问题。HDFS支持传统的层次文件组织结构,同现 有的一些文件系 统类似,如可以对文件进行创建、删除、重命名等操作。
1. 高可靠性:HDFS 能够检测和恢复节点故障,确保数据的安全和可用性。 2. 高性能:HDFS 可以处理大规模数据,支持高效的数据读写操作。 3. 可扩展性:HDFS 可以水平扩展,支持大规模数据存储和管理。 4. 优化搜索...
- **文件和目录操作**:HDFS Explorer提供类似Windows资源管理器的界面,可以进行文件的上传、下载、删除、重命名、创建目录等基本操作。 - **文件预览**:用户可以直接在浏览器中查看文本文件内容,无需下载到本地...
6. `hdfs dfs -rm` 和 `-rmr`: 删除单个文件或递归删除整个目录,`-skipTrash`参数跳过回收站直接删除。 7. `hdfs dfs -put`: 将本地文件或目录上传到HDFS,`-f`参数强制覆盖目标。 8. `hdfs dfs -get`: 从HDFS...
当新增节点时,HDFS会运行数据负载均衡,将现有的数据块在集群中重新分布,以充分利用新添加的资源。这个过程可能耗时较长,因此通常在集群空闲时进行。 总的来说,HDFS是大数据处理的关键组成部分,它的设计和操作...
使用这些jar包,可以实现创建、读取、写入、移动和删除HDFS上的文件和目录等操作。 以下是HDFS的一些关键知识点: 1. 分布式存储:HDFS将大文件分割成块(默认为128MB或256MB),并将这些块复制到集群的不同节点上...
DataNode是HDFS的数据存储节点,负责存储和管理数据块。DataNode维护着数据块的副本,确保数据的可用性和可靠性。 DataNode的主要功能: 1. 数据块存储:DataNode存储数据块的副本,确保数据的可用性和可靠性。 2....
此外,你还需要修改配置文件,例如`core-site.xml`和`hdfs-site.xml`,以确保它们与你的本地HDFS集群设置相匹配,例如命名节点(NameNode)地址和端口。 在描述中提到,有五个示例文档涉及SpringBoot的使用。...
NameNode是HDFS的核心节点,负责管理文件系统的元数据,包括文件和目录的命名空间以及文件的块映射信息。SecondaryNameNode则辅助NameNode,帮助维护元数据的一致性。下面我们将详细探讨NameNode和SecondaryNameNode...
在分布式存储领域,Hadoop Distributed File System(HDFS)是一个被广泛使用的开源文件系统,它设计用于处理...同时,Hadoop的API还提供了其他功能,如文件的移动、重命名、删除等,这些都是HDFS操作的重要组成部分。
5. **集群的线性水平可扩展性**:随着需求的增长,可以通过简单地添加更多的节点来扩展HDFS集群。 6. **一次写入,多次读取模型**:一旦文件创建完成,它们通常不会再被修改。 7. **支持可移植性**:HDFS不仅支持...
HDFS设计为水平扩展,可以轻松添加更多DataNode以增加存储容量和处理能力。随着集群规模的增长,HDFS能够处理PB级别的数据。 ### 7. 高容错性 通过心跳机制和Block Report,DataNode定期向NameNode报告其状态,...
1. 添加节点:在新节点上完成HBase的安装和配置,然后将该节点加入到Hadoop集群中。更新HBase的`regionservers`文件,添加新节点的主机名。重启HMaster服务,HBase会自动分配工作负载到新节点。 2. 移除节点:在...
- **可扩展性**:支持动态添加或删除节点,以适应不同规模的数据处理需求。 - **高吞吐量**:适用于大数据的批量处理,提供了高效的读写性能。 - **支持流式数据访问**:非常适合处理大规模数据集,支持高速流式数据...
1. 文件操作:使用Hadoop命令行工具或编程接口(如Java API)进行文件的创建、读取、修改和删除。 2. MapReduce配合:HDFS与MapReduce结合,实现大规模数据的分布式计算。 3. 配置优化:根据实际需求调整HDFS参数,...
1. **HDFS API**:HDFS API是Hadoop的核心组件之一,提供了对分布式文件系统的基本操作,如打开、创建、读取、写入、移动和删除文件或目录。主要类包括`FSDFSClient`、`FileSystem`、`Path`等。 2. **FileSystem...
HDFS提供了两种命令前缀:`bin/hadoop fs` 和 `bin/hdfs dfs`,其中`dfs`是`fs`的一个实现类。两种方式都可以用来执行具体的HDFS操作命令。 **2.2 命令大全** HDFS提供了一系列强大的命令用于文件和目录的操作: ...