- 浏览: 217558 次
- 性别:
- 来自: 深圳
最新评论
-
daijope:
...
mysql创建数据库后出现Access denied for user 'root'@'%' to database ‘xxxx’ -
szgaea:
本人博客记录在工作中遇到的一些问题的解决方案,都是现场解决后的 ...
如何为已安装php扩展安装mbstring -
56553655:
可以执行,我没看清楚
如何为已安装php扩展安装mbstring -
56553655:
老兄,你照这个命令再试一下是否成功
如何为已安装php扩展安装mbstring -
JeffreyHsu:
有用!!!
ie通过window.open下载excel,结果是直接打开excel乱码
说明:文档转正Hadoop Jira,以做备份。
I've done some estimates on how much space our data structures take on the name-node per block, file and directory.
Brief overview of the data structures:
Directory tree (FSDirectory) is built of inodes. Each INode points either to an array of blocks
if it corresponds to a file or to a TreeMap<String, INode> of children INodes if it is a directory.
[Note: this estimates were made before Dhruba replaced the children TreeMap by ArrayList.]
Each block participates also in at least 2 more data structures.
BlocksMap contains a HashMap<Block, BlockInfo> of all blocks mapping a Block into a BlockInfo.
DatanodeDescriptor contains a TreeMap<Block, Block> of all blocks belonging to this data-node.
A block may or may not be contained also in other data-structures, like
UnderReplicatedBlocks
PendingReplicationBlocks
recentInvalidateSets
excessReplicateMap
Presence of a block in any of these structures is temporary and therefore I do not count them in my estimates.
The estimates can be viewed as lower bounds.
These are some classes that we are looking at here
class INode {
String name;
INode parent;
TreeMap<String, INode> children;
Block blocks[];
short blockReplication;
long modificationTime;
}
class Block {
long blkid;
long len;
}
class BlockInfo {
FSDirectory.INode inode;
DatanodeDescriptor[] nodes;
Block block;
}
The calculations are made for a 64-bit java vm based on that
Reference size = 8 bytes
Object header size = 16 bytes
Array header size = 24 bytes
Commonly used objects:
TreeMap.Entry = 64 bytes. It has 5 reference fields
HashMap.Entry = 48 bytes. It has 3 reference fields
String header = 64 bytes.
The size of a file includes:
Size of an empty file INode: INode.children = null, INode.blocks is a 0-length array, and file name is empty. (152 bytes)
A directory entry of the parent INode that points to this file, which is a TreeMap.Entry. (64 bytes)
file name length times 2, because String represents each name character by 2 bytes.
Reference to the outer FSDirectory class (8 bytes)
The total: 224 + 2 * fileName.length
The size of a directory includes:
Size of an empty directory INode: INode.children is an empty TreeMap, INode.blocks = null, and file name is empty. (192 bytes)
A directory entry of the parent INode that points to this file, which is a TreeMap.Entry. (64 bytes)
file name length times 2
Reference to the outer FSDirectory class (8 bytes)
The total: 264 + 2 * fileName.length
The size of a block includes:
Size of Block. (32 bytes)
Size of BlockInfo. (64 + 8*replication bytes)
Reference to the block from INode.blocks (8 bytes)
HashMap.Entry referencing the block from BlocksMap. (48 bytes)
References to the block from all DatanodeDescriptors it belongs to.
This is a TreeMap.Entry size times block replication. (64 * replication)
The total: 152 + 72 * replication
Typical object sizes:
Taking into account that typical file name is 10-15 bytes and our default replication is 3 we can say that typical sizes are
File size: 250
Directory size: 290
Block size: 368
Object size estimate (bytes) typical size (bytes)
File 224 + 2 * fileName.length 250
Directory 264 + 2 * fileName.length 290
Block 152 + 72 * replication 368
One of our clusters has
Files: 10 600 000
Dirs: 310 000
Blocks: 13 300 000
Total Size (estimate): 7,63 GB
Memory used on the name-node (actual reported by jconsole after gc): 9 GB
This means that other data structures like NetworkTopology, heartbeats, datanodeMap, Host2NodesMap,
leases, sortedLeases, and multiple block replication maps occupy ~1.4 GB, which seems to be pretty high
and need to be investigated as well.
Based on the above estimates blocks should be the main focus of the name-node memory reduction effort.
Space used by a block is 50% larger compared to a file, and there is more blocks than files or directories.
Some ideas on how we can reduce the name-node size without substantially changing the data structures.
INode.children should be an ArrayList instead of a TreeMap. Already done HADOOP-1565. (-48 bytes)
Factor out the INode class into a separate class (-8 bytes)
Create base INode class and derive file inode and directory inode classes from the base.
Directory inodes do not need to contain blocks and replication fields (-16 bytes)
File inodes do not need to contain children field (-8 bytes)
String name should be replaced by a mere byte[]. (-(40 + fileName.length) ~ -50 bytes)
Eliminate the Block object.
We should move Block fields into into BlockInfo and completely get rid of the Block object. (-16 bytes)
Block object is referenced at least 5 times in our structures for each physical block.
The number of references should be reduced to just 2. (-24)
Remove name field from INode. File or directory name is stored in the corresponding directory
entry and does need to be duplicated in the INode (-8 bytes)
Eliminate INode.parent field. INodes are accessed through the directory tree, and the parent can
be remembered in a local variable while browsing the tree. There is no need to persistently store
the parent reference for each object. (-8 bytes)
Need to optimize data-node to block map. Currently each DatanodeDescriptor holds a TreeMap of
blocks contained in the node, and we have an overhead of one TreeMap.Entry per block replica.
I expect we can reorganize datanodeMap in a way that it stores only 1 or 2 references per replica
instead of an entire TreeMap.Entry. (-48 * replication)
Note: In general TreeMaps turned out to be very expensive, we should avoid using them if possible.
Or implement a custom map structure, which would avoid using objects for each map entry.
This is what we will have after all the optimizations
Object size estimate (bytes) typical size (bytes) current typical size (bytes)
File 112 + fileName.length 125 250
Directory 144 + fileName.length 155 290
Block 112 + 24 * replication 184 368
I've done some estimates on how much space our data structures take on the name-node per block, file and directory.
Brief overview of the data structures:
Directory tree (FSDirectory) is built of inodes. Each INode points either to an array of blocks
if it corresponds to a file or to a TreeMap<String, INode> of children INodes if it is a directory.
[Note: this estimates were made before Dhruba replaced the children TreeMap by ArrayList.]
Each block participates also in at least 2 more data structures.
BlocksMap contains a HashMap<Block, BlockInfo> of all blocks mapping a Block into a BlockInfo.
DatanodeDescriptor contains a TreeMap<Block, Block> of all blocks belonging to this data-node.
A block may or may not be contained also in other data-structures, like
UnderReplicatedBlocks
PendingReplicationBlocks
recentInvalidateSets
excessReplicateMap
Presence of a block in any of these structures is temporary and therefore I do not count them in my estimates.
The estimates can be viewed as lower bounds.
These are some classes that we are looking at here
class INode {
String name;
INode parent;
TreeMap<String, INode> children;
Block blocks[];
short blockReplication;
long modificationTime;
}
class Block {
long blkid;
long len;
}
class BlockInfo {
FSDirectory.INode inode;
DatanodeDescriptor[] nodes;
Block block;
}
The calculations are made for a 64-bit java vm based on that
Reference size = 8 bytes
Object header size = 16 bytes
Array header size = 24 bytes
Commonly used objects:
TreeMap.Entry = 64 bytes. It has 5 reference fields
HashMap.Entry = 48 bytes. It has 3 reference fields
String header = 64 bytes.
The size of a file includes:
Size of an empty file INode: INode.children = null, INode.blocks is a 0-length array, and file name is empty. (152 bytes)
A directory entry of the parent INode that points to this file, which is a TreeMap.Entry. (64 bytes)
file name length times 2, because String represents each name character by 2 bytes.
Reference to the outer FSDirectory class (8 bytes)
The total: 224 + 2 * fileName.length
The size of a directory includes:
Size of an empty directory INode: INode.children is an empty TreeMap, INode.blocks = null, and file name is empty. (192 bytes)
A directory entry of the parent INode that points to this file, which is a TreeMap.Entry. (64 bytes)
file name length times 2
Reference to the outer FSDirectory class (8 bytes)
The total: 264 + 2 * fileName.length
The size of a block includes:
Size of Block. (32 bytes)
Size of BlockInfo. (64 + 8*replication bytes)
Reference to the block from INode.blocks (8 bytes)
HashMap.Entry referencing the block from BlocksMap. (48 bytes)
References to the block from all DatanodeDescriptors it belongs to.
This is a TreeMap.Entry size times block replication. (64 * replication)
The total: 152 + 72 * replication
Typical object sizes:
Taking into account that typical file name is 10-15 bytes and our default replication is 3 we can say that typical sizes are
File size: 250
Directory size: 290
Block size: 368
Object size estimate (bytes) typical size (bytes)
File 224 + 2 * fileName.length 250
Directory 264 + 2 * fileName.length 290
Block 152 + 72 * replication 368
One of our clusters has
Files: 10 600 000
Dirs: 310 000
Blocks: 13 300 000
Total Size (estimate): 7,63 GB
Memory used on the name-node (actual reported by jconsole after gc): 9 GB
This means that other data structures like NetworkTopology, heartbeats, datanodeMap, Host2NodesMap,
leases, sortedLeases, and multiple block replication maps occupy ~1.4 GB, which seems to be pretty high
and need to be investigated as well.
Based on the above estimates blocks should be the main focus of the name-node memory reduction effort.
Space used by a block is 50% larger compared to a file, and there is more blocks than files or directories.
Some ideas on how we can reduce the name-node size without substantially changing the data structures.
INode.children should be an ArrayList instead of a TreeMap. Already done HADOOP-1565. (-48 bytes)
Factor out the INode class into a separate class (-8 bytes)
Create base INode class and derive file inode and directory inode classes from the base.
Directory inodes do not need to contain blocks and replication fields (-16 bytes)
File inodes do not need to contain children field (-8 bytes)
String name should be replaced by a mere byte[]. (-(40 + fileName.length) ~ -50 bytes)
Eliminate the Block object.
We should move Block fields into into BlockInfo and completely get rid of the Block object. (-16 bytes)
Block object is referenced at least 5 times in our structures for each physical block.
The number of references should be reduced to just 2. (-24)
Remove name field from INode. File or directory name is stored in the corresponding directory
entry and does need to be duplicated in the INode (-8 bytes)
Eliminate INode.parent field. INodes are accessed through the directory tree, and the parent can
be remembered in a local variable while browsing the tree. There is no need to persistently store
the parent reference for each object. (-8 bytes)
Need to optimize data-node to block map. Currently each DatanodeDescriptor holds a TreeMap of
blocks contained in the node, and we have an overhead of one TreeMap.Entry per block replica.
I expect we can reorganize datanodeMap in a way that it stores only 1 or 2 references per replica
instead of an entire TreeMap.Entry. (-48 * replication)
Note: In general TreeMaps turned out to be very expensive, we should avoid using them if possible.
Or implement a custom map structure, which would avoid using objects for each map entry.
This is what we will have after all the optimizations
Object size estimate (bytes) typical size (bytes) current typical size (bytes)
File 112 + fileName.length 125 250
Directory 144 + fileName.length 155 290
Block 112 + 24 * replication 184 368
发表评论
-
PrettyTime是一个开源的时间格式化类库
2015-03-02 16:24 714PrettyTime是一个开源的时间格式化类库。它能够将时间格 ... -
移动网关代码
2012-03-27 11:55 2214移动发送短信的状态报告 ISMG向SP送交状态报告中的STA ... -
Oracle MTS
2012-03-18 12:07 922关于Oracle的MTS 一、 ... -
rmi spring resin启动出错
2012-02-19 01:32 1991[00:30:36,734] [ContextLoader,2 ... -
[转]ngnix完整配置
2012-02-18 12:29 2078#用户 用户组 user www www; ... -
常用数据库命令
2012-02-02 15:19 1072show engine innodb status\G; s ... -
myisam类型的表自动修复的方法
2011-12-21 17:23 1306myisam类型的表极容易损坏,下面介绍一种自动修复myisa ... -
mysql备份表的几种方式
2011-12-08 09:55 4417#!/bin/bash # 记录时间 T=$(date + ... -
oracle管理员密码忘了
2011-11-21 17:03 1182Oracle WEB管理平台访问地址: http://loca ... -
oracle锁一些知识
2011-10-28 00:43 4696表级锁共具有五种模式,如下所示。 行级排他锁(Row Exc ... -
常用linux命令
2011-10-21 16:52 877查看进程使用句柄数 lsof -n |awk '{pri ... -
mysql创建数据库后出现Access denied for user 'root'@'%' to database ‘xxxx’
2011-09-14 10:55 36421create database test1; 创 ... -
oracle自动备份
2011-09-06 16:30 1055#!/bin/bash source /home/ora ... -
oracle管理的一些常用令
2011-09-06 16:29 1121sqlplus /nolog conn / as sysdb ... -
Linux 系统下oracle服务自启动简单配置
2011-08-03 11:52 1405Linux 系统下oracle服务自启动简单配置 1、实例启 ... -
linux安装oracle实录
2011-08-02 00:24 1285安装企业版的Oracle 10g 1、检查是否需要的包都存在 ... -
关于struts2的验证问题
2011-07-18 21:35 842自我感觉struts2的验证比struts1的验证做得还 ... -
struts2 开发过程中遇到的一些问题及解决办法
2011-07-15 23:25 11891、在进行form validate时,若form中有下拉框的 ... -
Resin使用struts2标签错误解决com.caucho.jsp.JspParseException: javax/xml/ws/WebServiceRef
2011-07-08 15:16 2596MyEclipse+Resin使用struts2标签错误解 ... -
[转载记录]系统的UIM卡介绍
2011-06-21 16:48 1138系统的UIM卡介绍 目前CDMA终端在全球绝大多 ...
相关推荐
### Hadoop Namenode性能诊断及优化 #### 一、Namenode简介与性能挑战 Hadoop作为大数据处理领域的核心技术之一,其分布式文件系统HDFS(Hadoop Distributed File System)是整个框架的重要组成部分。HDFS主要由两...
Namenode 是 Hadoop 集群中的主节点,负责管理文件系统的命名空间和数据块的分布。它维护着文件系统的树形结构,记录着每个文件的元数据,如文件名、权限、所有者等信息。 Datanode 的作用 Datanode 是 Hadoop ...
Namenode 负责管理文件系统的名字空间和客户端对文件的访问,Datanode 负责管理其所在节点上的存储。 3. HDFS 的特征 HDFS 有以下基本特征: (1)对于整个集群有单一的命名空间。 (2)数据一致性,适合一次...
基于 Hadoop 集群平台的计算架构 本文主要介绍了基于 Hadoop 集群平台的计算架构,包括 Hadoop 简介、HDFS 体系结构、Hadoop 集群搭建等方面的知识点。 Hadoop 简介 Hadoop 是 Apache 下的一个开源项目,是一个...
基于 Hadoop 集群平台的计算架构 本文将详细介绍基于 Hadoop 集群平台的计算架构,包括 Hadoop 的简介、HDFS 的架构、Hadoop 集群的搭建等。 一、Hadoop 简介 Hadoop 是 Apache 下的一个项目,它是一个开源的可...
Hadoop分布计算安装 Hadoop是Apache软件基金会旗下的一个开源分布式计算平台,以Hadoop分布式文件系统(HDFS)和MapReduce(Google MapReduce的开源实现)为核心的Hadoop为用户提供了系统底层细节透明的分布式基础...
例如,对于不再频繁访问的数据,可以降低其复制因子,从而释放更多的存储空间。 #### 总结 综上所述,Hadoop分布式文件系统(HDFS)通过对硬件故障的容忍、大规模数据集的高效处理、数据块的智能复制与组织,以及...
副本数量的选择需要平衡数据安全性与存储空间的使用。 再者,`mapred-site.xml`文件配置了MapReduce作业的调度器。`mapred.job.tracker`属性指定了JobTracker的位置,它是MapReduce作业的管理和调度中心,这里设置...
总的来说,Hadoop架构在分布式计算与存储方面的优势十分明显。它不仅可以高效地存储和处理海量数据,还具备良好的可扩展性,并且在成本上比传统数据处理系统更低。随着大数据技术的发展,Hadoop架构在很多领域得到了...
4. Erasure Coding:2.7.2版本开始引入Erasure Coding,这是一种更节省存储空间的数据冗余策略,相比传统的三副本,可以减少存储成本,同时保证数据恢复能力。 5. 性能优化:包括提升数据读写速度、降低网络延迟等...
搭建Hadoop集群需要至少三台服务器,在硬件准备中需要考虑服务器的CPU、内存、存储空间等资源,以便于支持大规模数据处理需求。文中提到了使用5台Dell T5600主机作为实验环境。 5. 软件准备 在搭建Hadoop平台之前,...
Namenode 负责管理文件系统的名字空间和客户端对文件的访问,而 Datanodes 负责存储文件块。 MapReduce 是一个分布式并行编程框架,能够处理大规模的数据集。MapReduce 框架包括两个主要阶段:Map 阶段和 Reduce ...
完成配置后,初始化HDFS命名空间,格式化NameNode,通过`hadoop namenode -format`命令实现。接着,启动Hadoop的各个服务,包括DataNode、NameNode、ResourceManager、NodeManager等。可以使用`start-dfs.sh`和`...
Hadoop是一个开源的分布式存储和处理大数据的框架,它能有效地存储和处理PB级别的数据。Hadoop的核心是HDFS(Hadoop Distributed File System),它使用的是主从架构模式,其中包含一个NameNode作为主服务节点,负责...
HDFS系统由一个NameNode和多个DataNode组成,其中NameNode负责提供元数据服务,即存储文件系统的命名空间和客户端对文件的访问操作,而DataNode则负责具体的数据存储。 在HDFS中,数据被分割成数据块(block),并...
Hadoop作为一个代表性的开源云计算平台,其分布式存储架构HDFS(Hadoop Distributed File System)和分布式处理框架MapReduce已经成为研究者和开发者关注的焦点。 Hadoop分布式存储架构的核心组件之一HDFS,是一个...
在大数据处理领域,Hadoop是一个不可或缺的开源框架,它提供了分布式存储(HDFS)和分布式计算(MapReduce)的能力。本文将详细介绍如何在单机环境下搭建Hadoop的伪分布式模式,这是一种模拟分布式环境的配置,适合...
在Hadoop 3.1.3中,HDFS引入了Erasure Coding,这是一种用于数据冗余和容错的新方法,相比于传统的三副本策略,它可以节省更多的存储空间。此外,HDFS还支持大文件块(128MB或更大),提高了大规模数据处理的效率。 ...
在IT领域,尤其是大数据处理与分布式计算环境中,Hadoop无疑占据着举足轻重的地位。作为一款开源软件框架,Hadoop被设计用于分布式存储和处理大规模数据集,它包括了Hadoop Distributed File System (HDFS) 和...