- 浏览: 615087 次
- 性别:
- 来自: 上海
文章分类
最新评论
-
月光杯:
问题解决了吗?
Exceptions in HDFS -
iostreamin:
神,好厉害,这是我找到的唯一可以ac的Java代码,厉害。
[leetcode] word ladder II -
standalone:
One answer I agree with:引用Whene ...
How many string objects are created? -
DiaoCow:
不错!,一开始对这些确实容易犯迷糊
erlang中的冒号 分号 和 句号 -
standalone:
Exception in thread "main& ...
one java interview question
code 没看仔细,所以在hbase 的mail list上面问了这么个问题。其实再仔细看一下big table的paper就知道肯定是open的。现在分析的结果是hbase random read的performance决定在几个方面:
1)HDFS的seek操作,平均每次random get导致几次seek?
2)memory copy; 这个问题尤其在data locality差的时候,比如datanode 和regionserver不在一个node上的时候;
3)block cache?
hi, I know generally regionserver manages HRegions and in the HDFS layer data in HRegion are stored as HFile format. I want to know whether HFiles are all open and things lke block index are all loaded first to improve lookup performance? If so, what will happen if exceeding memory limit?
Thanks.
回复
|
转发
|
回复
|
显示详细信息 1月13日 (6 天前)
|
Yes, all files are opened on startup and kept open. Open of an hbase
storefile/hfile includes loading up of the file index and metadata.
In our experience, this overhead has been small. Its currently not
accounted for in our general memory-counting. We should for sure add
it.
St.Ack
storefile/hfile includes loading up of the file index and metadata.
In our experience, this overhead has been small. Its currently not
accounted for in our general memory-counting. We should for sure add
it.
St.Ack
- 显示引用文字 -
回复
|
转发
|
回复
|
显示详细信息 1月13日 (6 天前)
|
Thanks for your response, Stack. I have a further question when understanding hbase.
In my minds, I think a get is executed in the following process.
hbase client <=> RS <=> DN
1) hbase client finds the RS managing the key; 2) RS knows the hfile and fetches data from DataNode, this may be a pread + scanning in the hbase data block; 3) record result is returned to client.
Is this correct? So the most expensive operation is step 2? Any other time-consuming places?
回复
|
转发
|
回复
|
显示详细信息 1月13日 (6 天前)
|
retrieving data from disk is the most dominant element, until you are
fully cached in which case other factors inside the regionserver
become dominant. at this point copying memory, gc, algorithmic
complexity, etc become important.
fully cached in which case other factors inside the regionserver
become dominant. at this point copying memory, gc, algorithmic
complexity, etc become important.
- 显示引用文字 -
回复
|
转发
|
邀请 Ryan Rawson 聊天
|
回复
|
显示详细信息 1月14日 (4 天前)
|
is hdfs seek the most dominant in retrieving data? If records are small (~1k) and most requests are random Gets, how many seek will happen in average during a Get. Btw, what do you mean by memory copying? when will it cause large overhead? thanks.
2011/1/13 Ryan Rawson <ryanobjc@gmail.com>
- 显示引用文字 -
回复
|
转发
|
回复
|
显示详细信息 3:08 (7 小时前)
|
There should be as many seeks as there is store files in the region
that's serving the data. There's also the family dimension e.g. if you
read from only 1 family then only those store files are read.
So on average, I'd say you'll do 3 seeks since you do a minor
compaction once you reach 4 store files in a family.
What he meant by memory copying is just that the data has to be copied
from the socket when you read from HDFS and then into the outbound
socket for the client after the region server does whatever processing
it needs to do. I guess the more data you read to longer it takes to
copy in RAM?
J-D
that's serving the data. There's also the family dimension e.g. if you
read from only 1 family then only those store files are read.
So on average, I'd say you'll do 3 seeks since you do a minor
compaction once you reach 4 store files in a family.
What he meant by memory copying is just that the data has to be copied
from the socket when you read from HDFS and then into the outbound
socket for the client after the region server does whatever processing
it needs to do. I guess the more data you read to longer it takes to
copy in RAM?
J-D
- 显示引用文字 -
发表评论
-
hadoop-2.2.0 build failure due to missing dependancy
2014-01-06 13:18 756The bug and fix is at https://i ... -
HDFS中租约管理源代码分析
2013-07-05 18:05 0HDFS中Client写文件的时候要获得一个租约,用来保证Cl ... -
HBase Schema Design
2013-05-24 11:41 1195As someone has said here 引用You ... -
Question on HBase source code
2013-05-22 15:05 1122I'm reading source code of hbas ... -
Using the libjars option with Hadoop
2013-05-20 15:03 971As I have said in my last post, ... -
Use HBase to Solve Page Access Problem
2013-05-17 14:48 1189Currrently I'm working on sth l ... -
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/thi
2013-05-16 15:27 1146If you meet this exception, mak ... -
What's Xen?
2012-12-23 17:19 1130Xen的介绍。 -
学习hadoop之基于protocol buffers的 RPC
2012-11-15 23:23 10123现在版本的hadoop各种serv ... -
学习hadoop之基于protocol buffers的 RPC
2012-11-15 22:59 2现在版本的hadoop各种server、client RPC端 ... -
Hadoop RPC 一问
2012-11-14 14:43 121看代码时候发现好像有个地方做得多余,不知道改一下会不会有好处, ... -
Hadoop Version Graph
2012-11-14 11:47 932可以到这里看全文: http://cloudblog.8km ... -
Hadoop 2.0 代码分析---MapReduce
2012-10-25 18:27 7096本文参考hadoop的版本: hadoop-2.0.1-alp ... -
how to study hadoop?
2012-04-27 15:34 1536From StackOverflow http://stack ... -
[youtube] Scaling the Web: Databases & NoSQL
2012-03-23 13:11 1089Very good talk on this subject. ... -
首相发怒记之hadoop篇
2012-03-23 12:14 794我在youtube上看到的,某位能翻*墙的看一下吧,挺好笑的。 ... -
Cloud Security?
2011-09-02 14:23 856看了一些文章,主要是保证用户怎么保证存储在公有云的数据的完整性 ... -
一个HDFS Error
2011-06-11 21:53 1538ERROR: hdfs.DFSClient: Excep ... -
hadoop cluster at ebay
2011-06-11 21:39 1162Friday, December 17, 2010Hadoop ... -
[转]hadoop at ebay
2011-06-11 21:09 1200http://www.ebaytechblog.com/201 ...
相关推荐
HBase的核心组成部分包括HMaster、HRegionServer、Write-Ahead Logs (WALs)、HFiles、Store、MemStore和Region等。 ##### 1.2.1 HMaster - **监控RegionServer**:跟踪并管理活跃的RegionServer实例。 - **故障...
- **存储机制**: HBase使用HDFS存储WAL(Write-Ahead Log)和HFiles。默认情况下,HDFS不会实时同步数据到磁盘,而是写入临时文件后移动到最终位置,导致在断电情况下可能会丢失旧数据。 - **HDFS正确性设置**: - ...
FILE [INPUT_FILE ...]]] [-d [INPUT_DIR]] [-D OUTPUT_DIR] [--clear_output_dir] [-a] [--alv ABSTRACTION_LEVEL] [-N]A extractor used to extract all clang functions in specified file dir or .c/.hfiles....
HMaster负责协调和监控RegionServer的状态,分配Region给RegionServer,以及处理表的创建、删除、合并等操作。HRegionServer负责管理数据的存储,响应客户端的读写请求。ZooKeeper提供分布式锁服务,确保集群中的...
在架构方面,HBase还支持高级功能,比如HBase Shell脚本编程、快速读取和写入HFiles、MapReduce作业中的数据操作以及HBase在批处理和流处理中的应用。 通过这篇文档,用户可以了解和掌握HBase的基本操作、数据模型...
- **区域服务器(RegionServer)的大小**,提供了调整RegionServer大小的建议和准则。 - **列族数量的影响**,讨论了列族数量对性能和存储的影响。 - **行键设计**,强调了行键设计对性能的重要性,并提供了一些设计...
3. **InputFormat类**:自定义的输入格式,允许程序直接读取HFile,而不是HBase的RegionServer。 4. **OutputFormat类**:定义如何将MapReduce作业的输出写回,可能不涉及写回HBase,而是生成报表或其他形式的输出。...
文档可能提到,为了提高可用性,HBase可以在其他RegionServer上创建只读副本Region。这些副本可以指向HDFS中的相同HFiles。尽管如此,由于只保证了最终一致性,并且没有强一致性保证,因此只在特定的用例中提高可用...
- **Region Server Sizing Rules of Thumb**(RegionServer容量规划准则):根据业务需求和硬件配置来合理规划每个RegionServer的容量。 - **On the Number of Column Families**(关于列族的数量):探讨了增加列...
在HBase中,Bulk Load是一种高效的数据加载方法,它允许我们预先将数据转换为HFile格式,然后直接加载到表中,避免了多次写入RegionServer的开销,从而提高了性能。本文将深入探讨如何使用Java API实现HBase的Bulk ...
6. **加载HFiles**:最后,通过HBase的Admin API或者HBase的`importtsv`命令,将HDFS上的HFiles加载到HBase表中。加载完成后,HBase会将这些HFiles合并到其内部的存储层,从而完成数据导入。 在实际开发过程中,你...
表->HTable,按RowKey范围分的Region->HRegion->RegionServers,HRegion按列族->多个HStore,HStore->memstore+HFiles(均为有序的KV),HFiles->HDFS。 HRegion是HBase中分布式存储和负载均衡的最小单元。不同的...
HFile查看工具可以帮助运维人员查看HBase中存储的数据文件(HFiles)。这些文件是HBase存储数据的基本单元,理解它们的内容对于调试和优化HBase性能至关重要。 ##### 1.1.4 CopyTable CopyTable是一种工具,用于...
HRegion按列族分割成多个HStore,每个HStore包含一个memstore(内存存储)和多个HFiles(磁盘存储)。HFiles以有序的键值对形式存储,并最终持久化到HDFS上。HRegion是HBase分布式存储和负载均衡的最小单元,保证了...
使用 API 将数据放入工作中,但因为它必须遍历 HBase 的写入路径(即在将其刷新到 HFile 之前通过 WAL 和 memstore),它比您简单地绕过该批次并自己创建 HFiles 和将它们直接复制到 HDFS 中。 幸运的是 HBase 带有...
使用 API 将数据放入工作中,但因为它必须遍历 HBase 的写入路径(即在将其刷新到 HFile 之前通过 WAL 和 memstore),它比您简单地绕过该批次并自己创建 HFiles 和将它们直接复制到 HDFS 中。 幸运的是 HBase 带有...
hive.hbase.generatehfiles-生成HFiles为true hfile.family.path-HDFS中放置HFile的路径。 请注意,对于hfile.family.path,最后的su目录必须MATCH MATCH列族名称。 项目中的脚本可与Hortonworks沙盒一起使用,以...
批处理通常指的是使用HBase的Bulk Load功能,将预处理后的数据直接加载到HFiles,避免了中间的MemStore阶段,进一步提升了写入速度。 对于设备装置而言,HBase可能被集成到各种硬件系统中,如传感器网络、边缘计算...
对于MapReduce作业中HBase的使用,文档从HBase, MapReduce和CLASSPATH的配置到HBase作为MapReduce作业的数据源和数据接收器,再到批量导入期间直接写入HFiles,以及MapReduce作业中访问其他HBase表,还包括了推测...