Will all HFiles managed by a regionserver kept open

standalone

浏览: 620496 次
性别:
来自: 上海

最近访客更多访客>>

liujun.1980

rkikbs

yy629

songhait

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

hbase
hadoop
cloud

HBase

code 没看仔细，所以在hbase 的mail list上面问了这么个问题。其实再仔细看一下big table的paper就知道肯定是open的。现在分析的结果是hbase random read的performance决定在几个方面：

1）HDFS的seek操作，平均每次random get导致几次seek？

2）memory copy; 这个问题尤其在data locality差的时候，比如datanode 和regionserver不在一个node上的时候；

3）block cache？

hi, I know generally regionserver manages HRegions and in the HDFS layer data in HRegion are stored as HFile format. I want to know whether HFiles are all open and things lke block index are all loaded first to improve lookup performance? If so, what will happen if exceeding memory limit?

Thanks.

Stack

发送至 user

显示详细信息 1月13日 (6 天前)

Yes, all files are opened on startup and kept open. Open of an hbase
storefile/hfile includes loading up of the file index and metadata.
In our experience, this overhead has been small. Its currently not
accounted for in our general memory-counting. We should for sure add
it.

St.Ack

- 显示引用文字 -

Tao Xie

发送至 user

显示详细信息 1月13日 (6 天前)

Thanks for your response, Stack. I have a further question when understanding hbase.

In my minds, I think a get is executed in the following process.

hbase client <=> RS <=> DN

1) hbase client finds the RS managing the key; 2) RS knows the hfile and fetches data from DataNode, this may be a pread + scanning in the hbase data block; 3) record result is returned to client.

Is this correct? So the most expensive operation is step 2? Any other time-consuming places?

2011/1/13 Stack <stack@duboce.net>

- 显示引用文字 -

Ryan Rawson

发送至 user

显示详细信息 1月13日 (6 天前)

retrieving data from disk is the most dominant element, until you are
fully cached in which case other factors inside the regionserver
become dominant. at this point copying memory, gc, algorithmic
complexity, etc become important.

- 显示引用文字 -

邀请 Ryan Rawson 聊天

Tao Xie

发送至 user

显示详细信息 1月14日 (4 天前)

is hdfs seek the most dominant in retrieving data? If records are small (~1k) and most requests are random Gets, how many seek will happen in average during a Get. Btw, what do you mean by memory copying? when will it cause large overhead? thanks.

2011/1/13 Ryan Rawson <ryanobjc@gmail.com>

- 显示引用文字 -

Jean-Daniel Cryans

发送至 user

显示详细信息 3:08 (7 小时前)

There should be as many seeks as there is store files in the region
that's serving the data. There's also the family dimension e.g. if you
read from only 1 family then only those store files are read.

So on average, I'd say you'll do 3 seeks since you do a minor
compaction once you reach 4 store files in a family.

What he meant by memory copying is just that the data has to be copied
from the socket when you read from HDFS and then into the outbound
socket for the client after the region server does whatever processing
it needs to do. I guess the more data you read to longer it takes to
copy in RAM?

J-D

- 显示引用文字 -

分享到：

impact of total region numbers? | 细说HBase怎么完成一个Get操作 (server s ...

2011-01-19 10:29
浏览 1507
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论