Hadoop HBase建表时预分区(region)的方法学习

艾伦蓝

浏览: 611434 次
性别:
来自: 厦门

最近访客更多访客>>

stephen830

njdccy

lzy8828

WangJiaX

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Hadoop
HBase

如果知道Hbase数据表的key的分布情况，就可以在建表的时候对hbase进行region的预分区。这样做的好处是防止大数据量插入的热点问题，提高数据插入的效率。

1.规划hbase预分区
-------------------------
首先就是要想明白数据的key是如何分布的，然后规划一下要分成多少region，每个region的startkey和endkey是多少，然后将规划的key写到一个文件中。比如，key的前几位字符串都是从0001~0010的数字，这样可以分成10个region，划分key的文件如下：

为什么后面会跟着一个"|"，是因为在ASCII码中，"|"的值是124，大于所有的数字和字母等符号，当然也可以用“~”（ASCII-126）。分隔文件的第一行为第一个region的stopkey，每行依次类推，最后一行不仅是倒数第二个region的stopkey，同时也是最后一个region的startkey。也就是说分区文件中填的都是key取值范围的分隔点，如下图所示：

2.hbase shell中建分区表，指定分区文件
-------------------------------------
在hbase shell中直接输入create，会看到如下的提示：


Create a table with namespace=ns1 and table qualifier=t1  
  hbase> create 'ns1:t1', {NAME => 'f1', VERSIONS => 5}  
  
Create a table with namespace=default and table qualifier=t1  
  hbase> create 't1', {NAME => 'f1'}, {NAME => 'f2'}, {NAME => 'f3'}  
  hbase> # The above in shorthand would be the following:  
  hbase> create 't1', 'f1', 'f2', 'f3'  
  hbase> create 't1', {NAME => 'f1', VERSIONS => 1, TTL => 2592000, BLOCKCACHE => true}  
  hbase> create 't1', {NAME => 'f1', CONFIGURATION => {'hbase.hstore.blockingStoreFiles' => '10'}}  
    
Table configuration options can be put at the end.  
Examples:  
  
  hbase> create 'ns1:t1', 'f1', SPLITS => ['10', '20', '30', '40']  
  hbase> create 't1', 'f1', SPLITS => ['10', '20', '30', '40']  
  hbase> create 't1', 'f1', SPLITS_FILE => 'splits.txt', OWNER => 'johndoe'  
  hbase> create 't1', {NAME => 'f1', VERSIONS => 5}, METADATA => { 'mykey' => 'myvalue' }  
  hbase> # Optionally pre-split the table into NUMREGIONS, using  
  hbase> # SPLITALGO ("HexStringSplit", "UniformSplit" or classname)  
  hbase> create 't1', 'f1', {NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}  
  hbase> create 't1', 'f1', {NUMREGIONS => 15, SPLITALGO => 'HexStringSplit', CONFIGURATION => {'hbase.hregion.scan.loadColumnFamiliesOnDemand' => 'true'}}  
  hbase> create 't1', {NAME => 'f1'}, {NAME => 'if1', LOCAL_INDEX=>'COMBINE_INDEX|INDEXED=f1:q1:8|rowKey:rowKey:10,UPDATE=true'}

可以通过指定SPLITS_FILE的值指定分区文件,如果分区信息比较少，也可以直接用SPLITS分区。我们可以通过如下命令建一个分区表，指定第一步中生成的分区文件：

create 'split_table_test', 'cf', {SPLITS_FILE => 'region_split_info.txt'}

SNAPPY压缩
--------------------------------

create 'split_table_test',{NAME =>'cf', COMPRESSION => 'SNAPPY'}, {SPLITS_FILE => '/tmp/region_split_info.txt'}

这里注意，一定要将分区的参数指定单独用一个大括号扩起来，因为分区是针对全表，而不是针对某一个column family。

转自：http://blog.csdn.net/chaolovejia/article/details/46375849#

查看图片附件

分享到：

Sqoop 安装总结（v1.99.7） | Hadoop HBase行健（rowkey）设计原则学习

2017-05-15 11:18
浏览 1215
评论(0)
分类:编程语言
查看更多

发表评论

文章已被作者锁定，不允许评论。

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论