How-to: Use HBase Bulk Loading, and Why -

zhangxiong0301

浏览: 364721 次

最近访客更多访客>>

brosnan2800

rl724

itgege

fhtwins

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

How-to: Use HBase Bulk Loading, and Why

博客分类：

HBASE

Hbase对外提供随机、实时的读写访问大数据，但问题是首先需要高效的把数据导入HBASE。通常我们通过mapreduce任务以及设置TableOutputFormat来调用hbase API导入数据。但是这样需要经过hbase的writepath：写数据到memstore，写WAL，flush数据，以及split和compact。因此更好的方式是即将介绍的BULKLOAD。

当我们在hbase使用中碰到以下特征时，可以考虑使用BULKLOAD:

·  You needed to tweak your MemStores to use most of the memory.
·  You needed to either use bigger WALs or bypass them entirely.
·  Your compaction and flush queues are in the hundreds.
·  Your GC is out of control because your inserts range in the MBs.
·  Your latency goes out of your SLA when you import data.

BULKLOAD就是生成HFILE，直接加载HFILE到regionserver的过程，从而绕过writepath。

bulk loading is the process of preparing and loading HFiles (HBase’s own file format) directly into the RegionServers, thus bypassing the write path and obviating those issues entirely.

BULKLOAD的步骤

1. Extract the data from a source, typically text files or another database（准备数据）.这个步骤叫准备数据，hbase不会参与其中，需要我们自己用mysqldump等工具将待导入数据提取出来，并上传到HDFS，为后续步骤做好准备。

2. Transform the data into Hfiles（生成HFile）.这个步骤是BULKLOAD的核心，通常由一个mapreduce任务为每个region生成一个Hfile。这个mapreduce很多场景下需要我们自己实现map。当然如果默认的map能满足条件的话就不需要自定义mapper，这种情况主要对应tsv文件中的每行的各个字段跟hbase中列完全对应（包括rowkey）。Mapreduce任务的输出键必须是rowkey，值必须是KeyValue, Put, or或Delete之一。Reducer是完全由HBASE控制的，主要通过HFileOutputFormat.configureIncrementalLoad() 来完成，这个方法做了很多事：

·  Inspects the table to configure a total order partitioner
·  Uploads the partitions file to the cluster and adds it to the    DistributedCache
·  Sets the number of reduce tasks to match the current number of    regions
·  Sets the output key/value class to match HFileOutputFormat’s requirements
·  Sets the reducer up to perform the appropriate sorting (either KeyValueSortReducer or PutSortReducer)

3. Load the files into HBase by telling the RegionServers where to find them（加载HFile）。很简单，只需使用LoadIncrementalHFiles （通常叫做 completebulkload）。只需要指定刚生成的Hfile所在的文件夹，就可以直接把各个文件加载到对应的region。有一种情况是，当我们生成好Hfile但还没有导入进Hbase时，目标表发生了split，我们的工具也能在加载Hfile到表的时候自动将Hfile拆分到对应的region，只是不太高效，所以如果在我们的BULKLOAD过程中有其他进程在写目标表，则应该尽快将Hfile加载到hbase。

使用场景

1. 原始数据导入（Original dataset load）。这种场景主要是从其他存储系统迁移数据到hbase。我们先要创建好表，并进行预分区，预分区的splitKey需要考虑rowkey的分布和region的数量。

2. 增量导入（Incremental load）。当hbase中某张表已经在对外提供服务，但是我们需要再导入一部分数据到这张表时，就是这种场景。

使用案例

1. 直接导入。假如有一个wordcount的TSV文件需要导入hbase，每一行的格式为[word，count]。Hbase表则设计为：word为rowkey，count为唯一的一列。则操作步骤如下：

上传CSV或TSV文件：

                    hdfs dfs-put word_count.csv

预分区方式创建好表：

     create'wordcount',{NAME=&gt;'f'},   {SPLITS=&gt;['g','m','r','w']}

生成Hfile，需要注意的是如果不指定importtsv.bulk.output则会直接将数据写入hbase。HBASE_ROW_KEY代表rowkey，是约定好的。如果想了解这个命令的用法，则输入命令不加参数，然后回车。

./bin/hbase org.apache.hadoop.hbase.mapreduce.Driver importtsv -Dimporttsv.separator=, -Dimporttsv.bulk.output=/user/hadoop/wordcount/ -Dimporttsv.columns=HBASE_ROW_KEY,f:count wordcount /user/hadoop/word_count.csv

执行完命令，则会看见目标目录下已经生成对应分区的Hfile

Found 5 items
-rw-r--r--   3 hadoop supergroup      10201 2015-06-26 10:36 /user/hadoop/wordcount/f/558cfca392a945e9acf7abb5851d50c9
-rw-r--r--   3 hadoop supergroup       7468 2015-06-26 10:36 /user/hadoop/wordcount/f/61e199926f2347a9a444d2a7ad1ffeb3
-rw-r--r--   3 hadoop supergroup       6311 2015-06-26 10:36 /user/hadoop/wordcount/f/b559aa29e7074a5fb79c2ffa746f1717
-rw-r--r--   3 hadoop supergroup       5529 2015-06-26 10:36 /user/hadoop/wordcount/f/d34d5903715e423f99eef210cc3c5123
-rw-r--r--   3 hadoop supergroup       2383 2015-06-26 10:36 /user/hadoop/wordcount/f/d570e72200114d39896632808095b6c9

加载数据到hbase

./bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /user/hadoop/wordcount/ wordcount

2）自定义Mapreduce的导入。这种导入也很简单，只要自己实现一个mapper以及Driver就行，reduce可通过上文所述的调用HFileOutputFormat.configureIncrementalLoad(job,hTable) 方法，由hbase实现。所以HFileOutputFormat.configureIncrementalLoad实际上实现了除mapper之外的shuffle和reducer逻辑。因此，执行hadoop jar命令完成mareduce任务即生成额Hfile。假如以FACEBOOK 2010年NBA决赛消息为TSV，则过程如下：

1.上传TSV以及建表，同[直接导入]的步骤1，2。

2.跑mapreduce任务，实际上[直接导入]方式的步骤3也是跑的mapreduce。参数只需要csv数据文件路径，输出路径以及hbase对应的表名。

       hadoop jar my_map_reduce.jar com.my.mapreduce.Driver data.csvoutput_dir NBAFinal2010_tableName

3.加载数据

  Hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles output_dirNBAFinal2010_tableName

附件是自定义的mapreduce代码

hbase-bulk-import-example-master.zip (9.4 KB)
下载次数: 2

分享到：

Linux下高cpu解决方案（转载） | java中priorityQueue的实现

2015-06-26 16:46
浏览 1474
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

How-to: Use HBase Bulk Loading, and Why

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

How-to: Use HBase Bulk Loading, and Why

评论

发表评论

相关推荐

HBase安全及namespace操作

HBase的Block Cache实现机制分析

hbase中的MSLAB

hbase优化（1）

实时系统HBase读写优化--大量写入无障碍

hbase0.96—+版本的endpoint

hbase observer

hbase block cache中的in-memory

hbase0.94之后split策略

HBASE COPROCESSOR EndPoint实例

HBASE在QIHOO 360搜索中的应用

HBase的long GC与 Zookeeper lease expired的权衡(转载)

hadoop+hbase+hive日常异常记录

HBASE API高级特性

HBASE 协处理器入门（转载）

HBASE数据架构

HBASE高级应用

HBASE高级应用

HBASE ScannerTimeoutException 问题

hbase维护（转载）

最近访客更多访客>>