Import data to HBase

Vitas_Wang

浏览: 10091 次
性别:
来自: 北京

最近访客更多访客>>

杨鹏飞

心若水寒

InJavaWeTrust

DZH1002475759

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Hadoop HBase

hadoop hbase

There have 3 common usages to import data to HBase:

1>Use ImportTsv

ImportTsv is a utility that will load data in TSV format into HBase.

There have two usages about it:

a.Loading data from TSV format in HDFS into HBase via Puts.

This kind of load is non-bulk loading

Eg:

Bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv –Dimporttsv.columns=HBASE_ROW_KEY,information:c1 table_name /user/hadoop/data

b.Another method is to use the bulk loading

It divided into 2 steps:

1>To generate StoreFiles for bulk-loading

HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.94.2.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,information:c1 -Dimporttsv.bulk.output=/user/hadoop/rdc_search_hfile -Dimporttsv.separator=, rdc_search_information /user/hadoop/sourcedata

2>Move the generated StoreFiles into an HBase table using completebulkload utility

bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /user/hadoop/rdc_search_hfile rdc_search_information

in above demo, the table name is rdc_search_information and it has one column family information.

Running ImportTsv with no arguments prints brief usage information:

Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>

Imports the given input directory of TSV data into the specified table.

The column names of the TSV data must be specified using the -Dimporttsv.columns

option. This option takes the form of comma-separated column names, where each

column name is either a simple column family, or a columnfamily:qualifier. The special

column name HBASE_ROW_KEY is used to designate that this column should be used

as the row key for each imported record. You must specify exactly one column

to be the row key, and you must specify a column name for every column that exists in the

input data.

By default importtsv will load data directly into HBase. To instead generate

HFiles of data to prepare for a bulk data load, pass the option:

-Dimporttsv.bulk.output=/path/for/output

Note: the target table will be created with default column family descriptors if it does not already exist.

Other options that may be specified with -D include:

-Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line

'-Dimporttsv.separator=|' - eg separate on pipes instead of tabs

-Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import

-Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper

2> We can write a java project and read the file in our OS and use the Put class to put the data into HBase.

3>Use the Pig to read the data from HDFS and write to HBase.

Ref: http://hbase.apache.org/book/ops_mgt.html#importtsv

分享到：

MapReduce 扫盲

2013-08-20 17:30
浏览 1066
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论