hbase -tables replication/snapshot/backup within/cross clusters

leibnitz

浏览: 288680 次
性别:
来自: 广州

最近访客更多访客>>

eternal1025

bneliao

adapterofcoms

caipeijun666

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

hbase

serial no	solutions	level	preconditino	runon	flow	advances	shortcomings	use cases
1	direct client API	log	-		transfer data via both clusters
2	export/import	log		-src then target	-mr gen hdfs seq files -transfer files -import with mr	-support time-range filter
3	copy table	stream		src	case: 1.copy directly data(mem+hfile) to other(IFF cluster to cluster is enable) 2.(IFF cluster to cluster is NOT enable) same as export ,but the last step is:using hdfs put files
4	replication	wal			sync wal with new cluster
5	bulkload
6	snapshot	file	-flush before snapshoting if online	-src then target	-create snapshot -clone to new table -restore from new table[cluster internal]
7	distcp	file		-src	-flush memstore -distcp files within both clusters		-cant copy data with specified date-range; but it can be used as the final step to transfer the target files generated by other solutions -stop hbase before distcp

now,i want to retrieve last month datum from a table to backup to another cluster,but both clusters cant connected to each other（no MR）,so i issued the new steps:

1.subset the table data （last month:2014-06-01--> 2014-06-30）

hbase org.apache.hadoop.hbase.mapreduce.CopyTable -Dhbase.client.scanner.caching=1000 -Dmapred.map.tasks.speculative.execution=false --starttime=1401552000000 --endtime=1404057600000 --new.name=new-tableX tableX

then you MUST flush this table as some data lie on memstores,and the next step will operate on file level directly,

 echo "flush 'new-tableX' "|hbase shell

2.download hdfs table hfiles

hadoop fs -get /hbase/new-tableX new-tableX

(of curse u can run extend this command in multi nodes parallelly by subtasking the dirs)

3.transfer these files to other cluster parallelly

a.scp part files to local nodeA,B,C...

b.run scp part-files to peer node of another cluster

(so these will balance the network bandwidth limited by one node for both sides)

4.now import the data to hdfs

hadoop fs -put part-files /hbase

(just mkdir it if nonexists)

5.load these hfiles to meta and assign

hbase hbck -fixMeta

then

hbase hbck -fixAssignments

(try second step one more time to the jude whether table is readable or not)

6.rename the new table to original table[optional]

hbase shell> disable 'tableName'
hbase shell> snapshot 'tableName', 'tableSnapshot'
hbase shell> clone_snapshot 'tableSnapshot', 'newTableName'
hbase shell> delete_snapshot 'tableSnapshot'
hbase shell> drop 'tableName'

utility snapshot is supported by 0.94.6+ version,and u can patch your old version also if u have a older one.

some optimized usages in step 1

-mapreduce failure times

-D=mapred.map.max.attempts=2

failure ratio

-D=mapred.max.map.failures.percent=0.05

-close the hlog writing(maybe refactor the Import.Imperter.java)

-decrease the block replication

-D-Ddfs.replication=2 or -D-Ddfs.replication=1

-increase the buffer

-Dhbase.client.write.buffer=10485760

-presplit the new table when created in step 1

 {NUMREGIONS => [1], SPLITALGO => 'HexStringSplit'}

[1] hbase -how many regions are fit for a table when prespiting or keeping running

ref:

用distcp进行hdfs的并行复制

HBase跨集群复制数据的另一种方法

CDH:introduction-to-apache-hbase-snapshots

jira:snapshot of table (attached principle docs)

复制部分HBase表用于测试 (some tools used java class in shell)

分享到：

finished reading the definitive guide no ... | some important optimized advices for hba ...

2014-06-24 18:09
浏览 821
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论