- 浏览: 1791254 次
- 性别:
- 来自: 北京
文章分类
最新评论
-
奔跑的小牛:
例子都打不开
如何使用JVisualVM进行性能分析 -
蜗牛coder:
好东西[color=blue][/color]
Lucene学习:全文检索的基本原理 -
lovesunweina:
不在haoop中是在linux系统中,映射IP的时候,不能使用 ...
java.io.IOException: Incomplete HDFS URI, no host -
evening_xxxy:
挺好的, 谢谢分享
如何利用 JConsole观察分析Java程序的运行,进行排错调优 -
di1984HIT:
学习了~~~
ant使用ssh和linux交互 如:上传文件
HBase如何迁移数据?这里有个方案:http://blog.mozilla.com/data/2011/02/04/migrating-hbase-in-the-trenches/ ,我还未验证,因为我碰到了更加棘手的问题,我的两个集群在两个局域网,没法通信。(不过可以有一台机双网卡连接两个集群)。
先了解下 /app/cloud/hadoop/bin/hadoop distcp src desc
原文内容:
We recently had a situation where we needed to copy a lot of HBase data while migrating from our old datacenter to our new one. The old cluster was running Cloudera’s CDH2 with HBase 0.20.6 and the new one is running CDH3b3. Usually I would use Hadoop’s distcp utility for such a job. As it turned out we were unable to use distcp while HBase was still running on the source cluster. Part of the reason for this is that the HFTP will throw XML errors due to HBase modifying files (particularly the case if HBase removes a directory). And to transfer our entire dataset at the time was going to take well over a day. This presented a serious problem because we couldn’t accept that kind of downtime. We were also about 75% full in the source cluster so doing HBase export was out as well. Thus I created a utility called
Backup
. Backup is designed to essentially do the same work as distcp with a few differences. The first being that Backup would be designed move beyond failures. Since we’re still running HBase on the source cluster we can actually expect quite a few failures as a matter of fact. So inside Backup’s MapReduce job will by design catch generic exceptions. This is probably a bit over-zealous, but I really needed it not to fail no matter what. Especially after a few hours in. One of the other differences is that I designed Backup to always use relative paths. It does this by generating a common path between the source and destination via regular expression. Distcp on the other hand will do some really interesting things depending on what options you’ve enabled. If you use the
-f
flag for providing a file list, it will take all the files and write them directly to the target directory, rather than putting them in their respective sub-directories based on the source path. If you run with the
-update
flag it seems to put the source directory inside the destination rather than realizing that I want these two directories to look the same. The last major difference is that Backup is designed to run in update mode always. This was found because our network connection could only push about 200 MB/s between datacenters. We later found that a firewall was the bottleneck, but we didn’t want to drop our pants to the world either. Distcp would take hours just to stat and compare the files. For context we had something on the order of 300K-400K files we were looking to transfer. This is because distcp currently does this in a single-thread before it runs its MapReduce job. This actually makes sense when considering that distcp is only a single MapReduce job and it wants to distribute the copy evenly. Since we needed to minimize downtime, the first thing I did was distribute the file stat comparisons. In exchange we currently take a hit on not being able to evenly distribute the copy work. Backup uses a hack to attempt to get better distribution, but it’s nowhere near ideal. Currently it looks at the top-level directories just under the main source directory. It then splits that list of directories into mapred.map.tasks number of files. Since the data is small (remember this is paths and not the actual data) you’re pretty much guaranteed MapReduce will take your suggestion for once. This splits up the copy pretty well especially for the first run. On subsequent runs however you’ll get bottlenecked by a few nodes doing all the work. You can always up the mapred.map.tasks even higher, but really I need to split it out into two MapReduce jobs. I also added a
-f
flag so that we could specify file lists. I’ll explain later on why this was really useful for us. So back to our situation. I ran the first Backup job while HBase was running. This copied the bulk of our 28 TB dataset obviously with a bunch of a failures because HBase had deleted some directories. Now that we had most of the data we could do subsequent Backup’s within a smaller time window. We ingest about 300 GB/day so our skinny pipe between datacenters was able to make subsequent transfers in hours and not days. During scheduled downtime we would shutdown the source HBase. Then we copied the data to a secondary cluster in the new datacenter. As soon as the transfer was finished we would verify the source and destination matched. If so then we were all good to start up the source cluster again and resume normal production operation. Meanwhile we would copy the data from the secondary cluster to the new production cluster. The reason for doing this was because HBase 0.89+ would change the region directories, and we also needed to allow Socorro web developers to do their testing. So having the two separate clusters was a real blessing. It allowed us to keep a pristine backup at all times on secondary while testing against the new production cluster. So we did this a number of times the week before launch. Always trying to keep everything as up to date as we could before we threw the switch to cut over. It was during this last week I added the
-f
flag which allowed giving Backup a source file list. We would run “hadoop fs -lsr /hbase
” on both the source and the destination cluster. I wrote a simple python utility (lsr_diff
) to compare these two files and figure out what needed to be copied and what needed to be deleted. The files to copy could be given to the Backup job while the deletes could be handled with a short shell script (Backup doesn’t have delete functionality). The process looked something like this: The number of map tasks I refined over time, but I started the initial run with (# of hosts * # of map task slots). On subsequent runs I ended up doubling that number. After the backup job completed each time we would run “hadoop fs -lsr” and diff again to make sure that everything copied over. I saw a lot of times that wasn’t the case when the source was HFTP from one datacenter to another. However when copying files from an HDFS source within our new datacenter I never saw an issue with copying. Due to other issues (there always are right?) we had a pretty tight timeline and this system was pretty hacked together, but it worked for us. In the future I would love to see some modifications made to distcp. Here’s my wishlist based on our experiences: 1.) Distribute the file stat comparisons and then run a second MapReduce job to do the actual copying. To be honest though I found the existing distcp code a bit overly complex otherwise I might have made the modifications myself. Perhaps the best thing is that someone take a crack at a fresh rewrite of distcp altogether. I would love to hear people’s feedback. 声明:谁有高招麻烦告知在下,上面说的这个解决方案不适合我的情况。
RUN ON SOURCE CLUSTER:
hadoop fs -lsr /hbase > source_hbase.txt
RUN ON TARGET CLUSTER:
hadoop fs -lsr /hbase > target_hbase.txt
scp source_host:./source_hbase.txt .
python lsr_diff.py source_hbase.txt target_hbase.txt
sort copy-paths.txt -o copy-paths.sorted
sudo -u hdfs hadoop fs -put copy-paths.sorted copy-paths.sorted
nohup sudo -u hdfs hadoop jar akela-job.jar com.mozilla.hadoop.Backup -Dmapred.map.tasks=112 -f hdfs://target_host:8020/user/hdfs/copy-paths.sorted hftp://source_host:50070/hbase hdfs://target_host:8020/hbase
2.) Do proper relative path copies.
3.) Distribute deletes too.
发表评论
-
HBase配置LZO压缩
2011-07-10 22:40 6197系统: gentoo HDFS: hadoop:hado ... -
HBase RegionServer 退出 ( ZooKeeper session expired)
2011-04-23 08:32 9102RegionServer 由于 ZooKeeper sessi ... -
HBase迁移数据方案1(两个集群不能通信)
2011-03-30 18:23 3883前一篇文章里面介绍了 两个可以直接通信的集群之间很容易拷贝数据 ... -
HBase如何存取多个版本的值
2011-03-07 16:11 27252HBase如何存取多个版本 ... -
HBase简介(很好的梳理资料)
2011-01-30 10:18 130760一、 简介 history s ... -
Google_三大论文中文版(Bigtable、 GFS、 Google MapReduce)
2010-11-28 16:30 22200做个中文版下载源: http://dl.iteye.c ... -
hadoop主节点(NameNode)备份策略以及恢复方法
2010-11-11 19:35 27830一、dits和fsimage 首先要提到 ... -
HRegionServer: ZooKeeper session expired
2010-11-01 14:21 11500Hbase不稳定,分析日志 ... -
Bad connect ack with firstBadLink
2010-10-25 13:20 8333hbase报的错误,经过分析是Hadoop不能写入数据了。可恶 ... -
hbase0.20.支持多个主节点容灾切换功能(只激活当前某个节点,其他节点备份)
2010-09-09 14:53 2879http://wiki.apache.org/hadoop/H ... -
java.io.IOException: Incomplete HDFS URI, no host
2010-09-07 08:31 16235ERROR org.apache.hadoop.hdfs.se ... -
升级hadoop0.20.2到hadoop-0.21.0
2010-09-05 11:52 7760按照新的文档来 更新配置: http://hadoop.apa ... -
hadoop-hdfs启动又自动退出的问题
2010-05-20 10:45 6165hadoop-hdfs启动又自动退出的问题,折腾了我1天时间啊 ... -
在windows平台下Eclipse调试Hadoop/Nutch
2010-04-29 14:34 3305即让碰到这个问题说明 准备工作都做好了,软件包,环境什么的这里 ... -
Hadoop运行mapreduce实例时,抛出错误 All datanodes xxx.xxx.xxx.xxx:xxx are bad. Aborting…
2010-04-29 14:26 6443Hadoop运行mapreduce实例时,抛出错误 All d ... -
cygwin 添加用户
2010-04-13 17:48 7425http://hi.baidu.com/skychen1900 ... -
nutch总体输入输出流程图解析
2010-04-12 16:58 2476附件里面有word文档,请下 ... -
解析hadoop框架下的Map-Reduce job的输出格式的实现
2010-04-10 18:34 10146Hadoop 其实并非一个单纯用于存储的分布式文 ... -
nutch分布式搭建
2010-04-06 17:54 6832如何在eclipse中跑nutch :http://jiaj ... -
解析Nutch插件系统
2010-03-31 16:31 6623nutch系统架构的一个亮点就是插件,借鉴这个架构我们 ...
相关推荐
### HBase基于快照的数据迁移 #### 前言 HBase是一款开源的、分布式的、面向列的数据库系统,其设计目标是为了处理大规模数据集(TB甚至PB级别)。随着业务的发展,数据量逐渐增大,可能需要将数据从一个集群迁移...
这可能包括手动处理事务状态、分步迁移数据以及在目标平台上重建事务元数据。实施迁移时,需确保兼容性,因为Hive的表结构、分区和元数据都需要精确地在新环境中复现。此外,对于大规模数据,可能需要分批迁移,同时...
### Hadoop数据迁移至HBase的关键技术点解析 #### Hadoop与HBase的协同作业:数据迁移策略 在大数据处理领域,Hadoop与HBase分别扮演着存储与快速查询的重要角色。Hadoop作为分布式文件系统(HDFS)的基石,主要...
Hadoop数据迁移是指将存储在Hadoop分布式文件系统(HDFS)中的数据转移到其他存储系统中,例如HBase。HBase是一个基于Hadoop的分布式数据库,它主要用于随机实时读/写访问超大表,适用于存储半结构化或非结构化稀疏...
在HBase这样的分布式数据库系统中,数据迁移是一个常见的任务,特别是在集群扩展、故障恢复或版本升级等场景下。本文档详细介绍了如何在HBase 0.94.1版本上手动进行数据迁移,主要涉及以下几个关键步骤: 1. **数据...
HBase 数据迁移与数据备份&恢复 本实验主要介绍了 HBase 数据迁移与数据备份和恢复的方法,包括使用 Sqoop 将 MySQL 数据导入到 HBase、将文本文件批量导入 HBase、使用 Hadoop DistCp 实现 HBase 的冷备份和热备份...
使用场景及目标:在需要将HBase数据迁移到其他系统(如关系型数据库、大数据平台、云存储、消息队列等)时,本文档为实现高效、安全、完整的数据迁移提供了全面指导和支持。 阅读建议:读者应该先熟悉Hadoop、HBase...
### HBase数据迁移解决方案的设计与实践 #### 1. 同构数据源迁移 同构数据源迁移是指在相同的硬件和软件环境下进行数据迁移。这种迁移方式通常包括以下典型场景:机房搬迁、HBase主备集群的搭建、集群冷备、异地...
5. **运行作业**:设置作业参数,运行作业将数据从MySQL迁移到HBase。 #### 四、总结 通过本文的介绍,我们了解了Kettle集群的基本概念、搭建步骤以及如何使用Kettle将MySQL数据转换为HBase数据的过程。Kettle作为...
本文将深入探讨如何有效地进行HBase的全量数据导入,特别是针对MySQL数据的迁移至HBase场景。 #### HBase数据结构与Hadoop生态集成 HBase基于Hadoop框架构建,其底层存储依赖于HDFS(Hadoop Distributed File ...
关系型数据库与HBASE间的数据迁移介绍.pptx
文章分析了现有迁移工具的利弊,基于HBase数据库提出了一种有效的数据迁移策略,并依据提出的策略实现了一种半自动化迁移工具。以美国城市和方言系统CityDetail数据库数据为例,阐述了该迁移工具的工作原理并对迁移...
通过使用 DataFrame API 和 Spark SQL,可以方便地在不同的数据源之间进行数据迁移和处理。在实际应用中,根据具体需求,你可能还需要处理数据类型转换、错误处理等问题,以确保数据的一致性和完整性。
这个过程通常涉及到多个步骤,包括HBase与Hive的交互,以及数据的迁移和转换。 描述中提到的方法是首先通过HBase的条件查询功能筛选出所需的数据,然后将这些数据导出到Hive中。Hive提供了更灵活的数据处理能力,...
Java在大数据生态中扮演着连接不同组件的重要角色,它提供了丰富的API和库,使得开发者能够轻松地实现Hive和HBase之间的数据迁移。 要实现在Java中从Hive到HBase的快速导数据,我们需要遵循以下步骤: 1. **配置...
这个过程涉及了SQL查询、数据转换、数据预处理、JSON解析、HBase表设计以及客户端API的使用,是大数据领域常见的数据迁移任务。在实际操作中,可能还需要考虑到性能优化、错误处理和数据一致性等问题,以确保整个...
文中介绍了通过 Sqoop 在 MySQL 和 HDFS 之间、MySQL 和 Hive 之间以及 MySQL 和 HBase 之间的数据互导过程,包括如何处理常见的错误。同时,也详细记录了 MySQL 用户创建、授权、数据插入和 Sqoop 配置的相关细节。...
在IT行业中,数据库之间的数据迁移是一项常见的任务,特别是在大数据领域,如从传统的SQL数据库(如MySQL)迁移到分布式NoSQL数据库(如HBase)。本文将详细介绍如何使用Java代码实现这一过程,包括样例MySQL表和...
内容概要:本文档提供了一项实验指导,旨在从MySQL到HBase之间执行数据迁移。主要介绍了两种数据库的设计差异,特别是HBase特有的列族概念如何使得数据能在单张表中表示多个实体及其关系。详细步骤涵盖了数据库和表...
文档还提供了详细的实战步骤指导,覆盖了从准备工作、HBase 表创建,到数据导入验证的具体执行细节,同时对潜在的问题进行了预判和给出解决方案建议。 适合人群:本指南面向希望深入掌握数据迁移技术的技术人员,...