转 http://blog.cloudera.com/blog/2012/07/hbase-log-splitting/
cloudera的blog还是挺不错的
In the recent blog post about the Apache HBase Write Path, we talked about the write-ahead-log (WAL), which plays an important role in preventing data loss should a HBase region server failure occur. This blog post describes how HBase prevents data loss after a region server crashes, using an especially critical process for recovering lost updates called log splitting.
Log splitting
As we mentioned in the write path blog post, HBase data updates are stored in a place in memory called memstore for fast write. In the event of a region server failure, the contents of the memstore are lost because they have not been saved to disk yet. To prevent data loss in such a scenario, the updates are persisted in a WAL file before they are stored in the memstore. In the event of a region server failure, the lost contents in the memstore can be regenerated by replaying the updates (also called edits) from the WAL file.
A region server serves many regions. All of the regions in a region server share the same active WAL file. Each edit in the WAL file has information about which region it belongs to. When a region is opened, we need to replay those edits in the WAL file that belong to that region. Therefore, edits in the WAL file must be grouped by region so that particular sets can be replayed to regenerate the data in a particular region. The process of grouping the WAL edits by region is called log splitting. It is a critical process for recovering data if a region server fails.
Log splitting is done by HMaster as the cluster starts or by ServerShutdownHandler as a region server shuts down. Since we need to guarantee consistency, affected regions are unavailable until data is restored. So we need to recover and replay all WAL edits before letting those regions become available again. As a result, regions affected by log splitting are unavailable until the process completes and any required edits are applied.
When log splitting starts, the log directory is renamed as follows:
For example:
It is important that HBase renames the folder. A region server may still be up when the master thinks it is down. The region server may not respond immediately and consequently doesn’t heartbeat its ZooKeeper session. HMaster may interpret this as an indication that the region server has failed. If the folder is renamed, any existing, valid WAL files still being used by an active but busy region server are not accidentally written to.
Each log file is split one at a time. The log splitter reads the log file one edit entry at a time and puts each edit entry into the buffer corresponding to the edit’s region. At the same time, the splitter starts several writer threads. Writer threads pick up a corresponding buffer and write the edit entries in the buffer to a temporary recovered edit file.
The file location and name is of the following form:
The <sequenceid> shown above is the sequence id of the first log entry written to the file. The temporary recovered edit file is used for all the edits in the WAL file for this region. Once log splitting is completed, the temporary file is renamed to:
In the preceding example, the is the highest (most recent) edit sequence id of the entries in the recovered edit file. As a result, when replaying the recovered edits, it is possible to determine if all edits have been written. If the last edit that was written to the HFile is greater than or equal to the edit sequence id included in the file name, it is clear that all writes from the edit file have been completed.
When the log splitting is completed, each affected region is assigned to a region server. When the region is opened, the recovered.edits folder is checked for recovered edits files. If any such files are present, they are replayed by reading the edits and saving them to the memstore. After all edit files are replayed, the contents of the memstore are written to disk (HFile) and the edit files are deleted.
Times to complete single threaded log splitting vary, but the process may take several hours if multiple region servers have crashed. Distributed log splitting was added in HBase version 0.92 (HBASE-1364) by Prakash Khemani from Facebook. It reduces the time to complete the process dramatically, and hence improves the availability of regions and tables. For example, we knew a cluster crashed. With single threaded log splitting, it took around 9 hours to recover. With distributed log splitting, it just took around 6 minutes.
Distributed log splitting
HBase 0.90 log splitting is all done by the HMaster. For one log splitting invocation, all the log files are processed sequentially. After a cluster restarts from crash, unfortunately, all region servers are idle and waiting for the master to finish the log splitting. Instead of having all the region servers remain idle, why not make them useful and help in the log splitting process? This is the insight behind distributed log splitting
With distributed log splitting, the master is the boss. It has a split log manager to manage all log files which should be scanned and split. Split log manager puts all the files under the splitlog ZooKeeper node (/hbase/splitlog) as tasks. For example, while in zkcli, “ls /hbase/splitlog” returns:
[hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost8.sample.com%2C57020%2C1340474893275-splitting%2Fhost8.sample.com%253A57020.1340474893900, hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost3.sample.com%2C57020%2C1340474893299-splitting%2Fhost3.sample.com%253A57020.1340474893931, hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost4.sample.com%2C57020%2C1340474893287-splitting%2Fhost4.sample.com%253A57020.1340474893946]
After some characters are converted into plain ASCII, it is:
[hdfs://host2.sample.com:56020/hbase/.logs/host8.sample.com,57020,1340474893275-splitting/host8.sample.com%3A57020.1340474893900, hdfs://host2.sample.com:56020/hbase/.logs/host3.sample.com,57020,1340474893299-splitting/host3.sample.com%3A57020.1340474893931, hdfs://host2.sample.com:56020/hbase/.logs/host4.sample.com,57020,1340474893287-splitting/host4.sample.com%3A57020.1340474893946]
It is a list of WAL file names to be scanned and split, which is a list of log splitting tasks.
Once split log manager publishes all the tasks to the splitlog znode, it monitors these task nodes and waits for them to be processed.
In each region server, there is a daemon thread called split log worker. Split log worker does the actual work to split the logs. The worker watches the splitlog znode all the time. If there are new tasks, split log worker retrieves the task paths, and then loops through them all to grab any one which is not claimed by other worker yet. After it grabs one, it tries to claim the ownership of the task, to work on the task if successfully owned, and to update the task’s state properly based on the splitting outcome. After the split worker completes the current task, it tries to grab another task to work on if any remains.
This feature is controlled by the configuration hbase.master.distributed.log.splitting property. By default, it is enabled. (Note that distributed log splitting is backported to CDH3u3 which is based on 0.90. However, it is disabled by default in CDH3u3. To enable it, you need to set configuration parameter hbase.master.distributed.log.splitting to true). When HMaster starts up, a split log manager instance is created if this parameter is not explicitly set to false. The split log manager creates a monitor thread. The monitor thread periodically does the following:
- Checks if there are any dead split log workers queued up. If so, it will resubmit those tasks owned by the dead workers. If the resubmit fails due to some ZooKeeper exception, the dead worker is queued up again for retry.
- Checks if there are any unassigned tasks. If so, create an ephemeral rescan node so that each split log worker is notified to re-scan unassigned tasks via the nodeChildrenChanged ZooKeeper event.
- Checks those assigned tasks if they are expired. If so, move the task to TASK_UNASSIGNED state again so that they can be retried. These tasks could be assigned to some slow workers, or could be already finished. It is fine since the split can be retried due to the idempotency of the log splitting task; that is, the same log splitting task can be processed many times without causing any problem.
Split log manager watches the HBase split log znodes all the time. If any split log task node data is changed, it retrieves the node data. The node data has the current state of the task. For example, while in zkcli, “get /hbase/splitlog/hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost6.sample.com%2C57020%2C1340474893287-splitting%2Fhost6.sample.com%253A57020.1340474893945” returns:
unassigned host2.sample.com:57000
cZxid = 0×7115
ctime = Sat Jun 23 11:13:40 PDT 2012
mZxid = 0×7115
mtime = Sat Jun 23 11:13:40 PDT 2012
pZxid = 0×7115
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0×0
dataLength = 33
numChildren = 0
It shows this task is still unassigned.
Based on the state of the task whose data is changed, the split log manager does one of the following:
- Resubmit the task if it is unassigned
- Heart beat the task if it is assigned
- Resubmit or fail* the task if it is resigned
- Resubmit or fail* the task if it is completed with errors
- Resubmit or fail* the task if it could not complete due to errors
- Delete the task if it is successfully completed or failed
Note: fail a task if:
- The task is deleted
- The node doesn’t exist anymore
- Fails to move the state of the task to TASK_UNASSIGNED
- The number of resubmits is over the resubmit threshold
The split log worker is created and started by the region server. So there is a split log worker in each region server. When the split log worker starts, it registers itself to watch HBase znodes.
If any splitlog znode children change, it notifies the worker thread to wake up to grab more tasks if it is sleeping. If current task’s node data is changed, it checks if the task is taken by another worker. If so, interrupt the worker thread and stop the current task.
The split log worker thread keeps checking the task nodes under splitlog znode if any node children change.
For each task, it does the following:
- Get the task state and doesn’t do anything if it is not in TASK_UNASSIGNED state.
- If it is in TASK_UNASSIGNED state, try to set the state to TASK_OWNED by the worker. If it fails to set the state, it is ok, another worker will try to grab it. Split log manager will also try to ask all workers to rescan later if it remains unassigned.
- If the worker gets this task, it tries to get the task state again to make sure it really gets it asynchronously. In the meantime, it starts a split task executor to do the actual work:
- Get the HBase root folder, create a temp folder under the root, and split the log file to the temp folder.
- If everything is ok, the task executor sets the task to state TASK_DONE.
- If catches an unexpected IOException, the task is set to state TASK_ERR.
- If the working is shutting down, set the the task to state TASK_RESIGNED.
- If the task is taken by another worker, it’s ok, just log it.
Split log manager returns when all tasks are completed successfully. If all tasks are completed with some failure, it throws an exception so that the log splitting can be retried. Due to an asynchronous implementation, in very rare cases, split log manager loses track of some completed tasks. So it periodically checks if there is any remaining uncompleted task in its task map or ZooKeeper. If none, it throws an exception so that the log splitting can be retried right away instead of hanging there waiting for something that won’t happen.
Conclusion
In this blog post, we have presented a critical process, log splitting, to recover lost updates from region server failures. Log splitting used to be done by the HMaster sequentially. In 0.92, an improvement called distributed log splitting was introduced, and the actual work is done by region servers in parallel. Since there are many region servers in the cluster, distributed log splitting dramatically reduces the log splitting time, and improves regions’ availability.
相关推荐
《Cloudera-HBase最佳实践及优化》是针对大数据存储和处理领域中HBase的一份重要参考资料,由Cloudera公司权威发布。这份文档详细介绍了如何有效地使用和优化HBase,以实现更高效、稳定的数据管理。以下是其中涵盖的...
可以通过Cloudera Manager(CM)或其他管理工具搜索并编辑`hbase-site.xml`文件,或者直接修改服务范围内的高级配置代码段中的HBase服务高级配置代码段。 - **增加以下配置**: ```xml <name>hbase.rpc.engine ...
修改hbase 0.90版本 cloudera3u3 中的thrift接口,mutation类中增加timestamp参数,可以通过thrift接口使用Hbase.Mutation(column="f1:1", value='test',timestamp=20130112121212),mutateRows("testdb1", ...
本篇文章将详细讲解如何利用Spring Data Hadoop中的HbaseTemplate来操作HBase。 首先,我们需要理解Spring Data Hadoop提供的HbaseTemplate类。这是一个封装了HBase操作的模板类,简化了Java开发人员与HBase交互的...
Cloudera作为一家专注于大数据技术的领先企业,其出品的HBase最佳实践及优化文档,对于使用HBase作为核心存储组件的企业和开发人员而言,是极具价值的参考资料。HBase作为Google BigTable的开源实现,它是一个分布式...
本示例“基于Mysql的表转HBase小Demo”提供了一个简单的解决方案,将Mysql中的数据转换并存储到HBase这种分布式列式数据库中。这个过程对于那些希望从传统的关系型数据库迁移到NoSQL数据库,尤其是对大规模数据进行...
CDH(Cloudera Distribution Including Apache Hadoop)已经包含了HBase,因此在CDH环境中安装HBase变得相对简单。 首先,安装HBase的前提条件是已经部署了HDFS(Hadoop分布式文件系统)和Zookeeper。Zookeeper是一...
修改thrift接口支持使用不同timestamp批量插入数据
本文将围绕"Hbase-1.2.0-cdh5.12.0.tar.gz"这一版本,深入探讨HBase在CDH(Cloudera Distribution Including Apache Hadoop)5.12.0环境下的具体应用、特点以及与其他组件如Hadoop、Hive、Kafka的协同工作。...
例如,日志目录包括/var/log/cloudera-scm-*、/var/log/hadoop-*、/var/log/hive、/var/log/hbase和/var/log/spark等,这些目录下存放着对应服务的运行日志。程序安装目录位于/opt/cloudera/parcels/CDH-XXX/lib,...
在这里,我们创建了一个名为 HBASE_LOG_DIR 的路径,以存储日志信息。 Hbase 配置文件 在 Hbase 中,配置文件是非常重要的。这里,我们需要配置 hbase-site.xml 文件,以便 Hbase 能够正确地运行。 1. zookeeper....
log.debug("{}", column); filterList.addFilter(new FamilyFilter(CompareFilter.CompareOp.EQ, new SubstringComparator(column))); } // ... } } ``` 知识点5:application.yml的配置 application.yml文件...
**HBase实验报告** 在本实验中,我们主要聚焦于HBase,这是一个基于谷歌Bigtable设计的开源...在后续的学习和实践中,应深入研究HBase的其他高级特性,如Region Split、Compaction等,以便更好地应用到实际项目中。
搭建pinpoint需要的hbase初始化脚本hbase-create.hbase
HBase是一种分布式、基于列族的NoSQL数据库,它在大数据领域中扮演着重要的角色,尤其是在需要实时查询大规模数据集时。HBase以其高吞吐量、低延迟和水平扩展能力而闻名,常用于存储非结构化和半结构化数据。在HBase...
`hbase-indexer`是Cloudera开发的一个开源工具,它允许用户将HBase表的数据实时地、无损地导入到Elasticsearch,从而充分利用Elasticsearch的搜索和分析能力。 **2. hbase-indexer组件解析** `hbase-indexer`组件...
在大数据处理领域,HBase和Hive是两个重要的组件,它们各自扮演着不同的角色。HBase是一个基于Apache Hadoop的数据存储系统,适用于处理大规模、分布式、实时的数据存储和查询。而Hive则是一个数据仓库工具,它允许...
`hbase-1.2.0-cdh5.14.2.tar.gz` 是针对Cloudera Distribution Including Apache Hadoop (CDH) 5.14.2的一个特定版本的HBase打包文件。CDH是一个流行的Hadoop发行版,包含了多个大数据组件,如HDFS、MapReduce、YARN...
使用HBase的Compaction和Split机制,保持Region的平衡;并考虑使用二级索引提高查询效率。 六、总结 通过SpringBoot搭建的HBase可视化系统,使得非技术人员也能便捷地管理和操作HBase,降低了使用门槛,提高了工作...