Currrently I'm working on sth like calculating how many hits on a page. The problem is the raw data can be huge, so it may not scale if you use RDBMS.
The raw input is as follows.
Date Page User
-----------
date1 page1 user1
date1 page1 user2
date1 page2 user1
date1 page2 user3
... ...
So I need to answer questions like "for page1 on day1 how many distinct users have visited it?" or "on day1 how many distinct users have visited (the web site)?" That is, you need to support roll up or drill down at some columns.
Before coding, I read some articles related to my problem. Here are the references.
1.
http://sujee.net/tech/articles/hadoop/hbase-map-reduce-freq-counter/ (English)
2.
http://www.cnblogs.com/panfeng412/archive/2011/11/19/hbase-application-in-data-statistics.html (Chinese)
For raw data we can simpley write into hbase; the challenge left is how to calculate the aggregated result. One solution mentioned in [2] is that you have a table holding the aggregated result, whenever a raw data record is put in hbase, you also update the corresponding aggregated record. E.g.,
Table: Day_Page_Access
key value
20130304_page1 3400
20130304_page2 7800
When a raw data record (20130304, page1, Tom) is processed, you get the total access count with row key 20130304_page1 which is 3400 then increase it by 1 and write back.
I think the problem is when doing large writes mixed with updates the in-all performance will be droped severely. But the benefit of this approach is the aggregated result is available at any time. You can support querying in almost real time.
The other solution in [1] has done a quite good work which leverage hadoop mapreduce to calculate the aggregated result. After all data is loaded in hbase, it will perform a mapreduce job to sum the access counts for same page same day. The solution is suited to scenarios that allow off-line batch processing. It can have greate write throughput.
Btw, I'm working on solution 1.
分享到:
相关推荐
在大数据处理领域,HBase(Hadoop Database)是一款基于Google Bigtable设计的高可靠性、高性能、分布式的列式存储系统。HBase适用于处理海量结构化数据,尤其在实时读写性能方面表现出色。本篇文章将深入讲解如何...
Hbase本身只有一级索引rowkey,现在通过Hbase coprocessor协处理器把Hbase的数据索引存储到Elasticsearch,从而建立二级索引;ppt中讲述了一些注意事项,挺有用的,希望能有所帮忙!
<value>org.apache.hadoop.hbase.security.access.AccessController <name>hbase.coprocessor.region.classes <value>org.apache.hadoop.hbase.security.access.AccessController <name>hbase.coprocessor....
在IT行业中,尤其是在大数据处理领域,HBase是一个广泛使用的分布式、高性能、列式存储的NoSQL数据库。HBase是建立在Hadoop文件系统(HDFS)之上,为处理大规模数据提供了一个高效的数据存储解决方案。而Spring Data...
log4j.logger.SecurityLogger.org.apache.hadoop.hbase.security.access.AccessController ``` 以上配置指定了审计日志的文件名为`SecurityAuth.audit`,最大文件大小为256MB,并且最多保留20个备份文件。日志...
<value>org.apache.hadoop.hbase.security.access.AccessController <name>hbase.coprocessor.region.classes <value>org.apache.hadoop.hbase.security.access.AccessController <name>hbase.coprocessor....
"藏经阁-Use CCSMap to Improve HBase YGC Time & Efforts on SLA improv" CCSMap 是阿里云开发的一种高效的内存管理技术,用于改进 HBase 的 YGC 时间和努力,并提高 SLA 改进。下面是 CCSMap 的详细知识点: 为...
安全性和权限管理**:HBase提供了细粒度的安全控制机制,可以通过ACL(Access Control List)来管理用户的访问权限,确保数据安全。 **5. 集成与生态**:HBase不仅与HDFS紧密结合,还与Hadoop生态系统中的其他组件...
Herein you will find either the definitive documentation on an HBase topic as of its standing when the referenced HBase version shipped, or it will point to the location in Javadoc or JIRA where the ...
搭建pinpoint需要的hbase初始化脚本hbase-create.hbase
conf.addResource("path/to/hbase-site.xml") ``` 3. 建立HBase连接: 使用配置对象创建一个`Connection`实例: ```scala val connection = ConnectionFactory.createConnection(conf) ``` 4. 获取HBase表:...
With the increasing use of NoSQL in general and HBase in particular, knowing how to build practical applications depends on the application of design patterns. These patterns, distilled from extensive...
HBase是一种分布式、基于列族的NoSQL数据库,由Apache软件基金会开发并维护,是Hadoop生态系统中的重要组件。这份“HBase官方文档中文版”提供了全面深入的HBase知识,帮助用户理解和掌握如何在大数据场景下有效地...
6.Use Case: HBase as a System of Record 7.Implementation of an Underlying Storage Engine 8.Use Case: Near Real-Time Event Processing 9.Implementation of Near Real-Time Event Processing 10.Use Case: ...
### HBase 配置内置 ZooKeeper 的详细步骤与解析 #### 一、配置背景与目的 在 HBase 的部署环境中,ZooKeeper 起着非常重要的作用,它主要用于协调集群中的各个节点,并且管理 HBase 的元数据。通常情况下,HBase ...
HBase是一种分布式、基于列族的NoSQL数据库,它在大数据领域中扮演着重要的角色,尤其是在需要实时查询大规模数据集时。HBase以其高吞吐量、低延迟和水平扩展能力而闻名,常用于存储非结构化和半结构化数据。在HBase...
### HBase权威指南知识点概述 #### 一、引言与背景 - **大数据时代的来临**:随着互联网技术的发展,人类社会产生了前所未为的数据量。这些数据不仅数量巨大,而且种类繁多,传统的数据库系统难以应对这样的挑战。 ...
HBase(hbase-2.4.9-bin.tar.gz)是一个分布式的、面向列的开源数据库,该技术来源于 Fay Chang 所撰写的Google论文“Bigtable:一个结构化数据的分布式存储系统”。就像Bigtable利用了Google文件系统(File System...
在Windows上安装HBase 本文将指导您如何在Windows平台上安装HBase,包括配置详解。安装完成后,您将能够配置集群。 一、前提条件 在安装HBase前,需要安装Cygwin和Hadoop。这两个软件的安装不在本文的讨论范围内...
### HBase 安装与使用知识点详解 #### 概述 HBase 是一款构建于 Hadoop 之上的分布式、可扩展的大规模数据存储系统。它提供了类似 Google BigTable 的功能特性,非常适合处理海量数据和高并发读写需求的应用场景。...