`
standalone
  • 浏览: 606316 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

Use HBase to Solve Page Access Problem

阅读更多
Currrently I'm working on sth like calculating how many hits on a page. The problem is the raw data can be huge, so it may not scale if you use RDBMS.

The raw input is as follows.

Date Page  User
-----------
date1 page1 user1
date1 page1 user2
date1 page2 user1
date1 page2 user3

...   ...

So I need to answer questions like "for page1 on day1 how many distinct users have visited it?" or "on day1 how many distinct users have visited (the web site)?" That is, you need to support roll up or drill down at some columns.

Before coding, I read some articles related to my problem. Here are the references.

1. http://sujee.net/tech/articles/hadoop/hbase-map-reduce-freq-counter/  (English)

2. http://www.cnblogs.com/panfeng412/archive/2011/11/19/hbase-application-in-data-statistics.html (Chinese)

For raw data we can simpley write into hbase; the challenge left is how to calculate the aggregated result. One solution mentioned in [2] is that you have a table holding the aggregated result, whenever a raw data record is put in hbase, you also update the corresponding aggregated record. E.g.,

Table: Day_Page_Access

key             value
20130304_page1   3400
20130304_page2   7800

When a raw data record (20130304, page1, Tom) is processed, you get the total access count with row key 20130304_page1 which is 3400 then increase it by 1 and write back.

I think the problem is when doing large writes mixed with updates the in-all performance will be droped severely. But the benefit of this approach is the aggregated result is available at any time. You can support querying in almost real time.

The other solution in [1] has done a quite good work which leverage hadoop mapreduce to calculate the aggregated result. After all data is loaded in hbase, it will perform a mapreduce job to sum the access counts for same page same day. The solution is suited to scenarios that allow off-line batch processing. It can have greate write throughput.

Btw, I'm working on solution 1.
分享到:
评论

相关推荐

    How-to: Use HBase Bulk Loading, and Why

    在大数据处理领域,HBase(Hadoop Database)是一款基于Google Bigtable设计的高可靠性、高性能、分布式的列式存储系统。HBase适用于处理海量结构化数据,尤其在实时读写性能方面表现出色。本篇文章将深入讲解如何...

    hbase to elasticsearch

    Hbase本身只有一级索引rowkey,现在通过Hbase coprocessor协处理器把Hbase的数据索引存储到Elasticsearch,从而建立二级索引;ppt中讲述了一些注意事项,挺有用的,希望能有所帮忙!

    hbase 权限配置.docx

    <value>org.apache.hadoop.hbase.security.access.AccessController <name>hbase.coprocessor.region.classes <value>org.apache.hadoop.hbase.security.access.AccessController <name>hbase.coprocessor....

    HbaseTemplate 操作hbase

    在IT行业中,尤其是在大数据处理领域,HBase是一个广泛使用的分布式、高性能、列式存储的NoSQL数据库。HBase是建立在Hadoop文件系统(HDFS)之上,为处理大规模数据提供了一个高效的数据存储解决方案。而Spring Data...

    HBase开启审计日志

    log4j.logger.SecurityLogger.org.apache.hadoop.hbase.security.access.AccessController ``` 以上配置指定了审计日志的文件名为`SecurityAuth.audit`,最大文件大小为256MB,并且最多保留20个备份文件。日志...

    hbase 权限三种方式.docx

    <value>org.apache.hadoop.hbase.security.access.AccessController <name>hbase.coprocessor.region.classes <value>org.apache.hadoop.hbase.security.access.AccessController <name>hbase.coprocessor....

    藏经阁-Use CCSMap to Improve HBase YGC Time & Efforts on SLA improv

    "藏经阁-Use CCSMap to Improve HBase YGC Time & Efforts on SLA improv" CCSMap 是阿里云开发的一种高效的内存管理技术,用于改进 HBase 的 YGC 时间和努力,并提高 SLA 改进。下面是 CCSMap 的详细知识点: 为...

    HBase.High.Performance.Cookbook.epub

    It is an open source, disturbed, versioned, column-oriented store and is written in Java to provide random real-time access to big Data. We'll start off by ensuring you have a solid understanding ...

    HBase学习利器:HBase实战

    安全性和权限管理**:HBase提供了细粒度的安全控制机制,可以通过ACL(Access Control List)来管理用户的访问权限,确保数据安全。 **5. 集成与生态**:HBase不仅与HDFS紧密结合,还与Hadoop生态系统中的其他组件...

    HBase3.0参考指南

    Herein you will find either the definitive documentation on an HBase topic as of its standing when the referenced HBase version shipped, or it will point to the location in Javadoc or JIRA where the ...

    pinpoint的hbase初始化脚本hbase-create.hbase

    搭建pinpoint需要的hbase初始化脚本hbase-create.hbase

    scala API 操作hbase表

    conf.addResource("path/to/hbase-site.xml") ``` 3. 建立HBase连接: 使用配置对象创建一个`Connection`实例: ```scala val connection = ConnectionFactory.createConnection(conf) ``` 4. 获取HBase表:...

    HBase.Design.Patterns

    With the increasing use of NoSQL in general and HBase in particular, knowing how to build practical applications depends on the application of design patterns. These patterns, distilled from extensive...

    Architecting_HBase_Applications_201608

    6.Use Case: HBase as a System of Record 7.Implementation of an Underlying Storage Engine 8.Use Case: Near Real-Time Event Processing 9.Implementation of Near Real-Time Event Processing 10.Use Case: ...

    hbase配置内置的zookeeper

    ### HBase 配置内置 ZooKeeper 的详细步骤与解析 #### 一、配置背景与目的 在 HBase 的部署环境中,ZooKeeper 起着非常重要的作用,它主要用于协调集群中的各个节点,并且管理 HBase 的元数据。通常情况下,HBase ...

    hbase用于查询客户端工具

    HBase是一种分布式、基于列族的NoSQL数据库,它在大数据领域中扮演着重要的角色,尤其是在需要实时查询大规模数据集时。HBase以其高吞吐量、低延迟和水平扩展能力而闻名,常用于存储非结构化和半结构化数据。在HBase...

    HBase(hbase-2.4.9-bin.tar.gz)

    HBase(hbase-2.4.9-bin.tar.gz)是一个分布式的、面向列的开源数据库,该技术来源于 Fay Chang 所撰写的Google论文“Bigtable:一个结构化数据的分布式存储系统”。就像Bigtable利用了Google文件系统(File System...

    Hbase权威指南(HBase: The Definitive Guide)

    ### HBase权威指南知识点概述 #### 一、引言与背景 - **大数据时代的来临**:随着互联网技术的发展,人类社会产生了前所未为的数据量。这些数据不仅数量巨大,而且种类繁多,传统的数据库系统难以应对这样的挑战。 ...

    hbase安装与使用

    ### HBase 安装与使用知识点详解 #### 概述 HBase 是一款构建于 Hadoop 之上的分布式、可扩展的大规模数据存储系统。它提供了类似 Google BigTable 的功能特性,非常适合处理海量数据和高并发读写需求的应用场景。...

Global site tag (gtag.js) - Google Analytics