Use HBase to Solve Page Access Problem

standalone

浏览: 619374 次
性别:
来自: 上海

最近访客更多访客>>

liujun.1980

rkikbs

yy629

songhait

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

hbase

hbase mapreduce

Currrently I'm working on sth like calculating how many hits on a page. The problem is the raw data can be huge, so it may not scale if you use RDBMS.

The raw input is as follows.

Date Page User
-----------
date1 page1 user1
date1 page1 user2
date1 page2 user1
date1 page2 user3

...   ...

So I need to answer questions like "for page1 on day1 how many distinct users have visited it?" or "on day1 how many distinct users have visited (the web site)?" That is, you need to support roll up or drill down at some columns.

Before coding, I read some articles related to my problem. Here are the references.

1. http://sujee.net/tech/articles/hadoop/hbase-map-reduce-freq-counter/ (English)

2. http://www.cnblogs.com/panfeng412/archive/2011/11/19/hbase-application-in-data-statistics.html (Chinese)

For raw data we can simpley write into hbase; the challenge left is how to calculate the aggregated result. One solution mentioned in [2] is that you have a table holding the aggregated result, whenever a raw data record is put in hbase, you also update the corresponding aggregated record. E.g.,

Table: Day_Page_Access

key             value
20130304_page1   3400
20130304_page2   7800

When a raw data record (20130304, page1, Tom) is processed, you get the total access count with row key 20130304_page1 which is 3400 then increase it by 1 and write back.

I think the problem is when doing large writes mixed with updates the in-all performance will be droped severely. But the benefit of this approach is the aggregated result is available at any time. You can support querying in almost real time.

The other solution in [1] has done a quite good work which leverage hadoop mapreduce to calculate the aggregated result.　After all data is loaded in hbase, it will perform a mapreduce job to sum the access counts for same page same day. The solution is suited to scenarios that allow off-line batch processing. It can have greate write throughput.

Btw, I'm working on solution 1.

分享到：

Using the libjars option with Hadoop | Exception in thread "main" java.lang.NoC ...

2013-05-17 14:48
浏览 1204
评论(0)
分类:企业架构
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论