Use HBase to Solve Page Access Problem

Currrently I'm working on sth like calculating how many hits on a page. The problem is the raw data can be huge, so it may not scale if you use RDBMS.

The raw input is as follows.

Date Page  User
-----------
date1 page1 user1
date1 page1 user2
date1 page2 user1
date1 page2 user3

...   ...

So I need to answer questions like "for page1 on day1 how many distinct users have visited it?" or "on day1 how many distinct users have visited (the web site)?" That is, you need to support roll up or drill down at some columns.

Before coding, I read some articles related to my problem. Here are the references.

1. http://sujee.net/tech/articles/hadoop/hbase-map-reduce-freq-counter/  (English)

2. http://www.cnblogs.com/panfeng412/archive/2011/11/19/hbase-application-in-data-statistics.html (Chinese)

For raw data we can simpley write into hbase; the challenge left is how to calculate the aggregated result. One solution mentioned in [2] is that you have a table holding the aggregated result, whenever a raw data record is put in hbase, you also update the corresponding aggregated record. E.g.,

Table: Day_Page_Access

key             value
20130304_page1   3400
20130304_page2   7800

When a raw data record (20130304, page1, Tom) is processed, you get the total access count with row key 20130304_page1 which is 3400 then increase it by 1 and write back.

I think the problem is when doing large writes mixed with updates the in-all performance will be droped severely. But the benefit of this approach is the aggregated result is available at any time. You can support querying in almost real time.

The other solution in [1] has done a quite good work which leverage hadoop mapreduce to calculate the aggregated result. After all data is loaded in hbase, it will perform a mapreduce job to sum the access counts for same page same day. The solution is suited to scenarios that allow off-line batch processing. It can have greate write throughput.

Btw, I'm working on solution 1.

猜你喜欢

转载自standalone.iteye.com/blog/1870948