Currrently I'm working on sth like calculating how many hits on a page. The problem is the raw data can be huge, so it may not scale if you use RDBMS.
The raw input is as follows.
Date Page User
-----------
date1 page1 user1
date1 page1 user2
date1 page2 user1
date1 page2 user3
... ...
So I need to answer questions like "for page1 on day1 how many distinct users have visited it?" or "on day1 how many distinct users have visited (the web site)?" That is, you need to support roll up or drill down at some columns.
Before coding, I read some articles related to my problem. Here are the references.
1.
http://sujee.net/tech/articles/hadoop/hbase-map-reduce-freq-counter/ (English)
2.
http://www.cnblogs.com/panfeng412/archive/2011/11/19/hbase-application-in-data-statistics.html (Chinese)
For raw data we can simpley write into hbase; the challenge left is how to calculate the aggregated result. One solution mentioned in [2] is that you have a table holding the aggregated result, whenever a raw data record is put in hbase, you also update the corresponding aggregated record. E.g.,
Table: Day_Page_Access
key value
20130304_page1 3400
20130304_page2 7800
When a raw data record (20130304, page1, Tom) is processed, you get the total access count with row key 20130304_page1 which is 3400 then increase it by 1 and write back.
I think the problem is when doing large writes mixed with updates the in-all performance will be droped severely. But the benefit of this approach is the aggregated result is available at any time. You can support querying in almost real time.
The other solution in [1] has done a quite good work which leverage hadoop mapreduce to calculate the aggregated result. After all data is loaded in hbase, it will perform a mapreduce job to sum the access counts for same page same day. The solution is suited to scenarios that allow off-line batch processing. It can have greate write throughput.
Btw, I'm working on solution 1.
Use HBase to Solve Page Access Problem
猜你喜欢
转载自standalone.iteye.com/blog/1870948
今日推荐
周排行