Use HBase to Solve Page Access Problem - 代码天地

Use HBase to Solve Page Access Problem

企业开发 2018-05-12 16:18:13 阅读次数: 0

Currrently I'm working on sth like calculating how many hits on a page. The problem is the raw data can be huge, so it may not scale if you use RDBMS.

The raw input is as follows.

Date Page User
-----------
date1 page1 user1
date1 page1 user2
date1 page2 user1
date1 page2 user3

...   ...

So I need to answer questions like "for page1 on day1 how many distinct users have visited it?" or "on day1 how many distinct users have visited (the web site)?" That is, you need to support roll up or drill down at some columns.

Before coding, I read some articles related to my problem. Here are the references.

1. http://sujee.net/tech/articles/hadoop/hbase-map-reduce-freq-counter/ (English)

2. http://www.cnblogs.com/panfeng412/archive/2011/11/19/hbase-application-in-data-statistics.html (Chinese)

For raw data we can simpley write into hbase; the challenge left is how to calculate the aggregated result. One solution mentioned in [2] is that you have a table holding the aggregated result, whenever a raw data record is put in hbase, you also update the corresponding aggregated record. E.g.,

Table: Day_Page_Access

key             value
20130304_page1   3400
20130304_page2   7800

When a raw data record (20130304, page1, Tom) is processed, you get the total access count with row key 20130304_page1 which is 3400 then increase it by 1 and write back.

I think the problem is when doing large writes mixed with updates the in-all performance will be droped severely. But the benefit of this approach is the aggregated result is available at any time. You can support querying in almost real time.

The other solution in [1] has done a quite good work which leverage hadoop mapreduce to calculate the aggregated result.　After all data is loaded in hbase, it will perform a mapreduce job to sum the access counts for same page same day. The solution is suited to scenarios that allow off-line batch processing. It can have greate write throughput.

Btw, I'm working on solution 1.

猜你喜欢

转载自standalone.iteye.com/blog/1870948

Use HBase to Solve Page Access Problem

How to use vimdiff to solve conflict?

Zookeeper(2)Solve the Problem on Redhat

solve the problem covering the interval in mathematic

solve multiplication problem with the largest integers

You can Solve a Geometry Problem too 7.1.2

HDU 5323 Solve this interesting problem（DFS）

Solve an LP problem in C++ using Gurobi

leetcode substring search problem solve template

Solve the problem "Value was invalid" in .Net ICryptoTransform

[Daily Coding Problem 371] Solve Arithmetic Equations

Solve the problem merging the stacks consisting rocks

solve the problem of 'java web project cannot display verification code'

HDU 1086 You can Solve a Geometry Problem too

HDU 1086 You can Solve a Geometry Problem too（线段交点）

【HDOJ】1086 You can Solve a Geometry Problem too【计算几何】

2018 ICPC 焦作 H题 Can You Solve the Harder Problem?

Hdoj 1086.You can Solve a Geometry Problem too 题解

[Algorithms] Using Dynamic Programming to Solve longest common subsequence problem

hdu 1086 You can Solve a Geometry Problem too [线段相交]

Solve the Problem of ownCloud: 'No Keychain Service Available' on Ubuntu 18.04

HDU 1086 You can Solve a Geometry Problem too [计算几何]

You can Solve a Geometry Problem too HDU - 1086

hdu1086 You can Solve a Geometry Problem too

How to solve the problem that Github can't visit in China?

DP algorithm to solve problem 3014 on poj.org by C++

[置顶] Linux use apktool problem

Simple User Access to Apache HBase

mapreduce mapper access security hbase

security cdh mapreduce access hbase

今日推荐

Apache Doris 2.0.10 版本正式发布！

开源日报 | 大模型开战；大模型独角兽被曝卖身；周鸿祎建议谷歌开源所有产品；最大开源AI社区提供1000万美元共享GPU

开源日报 | Chrome内置Gemini的意义不在于Gemini；中国AI追随之路的五大误区；ECharts创始人“下海”养鱼；谷歌I/O开发者大会什么都有，只是没有惊喜

微软回应中国区AI团队“打包赴美”传闻

基于大语言模型的开源知识库问答系统 MaxKB GitHub Star 数量突破 5,000 个！

美国拟限制 AI 大模型出口中国和俄罗斯

苹果将与 OpenAI 达成协议，将 ChatGPT 应用于 iPhone

周排行

阿里云短信服务平台注册

Windows下的字符串处理(1)

sqoop: mysql导入数据到hdfs, hive, hbase

commons.lang中常用的工具类

离线安装PostgreSQL11.6

使用PyTorch简单实现卷积神经网络模型

一文彻底搞定谱聚类

一道面试题引发的血案

One Chat for Mac(聊天工具)

TCP/IP的底层队列是如何实现的？

每日归档

更多

2024-05-17(34)

2024-05-16(6)

2024-05-15(24)

2024-05-14(0)

2024-05-13(18)

2024-05-12(0)

2024-05-11(38)

2024-05-10(38)

2024-05-09(35)

2024-05-08(42)