Data Development Methodology

Since the company's data platform has done a good job of encapsulating the ETL development tools, the development efficiency has been greatly improved. The main work of the data development students is to write SQL, and they are tired of dealing with various huge business needs. Many students think that there is no technical content in data development, and there is no improvement in technology, which leads to boredom over time.

Personally, I think the main reasons for this idea are:

  • To solve the needs of a point through the ETL method, without thinking about the problem from a global perspective, building data, without any precipitation and reflection.
  • There is no clear positioning for data development. In fact, building data is a very challenging job.

the problem we are facing

A data warehouse is an architecture that first puts the question in front of us:

  • How to understand the business efficiently, express (model) the complex business with a simple model, and the final destination is a table in a data warehouse or a data mart and a series of upper-level indicators (how to layer, how to improve data reuse rate, etc.).
  • The best data organization method can efficiently respond to business changes, and support business needs more flexibly and with high quality (are you building a warehouse or a data grave?).

And all this needs a better methodology to support, how to build it is worth thinking about.

proper attitude

A better attitude is to start data development with problem solving as the starting point.

  • The purpose of system development and data development is to solve problems. They are essentially the same. They both face various problems that are difficult to solve. It is very difficult to realize that it is not easy to build a good data warehouse (repetitive data construction, inconsistent data, etc.). , incomplete data, not easy to understand, etc.), the difficulty ratio is lower than that of making a system
  • In the process of building a data warehouse, it is also very necessary to adopt engineering methods to improve efficiency. Therefore, warehouse construction and system development are not contradictory. We should actively seek for a combination point to improve efficiency, instead of falling into the quagmire of being too tired to meet demand.
  • The background or platform development and data development capability models should be different, and data development is a test of a person's comprehensive ability.
  • To measure whether an engineer is qualified, the final landing point should be "work ability"

competency model

In my opinion:

  • Data processing and modeling capabilities should have sufficient technical breadth and depth in these two aspects.
  • Business understanding and abstract thinking, ability to summarize, problem to essence.
  • Other soft qualities such as communication skills

To do a good job in data development requires a more comprehensive quality.

some methodologies

How to quickly understand business

abstract thinking

Usually, the business side will give a lot of indicators or a series of requirements. Anyway, it seems to be more complicated. At this time, abstract thinking is required.

  • 复杂—>简单,不要过度在意细节,抓住需求的本质,关键点,站在整体的高度来理解需求。
  • 简单->复杂, 系统化,全面性,分解业务。

引入模型

任何复杂的业务都可以用一些简单的模型来表示,引入模型有助于我们快速分解需求,比如引入流量漏斗模型,围绕流量无非是分析:转化率、跳出率、停留时长等指标,而流量漏斗,即用户行为相对是稳定的,围绕这个漏斗的上下文(维度)可能会变化,这是我们重点需要关注的,再比如分析新老客户质量是引入客户忠诚度模型等,总之引入模型有助于我们理解需求,分解业务。

关于埋点

能在后台实现的尽量在业务后台服务埋点,尽量不依赖客户端,因为在客户端埋点可维护性比较差,且不灵活,统计效果依赖客户端发版。

关于模型选择

首先还要明确一点,数据仓库是面向分析的,所以一般情况下倾向于建设直观易于理解使用的数据,常用的关系模型以及维度模型区别在于划分世界的方式不一样:一个是实体关系,一个是维度模型,但都可以实现范式。

关系模型

将客观世界划分成关系与实体,数据仓库由一系列的关系以及实体组成,严格遵守3nf范式,数据一致性比较强,冗余度低。

通过参照完整性来保证一致性,采用关系模型生成的数据仓库往往成蜘蛛网结构,可读性差,不容易入手,对于非仓库建设人员要使用数据仓库中的数据成本是非常高的。

维度建模

维度建模方式将客户世界分解成事实和维度,建模过程的一大部分工作体现在抽取维度以及事实,相对来说更加直观,容易理解,在业务变更较为频繁的场景可以更加有效地覆盖业务需求。

数据仓库由一系列的事实表以及维度表组成,事实表与维表之间呈星型连接,事实表之间又通过特定维度联系,形成雪花结构。

3NF范式

3nf在第一范式原子性、第二范式不存在部分依赖基础上加入消除传递依赖,其目的是降低数据冗余度,提高数据一致性。关系模型有时候也被称为范式建模,但是这种说法是不太严谨的。因为关系模型好维度建模生成的表都可以符合3nf范式。

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326979297&siteId=291194637