Generally speaking, the big data platform consists of three parts:
- Data related tools, products and technologies:
- bulk data acquisition and transmission sqoop, spark
- off-line data processing Hadoop, Hive, Spark
- real-time streaming Storm, Spark Streaming, Flink
• Data assets:
- Data business itself and precipitation
- data generated by the operation of the company (such as financial, administrative)
- Third-party data: external purchase, exchange or crawlers from data
• Data Management: With the tools and data that needs to be managed in order to allow maximum data value and minimize risk
Data management techniques and concepts: data warehousing, data modeling, data quality, data standards, data security and metadata management
Heart-shaped model
Dimension tables: Some attribute the dictionary table product information,
Fact Table: User behavior
Snowflake model
For example, the user's age, gender id --- "id name, age
Uniform standards: for example, a business unit to delete online 1 0, and the other, delete the line Y N
Caliber is often said that where the filter conditions
The above data modeling a business line
The whole big data sector data warehouse ------- "data mart {
Pulling the relevant fields to establish wide table -------- "on the basis of wide tables -----" extract the field of each service form the corresponding service table (machine learning, data analysis) ----- - "statistical analysis (join, or staging table) -----"
}
This is for all types of data (Buried collect data, employee data, business product data) exists all data warehouse ============== "Follow the corresponding sub-department use to build the table
Modeling --- "The benefits of tiered: decoupling, less impact on downstream upstream data dependency tables to find business issues
ODS(Operational Data Store,操作数据存储):原始数据层,数据源头表通常会原封不动的存储一份。DW层(DWD和DWS层):
DWD(data warehouse detail)明细层
DWS(data warehouse service 汇总层
数据仓库明细层DWD和数据仓库汇总层DWS是数据平台的主要内容。它们是通过ODS层经过ETL清洗、转换、加载生成的,
基于维度建模理论来构建,通过一致性维度和数据总线来保证各个子主题的维度一致性。(就算数据表被删了也可以重新跑 从ODS恢复过来)
ADS(集市数据层,也称应用层):应用层主要是各个业务方或者部门基于DWD和DWS建立的数据集市(DM),数据集市是相对于数据仓库来说的。一般应用层的数据是来源于DW层,原则上是不能访问ODS层的。对比于DW层,应用层只包含部门或业务方自己关心的明细层和汇总层的数据。(一般是将各个要用的表join起来形成宽表,供下游业务分析人员 select * )
准备区:在hdfs备份一份原始数据
dw:数据仓库,数据开发建模
dm:数据集市应用 多表join的结果
OLTP与OLAP的区别:
OLTP(online transaction Processing) 联机事务处理过程:侧重于单条数据的查新,主要是在关系型数据库上
OLAP联机分析处理:专门的分析性数据库,侧重于批量的数据请求,更加试用于大数据查询处理
列式存储的好处:
对于OLAP 查询都是相关的列,不需要读取整个表所有字段进行处理
对于OLTP 进行增删改查,多半是对整行数据进行操作