Data in this life

Disclaimer: This article is a blogger original article, please indicate the source. https://blog.csdn.net/u010597819/article/details/89441662

Metadata

  1. Concepts: data description data called metadata, such as: a database of data dictionary table, table tables, the index table, and the like lock table
  2. application
  3. Locate the problem: a task appeared abnormal, the source node can analyze the problem from occurring metadata platform, instead of saying time costs arising from retroactive to the upper layers of;
  4. Analysis: the impact area of ​​the task, a task appeared abnormal, which is dependent on the task in question, while the task of recovery will need to rely on the accuracy of other subtasks of this task in order to ensure full recovery of data
  5. Service quality monitoring: for example, the database can provide the metadata task execution efficiency and lock transactions, and so on analysis and optimization sql. Timely interrupt when it finds a deadlock or a timeout sql, guarantee quality of service reliability

source data

  1. Concept: As the name suggests, refers to a variety of available data sources, such as: excel, flat files, databases, logs, etc.

database

  1. The concept: (the Data
    Warehouse) is a subject-oriented (Subject Oriented), integrated (Integrated), a relatively stable (Non-Volatile), reflects the historical changes (Time Variant) data collection to support management decision-making (Decision Making Support). Usually BI (business intelligent) which means the business intelligence also have to OLAP system, there are data to support the underlying data warehouse provides. The industry's more well-known open source projects domestically unicorn, has contributed to the Apache https://github.com/apache/kylin
  2. OLTP: on-line transaction processing, online transaction processing. Transaction-oriented
  3. OLAP: on-line analytical processing, online analytical processing. Oriented Analysis

Why data warehouses?

  1. How to do if I want to analyze orders various parts of the Order Center the last year and ranked profitable? Direct access to library services? Business database queries down Zezheng? During the inquiry does not come under the orders deadlock zezheng (even if separated from the master to read and write data latency can also cause increased synchronization)? If five years of it?
  2. If I want to analyze rider Center (orders with different databases on different servers center) of each rider's average monthly income of a single volume and how to do the last year?
  3. If I want to analyze the distribution of men and women workers of various branches of the country, but found that men and women differ identification system for each branch of the service how to do? For example: Shanghai using F | M, Beijing Use 0 | 1, Guangdong use f | m, Hangzhou use male | female, and so on similar issues
  4. If I find the actual orders and library inside the company how to do different? For example: Test list, test data, or even dirty data system failures. If you exclude these data?

The role of the number of positions

After the problems have solutions corresponding to the number of these positions is the product of birth problems

  1. Data isolation. The data the previous day after stable business data daily extraction, drawn to several positions, several warehouses and provide statistical analysis of historical data, business data will not have any impact
  2. Data standardization. Synonyms various business library data, synonyms will be a step by step into the unified standardization in the data warehouse
  3. Data cleansing. Dirty data traffic will be filtered out, the right to retain traffic data for later analysis using

Achieve a number of positions

Number warehouse modeling

  1. The number of common warehouse modeling divided into two categories: the star model, snowflake model. The most commonly used model for snowflakes
  2. Star model: allow the existence of redundant data. Common types of large wide table, like a core of Baidu map scales with over more than two thousand field attributes. Analysis on the drill drill facilitate operation, the polymerization is drilled, drill is thinner dimension statisticsimage.png
  3. Snowflake model: strict accordance with the three modeling paradigm. Redundant data does not exist, low-data-transaction cost, higher performance. But the model complexityimage.png

ETL

  1. extract: extraction, data extraction; data stage will generally have a layer of substantially seven days of the service database, with exactly the same business model structure, the boundary layer is a stage number of bins and business
  2. translate:清洗转换,数据清洗;一般会有一个ods(Operate Data Store)层,保留所有历史数据,轻度聚合,按照主题划分业务,数据标准化,脏数据过滤。ods层之后会有一个数据集市层,叫法较多例如:mdm,app前缀等。数据高度聚合,用户运营业务数据分析,针对某项业务进行数据统计指标计算。
  3. load:加载,数据加载;即数据加载至仓库

数据加载策略

  1. 快照表:日快照、月快照。按照每日、每月进行一次数据抽取加载。
  2. 拉链表:部分拉链、全表拉链。数据每次变化立即产生一条新得记录,每条记录均由有效期即:开始时间、结束时间。能够详细的反应某些状态的历史变化。且数据量比快照表更小。

数据分析及数据挖掘

  1. 如果没有数仓,数据分析的难度是可想而知的,数据分散在各种数据库各种数据源。并且存在脏数据,同义词等问题,会为分析带来很多的麻烦
  2. 数仓的主题划分,按照不同主题对各个业务进行分析统计。根据分析指标对公司的业务扩展及运营进行决策
  3. 最为广泛的例子是:根据统计分析发现男人再去商店买尿不湿的时候通常会喜欢买一些啤酒。如果不能同时满足这两个需求,那么他可能会选择其他商店。而且把这两个产品摆放在一起会节约购物时间提升购物体验。

无监督学习

K-均值

算法:k-均值。用于划分的k-均值算法,其中每个簇的中心都用簇中所有对象的均值来表示。
输入:

  • k:簇的数目;
  • D:包含n个对象的数据集

输出:k个簇的集合
方法:

  1. 从D中任意选择k个对象作为初始簇中心;
  2. repeat
  3. 根据簇中对象的均值,将每个对象分配到最相似的簇;
  4. 更新簇均值,即重新计算每个簇中对象的均值;
  5. until不再发生变化。timg.gif

监督学习

贝叶斯定理

贝叶斯分类算法基于贝叶斯定理。一种称为朴素贝叶斯分类法的简单贝叶斯分类法可以与决策树和经过挑选的神经网络分类器相媲美。
贝叶斯定理:image.png
P(H|X)是后验概率,在条件X下,H的后验概率。例如:假设数据元组是界限于分别由属性age和income描述的顾客,而X是一位35岁的顾客,其收入为4万美元。H是一种假设,如顾客将购买计算机。则P(H|X)反映当我们知道顾客的年龄与收入时,顾客X将购买计算机的概率。
P(H)是先验概率,H的先验概率。对于我们的例子,它是任意给定顾客将购买计算机的概率,而不管他们的年历、收入或者任何其他信息。
P(X|H)是条件H下,X的后验概率。也就是说,它是已知顾客X将购买计算机,该顾客是35岁并且收入为4万美元的概率。
P(X)是X的先验概率。它是顾客集合中年龄为35岁并且收入为4万美元的概率。
贝叶斯定理提供了一种由P(X)、P(H)和P(X|H)计算后验概率P(H|X)的方法

朴素贝叶斯

RID age income student credit_rating Class:buys_computer
1 youth high no fair no
2 youth high no excellent no
3 middle high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle medium no excellent yes
13 middle high yes fair yes
14 senior medium no excellent no

上表为训练数据。

朴素贝叶斯分类法工作过程

  1. 每行数据是一个n维属性的向量X={x1,x2,…,xn}
  2. 分类属性:buys_computer,假设有m个分类为C={c1,c2,…,cm}。当前案例分类只有两种,yes|no。分类法预测一个新的X值属于具有最高后验概率的类。也就是说X属于类Ci,当且仅当:P(Ci|X) > P(Cj|X),其中1≤j≤m,j≠i。预测分类就是求P(Ci|X)的最大值。根据贝叶斯定理:image.png
  3. Since P (X) is constant for all classes, so only P (X | Ci) P (Ci) to the maximum. If the class prior probabilities P (Ci) is unknown, it is generally assumed that these classes are equiprobable, i.e., P (C1) = P (C2) = ... = P (Ci)
  4. Has many of the attributes of a given set of data, calculate P (X | Ci) overhead can be very large. To reduce the computational overhead, you can do an independent class of simple conditions assumed. Thus: P (X | Ci) = P (x1 | Ci) P (x2 | Ci) ... P (Xn | Ci). For each property, the property inspection is continuous values ​​or classification.
  5. If Ak is classified property, P (xk | Ci) is the number of elements Ci class training set D Ak attribute value xk divided by the number of tuples in the D class Ci | Ci, d |.
  6. If Ak is a continuous attribute value, the attribute value is generally assumed continuous with mean [mu], standard deviation σ of the Gaussian distribution image.png. Therefore P (xk | Ci) = g (xk, μci, σci)

Bayesian classifier is used to predict class labels

Classification: C1 corresponds buys_computer = yes, C2 corresponds buys_computer = no. Hope disaggregated data X = (age = youth, income = medium, student = yes, credit_rating = fair).

  1. P (Ci) is the prior probability of each class:
P(buys_computer=yes)=9/14=0.643
P(buys_computer=no)=5/14=0.357
  1. P (X | Ci), i = 1,2, calculating posterior probability for each attribute of Ci
P(age=youth|buys_computer=yes) = 2/9 = 0.222
P(age=youth|buys_computer=no) = 3/5 = 0.600
P(income=medium|buys_computer=yes) = 4/9 = 0.444
P(income=medium|buys_computer=no) = 2/5 = 0.400
P(student=yes|buys_computer=yes) = 6/9 = 0.667
P(student=yes|buys_computer=no) = 1/5 = 0.200
P(credit_rating=fair|buys_computer=yes) = 6/9 = 0.667
P(credit_rating=fair|buys_computer=no) = 2/5 = 0.400
P(X|buys_computer=yes) = P(age=youth|buys_computer=yes)*P(income=medium|buys_computer=yes)*P(student=yes|buys_computer=yes)*P(credit_rating=fair|buys_computer=yes) = 0.044
P(X|buys_computer=no) = 0.019

则P(X|Ci)P(Ci)
P(X|buys_computer=yes)*P(buys_computer=yes)=0.028
P(X|buys_computer=no)*P(buys_computer=no)=0.007
  1. So for data X, naive Bayes prediction classified as buys_computer = yes.

Data summarized in this life

Data in this life .jpg

Guess you like

Origin blog.csdn.net/u010597819/article/details/89441662