Hebei University of Technology Data Mining Experiment 2 Data Cube and Online Analysis Processing Construction

1. Experimental purpose

(1) Be familiar with basic data cube construction and online analysis and processing algorithms.
(2) Establish a consistent and high-quality relational database.
(3) Establish a basic data cube based on the established database.
(5) Write an experiment report.

2. Experimental principles

1. Relational database

A relational database is a database created on the basis of a relational model, which uses mathematical concepts
and methods such as set algebra to process data in the database. The relational model consists of three parts: relational data structure, relational operation set, and relational
integrity constraints.

2. Data cube

A multidimensional data model that allows data to be modeled and observed in multiple dimensions. It is defined by peacekeeping facts. A dimension is a perspective or entity that a unit wants. Each dimension can have a table associated with people, called a dimension table, which further describes the dimension. For example, the dimension table of the item dimension contains attributes such as Name, time, type, etc. Facts: Multidimensional data models are organized around topics such as sales, with topics represented by facts, which are numerical measures.

3. OLAP operations

Roll-up: Hierarchically climbing concepts along a dimension or aggregating on a data cube through dimension reduction.
Drill-down: The reverse operation of roll-up, which may be achieved by hierarchically drilling down along the concept of dimensions or by introducing additional dimensions.
Slice: Make a selection on one dimension of the given data cube, resulting in a subcube. It is a certain layer of data in the data cube.
Toggle: Select in two or more dimensions to define subcubes. It is a certain piece of data
in a certain layer of the data cube.

4. Design of data warehouse

Select the business process to be modeled: which business processes are there, such as ordering, invoicing, shipping, inventory, bookkeeping, sales or general ledger. Select the granularity of business processing: For business processing, this granularity is basic and is the atomic level of data in the fact table, such as a single transaction, a snapshot of a day, etc. Select the dimensions to use for each fact table record: typical dimensions are time, item, customer, supplier, warehouse, transaction type, and status. Select the measure that will be placed in each fact table record: typical measures are additive numeric quantities, such as dollars_sold and units_sold.

3. Experimental content and procedures

1. Experimental content

1) Use VC++ programming tools to write programs, establish relational data storage structures, create data cubes, and write the main processes and methods used in the experimental report. The dimensions of the created data cube are 3, which are product categories, store numbers and time. Specific requirements:

  1. Create three storage tables (txt files) to store the data of 1019, 1020, and 1021 respectively;
  2. The horizontal direction of each txt file is the product category (the first five digits of the product ID): 10010 oil, 10020 flour products, 10030 rice and flour, 10088 grain and oil gifts;
  3. Each txt vertically has dates 13-19. The value stored in the table for this week
    is the total sales.
    2) Perform simple OLAP data query
    . Specific requirements:
  • Can find the total sales of 10010 oil products in 2020 stores on the 13th;
  • Can calculate the total sales of 10030 meters and noodles in the store in 2020;
  • Can check the sales volume of specified types of goods in specified stores; (additional question)

2. Experimental steps

1) Carefully study and review the data to identify attributes or dimensions that should be included in your analysis and remove unnecessary
data. After data preprocessing, the missing values ​​have been supplemented and the format has been unified. Read the product ID and date of the preprocessed data and calculate the sales volume.
2) Select an appropriate storage structure to realize data storage access and implement corresponding functions.

3. Program block diagram

Insert image description here

4. Experimental samples

Data.csv after experiment 1 processing

6. Experimental code

#!/usr/bin/env python  
# -*- coding: utf-8 -*-
#
# Copyright (C) 2021 #
# @Time    : 2022/5/30 21:27
# @Author  : Yang Haoyuan
# @Email   : [email protected]
# @File    : Exp2.py
# @Software: PyCharm
import pandas as pd
import argparse

parser = argparse.ArgumentParser(description='Exp2')
parser.add_argument('--Shop', type=str, default="1019", choices=["1019", "1020", "1021"])
parser.add_argument('--Good', type=str, default="10010油", choices=["10010油", "10020面制品", "10030米和粉", "10088粮油类赠品"])

parser.set_defaults(augment=True)
args = parser.parse_args()
print(args)


# 读取1019,1020,1021三个商店的数据
def getData():
    data = pd.read_csv("data.csv")
    data_19 = data[:7693]
    data_20 = data[7693:17589]
    data_21 = data[17589:]
    return data_19, data_20, data_21


# 构建数据立方体,数据结构采用DataFrame
def make_cuboid(data):
    arr = [[ 0.0, 0.0, 0.0, 0.0],
           [ 0.0, 0.0, 0.0, 0.0],
           [0.0, 0.0, 0.0, 0.0],
           [0.0, 0.0, 0.0, 0.0],
           [ 0.0, 0.0, 0.0, 0.0],
           [ 0.0, 0.0, 0.0, 0.0],
           [0.0, 0.0, 0.0, 0.0]]
    dataFrame = pd.DataFrame(arr, columns=["10010油", "10020面制品", "10030米和粉", "10088粮油类赠品"],
                             index=["13", "14", "15", "16", "17", "18", "19"])
    # 按日期进行筛选,把各日期的数据放入list中
    t = [data.loc[data["Date"] == 20030413, ["Date", "GoodID", "Num", "Price"]],
         data.loc[data["Date"] == 20030414, ["Date", "GoodID", "Num", "Price"]],
         data.loc[data["Date"] == 20030415, ["Date", "GoodID", "Num", "Price"]],
         data.loc[data["Date"] == 20030416, ["Date", "GoodID", "Num", "Price"]],
         data.loc[data["Date"] == 20030417, ["Date", "GoodID", "Num", "Price"]],
         data.loc[data["Date"] == 20030418, ["Date", "GoodID", "Num", "Price"]],
         data.loc[data["Date"] == 20030419, ["Date", "GoodID", "Num", "Price"]]
         ]
    idx = 13

    for df in t:



        # 按照商品类别,将各类商品各日期销售总额计算出来并保存
        _df = df[df["GoodID"] >= 1001000]
        _df = _df[_df["GoodID"] <= 1001099]
        _sum = 0

        for index, row in _df.iterrows():
            _sum = _sum + row["Num"] * row["Price"]
        dataFrame.loc[(str(idx), "10010油")] = _sum
        _sum = 0

        _df = df[df["GoodID"] >= 1002000]
        _df = _df[_df["GoodID"] <= 1002099]
        for index, row in _df.iterrows():
            _sum = _sum + row["Num"] * row["Price"]
        dataFrame.loc[(str(idx), "10020面制品")] = _sum
        _sum = 0

        _df = df[df["GoodID"] >= 1003000]
        _df = _df[_df["GoodID"] <= 1003099]
        for index, row in _df.iterrows():
            _sum = _sum + row["Num"] * row["Price"]
        dataFrame.loc[(str(idx), "10030米和粉")] = _sum
        _sum = 0

        _df = df[df["GoodID"] >= 1008800]
        _df = _df[_df["GoodID"] <= 1008899]
        for index, row in _df.iterrows():
            _sum = _sum + row["Num"] * row["Price"]
        dataFrame.loc[(str(idx), "10088粮油类赠品")] = _sum
        _sum = 0

        idx = idx + 1

    return dataFrame


if __name__ == "__main__":
    data_1019, data_1020, data_1021 = getData()

    # 各数据立方体按照4为小数保存到txt文件
    df_1019 = make_cuboid(data_1019)
    df_1019.applymap('{:.4f}'.format).to_csv("1019.txt", index=False)

    df_1020 = make_cuboid(data_1020)
    df_1020.applymap('{:.4f}'.format).to_csv("1020.txt", index=False)

    df_1021 = make_cuboid(data_1021)
    df_1021.applymap('{:.4f}'.format).to_csv("1021.txt", index=False)

    # 三维数据立方体保存到txt文件中
    data = pd.concat([df_1019, df_1020, df_1021], keys=["1019", "1020", "1021"], names=["Shop", "Date"])
    data.to_csv("data_cubiod.csv")

    # "1020商店10010油类商品13日总的销售额
    print("1020商店10010油类商品13日总的销售额", format(data.loc[("1020", "13"), "10010油"], '.2f'))

    # 1020商店10030米和粉总的销售额
    df = data.loc["1020"]
    print("1020商店10030米和粉总的销售额", format(df["10030米和粉"].sum(), '.2f'))

    # 指定商店指定货物的销售总额
    df = data.loc[args.Shop]
    print(args.Shop + "商店" + args.Good + "的销售额", format(df[args.Good].sum(), '.2f'))

4. Experimental results

Insert image description here
1019 data cube txt file
Insert image description here
1020 data cube txt file
Insert image description here
1021 data cube txt file
Insert image description here
Three-dimensional data cube csv file, the dimensions are Shop, Date and four categories of commodities
Insert image description here
2020 store 10010 The total sales of oil products on the 13th and the total sales of 2020 stores 10030 meters and powder Sales
Insert image description here
Insert image description here
Query the sales of 10088 grain and oil gifts in 2021 store

5. Experimental analysis

This experiment mainly constructs a data cube for the preprocessed data and performs corresponding OLAP operations.

The DataFrame data structure in the pandas package is used to store the two-dimensional data cube. The DataFrame structure has many suitable operations and methods suitable for the construction of data cubes and the implementation of OLAP operations.

In DataFrame, each data unit has two indexes, row index (index) and column index (columns). A 1-D cube can be read according to any index, or a basic cube can be read according to two indexes. The pandas package also has a panel structure that supports three-dimensional data, but this structure has been gradually deprecated, so for the construction of a three-dimensional cube, I still choose a two-dimensional DataFrame to connect the original three 2-D cubes.

OLAP operations are implemented using the DataFrame.loc method and Series.sum method.

Guess you like

Origin blog.csdn.net/d33332/article/details/127245436