DAMA-DMBOK2 key knowledge compilation CDGA/CDGP - Chapter 5 Data Modeling and Design

Table of contents

1. Score distribution

2. Summary of key knowledge

1 Introduction

1.1 Business drivers

1.2 Objectives and principles

1.3 Basic concepts

2. Activities

2.1 Planning data modeling

2.2 Establish data model

2.3 Review data model

2.4 Maintain data model

3. Tools

4. Method

5. Data modeling and design governance

5.1 Data modeling and design quality management

5.2 Metrics


1. Score distribution

        CDGA: 10 points (10 single choices)

        CDGP: 10 points (design questions)

                Test points:

                        Model design concepts and goals;

                        Data modeling methods;

                        Data model governance;

                        Data model evaluation metrics;

2. Summary of key knowledge

1 Introduction

Context diagram:

Data Modeling : The process of discovering, analyzing, and determining data requirements, representing and communicating those data requirements in a precise form called a data model. The process is iterative and may include conceptual, logical, and physical models.

There are 6 common data models :

  1. relationship patterns.
  2. Multidimensional model.
  3. Object-oriented pattern.
  4. Fact pattern.
  5. Time series pattern.
  6. NoSQL mode.

According to the different levels of description details, it can be divided into : conceptual model, logical model, and physical model.

1.1 Business drivers

Business drivers:

  1. Provide a common vocabulary about the data.
  2. Capture and document details of data and systems within the organization.
  3. Serve as the main communication tool in the project.
  4. Provides a starting point for application customization, integration, and even replacement.

1.2 Objectives and principles

The goals of data modeling and design: Confirm and document understanding of data requirements from different perspectives, ensure applications are more aligned with current and future business needs, and lay the foundation for more data applications or data management, such as master data management and data governance projects .

Understanding data from different perspectives can help:

  1. format. Concisely define, standardize the structure, and prevent exceptions.
  2. Scope definition. Helps explain the boundaries of data context.
  3. Knowledge retention records. Provide original records for the future, help better understand the organization, etc., and help understand the impact of changes. Can be reused to help understand the data structure in the environment. Modelers help others understand information blueprints. 

1.3 Basic concepts

Data modeling is most commonly used in the work environment of system development and system maintenance, also known as the system development life cycle (SDLC). The direct result lies in the understanding of the organization's data.

Model: It is a representation of things in reality or a model for creating things. A model can contain one or more diagrams. Model diagrams allow people to quickly understand their contents through standardized symbols. Maps, organizational charts, and architectural blueprints are all examples of everyday models. A data model describes the data that an organization already understands or will need in the future. A data model consists of a set of symbols with textual labels that attempt to visually represent and communicate data requirements to the data modeler for a particular set of data.

Data model: Describes the data that an organization already understands or will need in the future. A data model consists of a set of symbols with textual labels that attempt to visually represent and communicate data requirements to the data modeler for a particular set of data.

Modeled data types: (The following 4 classes are static data)

  1. Category information , data that classifies things or assigns types of things, such as color and model.
  2. Resource information , basic data required to implement operational processes, such as products and customers. Resource entities are sometimes called reference data .
  3. Business event information , data created during operations, such as customer orders. Event entities are sometimes referred to as transactional business data.
  4. Detailed transaction information is generated through sales systems and sensors and used to analyze trends. Such as big data.

Parts of "dynamic data" can also be modeled, such as system scenarios.

Data model components: entities, relationships, attributes, fields

  • 1. Entity
    • Definition : The definition of entity is a thing that is different from other things. In the data modeling concept, entities are the carriers through which an organization collects information. An entity is sometimes referred to as a group of nouns for an organization. An entity can be thought of as the answer to some basic questions - who, what, when, where, why, how or a combination of these questions; the definition of the entity belongs to the core metadata. High-quality data definition has three characteristics : clear, accurate, and complete
    • Alias :
      • According to model type:
        • The term "entity" is often used in relational models
        • The terms "dimension" and "fact table" are often used in dimensional models
        • Terms like "class" or "object" are often used in object-oriented models
        • Terms such as "center", "satellite" and "link" are often used in time-based models.
        • Terms such as "file" or "node" are often used in non-relational database models
      • According to the model abstract type:
        • Entities in conceptual models are generally called concepts or terms.
        • Entities in logical models are called entities (other names depend on different model types)
        • In the physical model, the names of entities vary depending on the database technology. The most common name is table.
    • Graphical representation : generally represented by a rectangle, with the entity name in the middle of the rectangle
  • 2.Relationship
    • Definition: Relationship is the association between entities. Relationships capture high-level interactions between conceptual entities, detailed interactions between logical entities, and constraints between physical entities.
    • Alias:
      • According to model type:
        • The term "relationship" is often used in relational models
        • The term "navigation path" is often used in dimensional models
        • Terms such as "boundary" or "link" are often used in NoSOL non-relational database models
      • According to the model abstract type:
        • Relationships at the conceptual and logical level are called "relationships"
        • Relationships at the physical level may be represented by other names, such as "constraints" or "references", depending on the specific database technology.
    • Graphical representation: relationships are represented as lines on a data modeling diagram
    • Cardinality of a relationship: Indicates the number of relationships an entity participates in with other entities. There are "0, 1, many"
      • Elements of relationships: (In a relationship network, an entity can have multiple parent entities)
        • Unary relationships: recursive relationships, self-reference relationships.
          • One-to-many: hierarchical relationship.
          • Many-to-many: Network relationships or graphs.
        • Binary relationship: A relationship involving two entities.
        • Ternary relationship: A relationship involving three entities.
    • Foreign Key : Representing relationships in physical model modeling
  • 3.Attributes
    • Definition: Identifiers, also called keys, are a collection of one or more attributes that uniquely identify an entity instance. This section classifies keys according to their structure (single key, combination key, compound key, surrogate key) and function (candidate key, primary key, alternate key).
    • Alias:
      • Physical representation as tables, views, documents, graphics
      • Columns, fields, tags, nodes, etc. in the file
    • Graphical representation: The attributes are described in the figure as a list within the entity rectangle.
    • Identifier:
      • A key is a collection of one or more attributes that uniquely identify an entity instance. The key structure can be divided into single keys, combination keys, compound keys, and surrogate keys. According to functions, it can be divided into candidate keys, primary keys, and backup keys.
        • Key structure type :
          • Single key: An attribute that uniquely identifies an entity instance.
          • Surrogate key: also a single key, a unique identifier of the table, usually a counter, automatically generated by the system, an integer, the meaning has nothing to do with the numerical value, technical, should not be visible to the user.
          • Composite key: A set of two or more attributes that together uniquely identify an entity instance.
          • Composite key: Contains an organization key and at least one other single key, composite key, or nonkey attribute.
        • Key function type :
          • Superkey: Any set of attributes that uniquely identifies an entity instance.
          • Candidate key: The smallest set of attributes that identifies an entity instance, which may contain one or more attributes. Minimal means that no subset of candidate keys uniquely identifies the entity instance. An entity can have multiple candidate keys. Candidate keys can be business keys (natural keys).
          • Business Key: One or more attributes used by business professionals to retrieve a single entity instance. Business keys and surrogate keys are mutually exclusive.
          • Primary Key: A candidate key chosen as a unique identifier for an entity.
          • Alternate key: A candidate key that is unique but has not been selected as the primary key and can be used to find a specific entity instance.
        • Identification and non-identification relationships.
          • An independent entity is one whose primary key contains only attributes that belong only to that entity.
          • A dependent entity is one whose primary key contains at least one attribute from another entity.
  • 4.Domain
    • Definition: All possible values ​​that can be assigned to an attribute. Provides a way to standardize attribute characteristics. All values ​​in the field are valid values. Values ​​that are not in the domain are called invalid values. Attributes should not contain values ​​outside their specified domain. Domains can be restricted by additional rules, called constraints .
    • Domains can be defined in a number of different ways:
      • 1.Data Type
      • 2.Data Format
      • 3.List
      • 4.Range
      • 5.Rule-Based

Data modeling method:

  • Specific representation used by modeling methods :
    • In the relational modeling method, the three-tier model is only applicable to relational databases , while the conceptual model and logical model can be applied to other databases.
    • Fact-based modeling is similar (same as relational modeling).
    • For dimensional modeling methods, the three-tier model is only applicable to relational databases and multidimensional databases.
    • Object-oriented modeling method is only suitable for relational databases and object databases.
    • The time-based modeling method is a physical data modeling technology and is mainly used in data warehouses in relational database environments.
    • NoSQL approaches rely heavily on underlying database structures (documents, columns, graphs, or key values) and are therefore also physical data modeling techniques. 
  • 6 common data modeling methods:
    • 1) Relationship modeling
      • Design purpose: The purpose of relational model design is to accurately express business data and eliminate redundancy.
      • Common notation:
        • 1) Information Engineering Method IE: Use a three-pronged line (duck paw model) to represent the base number
        • 2) Information modeling integrated definition IDEF1X
        • 3) Barker Notation
        • 4) Chen’s symbol Chen
    • 2) Dimensional modeling
      • Design purpose: Dimensional modeling is to optimize the query and analysis of massive data; use Axis Notation to model.
      • Common notation:
        • Dimensional modeling: divided into fact tables and dimension tables
          • Fact table: Rows in a fact table correspond to specific numeric measures, such as amounts. The fact table takes up most of the space in the data and has a large number of rows
          • Dimension table: represents the important objects of the business, mainly retaining text descriptions. Dimensions are entry points or links to fact tables. Serves as the primary source of query or report constraints. Highly anti-paradigm accounts for about 10% of the total. Each dimension has a unique identifier in each row, mainly surrogate keys and natural keys. Dimensions also have some properties.
            • Dimensions of the gradient class manage changes according to the rate and type of change. The main changes are coverage, new rows, and new columns .
          • Snowflake model Snowflaking: standardizes the plane, single table, and dimensional structure in the star model into the corresponding component hierarchy or network structure.
          • Granularity: The meaning or description of a single row of data in the fact table is the most detailed information that each row has. One of the key steps.
          • Consistency dimension: Built with the entire organization in mind.
          • Consistent facts: Use standardized terminology across multiple data marts.
    • 3) Object-oriented modeling
      • Common notation:
        • 1) Unified Modeling Language UML
          • Features:
            • 1) Similar to ER diagram, but there is no operation or method part in ER
            • 2) In the ER diagram, the concept closest to operations is stored procedures.
            • 3) Attribute types (such as date, minute) are represented by data types in programming languages, not physical database data types.
            • 4) The default value can be selectively displayed in the symbol.
            • 5) Accessing data is through the public interface of the class. Encapsulation or data hiding is based on "local influence". Classes and instances are maintained through exposed operation methods.
          • Class operations can be:
            • 1) Public. Fully visible.
            • 2) Internally visible. Visible to child entities.
            • 3) Private. hidden.
    • 4) Fact-based modeling
      • Common notation:
        • 1) Object role modeling ORM2
        • 2) Fully communication-oriented information modeling FCO-IM
    • 5) Time-based modeling 
      • Common notation:
        • 1) Data Vault model Data Vault
          • Definition: It is a set of detail-oriented, time-based and uniquely linked normalized tables that support one or more business functional areas.
          • Method: The data vault model is a hybrid method that combines the advantages of third normal form (3NF) and star schema .
          • Design Concept: The data vault model is specifically designed to meet the needs of enterprise data warehouses
          • structure:
            • Central table: The central table represents the business primary key
            • Linked tables: Linked tables define transactional integration between central tables
            • Satellite table: The satellite table defines the contextual information of the central table's primary key.
        • 2) Anchor Modeling
          • Applicable scenarios: Suitable for situations where the information structure and content change over time. Provides a graphical language for conceptual modeling that can be extended to handle temporary data.
          • structure:
            • Anchor: Anchor simulates entities and events
            • Properties: Properties model the characteristics of anchors
            • Connection: A connection represents the relationship between anchors
            • Node: Node simulates shared properties
    • 6) Non-relational modeling
      • Common notation:
        • 1) Document
        • 2) Column
        • 3) Graph
        • 4) Key-Value
      • Four common non-relational databases:
        • 1) Document database.
        • 2) Key-value database. Stored in only two columns.
        • 3) Column database. Closest to relational database. More complex data types such as unformatted text and images can be used. It stores the columns in its own structure.
        • 4) Graph database.

Data model level :

  • Triple mode of database management:
    • 1) Concept model. A "real world" view of the business that represents the current best model or way of doing business for the business.
    • 2) External mode.
    • 3) Internal mode. A “machine view” of your data
  • Model level:
    • 1) Conceptual data model CDM. Describe summary data requirements as a collection of related subject areas. The conceptual data model only includes basic and key business entities in a given domain and function, and also gives a description of entities and relationships between entities.
    • 2) Logical data model LDM. A detailed description of data requirements, usually in the context of supporting a specific usage (such as application requirements). Not subject to any technical or specific implementation conditions. In relational logical data models, extend by adding attributes. Attributes are assigned to entities through normalization techniques. There is a very strong relationship between each attribute and the primary key of the entity in which it is located. In many cases, a dimensional logical data model is a complete attribute perspective of a dimensional conceptual data model. Relational logical data models capture the rules of a business process, while dimensional logical data models capture business questions to determine the health and performance of a business process.
    • 3) Physical data model PDM. Describes a detailed technical solution, usually based on a logical model, matching a certain type of hardware, software and network tools, and related to a specific technology.
      • Specific techniques for latitude data models:
        • 1) Standard model. A variant of the physical model used to describe the movement of data between systems. This model describes the structure of data that is passed between systems as datagrams or messages. Universal to enable reuse and simplify interface requirements.
        • 2) View. Virtual tables provide a way to view data from multiple tables that contain or reference actual attributes.
        • 3) Partition. The process of splitting a table. Partitioning is performed to facilitate archiving and improve retrieval performance. Partitions can be vertical (grouped by columns) or horizontal (grouped by rows)
        • 4) Denormalization.
          • Definition: After careful consideration, the logical data model that conforms to the rules of the paradigm is converted into some physical tables with redundant data. Denormalization introduces the risk of data errors due to data redundancy. Generally, denormalization only improves database query performance or user security operations.
          • reason:
            • ① Combine data from multiple other tables in advance to avoid costly runtime joins.
            • ②Create smaller, pre-filtered copies of data to reduce expensive runtime calculations and/or scans of large tables.
            • ③ Precompute and store expensive data calculation results to avoid runtime system resource competition. In dimensional modeling, often referred to as folding, merging

If each dimension is collapsed into a structure, the resulting data model is called a star model; if the dimensions are not collapsed, the resulting model is a snowflake model

Normalization:

  • Definition: It is the process of using rules to transform complex business into standardized data structures. The goal is to ensure that each attribute appears in only one location to eliminate redundancy or inconsistencies caused by redundancy. Normalization rules organize attributes according to primary and foreign keys. Normalization rules can be classified into different normalization levels, and for each level a more granular approach and normalization can be applied to search for the correct primary and foreign keys. Each level consists of an independent paradigm, and each successive level need not contain the previous level.
  • Level of normal form: (usually it is required to reach the third normal form, usually BCNF, 4NF, and 5NF are rare)
    • First Normal Form 1NF: Every entity has a valid primary key, and every attribute depends on the primary key. 
    • Second Normal Form 2NF: Each entity has a minimal primary key, and each attribute depends on the complete primary key. 
    • Third normal form 3NF: Each entity has no hidden primary key, and the attributes do not depend on any attributes other than the key value (only rely on the complete primary key).
    • Boyce/Codd Normal Form (BCNF): Solving the crossed composite candidate key problem. Candidate keys are primary keys or alternate keys. 
    • Fourth Normal Form 4NF: Decompose all ternary relations into binary relations until these relations are no longer divisible.
    • Fifth normal form 5NF: Decompose the dependency relationships within the entity into binary relationships, and all connections depend on part of the primary key.

Abstraction: The removal of details to extend applicability to a larger context while retaining the essential properties of a concept or topic.

  • Generalization: Grouping entity common properties and relationships into superclass entities.
  • Specialization: Classifying partitioned attributes in an entity into subclasses of entities, usually based on the attribute values ​​in the entity instance. Superclasses can also create subclasses using roles or classifications to separate instances of an entity into functional groups. The subclass relationship means that all properties of the superclass are inherited by the subclass, which can reduce redundancy.

2. Activities

2.1 Planning data modeling

        Before starting the data model design work, you must first formulate a reasonable work plan.

Task:

  1. Assess organizational needs,
  2. Determine modeling standards,
  3. Clarify data model storage management

Deliverables:

  • 1. Chart . A data model contains several charts.
  • 2. Definition . The definition of entities, attributes, and relationships is critical to maintaining the accuracy of the data model.
  • 3. Controversies and unresolved issues .
  • 4. Blood relationship : For physical models, it is very important to understand the blood relationship of data. Usually presented in the form of source and target mapping.
    • The importance of blood relationships:
      • First, it helps modelers deeply understand data requirements and accurately locate attribute sources;
      • The second is to determine the status of attributes in the source system, which is an effective tool to verify the accuracy of models and mapping relationships.

2.2 Establish data model

        Data modeling is an iterative process.

  • Forward Engineering : The process of building a new application starting from requirements. Concept - Logic - Physics. First, you need to build a conceptual model to understand the scope of the requirements and core terminology; then build a logical model to describe the business process in detail; and finally, implement the physical model through specific table creation statements.
    • Specific steps:
      • 1) Conceptual data model modeling :
        • 1 Select a model type.
        • 2 Select a representation method. Such as Information Engineering (IE) or Object Role Modeling (ORM).
        • 3 Complete the initial conceptual model. The main purpose is to get the user's perspective.
        • 4 Collect the highest level concepts (names) in the organization.
        • 5 Collect activities (verbs) related to these concepts. Relationships can be bidirectional or involve multiple concepts.
        • 6 Consolidate corporate terminology.
        • 7 Obtain signature. Ensure that models are reviewed for best practices and requirements fulfillment. 
      • 2) Logical data model modeling :
        • 1 Analyze information needs. Requirements analysis includes the guidance, organization, recording, review, improvement, approval and change control of business requirements. Logical data modeling is an important means of expressing business data requirements.
        • 2 Analyze existing documentation. It helps even if the documentation is out of date. Existing data models should be considered.
        • 3 Add associated entities. Used to describe many-to-many relationships. Related entities can have more than 2 parent entities. In dimensional modeling, related entities are often called fact tables.
        • 4. Add attributes. The attributes in the logical model are atomic. It should have one and only one data (fact) and cannot be split.
        • 5 Specify the domain. Its function is to ensure the consistency of the format and value set in the model attributes.
        • 6 Assign the key. Properties assigned to entities can be key properties or non-key properties. Primary and alternate keys also need to be identified.
      • 3) Physical data model modeling :
        • 1. Solve logical abstractions
          • Subtype absorption: Subtype entity attributes are included as nullable columns in the table representing the supertype entity.
          • Supertype partitioning: Supertype entity attributes are contained in separate tables created for each subtype.
        • 2. Add attribute details.
          • Add details to the physical model such as technical names for tables and columns, documents and fields, etc.
          • Define the physical domain, physical data type, and length of each column or field.
          • Add constraints, especially NOT NULL constraints.
        • 3. Add reference data objects.
          • Create a matching separate code table.
          • Create a master shared code table.
          • Embed rules or valid codes into the definition of the corresponding object.
        • 4. Specify the surrogate key. Businesses are assigned invisible unique key values ​​that have no arbitrary meaning or relationship to the data they match. This is an optional step. If you specify a surrogate key as the table's primary key, make sure there is a backup key on top of the original primary key.
        • 5. Denormalization. Sometimes denormalizing or adding redundancy can greatly improve performance, far outweighing the cost of duplicate storage and copy processing. The dimensional model mainly uses denormalization.
        • 6. Create an index. Improve query performance. This is done using the most frequently referenced columns.
        • 7. Partition. Ideally, it is recommended to partition on the date key. If this is not possible, improvements will need to be made based on the analysis results and workload.
        • 8. Create a view. It can be used to control access to certain data elements, and can also be used to embed common connection conditions or filters to standardize objects or queries. The view itself should be requirements driven.
  • Reverse engineering: The process of documenting an existing database. Most data modeling tools support reverse engineering of a variety of databases.
    • Physical data modeling is often the first step to understanding the technical design of an existing system;
    • Logical data modeling is the second step to document how the existing system meets the business solution;
    • Conceptual data modeling is the third step and is used to document the scope and key terms in the existing system.

2.3 Review data model

        Techniques such as time to value, support cost, and data model quality validator (data model scorecard) are used to evaluate the correctness, completeness, and consistency of the model.

2.4 Maintain data model

        Need to stay up to date. A good practice is to reverse engineer the latest physical data model and ensure that it is consistent with the logical model.

3. Tools

The list of tools is as follows:

  1. Data modeling tool : software that automatically implements data modeling functions.
  2. Data Lineage Tools : are tools that allow the capture and maintenance of source structure changes for each attribute on the data model. Change impact analysis can be implemented.
  3. Data analysis tools : can help explore data content, validate against current metadata, identify data quality and deficiencies in existing data artifacts such as logical and physical models, DDL and model descriptions.
  4. Metadata repository : software tool. Used to store descriptive information about the data model, including diagrams and accompanying text (such as definitions) as well as metadata imported through other tools and processes (software development tools, BPM tools, system catalogs, etc.). The metadata repository itself should enable metadata integration and exchange. Sharing metadata is more important than storing metadata.
  5. Data Model Pattern : Data model pattern is a reusable model structure, there are components, suites and integrated data model patterns. Elementary patterns are the "nuts and bolts" of data modeling, including methods for resolving many-to-many relationships and building self-referential hierarchies. Assembly Pattern refers to a set of building blocks that spans the scope of business people and data modelers. Business people can understand them—assets, documents, people, organizations, etc. Published topic model suites that provide modeling designers with reliable, robust, scalable, and implementable model designs. Integration Pattern provides a framework for integrating suite patterns in common ways.
  6. Industry data model : Pre-built data model for the entire industry.

4. Method

  • 1. Best practices for command conventions : Publish data models and database naming standards for each type of modeling object and database object. Naming standards are especially important for entities, tables, properties, keys, views, and indexes.
    • The name should be unique and as descriptive as possible.
    • Logical names should be meaningful to business users and should use complete words whenever possible and avoid unfamiliar abbreviations.
    • The physical name conforms to the length allowed by the DBMS, using abbreviations when necessary.
    • Logical names usually do not allow any delimiters. Physical names can use underscores as word separators.
    • Naming standards should minimize name changes across environments.
    • Names should not be influenced by their specific environment, such as test, QA, or production.
    • Class Word, that is, the last term in attribute names such as quantity, name, and code, can be used to distinguish attributes of entities and column names from table names.
  • 2. Best practices in database design-PRISM design principles :
    • 1 Performance and ease of use.
    • 2 Reusability. Reuse in multiple applications and can be used for multiple purposes.
    • 3 Completeness. Data should always have valid business meaning and value and always reflect the valid status of the business.
    • 4 Security. True and accurate data is always provided to Authorized Users and is available only to Authorized Users.
    • 5 Maintainability. Maintenance costs do not exceed its value to the organization; respond as quickly as possible to changes in business processes and new business needs

5. Data modeling and design governance

5.1 Data modeling and design quality management

Data model and database design should be a reasonable balance between the short-term and long-term needs of the organization. There are:

  • 1. Develop data modeling and design standards.
    • 1) List and description of standard data modeling and database design deliverables.
    • 2) A list of standard names, acceptable abbreviations, and abbreviation rules for uncommon words that apply to all data model objects.
    • 3) List of standard naming formats for all data model objects, including attributes and classifiers.
    • 4) A list and description of standard methods for creating and maintaining these deliverables.
    • 5) List and description of data modeling and database design roles and responsibilities.
    • 6) A list and description of all metadata attributes captured in data modeling and database design, including business metadata and technical metadata. For example, guidelines can set expectations that the data model captures data lineage for each attribute.
    • 7) Metadata quality expectations and requirements.
    • 8) Guidelines on how to use data modeling tools.
    • 9) Guidelines for preparing and leading design reviews.
    • 10) Data model version control guide.
    • 11) List of things that are prohibited or to be avoided
  • 2. Review the quality of data model and database design.
  • 3. Manage data model versions and integration.
    • Changes need to be recorded in a timeline, and each change should be recorded, including: Why. What. How. When. Who. Where.

5.2 Metrics

Using a data model scorecard provides an overall assessment of model quality and clearly identifies improvements to the model.

Guess you like

Origin blog.csdn.net/DreamEhome/article/details/132919844