National Computer Rank Examination Level 3 Database Technology (13)

Chapter 13_Large-Scale Database Architecture

test analysis

◆Generally, multiple-choice questions will appear in the exam.
◆Common test knowledge points include:
1. Master the target of distributed database and data distribution strategy
2. Master the reference schema structure and distribution transparency level of distributed database
3. Master parallel database system structure and parallel algorithm
4. Familiar with cloud database system Structure and XML database

13.1 Distributed Database

1. Overview of Distributed Database System
1. Distributed database system: A database system that is physically dispersed and logically centralized. The data in the system are distributed on computers in different physical locations, and these sites are connected by a communication network. Each site has the ability to handle independently, but also can work together with other sites.
2. Distributed database: Logical set of databases on each site in the distributed database system.

2. Distributed database goals and data distribution strategy
1. Distributed database goals
In 1987, CJDate proposed 12 goals for distributed databases to achieve. As shown in the figure below:
Distributed Database Target
2. Data distribution strategy
Data Distribution Strategy
(1) Data sharding
description: Sharding a certain relationship is to divide the relationship into multiple fragments. These fragments contain enough information to reconstruct the relationship. There are 4 basic Method:
01. Horizontal Fragmentation: From the perspective of rows (tuples) in the relationship, it is divided into different fragments according to certain conditions. Each row in the relationship must belong to at least one fragment, so that the relationship can be reconstructed when needed.
02. Re-sharding: From the perspective of columns (attributes) in the relationship, it is divided into different fragments according to certain conditions. Each fragment should contain the primary key attribute of the relationship, so that the relationship can be restored through the connection method.
03. Export fragmentation: it is to export horizontal fragmentation. Fragmentation is based not on the conditions of this relationship attribute, but on the conditions of other relationship attributes.
04. Hybrid sharding: refers to the mixture of the above three methods.
(2) Data distribution
Description: It is a characteristic of a distributed database. There are several ways to solve data distribution:
01. Centralized: All data fragments are arranged on one site. This strategy is easy to control, but the data is too centralized. The load is too heavy, it is easy to form a bottleneck, and the reliability is poor.
02. Segmentation: There is one and only one copy of all global data. They are divided into several pieces, and each piece is assigned to a specific site. This strategy is flexible for local data control, but has low access efficiency for global data.
03. Full replication: There are multiple copies of global data, and each site has a complete data copy. This strategy has high reliability and fast response, but has large data redundancy and complicated synchronization maintenance.
04. Hybrid: Global data is divided into several data subsets, and each subset is arranged in one or more different venues. But not all data is saved for each site. This is a distribution method between split and full replication. This strategy is more flexible. It can maximize its strengths and avoid its weaknesses according to different situations to achieve higher efficiency.

3. Architecture of Distributed Database System
1. The structural diagram of a reference mode of distributed database is as follows:
Schematic diagram of a reference mode structure of a distributed database
The specific content of distributed database system:
01. Global external mode: it is the user view of global application. That is, tables, views, etc. that are not logically distributed as seen by end users.
02. Global concept mode: describe the logical structure and characteristics of all data.
03. Fragmentation mode: describes each data fragment and the image of the global relationship to the fragment, and is a logical division view of global data in a distributed database system.
04. Allocation mode: describe the image of each fragment to the physical storage site.
05. Local concept mode: describe the logical structure and characteristics of the physical fragments stored in the global relationship on the site.
06. Local internal mode: describe the physical storage of the data involved in the local conceptual mode in the local site.

2. Distribution transparency
There are several levels of distribution transparency:
01. Location transparency means that the allocation location of data fragments is transparent to users. When users write programs, they only need to consider the data The distribution of films in each venue.
02. Fragmentation Transparency The highest level of transparency. Located between the global concept mode and the fragment mode. Sharding transparency is just that
data sharding is completely transparent, and users don't need to think about it. When writing programs, they only need to operate on all relationships.
03. Local data model transparency is between the distribution model and the local concept model. When editing, the user needs to understand the global data fragmentation situation, as well as the replica replication situation of each fragment and the location allocation of fragments and their replicas.
3. Distributed database management system
The structure of the distributed database management system is as follows:
Structure of Distributed Database Management System
4. Related technologies of distributed database
1. Distributed query
In the centralized database system, the query cost is mainly composed of CPU cost and I/O cost To measure, in a distributed database system. The data is distributed in many different sites, and the communication generation for transferring data between sites should also be considered in query processing. Distributed direct query optimization mainly considers the following strategies:
Distributed query optimization strategy
2. Distributed transaction management
Distributed transaction management mainly includes recovery control and concurrency control. The completion of a global transaction in a distributed database system requires the participation of multiple sites. In order to maintain the atomicity of the transaction, all venues participating in the execution of the transaction are either all committed or all revoked.
01. Recovery control
A. The most typical strategy for recovery control of distributed database systems is based on a two-phase commit protocol, which divides the transaction management of the site into coordinators and participants.
B. The coordinator asks all participants whether the transaction can be committed in the first stage, and the participant responds: in the second stage, the coordinator decides whether to commit the transaction according to the participant's answer.
C. Both the coordinator and the participants maintain a log information in stable memory. When a failure occurs, the log information can be used to restore operations.
D. Disadvantages: The two-phase protocol may cause blocking when the coordinator fails, and the three-phase protocol submission can avoid blocking problems.
02. Concurrency control
Concurrency control in most distributed systems is mainly based on blocking protocols. All kinds of locking protocols in the centralized database system can be used in the distributed system, what needs to be changed is the way the lock manager handles the replicated data.

13.2 Parallel Databases

1. Overview of parallel data
1. The key issues in the development of database systems: to improve the throughput rate of database systems and the response time of transactions, and the development of database applications puts forward higher requirements for the performance and availability of databases.
2. Under the impetus of software and hardware requirements, parallel databases are introduced, and multi-processor parallel processing is used to increase the speed. Cheap parallel computers constructed with multiple processors have better performance than traditional large-scale computers.
2. Parallel database system structure
1. Parallel database has a variety of architectures, which can be mainly divided into the following four types:
(1) Shared memory structure
(2) Shared disk structure
(3) No shared structure
(4) Hierarchical structure
2. Shared Memory structure
Description: All processors share a common main memory through the Internet, as shown in the figure (P means processor, M means memory, D means disk):
shared memory structure

This parallel architecture differs from a stand-alone system only in that multiple processors are used instead of a single processor. Execute transactions in parallel, transfer messages and data through shared memory, and access one or more disks.
Advantages: Simple and most economical solution
Disadvantages: If there are too many processors, it is easy to cause conflicts in accessing memory. Therefore, the number of processors must be limited to 32 or 64, which limits the expansion of parallel capabilities to a certain extent.
3. Shared disk structure
Description: All processors have independent main memory and share disks through the Internet, as shown in the figure (P means processor, M means memory, and D means disk).
shared disk
Advantages: Compared with the shared memory structure, the shared disk structure has greater advantages. Each processor has independent memory, and access to memory will no longer cause conflicts. To a certain extent, it overcomes the problem of system crash when the memory fails, and improves the system availability.
Disadvantages: This structure realizes information and data exchange between processors through the Internet, which will generate certain communication costs.
4. No shared structure
Description: Each processor has independent main memory and disk, and does not share any resources.
no shared structure
Advantages: This structure is considered to be the best parallel structure to support parallel database systems, and reduces the probability of resource competition by minimizing shared resources. It is extremely scalable. It is more suitable for applications such as OLTP such as bank tellers and civil aviation ticket sales.
Disadvantages: The cost of communication and the cost of non-local cool disk access are high.
5. Hierarchical structure
description: This structure combines the characteristics of shared memory, shared disk, and shared-nothing structure. From a global perspective, it can be divided into two layers. The top layer is a shared-nothing structure composed of several nodes, and the bottom layer is shared memory or shared Disk structure. As shown in the figure below:
Hierarchy
Advantages: This structure is very flexible, and can be configured into systems with different structures according to different needs of users.
Disadvantages: The advantages and disadvantages of the above three structures are integrated.
3. Data partitioning and parallel computing
1. Reasonable data division can minimize query processing time and maximize parallel processing performance.
Data partition method
(1) One-dimensional data division
One-dimensional data partition_1
One-dimensional data partition_2
(2) Multi-dimensional data division
Multi-dimensional data division method:
A.MAGIC multi-dimensional division method
B.CMD multi-dimensional division method
C.RERD multi-dimensional division method
division method: divide the attribute of relation R into main division attribute and Auxiliary partition attributes. There is only one main division attribute, which is set to A. First, divide according to its range, and then construct an auxiliary relation RB with three attributes for each sub-division attribute B. The
tuples in RB correspond to the tuples in R. Among them, TID is the tuple identification of the record, and ProdD is the actual storage node. Divide the RB into each node according to the attribute Bi.
2. Parallel algorithm
(1) Sorting and joining are expensive operations in the database system. The following is an introduction to the parallel algorithms of these two operations:
Introduction to Parallel Algorithms

(2) Parallel sorting
Description: If the relationship is allocated to each disk according to the method of range division, and the sorting attribute is the partition attribute, then the data in each partition can be concatenated, and the sorted relationship can be obtained. If the relationship is divided in other ways, the following methods can be used for sorting:
Parallel sorting method:
01. Re-divide the range according to the sorting attribute, then sort each division separately, and finally merge the results directly.
02. Adopt the parallel outer sorting and merging algorithm, that is, each processor first sorts the local data, and then the system merges the sorted data on each processor to obtain the final sorted relationship.
(3) Parallel connection
Description: Parallel connection is divided into partition connection and fragmentation-replication connection. The specific content is as follows:
Partition join: For equivalence join and natural join, the two input relations can be divided into multiple processors, and then local joins are performed on each processor.
Shard-replicated connections: Sharded connections do not apply to normal e-connections. Shard-replicated connections can resolve with different connection attributes, and tuples can also match.
(4) Other relational operations
01. Selection: can be executed in parallel on all processors through the selection operation.
02. Eliminate duplication: It can be embedded in the sorting process summary, combined with parallel algorithms to achieve parallelization to eliminate duplication.
03. Projection: The projection without deduplication can be performed in parallel when the tuple is read from the disk; if you want to eliminate duplicate rows, you can use the above-mentioned deduplication method to eliminate duplication and perform projection at the same time.
04. Aggregation: The parallel calculation of aggregation functions can adopt the method of "dividing first and then combining". For the aggregation functions SUM, MIN, and MAX, each node first calculates part of the results in parallel, and then calculates each part of the results once using the same aggregation function That's it.

13.3 Cloud Computing Database Architecture

1. Overview of cloud computing
1. Description: Cloud computing is the development of distributed computing, parallel computing and grid computing, or the commercial realization of these computer science concepts.
2. Concept: Cloud computing is a business computing model, which provides cloud computing users with powerful computing power, storage and bandwidth resources by concentrating all computing resources and adopting hardware virtualization technology, and distributes computing tasks On the resource pool composed of a large number of computers, various application systems can obtain computing power, storage space and information services as needed. Get the same or higher computing power than traditional large servers. Cloud computing includes the following hardware and software facilities:
(1) Software as a Service (SaaS): Application services on the Internet have always been called Software as a Service (Saas). It is a software distribution model.
(2) Platform as a Service (Paas): refers to the provision of operating systems and related services through the grid. without downloading or installing.
(3) Infrastructure as a service (laas): Refers to the provision of external services for equipment used to support operations, including storage, hardware, servers, and network components.
3. Cloud computing provider's data center The software and hardware settings of the cloud
computing provider's data center are the so-called cloud . As shown: (1) Cloud computing can provide seemingly unlimited computing resources for application systems, and cloud computing will eliminate the need for end users to prepare plans or budgets for computing power ( 2)) SaaS service providers can gradually add hardware resources as needed, while No upfront commitment is required. (3) Cloud computing has the flexibility to provide its users with short-term use of resources. Users can easily release these resources when they are no longer needed.
cloud




2. Introduction to cloud computing architecture
1. In the cloud environment, the main object of computing is still data. The combination of "cloud + database" produces a cloud database. Google's cloud database architecture is shown in the figure:
Google's Cloud Services Framework

  1. BigTable data model
    (1) Description: BigTable is Google's cloud database and a distributed structured data storage system for processing, storing and direct querying of massive data.
    (2) The index of the BigTable table is row key (Row Key), column key (Column Key) and timestamp (Timestamp).
    (3) Each cell ( Cell ) is co-located by row keywords, column keywords and timestamps.
    (4) In BigTable, not only can the number of rows be increased or decreased at will, but also under certain constraints, Expand the number of columns, and introduce a time tag in each unit, which can store different data of different time versions.
    (5) Features of the data model
    01. The row key in the table can be any string.
    BigTable organizes data by lexicographic order of row keys. Each row in a table can be dynamically partitioned. Each partition is called a "tablet", and a tablet is the smallest unit for data distribution and load balancing adjustment.
    02. Column family is a collection of column keywords, which is the basic unit of access control.
    Under the same column family, data of the same type are usually stored. in the same table. Column families cannot be too many. On the column family layer, access control, disk and memory usage statistics are respectively carried out.
    03. Time stamps of different versions of data contained in data items.
    Indexing of different versions of data can be achieved through timestamps. BigTable can set the value of the timestamp to represent the exact time of the corresponding data. It can be accurate to milliseconds: the user program can also assign a value to the timestamp to represent the version information of the data.
    In the data item. The system sorts the data in timestamp order. That is, the latest teaching data comes first.
    (6) Architecture of BigTable
    A Bigtable cluster stores many tables, each table contains a collection of Tablets, and each Tablet contains all relevant data of rows in a certain range. In the initial state, a table has only one tablet. As the data in the table grows, it is automatically split into multiple Tablets. The BigTable structure diagram is shown in the figure:
    Bigtable structure diagram

3. Comparison between cloud database and traditional database

3. Comparison between cloud database and traditional database
1. There is a big difference between cloud database and existing RDBMS. Although they are all relational data models, the usual cloud database is a series of two-dimensional tables, and the operation method is also based on the simplified version of the class SQL or access objects.
2. The use of cloud databases saves us from purchasing hosting servers, installing and maintaining databases ourselves, and not caring about server location and other information, just accessing the required information. However, cloud databases also have the following disadvantages:
● Data security issues
● Reliance on the Internet
● Cloud management issues

13.4 XML databases

1. Overview of XML language
XML (Extensible Markup Language): Generally used to mark electronic files to make them structural, can be used to mark data to define data types, and is a source that allows users to define their own markup languages Language. Its syntax is similar to HTML. Labels are used to describe data. Its characteristics are as follows:
01.XML is a subset of Standard Generalized Markup Language (SGML), which is very suitable for Web transmission.
02. XML provides a unified method to describe and exchange structured data independent of applications or suppliers.
03.XML database is a database management system that supports operations such as storing and querying XML format documents.
04
II. Overview of XML data
1.XML database is a data management system that supports storage and query operations on XML format documents. It is a collection of XML documents and their components, and manages and controls this collection of documents through a It is maintained by the system itself and the information it represents.
2.XML database is not only a repository of structured data and semi-structured data, persistent XML data includes many characteristics of data, as follows:
independence, integration, access rights, views, completeness, redundancy , consistency, data recovery
3. Three types of XML database
Three Types of XML Database

4. Compared with traditional databases, XML databases have the following advantages:
01. XML databases can effectively access and manage semi-structured data.
02. Provide operations on tags and paths.
03. When the data itself has hierarchical characteristics. The XML data format can clearly express the hierarchical characteristics of numbers, and the XML database is convenient to operate on the hierarchical data.

3. SQL Server 2008 and XML
1. XML statement in SQL Server
1. In a relational database, the query result returned by the SELECT statement is a standard row set (a data set in data table format composed of fields and records). If If you want to return the query result in XML form, you can add the FOR XML clause in the SELECT statement to change the returned result into XML format. In addition, SQL Server 2008 also provides an OPENXML function for processing XML data streams.
2. XML statement syntax format in SQL Server
insert image description here

3. Examples of XML statements in SQL Server
insert image description here

4. XML data type in SQL Server
insert image description here

5.XML index type
insert image description here

6. Operate XML
04
test point 4: XML database
4. Operate XML
SQL Server 2008 supports the use of XQuery language to query XML data types. The commonly used methods mainly include:
01. The Query method used to query XML nodes in XML instances.
02. It is used to describe the Value method for obtaining node and element values ​​in an XML instance.
03 The Exist method used to determine whether the query returns an empty result.
04 Modify method for inserting, modifying and deleting nodes in XML instance

Guess you like

Origin blog.csdn.net/weixin_47288291/article/details/123586166