Oracle RAC principle Detailed

Oracle RAC principle Detailed

weixin_34174322 

real application clusters(RAC)

1. What is the cluster

It is a cluster of two or more independent servers through the network connection thereof.

The main purpose of the cluster is twofold: to improve the availability, automatically transferred to the secondary node in the node currently active node fails;

Providing distributed access, scalability work.

Foreign cluster can be seen as a single server, managing a single server like the same management cluster of servers.

In short a Cluster is a group of independent servers, work together to form a single system.

2. What is Oracle real Application Cluster (RAC)

RAC is a software configuration of the oracle, database files stored on a physically or logically connected to each node on the disk. RAC software manages access to data change operation between Instances are coordinated with each other, so that each active node in the Instance can read and write to files, and each Instance seen both information and data mirroring It is the same. By RAC configuration, redundancy can be obtained even when a crash or inaccessible example, the application can also be accessed by other Database Instance.

3, Why RAC

RAC may use a standard height Cluster, cost reduction module servers.

RAC provides automatic workload management services. Application services can be grouped or classified, composed of business application components to complete the task. The RAC service can be continued, uninterrupted operation of the Database, and provide support for multiple services on multiple Instances. Services may be designed to run on one or more Instance, and alternately Instances Instances can be used for backup. If the primary fails Instance, Oracle services will move on to an alternative active Instance Instance from failed nodes. Oracle automatically by the connection data loading balance.

RAC using multiple computers together to provide cheap Database services, like a large computer, serve a variety of applications only large-scale SMP can provide.

RAC is based on a shared disk structure, on demand may be increased or reduced, without the need for human data separated in Cluster. RAC and simply increased removal of servers Cluster

4, Clusters and scalability

Using the RAC can be obtained using a symmetric multiprocessing (symmetric multiprocessing SMP) mechanism capable of providing the same effects as a transparent service application. When a node failure occurs, the RAC can be excluded and the Database Instance node itself, thus ensuring complete Database.

a, examples scalability

  • Allow more concurrent batch
  • Allow a greater degree of concurrent execution
  • In OLTP systems may be connected to the user's surge

b, scalability level

  • hardware scalability: its interconnectivity is the key, which typically relies on higher bandwidth and lower latency;
  • OS scalability: the OS, the synchronization method may decide scalability of the system. In some cases, potentially scalable hardware OS because of inability to maintain concurrent multiple resource requests are lost;
  • Scalability Database management system: a key factor in the concurrent structure is complicated by the impact are affected by internal or external processes. The answer to this problem affects the synchronization mechanism;
  • Scalability on the application level: The application must be as clear and scalable design. As in most cases if the systems, each session in the same Data update, may create a bottleneck;

To be clear, if not achieved any level scalability, regardless of other levels of scalability and more strong, Cluster concurrent processes are likely to fail. A typical reason may be lack of scalability of access to shared resources. This enables concurrent operations on this bottleneck serialization of execution. This is not just confined to the RAC, but the limitations of all structures.

f3a64adb71cdc98fdafb6511ef0765f1a13.jpg

5, RAC structure and background processes

7335f2da882389f1b050ec4311b3b19a1e6.jpg

RAC instance a little more than normal background process instances, these processes are mainly used to manage global resources, to maintain consistency in each Database Instance in.

  • LMON: Global Queue service monitoring process --Global Enqueue Service Monitor
  • LMD0: Global Queue service daemon --Global Enqueue Service Daemon
  • LMSx: Global buffer service process, x can be from 0 to j - Global Cache Service Processes
  • LCK0: Lock process --Lock process
  • DIAG: diagnostic process --Diagnosibility process

In the main process of the Cluster Cluster Ready Services software layer, which provides a standard interface to Cluster on all platforms, and high availability operations. It can be seen as a process on each Cluster node:

  • CRSD and RACGIMON: high availability for engine operation.
  • OCSSD: Provides access to member nodes and service groups
  • EVMD: event detection process, run by the oracle user management
  • OPROCD: Cluster monitoring process

Resources (ASM Instance, RAC Database, Services and CRS node applications) on a global level in the Cluster Management tool, mainly Server Control (SRVCTL), DBCA and Enterprise Manager.

6, RAC software storage principle

Oracle10g RAC installation of two stages. Installing CRS and installing Database with RAC software components, and create the Cluster database. Oracle home CRS software used must be different from the home RAC software. voting file and OCR file can not be in ASM, as they have to be accessed can be stored in any Oracle Instance before the start. And it must be stored in a shared storage device.

5d4b4473cb79e8b71d5c8b77bd516e04245.jpg

voting file: it is essentially a Cluster synchronization Services daemon monitoring node information. Size of about 20MB;

Oracle Cluster Registry (OCR) file: CRS is also a key component. For maintaining information in a Cluster high availability components. For example, Cluster node list, Cluster Database Instance to the list of CRS node mapping and application resources (such as Services, virtual internal link protocol address, etc.). This file is by SRVCTL similar management tool automatically maintained. The size of about 100MB.

7, OCR structure

Cluster configuration information is maintained in the OCR. OCR rely on distributed shared cache architecture used to optimize queries on Cluster knowledge base. Each node in the Cluster have access to a copy of the OCR cache maintains in its memory through the OCR process. Cluster fact, only one OCR process OCR on shared storage read and write. This process is responsible refresh (refresh) has its own local cache and Cluster in the OCR cache to other nodes. For access to the Cluster involves knowledge base, OCR clients direct access to local OCR process. When a client needs to be updated OCR, OCR to read and write they will process documents with OCR process that plays by local interaction.

 

 

95fe6a78c7d7f4e6ad9425042e3f3728746.jpg

OCR client applications are: Oracle Universal Installer (OUI), SRVCTL, Enterprise Manager (EM), DBCA, DBUA, NetCA and virtual network protocol assistant (VIPCA). In addition, OCR maintain dependency and status information management with a variety of internal applications as defined in CRS resources, in particular the application Database, Instance, Services and nodes.

The name of the configuration file is ocr.loc, and configuration file variable is ocrconfig_loc. Cluster position knowledge is not limited to the raw device. OCR may be placed on a Cluster file system managed by the shared storage device.

note: OCR can also be used as a single configuration file ASM Instance, each node has a OCR.

8, RAC Database storage principle

0334a839eb34e4c488035ed507b17a3e332.jpg

And single-Instance Oracle's main storage RAC except that must be stored in the data files and all RAC (Cluster raw device or file system) in order to access the same Instance Database can be shared in the sharing device. Instance must be created for each of the redo log at least two groups, and all groups must redo log also stored in the shared device, so that for the purpose of crash recovery. Each online redo log groups Instance is called a thread Instance of online redo.

In addition, you must create a shared Oracle undo tablespace for automatic undo management features recommended for each Instance. Each undo table space must be shared by all Instance, mainly used for recovery purposes.

Archive logs can not be stored in the raw device, since it is automatically generated name, and each is inconsistent. Therefore it needs to be stored in a file system. If you are using Cluster file system (CFS), you can access the archived files on any node at any time. If you do not use CFS, you will have to make other Cluster members in restoring those archive logs are available, for example, by the Network File System (NFS). If you use the recommended flash recovery area characteristics, it must be in a shared directory so that all of Instance can access the storage. (Shared directory can be an ASM disk group, or a CFS).

9, RAC and shared storage technology

Grid storage is a key component of. Traditionally, both direct attached storage (directly attached to each individual Server DAS) on each Server. In the past few years, the emergence of more flexible storage and applied, mainly achieved through access storage space or regular Ethernet network. These new storage disk so that the same set of multiple Servers access is possible, in a distributed environment, you can get easy access.

storage area network (SAN) represents the evolution of data storage technology at this point. Traditionally, C / S system, data is stored in its Server internal or attached devices. Then, into the network attached storage (NAS) stage, which makes the storage device is directly connected with the Server and the network to their separation. It follows the principle SAN further allows the respective storage devices present in the network, and high-speed direct exchange medium. Users can access the data storage device by Server systems, Server System and a local network (LAN) connected to each other and SAN.

Select the file system is the key to the RAC. Traditional file systems do not support parallel multi-system mount. Therefore, it must be in the absence of any file system or bare label support concurrent access to multiple systems file system files are stored. Thus, three main methods for RAC shared storage are:

  • Bare label: it is some direct additional raw device, you need to store and to process block mode operation.
  • Cluster file system: also need to block access mode process. One or more Cluster file systems may be used to store all files RAC.
  • Automatic Storage Management (ASM): For Oracle Database files, ASM is a lightweight, dedicated, optimized Cluster file system.

10、Oracle Cluster file system

    Oracle Cluster file system (OCFS) is a shared file system, designed specifically for Oracle RAC. OCFS obviating the need is connected to Oracle Database files on a logical disk, and so that all nodes share a ORACLE Home, without each node having a local copy. OCFS label may span one or more shared disks, for enhancing the performance and redundancy. Oracle Cluster file system for when developers and users free of charge. It can be downloaded from the official website.

OCFS class files can be placed in the table:

  • Oracle software installation files: In 10g, this setting only in windows 2000 support. He said that later versions will provide support in Linux, but I did not see specific.
  • Oracle files (control files, data files, redo logs files, bfiles, etc.)
  • Share configuration files (spfile)
  • During Oracle running, files created by Oracle.
  • voting and OCR files

11, Automatic Storage Management (ASM)

A new feature in 10g. It provides a unified management of a vertical file system and label manager, dedicated to the establishment of Oracle Database files. ASM management may be provided through individual SMP machines or Oracle RAC plurality of Cluster nodes.

ASM need to manually adjust the I / O, is automatically allocated I / O load to all available resources, to optimize performance. By allowing the memory allocation is adjusted Database increasing the size of the database without shutdown to aid DBA manage dynamic database environment.

19918a14b19cb76c5023e561dd485ccdefd.jpg

ASM can maintain redundant data, thereby improving the fault tolerance of failure. It can also be mounted to a reliable storage mechanism.

12, or select RAW CFS

Advantages CFS: For the installation and management of the RAC is very simple; using Oracle managed files (OMF) of the RAC; single Oracle software installation; can be automatically extended in Oracle data files; when physical node fails, uniform access to archive logs.

Using raw device: generally the case for CFS not available or not supported Oracle; it provides the best performance, no intermediate layer is provided between the disk and Oracle; if space is exhausted, a raw device the automatic expansion fails; the ASM, the logical storage manager or the logical Volume Manager simplifies the work of the raw device, they also allow the loading space to the raw device on line, the name of a raw device may be created, thereby facilitating management.

13, RAC typical stack Cluster

Each node in Cluster software protocols are interconnected a need to support interactivity supported inside Instance, and needs TCP / IP support polling CRS. All UNIX platforms using the user datagram protocol (UDP) on Gigabit Ethernet as the primary protocol and interact RAC internal Instance of IPC. Other supported protocols include unique interactive connection for remote sharing of SCI and Sunfire memory protocol and hypertext protocol for ultra-fiber interaction. In any case, the interaction must be to identify the Oracle platform.

2a5b7aba03162cc666200f9e196df3f88ef.jpg

Use Oracle clusterware, reduces installation and support complications. But if you use a non-interactive ether, or developing application-dependent clusterware in the RAC, may need vendor clusterware.

Interactive connection with the same shared storage solution must be to identify the current Oracle platform. If on the target platform, CFS is available, Database area and flash recovery area can be created on a CFS or ASM. If on the target platform, CFS is unavailable, the Database area can be created on ASM or raw device (requires Volume Manager) and flash recovery area must be created in ASM.

14、RAC certification Matrix

It is designed to handle any authentication issues. RAC matrix can be used to answer any questions related to certification. Use specific steps are as follows:

  • Connection and log http://metalink.oracle.com
  • Click on the menu bar "certify and availability" button
  • Click "view certifications by product" connection
  • Select RAC
  • Choosing the right platform

15, the necessary global resources

    A single-Instance environment, leading to a lock coordinate shared resources like the row in the table. lock to avoid the two processes simultaneously modify the same resources.

    In a RAC environment, internal node synchronization is critical because it maintains the consistency of the different nodes in each process, at the same time avoid modifying the same resource data. Synchronization ensures that each internal node Instance see the most recent version of the buffer cache block. The figure shows a case when there is unlocked.

99f2b87b5ee5e7ef7b58e4245c3d7703279.jpg

A , to coordinate global resources

cluster operation requires control access to shared resources are synchronized in all Instance. RAC using the Global Resource Directory to record information using cluster resources in Database. Information Global Cache Service (GCS) and Global Enqueue Service (GES) Management of GRD.

8e6708ff2002fa0f9f23509552817202a3c.jpg

Each part of the maintenance GRD Instance in its local SGA. GCS and GES specify all the information a special Instance resource management, it is called master resource. Each Instance knew resource of Instance masters.

Dependence cache maintenance of RAC activities (cache coherency) is very important. The so-called cache coherency is maintained in different Oracle Instances of concordance of multiple versions of the block. GCS cache coherency is achieved through so-called cache fusion algorithm.

GES management of all non-cache operation and internal resources Instance Oracle team into the mechanism of fusion algorithms of the state of the track. The main control of resources GES is a dictionary cache locks and library cache locks. It also acts as deadlock detection to all deadlock sensitive queues and resources.

B , Free Join Cache Coordination Examples

9f8d067c70b590d62d9405698c62428c9e5.jpg

    Suppose a modified data block is the first node to become dirty data. And clusterwide, only a block copy version of its content with SCN numbers instead. The specific steps are as follows:

  • Instance view of the modified second block, a request to the GCS.
  • GCS submit a request to block the holder (holders). Here, the first Instance is the holder.
  • Instance to the first message, and transmits to the second block Instance. First Instance dirty buffer saved for recovery purposes. dirty block is referred to as the mirror of the past image block. A past image block will not be further changed.
  • After receiving block, the second Instance notice GCS, already holds inform the block.

c、write to disk coordination:实例

e0dff51c6d6a311f5dc35151eac06e50816.jpg

Instances in the cluster structure of the caches, there may be different versions of the same block of modifications. GCS managed by the written agreement ensures that only the most recent version is written to disk. It also needs to ensure that other versions before being cleaned from another cache. A disk write requests may originate from any one Instance, whether it be preserved block of the current version or previous versions. Suppose first Instance hold the last image block, the Oracle request buffer is written to disk, as shown above, as follows:

  • First Instance send a written request to the GCS
  • The GCS requests to the second Instance, the current holder of the block
  • Second Instance after receiving written request block is written to disk
  • Second Instance notice GCS, inform the write operation is complete
  • After receiving the notice GCS, GCS ordered all mirror holders past delete its past image. This image will not be needed because the recovery.

16、RAC和Instance/crash recovery

A , when a failure Instance, when the failure is detected other Instance, the second Instance recovery operation will be executed following:

  • In the first stages of recovery, GES again poured into the queue
  • GCS also re-poured its resources. GCS again poured into the process only to lose its control of those resources. During this period, all of GCS resources and write requests are temporarily suspended. However, the transaction can continue to modify the data blocks, as long as the transaction has received the necessary resources.
  • When the queue is reconfigured, Instance an activity can be obtained possession of the Instance recovery queue. Therefore, when at the same time be re-poured into the GCS resources, SMON determine the set of blocks that need to be restored. This set is called the recovery set. Because, using cache fusion algorithm, a Instance transfer the contents of these blocks to the Instance request, without the need for these blocks are written to disk. These blocks on the disk version may not contain other data modification operations processes Instance of blocks. This means that SMON need to merge the redo logs all failed Instance to determine the recovery set. This is because a failure could lead to thread a hole in the redo of the (hole) needed to fill a specified block. Therefore, the redo failed Instance continuous thread can not be applied. At the same time, Instances of redo thread activity need not be recovered because SMON can use past and current communication buffer of a mirror.
  • Buffer space is allocated for recovery, and those who read before the redo logs be identified resource is declared as recovery resources. This avoids other Instance to access these resources.
  • All the resources needed in the subsequent recovery operation is obtained, and the current GRD is not frozen. Restored without any data block can now be accessed. Therefore, part of the current system is available. In this case, assuming that there are past or current mirror blocks need to be restored, while the other caches in the cluster Database, for these particular blocks, the closest mirror is the beginning of a recovery point. If you want to recover to block, past and current mirror image buffer caches are not in the Instance of activity, then the SMON will write a log, it indicates that the merger failed. Each block SMON would be the third step in the recovery and identification of written immediately after the recovery will free up resources, so that more resources can be used in the recovery.

    When all the block to be restored, the recovery of occupied resources are released, the system can be used again.

    In the recovery, the number of failed Instances expenditure and is proportional to the combined log and redo logs related to the size of each of Instance.

b、Instance recovery和Database availability

The following figure shows when Instance recovery during every step of the implementation of the database when the extent available:

a77ee60bc1838d7edb3936faee8535bccd3.jpg

  • RAC running on multiple nodes
  • Has failed node is detected
  • Queues GRD is reset; Explorer be reassigned to active nodes. To do this faster
  • GRD buffer portion is reset, SMON read failure Instance Identification of redo logs to those blocks need to recover a set of
  • SMON initiates a request to GRD, get all the Database blocks in the set of blocks need to be restored in. When the end of the request, all the other blocks can be accessed
  • Oracle to perform recovery rolling forward. redo logs are applied to the threads fail Database, and those blocks will soon be fully recovered can be accessed
  • Oracle execution go back to recovery. For transactions not yet committed, undo blocks are applied to the Database
  • Instance recovery is completed, all data can be accessed

17, effective internal node of row-level locks

Oracle supports efficient row-level locking. These row-level locking is mainly created when DML operations such as UPDATE. These locks are held until the transaction is committed or rolled back. lock process any requests in the same row are suspended.

79141c574035773b3bf9993d4817e6dbe6d.jpg

Fusion cache block transfer algorithm is independent of these user visible row-level locks. GCS is a transmission of blocks of the underlying operating without contemporary row lock is released to start. blocks may be transmitted from one to the other Instance other Instances, the blocks may be locked simultaneously.

GCS provides access to data blocks, allowing multiple concurrent transactions carried out.

18, additional memory requirements of RAC

RAC-specific memory is allocated from the majority of the shared pool in the SGA created. Because the blocks across Instances may be buffered, you must require a larger buffer. Thus, when the SINGLE Instance Database migration to when the RAC, each of Instance hold request workload can be the same as the case of the single-instance when, it is necessary to run the RAC Instance buffer cache increased 10% and 15% the shared pool. These values ​​are based on the size of the RAC experience, try an initial value. Generally greater than this value.

If you are using the recommended automatic memory management features that can be set by modifying SGA_TARGET initial parameters. However, considering the same number of user access to a plurality of nodes is distributed to each Instance memory requirements can be reduced.

The actual use of resources can query each Instance of GCS and GES entities in view V $ RESOURCE_LIMIT view CURRENT_UTILIZATION and MAX_UTILIZATION field, the specific statement:

SELECT resource_name, current_utilization, max_utilization FROM v$resource_limit WHERE resource_name like ‘g%s_%’;

19, RAC and concurrent execution

Oracle's optimizer is based on the implementation of access cost, which takes into account the cost of concurrent execution, and get a good execution plan as a component.

757ba7eba9a5bc843a86d5139e2fdbcb044.jpg

In a RAC environment, and transmits the selected optimizer concurrent internal nodes and external nodes are of two types. For example, a particular query requires six query processes to complete the request, and there are subordinate six concurrent implementation process are idle, the query by using local resources to implement the local node, to obtain the results. This illustrates the concurrent effective internal node, and without spending a coordinated multi-node concurrent queries. If only two concurrent local node to perform slave process is available, the process of these two processes and four other nodes jointly execute the query. In this case, the internal nodes and external nodes are used concurrently to thereby speed up queries.

In the real decision-support application environment, the query can not pass a variety of query servers get a better division. So after some concurrent execution servers complete its task before other servers becomes idle state. Oracle concurrent execution of dynamic monitoring idle process technology, and assign the task list of teams to overload the process is in process idle state. In this way, Oracle and effective redistribution of the workload queries all processes. RAC to further extend the efficiency of the entire cluster.

20, the global dynamic performance views

Global dynamic performance Instances view shows all relevant information and open access to the RAC Database. The standard dynamic performance view only the relevant information about the local Instance.

For all types of views V $, GV $ will correspond to a view, in addition to several other special cases. In addition to V $ view columns, GV $ view contains an extra column called INST_ID, showing the RAC Instance number. GV $ can be accessed on any open Instance.

To view GV $ query, the initial PARALLEL_MAX_SERVERS on each Instance initialization parameter set to at least 1. This is due to the use of a special concurrent execution of GV $ queries. Coordinator executing concurrently running on the client connection Instance, and allocates a slave query for potential V $ view each Instance. If there PARALLEL_MAX_SERVERS on a Instance is set to 0, you can not get information about the node, empathy, if all the concurrent servers are very busy, you can not get results. In both cases, not be prompted or error messages.

21, RAC and Service

f33365a0f436691877f1eb2906d9a859a31.jpg

22, the virtual IP address and RAC

When a node fails completely, the virtual IP address (VIP) is valid on all applications. When a node fails, its associated VIP automatically assigned to the other node cluster. When this happens:

  • crs bind this ip on the MAC address of another node of the network card, the user is transparent. For clients directly connected, it will show errors.
  • VIP is then sent to the data packets will switch to the new node, it transmits to the client returns error RST packet. Client so that errors quickly obtain information, retry connecting to other nodes.

    If the VIP is not used, then a node fails, the node that sent the connection wait 10 minutes TCP expiration time.

27bd0f1d89813b108f08159f22f7e79b315.jpg

Reference: http: //tech.chinaunix.net/a2010/0415/874/000000874099.shtml

The default heartbeat

misscount: defined for communication between nodes heartbeat, i.e. heart network 60 seconds

disktimeout: Default 200 seconds, the timeout process and the vote disk css defined connection;
reboottime: and a split brain occurs kicked node after the node will restart in time reboottime; the default is 3 seconds;

 

 

Reproduced in: https: //my.oschina.net/peakfang/blog/2873857

Published 17 original articles · won praise 224 · views 280 000 +

Guess you like

Origin blog.csdn.net/cxu123321/article/details/105029126