Availability of financial IT system operation and maintenance Experience - people, technology, processes

Availability of financial IT system operation and maintenance Experience - people, technology, processes

   The financial industry has a monopoly, characteristic index, high risk, and is the hub of the country huge amounts of money, involving all sectors of the national economy, any instability are likely to lead to a "domino effect." At the same time the traditional financial sector is facing new challenges and opportunities, the development of Internet banking financial sector played a role in fueling, such as the Internet are not satisfied with the financial marketing channels, but pressing the idea of ​​the Internet and the customer's needs product seamless pattern make the financial industry into a new phase.

   Throughout the financial industry IT systems, but also has the characteristics of the times, the degree of development of the IT systems of enterprises of different sizes are not the same, but basically go through the following stages: the first stage: a small workshop-style mode. At that time the technology was not limited to IT, business operations dependent on IT is not strong. Second stage: the stack mode blocks. Then companies have realized the importance of IT systems and the expansion of IT systems in large quantities, but this time many systems are not well stacked in architecture under that system redundancy complex. The third stage: big focus on closing stages. Due to the financial industry more than the number of IT systems and geographically dispersed, so in order to facilitate the management, major data center IT systems has been on the take, at the same time build a disaster recovery center, completed a large-scale operation and maintenance. The fourth stage: high availability operation and maintenance phase. Then focus on the data center operation and maintenance have been converted by the construction, starting with the business-oriented, streamlined system stability as the goal, established a complete set of operation and maintenance system to ensure the stable development of the IT system as the cornerstone of the financial sector.

   Highly available IT system operation and maintenance of the financial industry

   The face of the growing, increasingly complex financial IT environment, IT management systems have become the focus of IT departments concerned. According to statistics, more than 70% of the IT budget is spent on operation and maintenance of existing systems. Reduce business downtime, businesses can reduce millions of economic losses. The famous "iceberg theory" for example, found that accounted for most of the iceberg is always that 80% of unplanned failures, and in the vast majority of unplanned failures can predict in advance and avoid by all means.

 

In the high-availability data center operation and maintenance, it is widely proven solution is ITSM (IT service management), and the people, technology, processes these three areas is the core high availability operation and maintenance.

personnel

Financial IT system maintenance personnel is a special group, 7 * 24 hour security, fighting countless times late at night, every day, others can not bear to imagine working pressure, but still no regrets, really can be called China a good employee . Stabilize the financial industry IT system or not closely related to IT maintenance personnel, people-oriented, this is the foothold of stability in the IT system. It is also proposed to IT practitioners higher demand, first of all, improve professional skills, with the usual accumulation of continuous learning, in order to solve the problem quickly resume production in the face of failure. After experiencing problems should not think about direct throw vendors to deal with, this will only gradually let themselves marginalized, and because of the different levels of staff manufacturers are different, does not necessarily guarantee to solve the problem of quality, so that only their skills really stronger can the overall situation. Second, to enhance their soft power, including the ability to self-regulate compression mentality, work-life balance. Because of work intensity, thus ensuring a good physical and mental state is very important, we must learn to relax in his spare time everyday, face everything with a positive attitude. Secondly, have good communication skills, so as to timely and accurate transmission of information, save time and costs and ensure efficient operation of the system. Finally, there must explore innovative spirit, must have the courage to embrace new technology, the vision of development to look at the operation and maintenance of the data center, the ability to maintain sustainable development of the data center.

Technical
standardized configuration management

Type and version of software in the data center there are dozens to hundreds, but a closer look at each of the number of software use and the overall ratio is not difficult to find, some software runs only on a few systems, the reason for the then Application Development staff development program only supports running in the software, thus causing a scene riding a tiger: system maintenance personnel had to spend time alone to spend energy to maintain the "not common" software. So at the beginning of the construction of the system should be as the IT or data center software "standardization" of the system changes. For each software version, configuration parameters and steps should be standardized, standardized version of the software can be easily software lifecycle management, software configuration steps and standardized parameter configuration can avoid human error.

In the standardized management, the development of standards is particularly important, so when combined with the development of standards for data center operation and maintenance of this experience, combined with recommended values ​​given by the manufacturer, and ultimately develop the best standard. And improvements continue to accumulate in the daily operation and maintenance, and constantly improve the revised standard.

Asset Lifecycle Management

For financial IT systems, software and hardware are the two pillars of support system to normal operation. As in recent years, the rapid expansion of China's financial industry infrastructure IT systems, every year there are a lot of new equipment, new software put into use, and the attendant how these resources lifecycle management.

For the maintenance of the software life cycle, first of all, we must first sort out clearly the existing version name and version, which contains the name of the business system, the key person in charge of information systems, IP address, software version, etc. This information should be updated on a regular basis, and the best set software version management positions to full-time responsible. Second, check the advance sections of the software version of the life cycle is long, and regularly take the initiative to query its EOS (End Of Service) time to arrange testing of the new version of the software before support expires, these new tests, including functional testing, performance testing, stability testing, to ensure that the new software line after no problem. Finally, regularly scheduled software upgrade, due to the large number of data center system, it is proposed to upgrade the software version as a long-term project to develop good software upgrade plan for each quarter in which part of the system upgrade.

Decided to phase out when there is a balance of old IT hardware. On the one hand you want to play as much as possible hardware performance, lower cost of data center operations; on the other hand to ensure that older hardware does not always result in system downtime, affecting the normal operation of the system. Therefore, based on experience to develop and warranty time in advance for each type of hardware recommended period, and before the end of the life business will migrate to the new system hardware resources up. While for discarded hardware To benefit the old, these devices will be used to develop test environment, make full use of hardware resources.

Construction contingency plans

The most important is the purpose of the financial industry IT system is stable, when the system fails the most important is the rapid resumption of production, and the ability to quickly restore the integrity and availability of production systems rely heavily on emergency programs. Construction contingency plan is to take preventive measures, in the construction of emergency programs to meet the "comprehensive and available" this four-character principle. The so-called "comprehensive" means that when the emergency response plan should face involved must take into account, such as system architecture diagram, system failure sphere of influence, interact with other systems, data backup, emergency steps, contacts and other ways information. The so-called "available" means pressing the emergency operation manual steps really quickly resume production when the system fails, it needs to be regular emergency drills to test the actual scene after the completion of the construction of emergency programs to detect and practical exercises to improve emergency response programs, do effective contingency plans available.

Disaster recovery planning and construction

China Banking Regulatory Commission, China Insurance Regulatory Commission, the Securities Association of China Construction Bank in recent years respectively for disaster recovery, insurance, securities industry raised three corresponding standards and policies, which fully shows the importance and necessity of disaster recovery construction. These standards and policies RTO for different levels of business systems (recovery time objective) and RPO (Recovery Point Objective) also have different levels of requirements.

Select from the disaster recovery plan, the current and medium-sized financial industry will adopt the basic "three centers in two" disaster recovery plan. The financial industry and small construction costs of the disaster control center equipment, the current use of the "city data replication", data protection across data centers within the city. From a technical point of view manner, data-level disaster recovery and application-level disaster recovery two categories. Backup and restore data-level disaster recovery data of interest is based on the premise application-level disaster recovery; application-level disaster recovery built on data-level disaster recovery, thus providing the ability to take over the business. From the future development trend, active-active data centers are the focus of future development, not only can improve the efficiency of dual-use center hardware resources, and can switch seamlessly to ensure disaster recovery. However, due to the construction of disaster recovery center needs to invest a lot of manpower and financial resources, it is proposed to establish for different companies for their own disaster recovery system construction, before the construction of disaster recovery to complete related research and analysis, to establish different levels for different business systems disaster recovery system.

In addition, the disaster recovery system in the event of a disaster can be a normal switch is a very real problem. Throughout recent years, several large faults within financial firms, found that two characteristics: First, policymakers in some hesitation on whether to switch; the second is not necessarily normal to take over the business after disaster recovery switchover. These two phenomena occur mainly Now many companies do not end their disaster recovery environment at heart, not sure whether true disaster recovery available. Therefore recommends setting up the environment early on to fully demonstrate the feasibility of switching, take into account all the disaster scenarios, and periodically switch to the real disaster recovery center running for some time, to verify the disaster recovery center can take over and to avoid the build out a check payable superfluous.

 

Analysis and Maintenance Initiative

Stable IT system is the cornerstone of the normal operation of financial enterprises, as data center maintenance personnel, operation and maintenance is a high availability requirements for the transport of Viti Levu, the passive response alone is not enough, which requires maintenance personnel to take the initiative , identify problems early, to avoid the corresponding risks. First, to establish business system performance prediction system, based on existing performance data to establish the appropriate mathematical model to deduce the relationship with the volume of business performance, so you can advance to analyze whether the system of resources to meet over the next few months or a special date (e.g., year-end closing, interest in mind, two-eleven) surge in traffic demand, thereby avoiding the effect on the system performance bottleneck. Secondly, the deployment automation tools on a regular basis to the existing system health check, to check whether the system whether hidden by routine "medical examination", and the detection of the indicators by automated tools, if an index is not up, then call automation tools for automated adjustment or expansion, proactive automation tools to achieve maintenance, so the system operation and maintenance become more "smart", not only simplifies the operation and maintenance personnel workload, but also eliminates the risk of system operation.

Process
change review and change management

Change operation is a data center needs to be done daily operations, changes in the production system has its own peculiarities, which is tantamount to dancing on the blade, change the purpose of this initiative is to make the system more stable operation, but if the change program have a problem or operator error, failure is likely caused by man. Therefore, we should have a comprehensive change management process to guarantee foolproof changes. First, the first change content and change procedures before the change refinement, it is best to each command for each parameter are clearly written, and the time of each operation takes a good estimate, corresponding to listed operator and review people. Secondly, there must be change review before the change, make all relevant experts colleagues to review the changing step, multi-level checks, change the risk control to a minimum. Finally, there must be change recording mechanism to ensure that the contents of each change are well documented, so that the latter can transport What has been changed when Vijay before convenient verification problem.

Fault handling linkage

As more data centers maintained by the system, so the relevant departments and personnel are also many vendors, when a fault occurs to a collection of related personnel to quickly deal with at this time joint coordination between the personnel is the ability to quickly restore critical aspects of production. Let everyone's consciousness must be a clear principle: to restore the production of the highest priority, followed by the other. This requires the establishment of inter-sectoral collaboration platform in advance to clear the respective responsibilities of the various personnel departments, clearly defined interface between each other, to avoid pass the buck to each other when a problem occurs, delay the timing of treatment failure. This needs to be noted in the daily operation and maintenance of the clear terms of reference the appropriate person, and a person in charge of overall (fault manager) overall planning in dealing with failure, and to coordinate the parties to timely and urge completion stage processing within the specified time Suggest.

Problem tracking and subtotals

In the data center operation and maintenance process, are likely to encounter a variety of daily failures that some soon to be rid of, and some may need to be analyzed in order to locate where the problem for some time, in order to ensure that each failure problems will not be missed, and real-time tracking the progress, thus establishing a fault issue tracking mechanism is very important. You can record the time of each alarm event occurs, such as follow-up treatment programs and people in charge of information through the fault issue tracking system. The long-term periodic questionnaire is not closed out the analysis, study countermeasures.

After the failure to establish a good issue tracking system, these failures statistical data can also be processed by the method of big data analytics, business system to find what kind of system is prone to failure, the cause of the failure is mainly caused by which of these categories, there is no cycle of failure of statistical information by these statistics in advance to avoid high risk in order to protect business continuity, which is the essence of the failure to establish a tracking mechanism problem lies.

 

 

Trend of Finance IT systems

Today's financial industry is growing, companies in order to meet the development of the market frequently launch a variety of new business, these new services to the enterprise development has brought new vitality, but also virtually brought new to the enterprise IT department challenge. Since the development of aging and strong new business, IT departments need to be able to keep up with the fast pace of the market demands, and rapidly build business supporting IT systems and provide services, to be offline business also need timely recovery of associated resources. This business-oriented operating model enables the financial industry IT operation and maintenance personnel can not just look down the road is not looked up, need from management to operation and maintenance personnel, from the macro to the technical details of the architecture, from management all the way to the concrete floor to change, to adapt to the operation and maintenance of such business-driven mode. The face of the cloud, the face of big data technologies, as a financial IT professionals, always remember that sentence: enterprises to long-term development, will never change is change.

Published 37 original articles · won praise 0 · Views 2403

Guess you like

Origin blog.csdn.net/syjhct/article/details/100633606