Industry Solutions|Introduction to Intelligent Operation and Maintenance Solutions for the "Medical" Industry

business background

development requirements

In recent years, under the guidance of several policies, key tasks such as hospital information management system, electronic medical record system, and regional medical information interconnection have been gradually improved. The National Health and Health Commission has begun to refine the types of hospital evaluations and ratings, clarified the acceptance rules and time points, and the assessment has become increasingly strict. For example, the National Health Commission requires that by 2022, the average application level of electronic medical records in secondary and tertiary public hospitals in the country reach levels 3 and 4 respectively, smart services strive to reach levels 2 and 3, and smart management strive to reach levels 1 and 2. It can support a new model of online and offline integrated medical services. Realize "construction through evaluation, and promote reform through evaluation".

Due to "interconnection + smart hospital + refined management = the integration and complexity of various hospital systems will continue to increase", the stricter review directly brings huge pressure and difficulties to the IT operation and maintenance of the information center.

safety requirements

From December 1, 2019, the relevant standards of network security level protection 2.0 have been officially implemented, realizing full coverage of new technologies, new application security protection objects and security protection fields. Because hospitals are currently on the cloud, virtualization security protection is particularly important , so Equal Protection 2.0 especially emphasizes network security protection and strengthens the security protection system of "one center, triple protection".

The improvement of the requirements for equal insurance has brought new challenges to the security work of the hospital information system. Because with the continuous development of information technology, especially the continuous emergence and application of new technologies such as cloud computing and the Internet of Things, which not only brings speed and convenience to the medical industry, but also blurs border protection, hacker attacks, ransomware, Data loss and business suspension caused by worms and system loopholes have become an urgent problem for hospitals at all levels. How to achieve all-round active defense, dynamic defense, overall prevention and control, and precise protection is a challenge for hospitals. During the "14th Five-Year Plan" period, new network infrastructures such as 5G are also facing higher security requirements.

Therefore, the hospital information center not only has to undertake more and more system operation and maintenance pressure, but also needs to improve its security protection capabilities in accordance with the requirements of the guarantee.

Operation and maintenance status

At present, in response to national policies, simplifying the process of seeing a doctor, opening up medical big data, and strengthening business applications and hospital management, the hospital's complete set of business processes includes an appointment registration system, an electronic medical record system, an inspection and testing system, an imaging platform system, and a financial information system. And a series of related systems to realize the interconnection and intercommunication of the main patient diagnosis data in the hospital. However, because various systems need to connect to multiple ports such as patients, doctors, and various departments of the hospital at the same time, they serve as the nerve center of the hospital's operation. The stability requirements of each system are extremely high. Once a system fails, it will affect the overall business process.

Therefore, in terms of the current volume of business covered by the hospital system and the closeness of communication, the information center has a lot of trivial operation and maintenance work, and the top three hospitals with the best informatization only have an operation and maintenance team of about 10 people at most, and 80% of the There are only 3-5 operation and maintenance personnel in the tertiary hospital, and only 2-3 in the second-level hospital, and the rest are even less. Faced with such a complex system operation and maintenance work, the results are not good.

Description of Requirement

tertiary hospital

  • Demand for centralized management : The construction of smart hospitals is advancing. Most of the top three hospitals have completed the construction of core business systems such as HIS, LIS, PACS, EMR, physical examination system, and unified payment, and have been equipped with a certain scale of network, server, dynamic system and other systems. Each manufacturer is independently monitored, data is fragmented, and there is a lack of unified supervision tools.

  • Problem discovery requirements : The online detection of new business systems and the troubleshooting of existing system problems are all problems found by the information center after the fact, and rely on manual inspection, long inspection time, low problem handling efficiency, and ineffective operation and maintenance management.

  • Root cause positioning requirements : A small number of tertiary hospitals are no longer satisfied with just finding problems. They need to investigate the root cause of the problem and improve the accuracy of problem handling. Therefore, they have higher requirements for abnormal detection and log auditing capabilities.

Secondary hospital

  • Demand for centralized management : With the continuous deepening of business expansion and informatization construction in secondary hospitals, especially the epidemic has prompted secondary hospitals to speed up the process of online business, there are many needs for new applications and optimization of existing systems, and the stability of IT system operation , There are many reliability problems. In addition, there are only 2-3 operation and maintenance personnel, no matter the technical ability + number, can not meet the current operation and maintenance work needs.

  • Review requirements : In order to be upgraded to a tertiary hospital, it is necessary to meet the relevant safety review standards, and operation and maintenance monitoring is part of the compliance requirements.

Hospital operation and maintenance safety

At present, due to the improvement of the requirements of the Grade Insurance 2.0, according to the "National Hospital Informatization Construction Standards and Specifications (Trial)", "Guidelines for the Graded Protection of Information Security in the Health Industry" and other regulations, hospitals are required to complete the fortress machine and the network security system in the system. Construction of log auditing and network management and control systems, most hospital medical networks currently do not have bastion machines and log auditing, network management and control system security equipment in the network security system.

key objective

  • Operation and maintenance security management

    • It can provide operation and maintenance security auditing services integrating account management, identity authentication, single sign-on, resource authorization, access control and operation auditing;
    • It can effectively audit the operation and maintenance operation process of assets such as servers, network equipment, security equipment, and databases, so that the operation and maintenance audit can be upgraded from event audit to operation content audit;
    • Through the pre-prevention, in-process control and post-audit of the internal control management platform, the security problems of operation and maintenance are comprehensively solved.
  • log audit

    • It can collect and aggregate the log information of different types of security equipment, network equipment, host, operating system, and production business system from different manufacturers in the hospital network in real time and uninterruptedly. Security incidents and audit breaches;
    • It can provide many powerful functions based on log analysis, such as centralized collection of security logs, analysis and mining, compliance audit, real-time monitoring and security alarms, etc., to provide strong support for the analysis and traceability of security events;
    • It can simultaneously meet the actual operation and maintenance analysis needs and audit compliance needs of the hospital, and is an important support platform for the hospital's daily information security work.
  • network management control

    • A new generation network management and control system that can face the hospital campus network, is a network automation and intelligent platform integrating management, control and analysis functions;
    • It can provide full life cycle automation of the campus network, and intelligent closed-loop capabilities based on big data and AI to help hospitals reduce operation and maintenance costs, accelerate the digital transformation of hospitals, and make hospital network management more automatic and network operation and maintenance more intelligent.

Pain point analysis

  • Difficulty positioning: Some hospitals have Huawei network monitoring systems (to monitor hospital network and switches), but lack effective and unified management methods for physical server hardware information, operating systems, service middleware, mobile applications, and databases, making it difficult to proactively manage Find problems and faults. It cannot meet the operation and maintenance monitoring requirements of existing IT equipment.

  • Problem discovery is lagging behind: users in outpatient buildings and inpatient buildings access various business systems through private networks. There is no effective means for real-time perception of network link conditions, system access availability, response time, etc., and investigations are often conducted after complaints are received. , time-consuming and labor-intensive, and the effect is not recognized. Tools are needed to improve the accuracy of anomaly detection.

  • High operation and maintenance load: Only 3-5 people are deployed. The daily operation and maintenance is carried out by manual inspection of the computer room, and faults are judged by observing the indicator lights of the equipment. There is no inspection of the server, network and other infrastructure layers and application business layers. Exhausted and unable to meet the operation and maintenance monitoring requirements of existing IT equipment

  • Launch of the new system : The hospital is in the period of launching the new system, and there may be many uncertain factors. It is necessary to use apm to capture the data from the server, and use abnormal data as an indicator for the hospital to verify the performance of the new system. It is hoped that business problems can be directly located to assist in the tuning of the new system, which needs to be directly deployed in the production environment.

  • Difficulty in resource management: Hospital data centers do not have effective methods for scientific management and planning of computing resources such as server CPU and memory, and storage resources such as disk space and disk I/O. Insufficient control.

  • Alarm storm: Some hospitals have dynamic loops and infrastructure monitoring, but due to the complex business system of the hospital, too much alarm information will be generated, so that the operation and maintenance personnel are drowned in the alarm storm and cannot determine the cause.

  • Vendor monitoring independence: Although some vendors can provide monitoring tools, each tool is relatively independent and can only provide status monitoring of its own products. Lack of monitoring the entire link of core business systems (such as HIS, PACS, etc.), when an abnormality occurs in the business system, the problem can only be analyzed from the device layer and the system layer, and it is difficult to locate the root cause of the business or application problem.

  • The safety review requirements are not up to standard: Since the security review 2.0 has improved the hospital's information security requirements, some hospitals need to use the operation and maintenance monitoring platform for support according to the corresponding safety review requirements.

  • Unstable network performance: Since major hospitals currently rely on the network, especially those that go to the cloud, once a network performance problem occurs, it will inevitably lead to a large area of ​​business paralysis. Therefore, network performance monitoring and security risk assessment are the top priorities. Heavy. However, due to the limited budget of hospital operation and maintenance projects, the NPM price is relatively high, and it is difficult to cover the cost of living, so there are not many sales.

Product List

  • DOIM : Private deployment, which mainly focuses on the equipment layer involved in the customer's HIS, LIS, and PACS systems, including databases, operating systems, server hardware, storage disk arrays, databases, and virtualization platforms for unified monitoring.

  • APM : privatized deployment, mainly monitoring and testing core back-end applications such as HIS\EMR\LIS\PACS\physical examination system\unified payment system. It is deployed in test scenarios and production environments.

  • DOLA : Mainly relying on the capabilities of cloud intelligence in intelligent algorithms and log analysis, it helps hospitals to do a good job of preventive work in business system operation and maintenance monitoring, reduce the time for problem discovery and troubleshooting, and improve the accuracy of abnormal detection. Logs are scattered in various server hosts, containers, and network devices. Logs need to be collected through CDC, and the collection objects are all devices at the IAAS layer.

  • DOEM : Alarm notification via email, third-party Push, etc.

Overall program

Cloud Wisdom provides integrated smart operation and maintenance practices for the field of "smart healthcare". Application scenarios include active monitoring, quick troubleshooting, centralized alarm, value presentation, centralized management, log analysis, active inspection, and service management. In addition, the cloud-intelligent integrated intelligent operation and maintenance monitoring solution has core advantages such as full-stack monitoring, independent control, mature solutions, a large number of practices, customer-centricity, national service network, leading algorithm capabilities, and official ITIL v4 certification.

Overall Architecture Design

The following figure shows the overall architecture design of the cloud-intelligent integrated intelligent operation and maintenance solution.

Main application scenarios

Full stack monitoring

Connect to servers of hundreds of manufacturers through Agent, SNMP (V1, V2, V3), WMI, SSH, Telnet, IPMI, ILO, northbound interface, serial port, ODBC/JDBC, custom SQL, URL, WMI, Java connection, etc. , network equipment, operating system, storage, virtualization, middleware, database, Web services and other resources configuration data and indicator data are collected uniformly. Finally, resource management and topology management are realized.

  • Resource management: including network device management, host management, database management, middleware management, storage management, hardware management, standard service management, log management (syslog, snmp trap).

  • Topology management: It has the function of automatic network topology, and adopts advanced network topology discovery algorithm and data acquisition protocol to realize network topology discovery, including generating network topology map based on routing layer link; generating physical network topology map based on network segment connection; The network generates the logical topology diagram of each subnet.

In addition, full stack monitoring also includes achieving the following monitoring goals:

  • Integrated monitoring: Built-in 120+ out-of-the-box resource models and 10,000+ monitoring indicators, which can comprehensively and quickly connect the old equipment, IT resources, dynamic ring facilities, and IOT equipment of hundreds of manufacturers for centralized collection Monitoring and alarm management, while supporting docking with other system data.

  • Heterogeneous cloud environment management: The cloud model architecture design is adopted, and the collection processor and agent mode of cloud nodes are used to realize cross-platform/network/security policy/domain management of IT resources of mainstream cloud vendors in multiple cloud heterogeneous models. Unified monitoring and centralized management, the monitoring scope can be expanded.

  • Localization Adaptation: Supports the modeling and index collection and monitoring of mainstream localized equipment, operating systems, databases, and middleware, not limited to Dameng, Jindie, Paulland, NPC Jincang, Dameng, Dongfangtong, Shentong, Kirin, Feiteng, etc., the self-developed database is not affected by the international environment.

  • Out-of-the-box: With hundreds of out-of-the-box indicators collection and CI data collection models, the construction of the collection server can be completed in a few minutes at the fastest, and it is simple and convenient to use.

Real-time display of the overall operation of various resources and application systems 7*24 hours a year, 365 days a year. Through intelligent operation and maintenance, the original complicated operation and maintenance management work has become simple and easy, and the responsibility is clearly defined, safe, efficient, stable and reliable. , the goal of intelligent management and control.

Centralized management

  • IP address management. The IP address management function can help data centers reasonably plan the daily and long-term use of network addresses and improve network security.

  • Use the tool to scan the table regularly to find the status of the IP addresses in the network segment. Including: in use, unused, management IP, reserved IP, etc. Real-time classification is performed according to the IP address status, presented in the form of views, different statuses are distinguished according to different colors, and real-time statistics are performed to ensure the rational use of network addresses.

Quick troubleshooting

Automatic discovery of application topology: Automatically discover all application technology stacks and their associated relationships, helping users to grasp the overall status of an application and its associated applications, as well as the changing trends of the number of requests, response time, and errors, and quickly locate problems at all levels.

For a single request, identify potential problems through basic information and business topology, track slow elements and stack details, analyze error and exception information and stacks, and analyze the execution of SQL statements, API calls, and request parameters.

log analysis

Log auditing mainly relies on the capabilities of cloud intelligence in intelligent algorithms and log analysis to help hospitals collect, integrate, and analyze logs scattered in various server hosts, containers, and network devices, and do a good job of preventive work in business system operation and maintenance monitoring. , reduce the time for problem discovery and troubleshooting, and improve the accuracy of anomaly detection.

Centralized alarm

The alarm management can notify the algorithm capability of automatic learning according to the basic rules related to the alarm, such as cluster merge, IP merge, etc. to aggregate the alarms related to the alarm at the same time. In addition, through intelligent analysis of alarms, users can avoid invalid alarms and alarm storms, quickly troubleshoot and locate faults, and comprehensively improve alarm management capabilities.

  • Alarm convergence and identification of valid alarms: Compress and deduplicate repeated and invalid alarms that occur in large numbers in a short period of time to identify valid alarms.

  • Alarm aggregation to help locate problems: including merging according to clusters, merging according to IP, merging according to network segments, merging according to abnormal types, and merging according to the relationship between the host and the virtual machine.

The following figure shows the scenario of one-stop intelligent alarm troubleshooting and location. The sample notification diagram shows that the Oracle data has failed. Cloud Intelligence conducted convergence identification based on massive alarms, and found 5 alarms about Oracle tablespaces, processes, instances, and deadlocks. At the same time, according to the basic rules related to alarms, the algorithm capabilities of automatic learning are notified, such as cluster merge, IP merge, etc., to aggregate the alarms related to the alarm at the same time. Finally, by correlating the alarm with the indicator, check the indicator trend of the current alarm and find that the problem may be caused by deadlock.

Through intelligent analysis of alarms, users can avoid invalid alarms and alarm storms, quickly troubleshoot and locate faults, and comprehensively improve alarm management capabilities.

Active inspection

In order to prevent accidents, operation and maintenance personnel need to inspect a large number of devices one by one every day. In the traditional operation and maintenance mode, the operation and maintenance personnel must log in to the equipment in turn to complete the inspection, which not only consumes a lot of time, but also the manual operation method is prone to errors. Automated inspections can improve efficiency by quickly focusing on problems.

  • Hospitals can add corresponding scenarios to automated operation scenarios according to actual needs, and associate operation and scheduling tasks with operation and maintenance scenarios.

  • A variety of common operating systems, databases, middleware and other inspection templates are built in, which can meet the needs of daily inspection.

  • Support flexible configuration of inspection indicators and thresholds, highlight abnormal indicators, and see abnormal conditions at a glance.

  • Supports timed execution policies, inspection notifications, custom email templates, and email attachment types to ensure planned execution of inspection tasks.

value presentation

  • Basic resource monitoring: Display the availability of application ports and port health status when each monitoring point accesses various applications in the hospital through a large screen.

  • Network quality monitoring: Monitor key performance indicators such as network delay and packet loss rate and display them on a large screen.

  • Application performance monitoring: Display the availability of application ports and port health status when each monitoring point accesses various applications in the hospital through a large screen.

  • Database health monitoring: monitor database availability, data capacity, and database key performance indicators and display them on a large screen.

Service management

Cloud Wisdom is China's first officially authorized IT service management consulting partner (ACP) certified by AXELOS (ITIL copyright owner). This means that in the industry, Cloud Wisdom will be able to provide more authoritative IT service management consulting and services for enterprises that are willing to introduce ITIL, and further strengthen the localization practice of this theory. Therefore, the existing product framework of Cloud Wisdom is also built following the new generation ITIL concept.

  • Intelligent customer service: Use natural language recognition technology to help users solve common problems, respond quickly to users, and greatly reduce the workload of operation and maintenance engineers;

  • Agent monitoring: You can view the number of receptions, the number of conversations, the average response time, the average conversation duration, the total number of messages, etc., and the detailed data of an agent in real time;

  • Mobile bill of lading: flexibly connect with corporate IMs such as corporate WeChat and DingTalk, as well as customer-owned APPs, support users to access intelligent customer service and online service desks through mobile terminals, and support users to submit work orders by themselves and check the processing progress of work orders;

  • Service Catalog: Provide unified definition and management functions of service catalog, provide a unified, consistent and accurate information source for enterprise services, and provide support for other service management activities;

  • Process form: Provides a visual work order process definition panel, rich visual controls and a powerful form designer, which can meet the work order customization requirements in various scenarios;

  • Knowledge base link: The platform has a large number of built-in knowledge items to centrally manage historical accumulated experience and common scene knowledge, which is convenient for knowledge application to check and use, improve the efficiency of problem solving, and reduce the dependence on professional talents;

  • Work Order Kanban: Visually monitor all kinds of work order data involved in IT service management, display service risk points, service quality, service efficiency and service level in real time, and help management to perceive and evaluate the quality of IT services from a global perspective. Control.

Case Studies

A hospital integrated monitoring project

Background of the project

Before using the automated monitoring system, a hospital mainly relied on manual inspections. Usually, when a business problem affects the use, the IT department can be notified and send relevant personnel to solve the problem. Since business cannot be interrupted, operation and maintenance personnel often need to go to the equipment room at night to solve equipment problems, and many of these problems are repetitive.

Service Content

  • Dynamic monitoring: smoke, temperature, water leakage, UPS, air conditioning, etc. in the computer room;

  • Basic monitoring: IT software and hardware equipment such as servers, operating systems, network equipment, databases, and middleware of each system;

  • Automatic disposal: automatic inspection replaces manual inspection, the integration of supervision and control makes operation and maintenance easier, and emergency alarms automatically trigger preset programs and scripts to realize automatic processing;

  • Real-time viewing: The APP terminal supports Android and IOS systems, and the managed business and equipment status can be viewed on the mobile terminal.

A hospital smart operation and maintenance project

Background of the project

The computer room of a hospital information center is based on a SAN network architecture to ensure the uninterrupted provision of various services of the hospital. However, in order to meet the continuous growth of massive image data and ensure the stable and orderly operation of each system, the leaders of the hospital decided to establish a set of intelligent operation and maintenance system to realize real-time monitoring and centralized management of IT facilities in all campuses. In this way, the efficiency of fault handling is improved, and the occurrence of downtime and system interruption is reduced.

Service Content

  • Integrated monitoring: Real-time monitoring of all equipment and applications such as the hospital's existing PC servers, UNIX servers, switches, routers, storage, oracle databases, SQL SERVER databases, and middleware. The unique MegaSpeed ​​massive second-level monitoring increases the fault response level to the second level.

  • Real-time alarm: 24*7 real-time monitoring of IT equipment and applications. Once a running failure occurs or the performance index reaches the alarm threshold, the alarm information will be automatically pushed through SMS, email, sound and color.

  • Panoramic large screen: meet the needs of various business systems, network equipment and other rich monitoring status visualization requirements in the hospital data center, and realize the dynamic presentation of holographic full-dimensional situation.

FlyFish open source benefits

Cloud Wisdom has open source data visualization orchestration platform FlyFish. By configuring the data model, it provides users with hundreds of visual graphic components, and zero coding can achieve a cool visual large screen that meets their own business needs. At the same time, Feiyu also provides flexible expansion capabilities, supports configuration of component development, custom functions and global events, and ensures efficient development and delivery for complex demand scenarios.

If you like our project, please don't forget to click the code repository address below and click Star on the GitHub / Gitee repository, we need your encouragement and support. In addition, immediately participate in the FlyFish project to contribute to become a FlyFish Contributor, and there will be 10,000 yuan in cash waiting for you.

GitHub address: https://github.com/CloudWise-OpenSource/FlyFish

Gitee address: https://gitee.com/CloudWise/fly-fish

Wechat scan to identify the QR code below, note [Flying Fish] Join the AIOps community Flying Fish developer exchange group, and communicate face-to-face with the FlyFish project PMC~

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/yunzhihui/blog/5580555