Amazon Cloud Technology Conference Revealed: How to Easily Diagnose and Solve Amazon RDS Performance Issues

关键字: [Amazon Web Services re:Invent 2023, Database Performance, Performance Insights, Performance Bottlenecks, Monitoring Tools, Wait Events]

Number of words: 3300, reading time: 16 minutes

video

If the video cannot be played normally, please go to bilibili to watch this video. >> Reduce the burden: Diagnose and resolve Amazon RDS performance issues_bilibili_bilibili

Introduction

Tuning the performance of relational databases is both an art and a science. In this session, you'll learn how to take full advantage of Amazon RDS performance monitoring features, such as Performance Insights and Enhanced Monitoring, and leave the session knowing how to efficiently diagnose and fix database performance issues.

Highlights of speech

The following is the essence of this speech compiled by the editor for you. It has a total of 3,000 words and takes about 15 minutes to read. If you want to know more about the content of the speech or watch the full text of the speech, please watch the full video of the speech or the original text of the speech below.

Ben Lee warmly welcomes all attendees to this productive session on monitoring and troubleshooting Amazon RDS and Aurora. He started by asking how many of the many people in the audience had encountered vexing database performance challenges. Nearly everyone raised their hands, indicating the importance and urgency of the topic to the audience.

Bin emphasized that the main purpose of this conference is to comprehensively train participants so that they can understand various powerful Amazon cloud technology monitoring tools and services. This will equip viewers with the technical skills to overcome complex database performance issues commonly faced by companies across a variety of industries.

When Bean introduced his partner, Maxim Kachika, he explained that he serves as a leading principal database engineer at Amazon Cloud Technologies. In Maxim's deep technical role, he will demonstrate some of the latest advanced technologies developed by Amazon Cloud Technologies related to cutting-edge database performance diagnostics and troubleshooting.

Before getting to the main part of the conference agenda, Bing posed a question to the audience: Think about what a flashlight, radar, and a magnifying glass have in common. Viewers quickly guessed that all three tools improve visibility and detection capabilities.

Bing agreed, illustrating with an analogy: Just as these tools can illuminate or identify hard-to-find objects in the dark, the right monitoring tools can illuminate hard-to-find database performance bottlenecks, making the problem clear so that Being able to diagnose and resolve it correctly like a magnifying glass or radar.

Bean reiterates that just as you can't solve a problem you can't see, you can't solve a complex database problem if you don't have the right visibility through the right tools that provide lighting and detection capabilities.

When transitioning into the meeting agenda, Bing first provided a brief introduction to some useful background on Amazon RDS to ensure that everyone could dive in once they had a deeper understanding of RDS. Next, they'll explore various monitoring services and tools that can help overcome performance challenges.

Maxim will demonstrate some of the latest capabilities developed by Amazon Cloud Technologies related to Amazon RDS, including advanced database performance diagnostics and troubleshooting. Finally, through a practical troubleshooting case, they will show how to use RDS to solve the problem of slow performance of an online web store. Attendees will see how these tools can quickly locate and solve customers' real bottlenecks. This demonstration will solidify how to apply surveillance concepts using real-life scenarios.

Pin provides an overview of Amazon Relational Database Service (RDS), a fully managed database service that handles all time-consuming management tasks such as backups, upgrades, patches, replication, failure detection, disaster recovery, capacity provisioning, and more. This allows developers to focus more on building innovative applications rather than managing the complex operational tasks associated with running a highly available, reliable and secure database. With RDS, companies no longer need to worry about database management and can instead devote resources to application innovation.

Pin emphasized that RDS supports popular database engines, including Oracle, SQL Server, PostgreSQL, MySQL, MariaDB and Amazon Aurora. This broad support makes it easy for customers to migrate existing databases to the cloud or build new cloud-native applications on their preferred relational database engine. Specifically, Amazon Aurora combines enterprise-grade security, availability, and reliability with the cost-effectiveness and simplicity of an open-source database. Aurora achieves this through its cloud-native architecture, which includes a distributed, fault-tolerant storage layer to optimize performance.

Pin shared that due to its enterprise-grade features and economics, Aurora usage is growing faster than any other service in Amazon's history. Today, thousands of diverse customers use Aurora to power their most important applications. Major customers include Verizon, GE, Capital One and Amazon's own Amazon.com.

Turning to the monitoring capabilities section of Amazon RDS, Pin explained that there are multiple levels that can be leveraged, including instance level, operating system level, and database engine level.

For example, for instance-level monitoring, industry experts recommend using Amazon CloudWatch as a starting point. According to Pin, CloudWatch can help users monitor all Amazon cloud technology resources and the entire application stack. Especially in terms of RDS, it can track instance-level indicators such as CPU utilization, storage, input/output operations per second (IOPS), network traffic, etc.

CloudWatch provides an intuitive visual alarm function, and users can easily customize alarm settings based on indicators, thresholds, evaluation periods and aggregate statistics. You can use static thresholds to trigger alerts when a specific value is exceeded, such as Oracle's 100,000 IOPS.

However, Pin noted that CloudWatch also offers more advanced dynamic threshold capabilities that leverage machine learning technology behind them. These algorithms analyze historical metric data to establish expected baseline ranges based on typical hourly, daily, and weekly patterns for your workload. Dynamic thresholds can trigger anomaly detection when current metrics deviate from expected baseline ranges. This results in higher sensitivity and accuracy to resource utilization changes relative to static thresholds. Capital One uses CloudWatch's dynamic thresholds to notify them about anomalies that indicate potential problems.

While CloudWatch is a good choice for instance-level monitoring, Pin said there are times when companies may need deeper operating system visibility that is beyond CloudWatch's capabilities. At this time, the enhanced monitoring function of RDS can come in handy. It collects fine-grained operating system telemetry directly from the RDS database instance itself, not just the hypervisor layer.

Pin emphasized that the enhanced monitoring feature is a fully integrated feature of Amazon RDS, so no manual installation is required to enable it. Amazon Cloud Technologies deploys an agent on the instance to collect operating system metrics. Enhanced monitoring includes a default 30-day retention period and 1-minute resolution. However, Pin said this can be customized based on customer needs to provide longer retention times or 1-second resolution for more detailed operating system data.

GE leverages RDS's enhanced monitoring capabilities to optimize their mission-critical workloads by gaining visibility into operating system metrics such as CPU time and page faults to help optimize database configurations.

When discussing the database engine aspect, Pin believes this is an area that requires in-depth analysis, including a focus on SQL query performance statistics, wait events, and other specific database metrics. For such in-depth database visualization needs, the Amazon RDS performance insight function is particularly suitable. It provides unparalleled visibility into database workloads and behavior.

Pin explained that performance insights can help measure database load, such as the average number of active sessions. An active session refers to a connection that is currently executing a SQL query and waiting for a response from the database. Therefore, it can demonstrate the visualization of concurrent queries and database interactions.

In addition to active sessions and aggregate load, Performance Insights also collects details about the SQL queries themselves. Dimensions associated with each query are also collected, such as database user, client program, application name, and wait events.

Pin emphasizes that wait events are particularly important because they help reveal the root cause of performance bottlenecks. The wait event represents what the query is waiting for during the execution of the query. Common sources of waits include I/O delays (reading data from disk), row or table lock contention, replication delays, and buffer cache misses. By analyzing wait events related to high-load queries, you can find useful insights into optimization opportunities.

Verizon leverages Performance Insights' wait event analysis to optimize configurations to quickly resolve temporary performance issues in large-scale Aurora clusters.

Pin mentioned that there may be concerns that Performance Insights’ 1-second sampling rate may lead to data loss. But he assured viewers that this level of granular detail provides a very accurate representation of overall database workload and behavior, just as video playback at 24 frames per second produces smooth video.

The goal is to optimize monitoring overhead in order to capture the most impactful "big fish" queries that have the greatest impact on the workload. Performance Insights is not intended to be an exhaustive audit tool that captures every SQL statement. Capturing the most impactful queries avoids excessive data collection overhead.

From a visual perspective, Performance Insights aggregates concurrent workloads from multiple users into an intuitive time-series database load graph. This design makes the comparison of database concurrency and idle time clearer. By assigning a different color to each wait event type, the chart vividly illustrates the overall impact of various wait behaviors over time.

The Pin function can help users quickly understand the status and health of the database. For example, if the graph shows no active sessions, that means the database is idle with no load. Therefore, with any end-user performance issue, we first need to rule out the possibility that it is caused by the database itself, and that it could be an issue with another component.

On the other hand, if active sessions exceed the number of vCPUs configured for the RDS instance, then this may indicate the need to tune or increase resources to resolve the performance bottleneck. However, if active sessions remain below the number of vCPUs, then the database should have additional capacity available.

The elasticity and pay-as-you-grow model of cloud computing allow us to use Performance Insights to make data-driven right-sizing decisions. Load graphs can show instances that may be too large or too small so that overhead can be optimized. For example, if the load is consistently well below the number of vCPUs, then this may mean that there are over-provisioned instances that are larger and more expensive than the actual workload required. Conversely, if the load exceeds the number of vCPUs, then an undersized instance may require more resources or optimization to prevent performance degradation.

Pin highlighted another powerful feature of Performance Insights, which is filtering and analyzing database activity using the multiple dimensions available. This helps simplify diagnosing the root cause of bottlenecks. For example, we can isolate all blocking lock wait queries executed by a user to identify potential culprit queries. Additionally, you can filter by client application to pinpoint applications abusing database connections.

Amazon leveraged the application name dimension in Performance Insights to quickly identify specific microservices that impacted database load with poorly optimized queries. This allows for more targeted tuning and expansion.

Pin explained that the dimensions supported by different RDS database engines will vary based on customer needs and suggestions. He encouraged attendees to provide input on useful dimensions they would like added. Amazon Cloud Technology's RDS team attaches great importance to customer needs for expanded monitoring capabilities.

After outlining the fundamentals of RDS monitoring, Pin turned the stage over to Maxim Kachka to discuss the latest innovations in simplifying and automating database performance insights.

Maxim first explained improvements that focus on three areas: unifying the interface, highlighting important telemetry, and introducing automated advisors. He began by detailing efforts to build a standalone database telemetry interface called the Single Pane of Glass. In the past, CloudWatch, Enhanced Monitoring, and Performance Insights each had separate consoles. With over 100 unique RDS metrics on these consoles, it's often unclear which specific metrics are most important and relevant.

To solve this problem, Amazon Cloud Technologies now offers a pre-built curated database health console that contains the most useful RDS metrics. These dashboards cover key areas such as compute, memory, storage, I/O and SQL performance. The goal is to provide a single starting point with the most important metrics.

Maxim notes that these predefined control boards are available for all RDS engines. But users can still build fully customized dashboards from basic telemetry data based on their specific needs.

In terms of recent enhancements, Maxim highlighted that the Performance Insights console can now publish directly to Amazon CloudWatch. This allows leveraging CloudWatch's rich graph customization capabilities, such as configuring metrics with thresholds, changing chart types, and adding annotations.

Annotations improve collaboration by communicating insights, actions taken, and progress toward resolving performance issues across teams over time. A real customer example is DevOps teams marking incident postmortems on dashboard charts to inform other teams of the root cause and prevent future recurrences.

Additional benefits of CloudWatch include the ability to use mathematical expressions to build derived metrics. For example, the buffer cache hit ratio is emitted directly by RDS, but can be easily calculated from the base metrics using simple arithmetic. The completed dashboard can be shared with others within the organization, such as the development team or leadership.

Historically, cloud monitoring service CloudWatch has been known for its rich alarm creation capabilities, but these capabilities were limited to local metrics. However, starting in September 2022, this changes as alerts can now be set on all RDS telemetry data, including performance insights and enhanced monitoring metrics. The easiest way to use this feature is to select "Edit in CloudWatch" on the Performance Insights chart and then bring the metrics into CloudWatch. After that, you can configure the alarm just like any other CloudWatch metric. This new feature enables Capital One to set alerts on its key database metrics to proactively identify potential issues.

Additionally, Maxim has also focused on on-demand performance analysis. This automated technology provides database performance analysis for a user-selected time range, eliminating the need for manual investigation. By summarizing the cause of the problem based on wait events, pointing out the major contributing queries, and highlighting related resource exceptions, it can quickly determine whether the database was functioning properly or had any issues during this period. Additionally, it features the ability to link related documents and provide subsequent troubleshooting steps for common detected issues.

In a real-world application with a key customer, on-demand performance analysis is used to automatically diagnose intermittent performance fluctuations. In the past, this situation required tedious manual analysis. Now, on-demand performance analysis provides root cause details for these issues, allowing for more efficient resolution.

In addition to on-demand performance analysis, Amazon DevOps Guru is also considered another service worth paying attention to. It continuously monitors metrics and provides alerts on abnormal patterns across the entire application environment. For RDS, it can detect database performance deviations and provide detailed analysis reports, similar to the functionality of on-demand performance analysis.

Finally, let's look at a live troubleshooting demonstration. In this demo, Maxim sets up an online e-commerce site that shows customers their recent orders. This page issues key queries to the backend RDS database to populate order history. Because this workflow is so important, the company has configured extensive monitoring and alerting to be notified when issues arise.

During this process, Maxim took on the role of a rotating engineer who was receiving alerts about slow database responses and some request timeouts, causing a visible degradation in website performance. Demonstrations show how to use a variety of tools to find the root cause of a problem, identify the person responsible, and determine how to fix the problem.

First, we need to pinpoint the source of the performance degradation. When looking at the Performance Insights database load graph, we found a very clear and significant spike in activity, indicating that the database was not functioning properly. To better understand the situation, Maxim clicked on the new "Analyze Performance" button to get automated reports during that time period.

After analysis, this peak is indeed unusual and is mainly affected by active sessions waiting for I/O activities, especially the io_datafile_read wait event. This means there is an I/O bottleneck reading the data file. In addition, unexpected database connection spikes occurring at the same time also caused concern.

Therefore, the root cause is an I/O resource issue due to a surge in database connections, thereby exceeding the IOPS limit. Compared with tedious manual indicator surveys, automated performance analysis reports provide us with faster insights.

Next, we need to identify additional sources of load. Using Performance Insights' top query view, we discovered that the query in question was not the expected order history query, but another query that was unfamiliar to the application team. By looking at the application name dimension, we tied these mystery queries to the new internal call center application that was causing the load spike. The application of this dimension simplifies the process of attributing load increases to their sources.

Finally, we explore some possible solutions. Tweaking call center queries didn't have immediate results. Increasing IOPS capacity can alleviate contention. Increasing the RDS instance size may avoid pushing two workloads to disk. Alternatively, we can offload call center queries to read replicas or use caching to reduce database load.

Overall, Performance Insights enables us to identify sources of performance degradation, understand root causes, and provide evidence-based solutions to underlying issues. Automated capabilities allow us to achieve fast resolution times without the need for tedious manual analysis.

Pin emphasized that having comprehensive visibility tools like CloudWatch Anomaly Detection, as well as Performance Insights integrated with CloudWatch, and automated performance diagnostics allow our company to identify and prevent potential performance issues before they have a noticeable impact on customers and the business. . He encouraged attendees to take advantage of the specialized capabilities of the appropriate tools to get the job done.

Emphasizes the importance of efficiently identifying database performance bottlenecks using visualization capabilities designed for monitoring tools.

Overview of the fully managed benefits of Amazon RDS and Aurora, such as automated configuration, backup, and upgrades.
Introducing the different levels of granular monitoring provided by RDS: instance monitoring through CloudWatch, operating system monitoring through Enhanced Monitoring, and database engine monitoring through Performance Insights.
Explain how Performance Insights provides customers with in-depth insights into the root causes of problems by measuring database load, collecting detailed query data, analyzing wait events and other dimensions.
Share actual customer application cases and demonstrate how to use RDS monitoring tools (such as CloudWatch dynamic thresholds, Enhanced Monitoring OS indicators, Performance Insights waiting events and dimensions) to solve actual problems.
Introducing the latest innovations in Amazon Cloud Technologies, such as Performance Analysis on Demand, designed to simplify and automate the process of uncovering insights into overall database health and performance.
Provides a practical end-to-end demonstration of how to use Performance Insights to quickly identify the source, causes and resolution of performance degradation.
Emphasize the importance of proactive monitoring and automated handling of diagnostic issues to prevent customer or business impact.

Here are some highlights from the speech:

The following is a simplified Chinese description of a screenshot discussing Amazon RDS and Aurora database performance monitoring and diagnostic tools.

Performance Insights is dedicated to tracking critical database queries to identify potential performance bottlenecks, rather than just auditing each query.

Performance Insights clearly shows changes in database load over time by assigning colors to different types of wait events.

The screenshot from Performance Insights shows that green represents CPU utilization and red represents blocking, which is helpful for understanding database performance issues.

The speaker explained that by comparing average active sessions and the number of vCPUs in an instance, you can find where the CPU bottleneck is.

Performance Insights helps customers adjust instance size to achieve the best cost-effectiveness by analyzing database load and instance size.

The leadership team will collect feedback from attendees through questionnaires to improve the quality of Amazon Cloud Technology services.

Summarize

This session covers the topic of monitoring and troubleshooting database performance issues in Amazon RDS. Speakers emphasized that visibility is crucial - if you can't see the problem, you can't solve it. They outline methods for instance, operating system, and database engine level monitoring across a variety of Amazon Cloud tools. Key Takeaways: Monitor instance metrics with CloudWatch, enable enhanced monitoring for OS-level visibility, and leverage performance insights for deep SQL query analysis.

Performance insights provide valuable dimensions, such as wait events, to identify bottlenecks. Speakers demonstrated how these tools can be used together to troubleshoot presumed performance issues caused by workload spikes. They explained several possible solutions, such as increasing IOPS, resizing the instance, and using caching.

In summary, this session empowers you to actively monitor your database environment and quickly diagnose and resolve performance bottlenecks using the rich visibility provided by Amazon Web Services.

Original speech

Take a load off: Diagnose & resolve performance issues with Amazon RDS-CSDN博客

Want to know more exciting and complete content? Visit re:Invent official Chinese website now!

2023 Amazon Cloud Technology re:Invent Global Conference - Official Website

Click here to get the latest global product/service information from Amazon Cloud Technology with one click!

Click here to get the latest product/service information from Amazon Cloud Technology China with one click!

Register an Amazon Cloud Technology account now and start your cloud journey!

[Free] Amazon Cloud Technology "Free trial of more than 100 core cloud service products"

[Free] "Free trial of more than 40 core cloud service products" of Amazon Cloud Technology China

Who is Amazon Cloud Technology?

Amazon Cloud Technology (Amazon Web Services) is the pioneer and leader of global cloud computing. Since 2006, it has been characterized by continuous innovation, technology leadership, rich services, and wide application And well-known in the industry. Amazon Cloud Technology can support almost any workload on the cloud. Amazon Cloud Technology currently provides more than 200 full-featured services, covering computing, storage, network, database, data analysis, robotics, machine learning and artificial intelligence, Internet of Things, mobile, security, hybrid cloud, virtual reality and augmented reality, media , as well as application development, deployment and management; the infrastructure covers 99 availability zones in 31 geographical regions, and plans to build 4 new regions and 12 availability zones. Millions of customers around the world, from startups, small and medium-sized enterprises, to large enterprises and government agencies, trust Amazon Cloud Technology. They use Amazon Cloud Technology services to strengthen their infrastructure, improve agility, reduce costs, accelerate innovation, and enhance competitiveness. Achieve business growth and success.