Unified Observation | Best practices for monitoring Memcached using Prometheus

Author: Weiwei

Introduction to Memcached

What is Memcached?

Memcached is a free open source, high-performance, distributed memory object caching system that supports storing chunk data of any data type in the form of key-value pairs. Memcached is essentially universal for all applications, but was originally used to store static data that is frequently accessed, reducing database load and accelerating dynamic web applications.

Memcached Features

  • memory storage

    All data in Memcached is stored in memory. Compared with persistent databases such as PostgreSQL and MySQL, the stored process does not require repeated round-trips to disk, thus providing memory-level response time (<1ms) and can perform one billion operations per second.

  • distributed

    Distributed multi-threaded architecture makes it easy to scale. Supports spreading data across multiple nodes to expand capacity by adding new nodes to the cluster. In addition, Memcached specifies multiple cores of the node and uses multi-threading to improve processing speed.

  • key value storage

    Memcached stores any type of data, supports storage of all frequently accessed data, and is suitable for various application scenarios. Complex operations such as joint queries are not supported to improve query efficiency.

  • Simplicity and usability

    Memcached provides client programs in various language frameworks, such as Java, C/C++, Go, etc.

Typical applicable scenarios

Applicable scene

cache

The initial usage scenario of Memcached is to store static data of web pages (HTML, CSS, pictures, sessions, etc.), which are read frequently and modified rarely. When users access web pages, they can return static data in memory first, thereby improving the response speed of the website. Like other caches, Memcached does not guarantee the existence of data every time it is accessed, but provides the possibility of access acceleration in the short term.

Database front-end

Memcached can be used as a high-performance memory cache, located between the client and the persistent database in the system to reduce the number of accesses to the slower database and reduce the load on the back-end system. At this time, Memcached is basically equal to a copy of the database. When reading the database, first check whether it exists in Memcached. If it does not exist, further access the database. If it exists, return directly. When writing to the database, return directly after writing to Memcached, and then write to the database during idle time.

For example: In social apps, there are often functions such as "latest comments" and "hottest comments", which require a large amount of data calculation and are called frequently. If you only use a relational database, you need to read and write to the disk frequently. When doing such operations, much of the data is reused, so holding it briefly in memory can greatly speed up the entire process. The last time the top comments were calculated based on 1000 comments, 800 were revisited and were able to stay in Memcached. The next time you calculate the most popular comments, you only need to fetch 200 more comments from the database, saving 80% of the number of database reads and writes.

Large data volume, high-frequency reading and writing scenarios

Memcached is a pure memory storage database, and its access speed is much higher than other persistence devices. In addition, as a distributed database, Memcached supports horizontal expansion, dispersing high-load requests to multiple nodes, and providing a high-concurrency access mode.

Not applicable scenarios

Cache object is too large

Due to storage design, it is officially recommended that cached objects should not be larger than 1MB. Memcached itself is not designed to store and process large multimedia (large media) and huge binary blocks (streaming huge blobs).

Need to traverse data

Memcached is designed to only support a small number of commands: set, put, inc, dec, del, cas, and does not support traversal access to data. Memcached needs to complete read and write operations within a constant time, but the time it takes to traverse the data will increase as the amount of data increases. In the process, it will affect the execution speed of other commands, which is not in line with the design concept.

High availability and fault tolerance

Memcached does not guarantee that cached data will not be lost. On the contrary, Memcached has the concept of hit rate, that is, expected data loss is acceptable. When it is necessary to ensure data disaster recovery backup, automatic partitioning, failover and other functions, it is best to store the data in a persistent database, such as MySQL.

Memcached core concepts

Memory management

The picture shows the memory management method of Memcached:

  • Slab

    In order to prevent the occurrence of memory fragmentation, memcached uses slab for management. memcached divides memory into multiple areas, called slabs. When storing data, select slabs based on the size of the data. Each slab is only responsible for data storage within a certain range. For example: the slab in the figure only stores values ​​of 1001~2000bytes. By default, Memcached's maximum value for the next slab is 1.25 times the previous one. This growth ratio can be modified by modifying the -f parameter.

  • Page

    Slab is composed of pages. The fixed size of page is 1M, which can be specified at startup through the -I parameter. If you need to apply for memory, Memcached will divide a new page and allocate it to the required slab area. Once a page is allocated, it will not be reclaimed or reallocated before restarting.

  • Chunk

    When storing data, memcached allocates a fixed-size chunk in the page and inserts the value into the chunk. It should be noted that no matter the size of the inserted data, the size of the chunk is always fixed. If each piece of data is discarded, it will occupy an exclusive chunk. There is no two smaller pieces of data stored in the same chunk. If the chunk is not enough, it can only be transferred to the chunk in the next slab class for storage, causing a waste of space.

LRU

As shown in the figure, Memcached uses an improved LRU mechanism to manage items in memory.

  • HOT

    Because items may exhibit strong temporal locality or have very short TTL (time to live). Therefore, items never move within the HOT: once an item reaches the end of the queue, it is moved to WARM if it is active (3), and if it is inactive (5) it is moved to WARM. Move to COLD.

  • WARM

    Acts as a buffer for scanning workloads, such as web crawlers reading old posts. Items that have never been hit twice cannot enter WARM. A WARM item has a greater chance of surviving its TTL while also reducing lock contention. If the tail item is active, we move it back to the head (4). Otherwise, we move the inactive item to COLD (7).

  • COLD

    Contains the item with the least activity. Inactive items will flow from HOT (5) and WARM (7) to COLD. Once memory is full, items are moved from the tail of COLD. If an item is active, it is queued to be moved asynchronously to WARM (6). In the event of a burst or large number of clicks on COLD, the buffer queue may overflow and the item will remain inactive. In an overload situation, movements from COLD will become probabilistic events without blocking worker threads.

  • TEMP

    As a queue for new items, its TTL is short (2) (usually a few seconds). Items in TEMP will not be replaced or flow to other LRUs, thus saving CPU and lock contention. This feature is currently not enabled by default.

crawl simultaneously traverses each crawler item backwards through LRU from bottom to top. The crawler checks each item it is passed to see if it has expired, and recycles it if so.

Main version introduction

  • 1.6.0: Support external flash storage, more protocols, network optimization
  • 1.5.0: Optimize LRU implementation, storage optimization
  • 1.4.0: Support C, Java and other client implementations, binary protocol
  • It is recommended to use the latest stable version

Monitor key metrics

Next, we will introduce some key indicators of Prometheus monitoring open source Memcached.

System indicators

Operating status

Startup status is the most basic indicator for monitoring Memcached, indicating whether the Memcached instance is running normally or whether it has been restarted. When Memcached goes down, it may have little impact on the functionality of the entire system, because access to data will be submitted to the underlying persistent database, but the lack of Memcached cache will reduce the system's operating efficiency by at least an order of magnitude.

Checking the Memcached startup time can help verify whether Memcached restarted. Because Memcached places all data in memory, all cached data will be lost when Memcached is restarted. At this time, the hit rate of Memcached will drop sharply, resulting in a cache avalanche, which will also put great pressure on the underlying database and reduce system efficiency.

memory usage

As a high-performance cache, Memcached needs to make full use of the node's hardware resources to provide fast data storage and query services. If the node's resource usage exceeds expectations or reaches the limit, it may cause performance degradation or system crash, affecting the normal operation of the business. . Memcached places data in memory, so we need to focus on memory usage.

When the memory usage of Memcached is too high, it may affect other running tasks of the node. It is necessary to consider whether the key design is scientific, increase hardware resources and other optimization solutions.

literacy indicators

Read and write speed

The read and write rate is an important indicator of Memcached cluster performance. If the read and write latency is high, it may cause problems such as long system response time, high node load, and system bottlenecks. The read and write indicators represent an overview of Memcached's operating efficiency. If the read and write delays are found to be high, operation and maintenance personnel may need to pay attention to other monitoring data to troubleshoot the problem. There may be many reasons for slow read and write speeds, such as low hit rate, tight node resources, etc. For different problems, different troubleshooting and optimization measures need to be taken to improve performance.

command rate

Memcached supports a variety of commands, such as set, get, delete, CAS, incr, etc. Monitoring the rate of each command can identify the bottleneck of Memcached. When the rate of a certain command is low, resulting in a low overall rate, you can consider adjusting the item storage strategy, modifying the access method, etc. to improve Memcached performance.

hit rate

Caching can improve the performance and efficiency of queries and reduce the number of disk reads, thereby improving the system's response speed and throughput. The hit rate is the most important indicator of Memcached. A hit means that when the upper-layer application obtains certain data, it finds that it can obtain it from Memcached without accessing the underlying database. When the hit rate is higher, it indicates that most of the data is accessed in memory.

In a system that uses cache, cache avalanche, cache penetration, cache breakdown, etc. may occur, resulting in a very low hit rate within a certain period of time. At this time, the system's operation is very low because most of the data access All fell on the disk, and the underlying database was under great pressure. We need to always pay attention to changes in hit rates to ensure that Memcached plays its due role in the system. For example: the response of the website suddenly becomes very slow during a certain period of time, and it is found that the hit rate is very low during this period. After checking the log again, it is found that a certain page suddenly changes from low-frequency access to high-frequency access due to business changes. This is what happened. Cache penetration is detected. To solve the problem of low hit rate, we need to analyze it in combination with other indicators, such as the slab indicator below.

Slab indicator

As a key-value database, Memcached calls a pair of key values ​​in memory an item. Understanding the storage situation of items in Memcached can optimize its memory usage efficiency and improve the hit rate.

Item storage

The storage status of items is generally monitored: the total amount of items stored by Memcached, the total amount of items recycled, and the total amount of items evicted. By observing the trend of the total number of items, you can see the pressure on Memcached storage and the storage pattern of applications using Memcached.

The difference between eviction and recycling is: when an item needs to be evicted, Memcached will look at some items around the LRU tail, look for expired items that can be recycled, and reuse its memory space instead of evicting the real tail.

Slab usage

According to the design idea of ​​Memcached, the size of items stored in each slab is the same, and each item occupies an exclusive chunk. This is to improve memory utilization. However, during use, there is still room for optimization in this design.

Memcached calcification is a common problem. When the memory reaches the Memcached limit, the service process will execute a series of memory recycling plans. However, no matter what the memory recycling plan is, there is only one major premise for recycling: only recycling the data that is consistent with the data block that is about to be written. Slabs.

For example: We previously stored a large number of items with a size of 64Kb, and now we need to store a large number of items with a size of 128Kb. If the memory quota is not enough and these 128Kb items are constantly updated, Memcached can only choose to evict other 128Kb items in order to store new data. We found that the original 64Kb item was never evicted and continued to occupy memory space before expiration, resulting in a waste of space and ultimately a reduction in hit rate.

To solve the problem of calcification, it is necessary to understand the distribution indicators of items in each slab. We provide this kind of monitoring to observe the comparison of the number of items stored between slabs. If you find that the number of items stored in certain slabs is much higher than that of other slabs, you can consider adjusting the size of each slab. The default slab size growth factor is 1.25, that is, the capacity of each slab is 1.25 times the previous one. By lowering the growth factor and setting the starting size of the slab so that items are evenly distributed in each slab, the problem of calcification can be effectively alleviated and the memory usage efficiency can be improved.

LRU indicator

Number of area items

In Memcached's LRU, each area has different functions. The items in HOT are the latest stored items, the items in WARM are popular items, and the items in COLD are items that are about to expire. Knowing the number of Items in each area can provide a deeper understanding of the status of Memcached and provide tuning solutions. Here are some examples:

There are fewer items in the HOT area and more items in other areas. It indicates that there are fewer newly created items, the data in Memcached is rarely updated, and the data in the system is mainly read. On the contrary, there are more items in the HOT area, indicating that there are more newly created items. At this time, the system mainly writes.

There are more items in the WARM area than in the COLD area. At this time, items are hit more often, which is an ideal situation. On the contrary, the hit rate is low and optimization needs to be considered.

Move the number of Items

Whether the item is hit or not will affect which LRU area it is in. Based on the LRU design idea, by monitoring the movement of items in the LRU area, the status of data access can be obtained. Here are some examples:

The number of items moved from the COLD area to the WARM area increases. It indicates that the item that is about to expire is hit. When the value is too large, it indicates that some unpopular data suddenly becomes hot data. It is necessary to pay attention to such situations to prevent the hit rate from decreasing.

The number of items moved from the HOT area to the COLD area is larger. This indicates that there are a large number of items that are no longer accessed after being inserted. This data may not need to be cached. Consider storing it directly in the underlying database to relieve memory pressure.

The number of items in the WARM area is large. It indicates that a large number of items are accessed after being inserted. This situation is ideal. The data stored in Memcached is frequently accessed, and the hit rate is higher at this time.

Connection metrics

Connection Status

Because Memcached uses an event-based architecture, large numbers of clients usually don't slow it down. Memcached also works well when users have hundreds of thousands of connected clients. But monitoring the current number of user connections can provide an overview of Memcached's working status.

connection error

Memcached limits the number of requests a single client connection can make for each event, determined with the -R parameter when enabled. After a client exceeds this value, the server prioritizes other clients before continuing with the original client request. You should always monitor whether such a situation occurs to ensure that the client uses Memcached normally. The default maximum number of connections for Memcached is 1024. Operation and maintenance personnel need to monitor the number of connections to ensure that the number of connections does not exceed the limit and affect the normal use of functions.

Detailed definition of indicators

System indicators

Indicator name Indicator description
memcached_process_system_cpu_seconds_total System CPU usage time of the process
memcached_process_user_cpu_seconds_total User CPU usage time of the process
memcached_limit_bytes Storage size
memcached_time_seconds current time
memcached_up Whether to start
memcached_uptime_seconds Start Time
memcached_version Version

literacy indicators

Indicator name Indicator description
memcached_read_bytes_total The total size of read data in the server
memcached_written_bytes_total The total size of send data in the server
memcached_commands_total The total number of all requests in the server classified by command
memcached_slab_commands_total The total number of all requests classified by command in slab class

Storage metrics

slab storage index
Indicator name Indicator description
memcached_items_total The total number of stored items
memcached_items_evicted_total The number of evicted items
memcached_slab_items_reclaimed_total The number of expired items, classified by slab
memcached_items_reclaimed_total The total number of expired items
memcached_current_items The number of items in the current instance
memcached_malloced_bytes slab page size
memcached_slab_chunk_size_bytes chunk size
memcached_slab_chunks_free Number of free chunks
memcached_slab_chunks_used Number of non-free chunks
memcached_slab_chunks_free_end The number of free chunks in the latest page
memcached_slab_chunks_per_page The number of chunks in the page
memcached_slab_current_chunks The number of chunks in slab class
memcached_slab_current_items Number of items in slab class
memcached_slab_current_pages Number of pages in slab class
memcached_slab_items_age_seconds The number of seconds since the most recent item in the slab class was last accessed
memcached_current_bytes actual occupied size of item
memcached_slab_items_outofmemory_total Number of items that trigger out of memory error
memcached_slab_mem_requested_bytes Item storage size in slab
LRU storage metrics
Indicator name Indicator description
memcached_lru_crawler_enabled Is LRU crawler enabled?
memcached_lru_crawler_hot_max_factor Set the idle period for HOT LRU to turn into WARM LRU
memcached_lru_crawler_warm_max_factor Set the idle period for WARM LRU to convert to COLD LRU
memcached_lru_crawler_hot_percent The percentage of slab reserved for HOT LRU
memcached_lru_crawler_warm_percent The percentage of slab reserved for WARM LRU
memcached_lru_crawler_items_checked_total Number of items checked by LRU Crawler
memcached_lru_crawler_maintainer_thread Split LRU mode and background threads
memcached_lru_crawler_moves_to_cold_total Number of items moved into COLD LRU
memcached_lru_crawler_moves_to_warm_total Number of items moved into WARM LRU
memcached_slab_items_moves_to_cold Number of items moved into COLD LRU, classified by slab
memcached_slab_items_moves_to_warm Number of items moved into WARM LRU, classified by slab
memcached_lru_crawler_moves_within_lru_total Number of items processed by crawler and scrambled in HOT and WARM LRU
memcached_slab_items_moves_within_lru The number of items exchanged between HOT and WARM by being hit
memcached_lru_crawler_reclaimed_total The number of items in LRU cleaned by LRU Crawler
memcached_slab_items_crawler_reclaimed_total The number of items cleaned by the LRU Crawler in the slab
memcached_lru_crawler_sleep LRU Crawler interval
memcached_lru_crawler_starts_total LRU startup time
memcached_lru_crawler_to_crawl The maximum number of items crawled by each slab
memcached_slab_cold_items Number of items in COLD LRU
memcached_slab_hot_items Number of items in HOT LRU
memcached_slab_hot_age_seconds The age of the oldest item in HOT LRU
memcached_slab_cold_age_seconds COLD The age of the oldest item in the LRU
memcached_slab_warm_age_seconds The age of the oldest item in WARM LRU
memcached_slab_items_evicted_nonzero_total The total number of times an item with an expiration time explicitly set must be deleted from the LRU before expiration
memcached_slab_items_evicted_total The total number of items that were evicted and never retrieved
memcached_slab_items_evicted_unfetched_total The total number of items that have expired and have never been retrieved
memcached_slab_items_tailrepairs_total The total number of times that the item with a specific ID needs to be retrieved
memcached_slab_lru_hits_total Total number of LRU hits
memcached_slab_items_evicted_time_seconds The number of seconds since the last access to the most recently evicted item in this slab.
memcached_slab_warm_items WARM中的item数,以slab分类

连接指标

指标名 指标说明
memcached_connections_total 接受连接总数
memcached_connections_yielded_total 因触及 memcached -R 限制而运行的连接总数。
memcached_current_connections 打开连接数
memcached_connections_listener_disabled_total 触及限制的连接总数
memcached_connections_rejected_total 拒绝连接数,触发memcached -c 限制
memcached_max_connections client限制大小

监控大盘

我们默认提供了 Memcached Overview 大盘。

总览

在该 panel 能看到Memcached运行时需要重点关注的指标,在检查 Memcached 状态时,首先查看总览中是否有异常状态,再检查具体的指标。

  • 启动状态:绿色代表正常运行,即能查询到 Memcached 运行指标;红色表示 Memcached 异常
  • 内存使用率:使用红黄绿颜色提示,内存使用率在 80% 以下时为绿色,80%~90% 为黄色,90% 以上为红色
  • 命中率:使用红黄绿颜色提示,在 10% 以下为红色,10%~30% 为黄色,30% 以上为绿色

性能

在以下 panel 能看到 Memcached 的运行速度,分为以下三类。

  • QPS:每秒能处理所有命令的总数
  • 命令速率:每秒能处理命令的数量,分为 set、get 等 Memcached 命令
  • 读/写速率:每秒能速写的数据量

命中率

命中率是需要重点关注的指标,通过以下三个 panel 能详细了解 Memcached 的命中状态。

  • 总命中率:整体的命中率趋势
  • 命令命中率:各命令的命中率趋势,一般重点关注 get 的命中率
  • 命中 slab 区域:命中的 slab 在哪个区域

item

item 表示 Memcached 的存储状态,通过以下 panel 检查 Memcached 的内存使用状态。

  • item 总数:Memcached 存储 item 的趋势
  • 各 slab 的 item 数:不同 slab 的 item 存储数量变化
  • slab 使用率、slab 大小:各 slab 的使用状态以及对应大小,能通过这些指标检查钙化问题
  • LRU 间 item 移动速率:通过检查 item 的移动情况,检查 Memcached 的热点、命中率问题
  • 回收、过期、驱逐 item 数:当内存不足时,必须有 item 要移出 Memcached

内存

内存是 Memcached 的重点关注硬件资源,通过该 panel 能了解 Memcached 的内存使用情况:

  • 最大内存:提供内存整体状态
  • 内存使用率/使用量:分析内存使用的趋势
  • OOM 次数:检查是否因为内存不足造成 OOM

网络

尽管 Memcached 提供高并发无损耗的支持,但网络资源不是无限的,需要随时关注:

  • 最大连接数以及连接使用率:查看趋势以及仪表盘检查连接数是否符合预期
  • 拒绝连接数:检查是否发生过拒绝连接的情况,保证客户端正常运行
  • 请求过多连接数:检查该趋势以保证没有异常客户端

句柄

句柄是系统指标,涉及到网络连接等资源。

关键告警项规则

在对 Memcached 进行告警规则配置时,我们推荐基于以上采集得到的指标,从以下几个方面进行告警规则的配置,分别是运行情况、资源使用情况、连接使用情况。一般来说,我们默认生成影响 Memcached 正常使用的告警规则,优先级较高。读写速率等与业务相关的告警则由用户自定义。以下是一些推荐的告警规则。

运行情况

Memcached 停机

Memcached 停机是 0/1 阈值的告警规则。一般来说,部署在 ACK 等阿里云环境的 Memcached 服务具有高可用的能力,当一个 Memcached 实例停止,其他的实例会继续工作。有可能出现程序错误,导致 Memcached 实例无法重新部署,这是非常严重的情况。我们默认设定 5 分钟内 Memcached 无法恢复的告警。

Memcached 重启

对于其他的服务,实例重启不算问题。然而 Memcached 是典型的有状态服务,它的主要数据都存储在内存中。当实例被重启,缓存的数据将全部丢失,此时命中率下降,导致系统性能很差。因此我们推荐监控 Memcached 的重启告警,当重启发生后找到对应问题,防止该情况再次出现。

资源使用情况

内存使用率

当内存使用率过高,Memcached 无法正常运行。我们设定的内存使用阈值为:危险值 80%,告警值 90%。当内存使用率为 80% 时,节点高负荷运转,但一般不影响正常使用。当内存长时间使用率为 90% 时,将发出告警,提示运维资源紧缺,尽早处理。

item 触发 OOM

当内存使用量超过节点总内存时,内存不足 (OOM) 情况可能会导致数据完全缓存刷新,从而中断您的应用程序和业务。当节点内存利用率指标接近 100% 时,实例更有可能遇到 OOM 情况。我们设置了 0/1 阈值的告警,一旦出现 OOM 则立刻告警。

连接使用情况

拒绝连接

一般情况下连接数的增多不影响 Memcached 的运行,除非突破了最大连接数限制,此时客户端无法正常获取 Memcached 服务。我们设置了 0/1 阈值的告警,一旦出现拒绝连接的情况,立刻告警,保证客户端正常运行。

连接请求数过多

我们设置了 0/1 阈值的告警,一旦出现连接请求数过多的情况,立刻告警,保证客户端正常运行。在这样的情况下,Memcached 暂时不接受该连接的命令。

相关实践示例

命中率低

命中率低有各种各样的原因,我们需要联系多个指标进行排查。

检查内存使用率

  • 原因: 当内存资源不够,Memcached 无法存储足够多的 item,某些应当是热点的 item 因为内存不足的原因被驱逐。
  • 排查方法: 检查大盘中的内存使用率 panel,检查内存使用率是否一直都很高。检查告警历史,查看是否提示内存资源不足。
  • 解决方法: 增加对应节点的内存资源。

检查 item 情况

  • 原因: 内存不足导致的命中率低同样会体现在 item 的相关指标。此外,缓存 item 设计是否合理同样可以通过 item 指标得出。
  • 排查方法: 分别检查 item 总数、item 驱逐数、item 回收数,正常情况下,item 驱逐数和 item 回收数应当占 item 总数比较小的比例。
  • 解决方法: 增加对应节点的内存资源;设计缓存的策略,尽量存储热点数据。

检查 LRU 各区域 item 情况

  • 原因: 存储的 item 有可能不是以后需要用到的;某些冷门数据突然成为热点数据。

  • 排查方法: 检查 LRU 区域 item 的指标。当 HOT 区域移动至 COLD 区域的 item 数较大。表明有大量的 item 被插入后就不再被访问。COLD 区域移动至 WARM 区域的 item 数增大。表明即将过期的 item 被命中,当该数值过大,表明某些冷门数据突然成为热点数据。

  • 解决方法: 设计缓存的策略,尽量存储热点数据。优化策略防止数据在冷门和热点两种状态切换。

内存使用率高

检查内存使用趋势

  • 原因: 节点的内存使用率一直是足够的,但是某些时间段流量突增,导致内存使用率突然上涨。
  • 排查方法: 检查大盘中内存使用率的趋势,查看是否有内存使用率突然上涨的情况。在结合上涨时间段检查业务流量。
  • 解决方法: 设置 memcached 实例横向扩展的策略,使其节点资源具有弹性。

检查 slab 使用率

  • 原因: 发生了存储钙化的问题,某些 slab 区域 item 一直占用内存。
  • 排查方法: 检查各 slab 区域的使用情况,是否有某几个特定 slab 存储量很大,某些 slab 存储量很小。
  • 解决方法: 调整 slab 大小策略:初始 slab 大小、slab 大小增长因子,尽量让 item 平均分布在各 slab 中。

监控体系搭建

自建 Prometheus 监控 Memcached 的痛点:

通常我们当前的 Memcached 都是部署在 ECS 上,因此自建 Prometheus 监控 Memcached 时,我们将面临的典型问题有:

  1. 由于安全、组织管理等因素,用户业务通常部署在多个相互隔离的 VPC,需要在多个 VPC 内都重复、独立部署 Prometheus,导致部署和运维成本高。

  2. 每套完整的自建监控系统都需要安装并配置 Prometheus、Grafana、AlertManager 等,过程复杂、实施周期长。

  3. 缺少与阿里云 ECS 无缝集成的服务发现(ServiceDiscovery)机制,无法根据 ECS 标签来灵活定义抓取 targets。如果自行实现类似功能,则需要使用 Golang 语言开发代码(调用阿里云 ECS POP 接口)、集成进开源 Prometheus 代码、编译打包后部署,实现门槛高、过程复杂、版本升级困难。

  4. 常用开源 Grafana Memcached 大盘不够专业,缺少结合 MSSQL 原理/特征和最佳实践进行深入优化。

  5. 缺少 Memcached 告警指标模板,需要用户自行研究、配置告警项,工作量大。

用阿里云 Prometheus 进行自建 Memcached 的监控:

  • 登录 ARMS 控制台 [ 1]
  • 在左侧导航栏选择 Prometheus监控 > Prometheus 实例列表,进入可观测监控 Prometheus 版的实例列表页面。
  • 单击目标 Prometheus 实例名称,进入集成中心页面。
  • 单击 Memcached 卡片的安装。

  • 配置相关参数,并单击确定,完成组件接入。

已接入的组件会显示在集成中心页面的已安装区域。单击该组件卡片,在弹出的面板中可以查看 Targets、指标、大盘、告警、服务发现配置、Exporter 等信息。

如下图所示,您可以看到目前可观测监控 Prometheus 版提供的关键告警指标:

您可以在大盘页签,单击大盘缩略图,查看对应 Grafana 大盘。

您可以面板中单击告警页签,查看 Memcached 的 Prometheus 告警。您还可以根据业务需求新增告警规则。创建 Prometheus 告警规则的具体操作,请参见 Prometheus 告警规则 [ 2]

自建 Prometheus 与阿里云可观测监控 Prometheus 版监控 Memcached 优劣对比:

参考链接:

[1] https://memcached.org/about

[2] https://memcached.org/blog/modern-lru/

[3] https://www.scaler.com/topics/aws-memcached/

相关链接:

[1] ARMS 控制台

https://account.aliyun.com/login/login.htm?oauth_callback=https%3A%2F%2Farms.console.aliyun.com%2F#/home

[2] Prometheus 告警规则

https://help.aliyun.com/zh/arms/prometheus-monitoring/create-alert-rules-for-prometheus-instances#task-2121615

商汤科技创始人汤晓鸥离世,享年 55 岁 2023 年,PHP 停滞不前 Wi-Fi 7 将于 2024 年初全面登场,速度比 Wi-Fi 6 提升 5 倍 鸿蒙系统即将走向独立,多家高校设立“鸿蒙班” 稚晖君创业公司再融资,金额超 6 亿元,投前估值 35 亿元 夸克浏览器 PC 版开启内测 AI 代码助手盛行,编程语言排行榜都没法做了 Mate 60 Pro 的 5G 调制解调器和射频技术遥遥领先 MariaDB 拆分 SkySQL,作为独立公司成立 小米回应余承东“龙骨转轴”抄袭华为言论
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3874284/blog/10321575