Prometheus captures Docker Swarm cluster replica service

dns_sd_configsIn this blog post, I will demonstrate how to do this easily by introducing an intermediate Prometheus instance in Docker swarm and combining several Prometheus functions (mainly cross-service federation) to collect and obtain the required metric data a little.

In a Docker swarm cluster, applications run as services. To external hosts (specifically everything outside the swarm cluster), the service looks like an instance accessed through a published port. But inside a swarm, there are usually multiple instances (replicas) running the service. When an external unsolicited request occurs, Docker Network routes the published service port to one of the running replicas. As the caller, you have no idea that you are being routed to a specific instance of the service.

If you want to have a Prometheus server running outside of Docker swarm to grab metrics for a service, the easiest way is to have it actively pull monitoring for the published service. If the service is running in replication mode with multiple instances, Specific examples will not be accurately obtained. Because the call to the service actually ends up in the Docker network load balancer, it forwards the fetch request to a running instance. So the data you get are metrics for one of the service instances (you don't know which one). Because Prometheus periodically scrapes service metrics, and each scrape request is routed independently of the previous request, the next scrape request may be routed to and responded to by a different service instance that returns metrics for that instance, etc. . So the worst case scenario is that Prometheus will get a different set of metrics on every crawl request and won't give you coherent data.

If Prometheus is aware of multiple service instances and can scrape them individually, it will add a instancelabel to the metric and thereby store a different time series for each metric and instance. Docker swarm does a good job of hiding these details from Prometheus, at least outside of the swarm. So, if you run Prometheus itself as a service in Docker swarm, you can use its dns_sd_configsfunctionality with Docker swarm DNS service discovery to crawl all instances individually. Combined with Prometheus's cross-service federation capabilities , you can capture these service instance metrics from Prometheus servers outside of the swarm.

In this blog post, I'll set up a local Docker swarm cluster running the sample service to demonstrate what it looks like.

Set up a Docker swarm using the sample service

First, I initialize swarm mode for the local Docker instance (can be deactivated again docker swarm leave --forceusing

docker swarm init

I'm running Docker Desktop for Mac, so I don't need any other options here. For details on how to set up local swarm in other environments, see the docker swarm tutorial.

An important detail (which unfortunately doesn't seem to be described in the Docker swarm documentation) is that Docker swarm DNS service discovery does not work with the default ingress overlay network (it took me a long time to figure this out until I found this in Answered in the Docker forum ). So I'm going to start by creating a custom overlay network.

docker network create \
    --driver overlay \
    --attachable \
    custom-overlay-network

As a sample service, I used a Docker image containing a very basic Spring Boot application with the Actuator and Micrometer Prometheus plugins enabled.

docker service create \
    --name sample-service \
    --replicas 3 \
    --network custom-overlay-network \
    -p 8080:8080 \
    sample-app:latest

Listing all the Docker services running in my swarm, I can see that I sample-servicehave three instances running.

docker service ls

    ID                  NAME                MODE                REPLICAS            IMAGE               PORTS
    kgjaw3vx1tnh        sample-service      replicated          3/3                 sample-app:latest   *:8080->8080/tcp

The 8080 port of my Spring Boot application is published so I can also access the actuator metrics endpoint

curl localhost:8080/actuator/prometheus

    # HELP jvm_gc_live_data_size_bytes Size of old generation memory pool after a full GC
    # TYPE jvm_gc_live_data_size_bytes gauge
    jvm_gc_live_data_size_bytes 0.0
    # HELP jvm_classes_loaded_classes The number of classes that are currently loaded in the Java virtual machine
    # TYPE jvm_classes_loaded_classes gauge
    jvm_classes_loaded_classes 7469.0
    ...

Since my Docker swarm only contains one manager node (my local machine), I can see all three replicas of the running Docker containers

docker ps

    CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
    bc26b66080f7        sample-app:latest   "java -Djava.securit…"   6 minutes ago       Up 6 minutes                            sample-service.3.hp0xkndw8mx9yoph24rhh60pl
    b4cb0a313b82        sample-app:latest   "java -Djava.securit…"   6 minutes ago       Up 6 minutes                            sample-service.2.iwbagkwjpx4m6exm4w7bsj5pd
    7621dd38374a        sample-app:latest   "java -Djava.securit…"   6 minutes ago       Up 6 minutes                            sample-service.1.1a208aiqnu5lttkg93j4dptbe

To see DNS service discovery in action, I connected to one of the containers running in Docker swarm. I have to install dnsutilsthe package to use it nslookup.

docker exec -ti bc26b66080f7 /bin/sh

apt-get update && apt-get install dnsutils -y

Looking for the service name itself, I get a virtual IP address

nslookup sample-service

    Server:   127.0.0.11
    Address:  127.0.0.11#53

    Non-authoritative answer:
    Name:     sample-service
    Address:  10.0.1.2

To resolve the virtual IP addresses of all service replicas running in my Docker swarm, I have to look up tasks.<service name>the domain name (see the Docker Overlay Networking documentation )

nslookup tasks.sample-service

    Server:   127.0.0.11
    Address:  127.0.0.11#53

    Non-authoritative answer:
    Name:     tasks.sample-service
    Address:  10.0.1.4
    Name:     tasks.sample-service
    Address:  10.0.1.3
    Name:     tasks.sample-service
    Address:  10.0.1.5

This DNS service discovery feature is what a Prometheus instance running in a Docker swarm can use to crawl all of these service instances (I will swarm-prometheusmention this instance in the rest of the text).

Capture service instances in swarm

To set up swarm-prometheusthe service, I built a Docker image based on the latest official Prometheus image and added my own configuration files.

FROM prom/prometheus:latest
ADD prometheus.yml /etc/prometheus/

The interesting part of the config file is swarm-servicethe crawl job I added. I use a dns_sd_config(see documentation for details ) to find crawl targets by performing a DNS query. I need to perform a Class A DNS query, and since the query only returns the IP address of the service instance, I have to tell the Prometheus instance what port it is listening on and the path to the metric endpoint.

scrape_configs:
  ...
  - job_name: 'swarm-service'
    dns_sd_configs:
      - names:
          - 'tasks.sample-service'
        type: 'A'
        port: 8080
    metrics_path: '/actuator/prometheus'

After building the image I created swarm-prometheusthe service

docker build -t swarm-prometheus .

docker service create \
    --replicas 1 \
    --name swarm-prometheus \
    --network custom-overlay-network \
    -p 9090:9090 \
    swarm-prometheus:latest

When I open the Prometheus Web UI and navigate to "Status -> Targets" I can see that my configuration is working as expected

figure 1Figure 1 – Status of the crawl job configured in the swarm-prometheus web UI

Executing a basic query against one of the metrics written in the sample application, I get three resulting time series, one for each of my instances. instanceTags added by prometheus crawl jobs contain the IP and port of the corresponding service instance.

figure 2Figure 2 - Basic Prometheus query with three result time series

At this point, the metrics for all my service instances are collected in swarm-prometheus. As a next step, I want to get them into a Prometheus server running outside of the swarm (I'll host-prometheusreference it here).

Use federate to grab metrics from another Prometheus

Prometheus provides an /federateendpoint that can be used to scrape a selected set of time series from another Prometheus instance (see the documentation for details ). The endpoint requires one or more instant vector selectors to specify the requested time series.

I want to call and query the endpoint for all the time series collected by my scraping job /federate(I use with and options to be able to use unencoded parameter values)swarm-prometheus``swarm-service``curl``-G``--data-urlencode

curl -G "http://localhost:9090/federate" --data-urlencode 'match[]={job="swarm-service"}'

    # TYPE jvm_buffer_count_buffers untyped
    jvm_buffer_count_buffers{
    
    id="direct",instance="10.0.1.3:8080",job="swarm-service"} 10 1586866971856
    jvm_buffer_count_buffers{
    
    id="direct",instance="10.0.1.4:8080",job="swarm-service"} 10 1586866975100
    jvm_buffer_count_buffers{
    
    id="direct",instance="10.0.1.5:8080",job="swarm-service"} 10 1586866976176
    jvm_buffer_count_buffers{
    
    id="mapped",instance="10.0.1.3:8080",job="swarm-service"} 0 1586866971856
    jvm_buffer_count_buffers{
    
    id="mapped",instance="10.0.1.5:8080",job="swarm-service"} 0 1586866976176
    jvm_buffer_count_buffers{
    
    id="mapped",instance="10.0.1.4:8080",job="swarm-service"} 0 1586866975100
    ...

The only thing I had to do host-prometheuswas add an appropriate scraping job to request that /federateendpoint.

scrape_configs:
  ...
  - job_name: 'swarm-prometheus'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="swarm-service"}'
    static_configs:
      - targets:
        - 'swarm-prometheus:9090'

Since I will be host-prometheusrunning in Docker, connected to the same network as my swarm, I can use swarm-prometheusthe service name as the hostname. In a real world environment, I might have to find another swarm-prometheusway to access the service, such as using a docker swarm node's IP address and published port.

The activation flag ensures that Prometheus retains the and labels honor_labelsalready included in crawled metrics and does not overwrite them with its own values ​​(see the documentation for details ).jobinstancescrape_config

Once it's built and running, host-prometheusI can check the target status page again to see if the crawl job ran successfully

docker build -t host-prometheus .

docker run -d \
    --network custom-overlay-network \
    -p 9999:9090 \
    host-prometheus:latest

image 3Figure 3 – Status of the crawl job configured in the host-prometheus Web UI

host-prometheusNow I can execute the same Prometheus query as before in my web UI and get three resulting time series.

So, that's already it. By simply setting up an intermediate Prometheus instance in docker swarm and combining a few existing features, it's easy to get metrics from all swarm service instances into the Prometheus server, even if it has to run outside of the swarm.

optimization

After implementing the above setup in my current project, I've come up with some improvements that I think are worth sharing.

If you are running multiple different Spring Boot services in a docker swarm, all listening on the default port 8080, then setting up a dedicated swarm-prometheusscraping job for each service is very redundant. The only thing that needs to be changed for each service is the requested domain name ( tasks.<service name>). And, as you may have noticed, you can do this in dns_sd_configs. So we can configure a scraping job that covers all existing services

scrape_configs:
  ...
  - job_name: 'swarm-services'
    metrics_path: '/actuator/prometheus'
    dns_sd_configs:
      - names:
          - 'tasks.sample-service-1'
          - 'tasks.sample-service-2'
          - ...
        type: 'A'
        port: 8080

However, doing so we may run into another problem. With the old configuration, there was one crawl job per service, and we were able to name the crawl jobs accordingly and use jobtags to identify/filter metrics for different services. Now, having a generic scraping job, we have to find another solution for this.

Fortunately, Micrometer, the library we use to provide Prometheus metrics endpoints in our Spring Boot application, can be easily configured to add custom tags to all written metrics. By adding the following line to application.propertiesthe configuration file of each of our Spring Boot services (for example), a servicetag named with a static value containing the service name (here) sample-service-1is added to all metrics written by our services.

management.metrics.tags.service=sample-service-1

Finally, if you are using Grafana on top of Prometheus instance, tag values ​​containing the IP address and port (for example) of the service instance 10.0.1.3:8080will prove problematic. If you want to use them as dashboard variables (for example, repeating the panel for all instances or filtering data for a specific instance) this will not work because of the dots and colons in the values ​​(these values ​​will break the data request to the underlying Prometheus Because they are not Grafana encoded URLs). We have to convert them into a less problematic format to use them in this way. metric_relabel_configsWe can do this by adding aswarm-prometheus

scrape_configs:
  ...
  - job_name: 'swarm-services'
    metrics_path: '/actuator/prometheus'
    dns_sd_configs:
      - names:
          - 'tasks.sample-service-1'
          - 'tasks.sample-service-2'
          - ...
        type: 'A'
        port: 8080
    metric_relabel_configs:
      - source_labels: [ instance ]
        regex: '^([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)\:([0-9]+)$'
        replacement: '${1}_${2}_${3}_${4}'
        target_label: instance

This configuration takes all values ​​of source_labels(here instance), applies the given value to each regexvalue, replacementreplacing the value with the given expression (using overwrite the original value by) into the indicator. Therefore, the old values ​​are converted to values ​​that are less problematic for Grafana.${1}``${2}``regex``target_label``instance``10.0.1.3:8080``10_0_1_3


Update: Starting with Prometheus 2.20, there is also a Docker Swarm service discovery available that can be used in place of the DNS service discovery described in this article. Thanks to Julien Pivoto for introducing me to the new feature.

Guess you like

Origin blog.csdn.net/quuqu/article/details/124151624