victoriaMetrics无法获取抓取target的问题

问题描述
最近在新环境中部署了一个服务,其暴露的指标路径为:10299/metrics,配置文件如下(名称字段有修改):

apiVersion: v1
items:

  • apiVersion: operator.victoriametrics.com/v1beta1
    kind: VMServiceScrape
    metadata:
    labels:
    app_id: audit
    name: audit
    namespace: default
    spec:
    endpoints:
    • path: /metrics
      targetPort: 10299
      namespaceSelector:
      matchNames:
      • default
        selector:
        matchLabels:
        app_id: audit
        但在vmagent上查看其状态如下,vmagent无法发现该target:

在这里插入图片描述

一般排查方式
确保服务本身没问题,可以通过${podIp}:10299/metrics访问到指标
确保vmservicescrape–>service–>enpoints链路是通的,即配置的selector字段能够正确匹配到对应的资源
确保vmservicescrape格式正确。注:vmservicescrape资源格式不正确可能会导致vmagent无法加载配置,可以通过第5点检测到
确保vmagent中允许发现该命名空间中的target
在vmagent的UI界面执行reload,查看vmagent的日志是否有相关错误提示
经过排查发现上述方式均无法解决问题,更奇怪的是在vmagent的api/v1/targets中无法找到该target,说明vmagent压根没有发现该服务,即vmservicescrape配置没有生效。在vmagent中查看上述vmservicescrape生成的配置文件如下(其拼接了静态配置),可以看到它使用了kubernetes_sd_configs的方式来发现target:

  • job_name: serviceScrape/default/audit/0
    metrics_path: /metrics
    relabel_configs:
    • source_labels: [__meta_kubernetes_service_label_app_id]
      regex: audit
      action: keep
    • source_labels: [__meta_kubernetes_pod_container_port_number]
      regex: “10299”
      action: keep
    • source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
      separator: ;
      target_label: node
      regex: Node;(.*)
      replacement: ${1}
    • source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
      separator: ;
      target_label: pod
      regex: Pod;(.*)
      replacement: ${1}
    • source_labels: [__meta_kubernetes_pod_name]
      target_label: pod
    • source_labels: [__meta_kubernetes_pod_container_name]
      target_label: container
    • source_labels: [__meta_kubernetes_namespace]
      target_label: namespace
    • source_labels: [__meta_kubernetes_service_name]
      target_label: service
    • source_labels: [__meta_kubernetes_service_name]
      target_label: job
      replacement: ${1}
    • target_label: endpoint
      replacement: “8080”
      kubernetes_sd_configs:
    • role: endpoints
      namespaces:
      own_namespace: false
      names:
      • default
        代码分析
        既然配置没有问题,那只能通过victoriametrics的kubernetes_sd_configs的运作方式看下到底是哪里出问题了。在victoriametrics的源码可以看到其拼接的target url如下:

scrapeURL := fmt.Sprintf(“%s://%s%s%s%s”, schemeRelabeled, addressRelabeled, metricsPathRelabeled, optionalQuestion, paramsStr)
其中:

schemeRelabeled:默认是http
metricsPathRelabeled:即生成的配置文件的metrics_path字段
optionalQuestion和paramsStr没有配置,可以忽略
最主要的字段就是addressRelabeled,它来自一个名为"address"的标签

func mergeLabels(swc *scrapeWorkConfig, target string, extraLabels, metaLabels map[string]string) []prompbmarshal.Label {

m[“job”] = swc.jobName
m[“address”] = target
m[“scheme”] = swc.scheme
m[“metrics_path”] = swc.metricsPath
m[“scrape_interval”] = swc.scrapeInterval.String()
m[“scrape_timeout”] = swc.scrapeTimeout.String()

}
继续跟踪代码,可以看到该标签是通过sc.KubernetesSDConfigs[i].MustStart获取到的,从KubernetesSDConfigs的名称上看,它就是负责处理kubernetes_sd_configs机制的:

func (sc *ScrapeConfig) mustStart(baseDir string) {
swosFunc := func(metaLabels map[string]string) interface{} {
target := metaLabels[“address”]
sw, err := sc.swc.getScrapeWork(target, nil, metaLabels)
if err != nil {
logger.Errorf(“cannot create kubernetes_sd_config target %q for job_name %q: %s”, target, sc.swc.jobName, err)
return nil
}
return sw
}
for i := range sc.KubernetesSDConfigs {
sc.KubernetesSDConfigs[i].MustStart(baseDir, swosFunc)
}
}
继续往下看,看看这个"address"字段到底是什么,函数调用如下:

MustStart–> cfg.aw.mustStart --> aw.gw.startWatchersForRole --> uw.reloadScrapeWorksForAPIWatchersLocked --> o.getTargetLabels

最后一个函数getTargetLabels是个接口方法:

type object interface {
key() string

// getTargetLabels must be called under gw.mu lock.
getTargetLabels(gw *groupWatcher) []map[string]string

}
getTargetLabels的实现如下,这就是kubernetes_sd_configs的各个role的具体实现。上述服务用到的是kubernetes_sd_configsrole为endpoints:

其实现如下:

func (eps *Endpoints) getTargetLabels(gw *groupWatcher) []map[string]string {
var svc *Service
if o := gw.getObjectByRoleLocked(“service”, eps.Metadata.Namespace, eps.Metadata.Name); o != nil {
svc = o.(*Service)
}
podPortsSeen := make(map[*Pod][]int)
var ms []map[string]string
for _, ess := range eps.Subsets {
for _, epp := range ess.Ports {
ms = appendEndpointLabelsForAddresses(ms, gw, podPortsSeen, eps, ess.Addresses, epp, svc, “true”)
ms = appendEndpointLabelsForAddresses(ms, gw, podPortsSeen, eps, ess.NotReadyAddresses, epp, svc, “false”)
}
}
// See https://kubernetes.io/docs/reference/labels-annotations-taints/#endpoints-kubernetes-io-over-capacity
// and https://github.com/kubernetes/kubernetes/pull/99975
switch eps.Metadata.Annotations.GetByName(“endpoints.kubernetes.io/over-capacity”) {
case “truncated”:
logger.Warnf(the number of targets for "role: endpoints" %q exceeds 1000 and has been truncated; please use "role: endpointslice" instead, eps.Metadata.key())
case “warning”:
logger.Warnf(the number of targets for "role: endpoints" %q exceeds 1000 and will be truncated in the next k8s releases; please use "role: endpointslice" instead, eps.Metadata.key())
}

// Append labels for skipped ports on seen pods.
portSeen := func(port int, ports []int) bool {
	for _, p := range ports {
		if p == port {
			return true
		}
	}
	return false
}
for p, ports := range podPortsSeen {
	for _, c := range p.Spec.Containers {
		for _, cp := range c.Ports {
			if portSeen(cp.ContainerPort, ports) {
				continue
			}
			addr := discoveryutils.JoinHostPort(p.Status.PodIP, cp.ContainerPort)
			m := map[string]string{
				"__address__": addr,
			}
			p.appendCommonLabels(m)
			p.appendContainerLabels(m, c, &cp)
			if svc != nil {
				svc.appendCommonLabels(m)
			}
			ms = append(ms, m)
		}
	}
}
return ms

}
可以看到,"address"其实就是拼接了p.Status.PodIP和cp.ContainerPort,而p则代表一个kubernetes的pod数据结构,因此要求:

pod状态是running的,且能够正确分配到PodIP
p.Spec.Containers[].ports[].ContainerPort中配置了暴露metrics target的端口
问题解决
鉴于上述分析,查看了一下环境中的deployment,发现该deployment只配置了8080端口,并没有配置暴露指标的端口10299。问题解决。

apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app_id: audit
name: audit
namespace: default
spec:

template:
metadata:

spec:
containers:
- env:
- name: APP_ID
value: audit
ports:
- containerPort: 8080
protocol: TCP

猜你喜欢

转载自blog.csdn.net/weixin_43214644/article/details/124647101