[Thanos source code analysis series] A brief analysis of the source code of the thanos sidecar component

1 Overview:

1.1 Source code environment

The version information is as follows:
a. Thanos component version: v0.16.0

1.2 The role of Thanos Sidecar

The Thanos Query component is bound to the prometheus instance, and has three functions:
1) As an access agent, it exposes the grpc interface to the client. The business logic is to access the http interface of the bound prometheus instance to obtain metrics and rule data, and finally Return to the client.
2) If the object storage function is enabled, all block directories in the promethues tsdb directory will be uploaded to the specified object storage system.
3) Monitor the changes of the promethues configuration file, and find that the http interface of the prometheus instance is also accessed after the file changes to make prometheus reload the configuration.
Insert picture description here

2 Brief analysis of source code:

Use the github.com/oklog/run package to start a set of coroutines. The logic of these coroutines is mainly to start the http server, grpc server, and dynamically discover downstream components that implement the STORE API.

2.1 main method

The startup command format of Thanos is as follows, the format starts with thanos (because it is the same executable binary file). Which component to start depends on the first parameter, in this example it is sidecar, so this command is the logic to start the sidecar component.

thanos sidecar \
--prometheus.url=http://localhost:9090/ \
--tsdb.path=/prometheus \
--grpc-address=[$(POD_IP)]:10901 \
--http-address=[$(POD_IP)]:10902 \


Let's take a look at the main method in detail. Create an app object. The app object contains the startup functions of all Thanos components, but only one function is taken from the map to start when it is actually started. Which function is taken out depends on the startup command.

func main() {


	/*
		其他代码
	*/

	app := extkingpin.NewApp(kingpin.New(filepath.Base(os.Args[0]), "A block storage based long-term storage for Prometheus").Version(version.Print("thanos")))
	/*
		其他代码
	*/


	// 把所有组件的启动逻辑都放进app对象中的setups列表中
	registerSidecar(app)
	registerStore(app)
	registerQuery(app)
	registerRule(app)
	registerCompact(app)
	registerTools(app)
	registerReceive(app)
	registerQueryFrontend(app)

	// 根据命令行的信息,从app对象的setups列表中取出一个组件逻辑
	cmd, setup := app.Parse()
	logger := logging.NewLogger(*logLevel, *logFormat, *debugName)

	/*
		其他代码
	*/

	var g run.Group
	var tracer opentracing.Tracer
	
	/*
		tracing相关的代码
	*/
	
	
	reloadCh := make(chan struct{}, 1)

	// 启动特定的一个组件(sidecar、query、store等组件中的一种),底层还是执行g.Add(...)
	if err := setup(&g, logger, metrics, tracer, reloadCh, *logLevel == "debug"); err != nil {		
		os.Exit(1)
	}

	// 监听来自系统的杀死信号.
	{
		cancel := make(chan struct{})
		g.Add(func() error {
			return interrupt(logger, cancel)
		}, func(error) {
			close(cancel)
		})
	}

	// 监听来配置重载的信号
	{
		cancel := make(chan struct{})
		g.Add(func() error {
			return reload(logger, cancel, reloadCh)
		}, func(error) {
			close(cancel)
		})
	}

	// 阻塞地等待所有协程中的退出
	// 有一个协程返回,其他协程也会返回
	if err := g.Run(); err != nil {
		level.Error(logger).Log("err", fmt.Sprintf("%+v", errors.Wrapf(err, "%s command failed", cmd)))
		os.Exit(1)
	}
	
	// 到达此处,说明整个程序结束了。
	level.Info(logger).Log("msg", "exiting")
}

2.2 registerQuery method


func registerSidecar(app *extkingpin.App) {
	cmd := app.Command(component.Sidecar.String(), "Sidecar for Prometheus server")
	conf := &sidecarConfig{}
	// 解析命令行参数
	conf.registerFlag(cmd)
	
	// Setup()的入参方法,会被放入app对象的setups列表中
	// 最核心的是runSidecar()方法
	cmd.Setup(func(g *run.Group, logger log.Logger, reg *prometheus.Registry, tracer opentracing.Tracer, _ <-chan struct{}, _ bool) error {
		rl := reloader.New(log.With(logger, "component", "reloader"),
			extprom.WrapRegistererWithPrefix("thanos_sidecar_", reg),
			&reloader.Options{
				ReloadURL:     reloader.ReloadURLFromBase(conf.prometheus.url),
				CfgFile:       conf.reloader.confFile,
				CfgOutputFile: conf.reloader.envVarConfFile,
				WatchedDirs:   conf.reloader.ruleDirectories,
				WatchInterval: conf.reloader.watchInterval,
				RetryInterval: conf.reloader.retryInterval,
			})

		return runSidecar(g, logger, reg, tracer, rl, component.Sidecar, *conf)
	})
}

2.3 runSidecar method

Use the run.Group object to start the http server, grpc server, transfer the block directory to the object storage coroutine, monitor the prometheus configuration file, and periodically check the prometheus instance survival coroutine.

Detailed instructions:
1) View the heartbeat mechanism of the prometheus instance through the /api/v1/status/config interface.
2) The toolkit for monitoring prometheus configuration file changes is github.com/fsnotify/fsnotify.
3) Turn on the upload block function, traverse all the block directories in the prometheus tsdb directory every 30s (uploaded blocks or empty blocks will be ignored, and compressed blocks will also be ignored by default), and upload the corresponding File to object storage.
4) The external label of the prometheus instance cannot be obtained or the external label is not configured in prometheus, which will cause the sidecar to fail to start.

func runSidecar(
	g *run.Group,
	logger log.Logger,
	reg *prometheus.Registry,
	tracer opentracing.Tracer,
	reloader *reloader.Reloader,
	comp component.Component,
	conf sidecarConfig,
) error {

	// 用一个结构体来保存prometheus实例的url、prometheus实例的external label、prometheus client等信息。
	var m = &promMetadata{
		promURL: conf.prometheus.url,

		mint: conf.limitMinTime.PrometheusTimestamp(),
		maxt: math.MaxInt64,

		limitMinTime: conf.limitMinTime,
		client:       promclient.NewWithTracingClient(logger, "thanos-sidecar"),
	}
	
	// 获取对象存储的配置信息,如果有,说明是开启上传block至对象存储的功能。
	confContentYaml, err := conf.objStore.Content()
	if err != nil {
		return errors.Wrap(err, "getting object store config")
	}
	var uploads = true
	if len(confContentYaml) == 0 {
		level.Info(logger).Log("msg", "no supported bucket was configured, uploads will be disabled")
		uploads = false
	}
	

	grpcProbe := prober.NewGRPC()
	httpProbe := prober.NewHTTP()
	statusProber := prober.Combine(
		httpProbe,
		grpcProbe,
		prober.NewInstrumentation(comp, logger, extprom.WrapRegistererWithPrefix("thanos_", reg)),
	)
	
	// 创建http server,并启动server(只有/metrics、/-/healthy、/-/ready等接口)
	srv := httpserver.New(logger, reg, comp, httpProbe,
		httpserver.WithListen(conf.http.bindAddress),
		httpserver.WithGracePeriod(time.Duration(conf.http.gracePeriod)),
	)
	g.Add(func() error {
		statusProber.Healthy()
		return srv.ListenAndServe()
	}, func(err error) {
		statusProber.NotReady(err)
		defer statusProber.NotHealthy(err)
		srv.Shutdown(err)
	})


	// 获取promehtues实例的external label,并做心跳
	{
		// promUp记录promehtues是否正常,0表示不正常,1表示正常
		promUp := promauto.With(reg).NewGauge(prometheus.GaugeOpts{
			Name: "thanos_sidecar_prometheus_up",
			Help: "Boolean indicator whether the sidecar can reach its Prometheus peer.",
		})
		// lastHeartbeat记录最后一次心跳时间
		lastHeartbeat := promauto.With(reg).NewGauge(prometheus.GaugeOpts{
			Name: "thanos_sidecar_last_heartbeat_success_time_seconds",
			Help: "Timestamp of the last successful heartbeat in seconds.",
		})

		ctx, cancel := context.WithCancel(context.Background())
		// 获取prometheus实例的external label(/api/v1/status/config接口),并通过定期(30s)做这件事情来做心跳
		g.Add(func() error {
			/*
				检查性代码
			*/
			
			// 获取prometheus实例的external label
			err := runutil.Retry(2*time.Second, ctx.Done(), func() error {
				// m.UpdateLabels(ctx)去访问prometheus实例的/api/v1/status/config接口,并将返回的数据设置到自己的属性labels
				if err := m.UpdateLabels(ctx); err != nil {						
					promUp.Set(0)
					statusProber.NotReady(err)
					return err
				}			
				promUp.Set(1)
				statusProber.Ready()
				// 记录心跳时间
				lastHeartbeat.SetToCurrentTime()
				return nil
			})
			
			// 拿不到prometheus实例的external label或者prometheus没有配置external label则退出
			if err != nil {
				return errors.Wrap(err, "initial external labels query")
			}			
			if len(m.Labels()) == 0 {
				return errors.New("no external labels configured on Prometheus server, uniquely identifying external labels must be configured; see https://thanos.io/tip/thanos/storage.md#external-labels for details.")
			}

			// 每个30s从prometheus实例获取exterlan label,通过此方式来记录心跳时间
			return runutil.Repeat(30*time.Second, ctx.Done(), func() error {				
				/*
					其他代码
				*/
				
				if err := m.UpdateLabels(iterCtx); err != nil {
					level.Warn(logger).Log("msg", "heartbeat failed", "err", err)
					promUp.Set(0)
				} else {
					promUp.Set(1)
					// 记录心跳时间
					lastHeartbeat.SetToCurrentTime()
				}
				return nil
			})
		}, func(error) {
			cancel()
		})
	}
	
	// 使用github.com/fsnotify/fsnotify包监听prometheus实例的配置文件的变化
	// 如果文件发生变化则发送一个POST请求给prometheus实例,让它重新加载配置文件
	{
		ctx, cancel := context.WithCancel(context.Background())
		g.Add(func() error {
			return reloader.Watch(ctx)
		}, func(error) {
			cancel()
		})
	}

	{
		t := exthttp.NewTransport()
		t.MaxIdleConnsPerHost = conf.connection.maxIdleConnsPerHost
		t.MaxIdleConns = conf.connection.maxIdleConns
		c := promclient.NewClient(&http.Client{Transport: tracing.HTTPTripperware(logger, t)}, logger, thanoshttp.ThanosUserAgent)

		promStore, err := store.NewPrometheusStore(logger, reg, c, conf.prometheus.url, component.Sidecar, m.Labels, m.Timestamps)
		if err != nil {
			return errors.Wrap(err, "create Prometheus store")
		}

		tlsCfg, err := tls.NewServerConfig(log.With(logger, "protocol", "gRPC"),
			conf.grpc.tlsSrvCert, conf.grpc.tlsSrvKey, conf.grpc.tlsSrvClientCA)
		if err != nil {
			return errors.Wrap(err, "setup gRPC server")
		}

		// 创建并grpc server
		s := grpcserver.New(logger, reg, tracer, comp, grpcProbe,
			// 注册grpc handler(通过http client从prometheus实例中获取指标数据)
			grpcserver.WithServer(store.RegisterStoreServer(promStore)),	
			// 注册grpc handler(通过http client从prometheus实例中获取rule数据)
			grpcserver.WithServer(rules.RegisterRulesServer(rules.NewPrometheus(conf.prometheus.url, c, m.Labels))), 
			grpcserver.WithListen(conf.grpc.bindAddress),
			grpcserver.WithGracePeriod(time.Duration(conf.grpc.gracePeriod)),
			grpcserver.WithTLSConfig(tlsCfg),
		)
		g.Add(func() error {
			statusProber.Ready()
			return s.ListenAndServe()
		}, func(err error) {
			statusProber.NotReady(err)
			s.Shutdown(err)
		})
	}

	// 若开启了上传block功能,则定期遍历prometehus tsdb目录下的所有block目录并上传文件至对象存储。
	if uploads {
		
		// 获取一个对象存储bucket
		bkt, err := client.NewBucket(logger, confContentYaml, reg, component.Sidecar.String())
		if err != nil {
			return err
		}
		
		/*
			其他代码
		*/

		ctx, cancel := context.WithCancel(context.Background())
		g.Add(func() error {
			/*
				其他代码
			*/

			/*
				拿不到prometheus实例的external label或者prometheus没有配置external label则退出
			*/
			
			s := shipper.New(logger, reg, conf.tsdb.path, bkt, m.Labels, metadata.SidecarSource,
				conf.shipper.uploadCompacted, conf.shipper.allowOutOfOrderUpload)
			
			// 每30执行一次s.Sync(ctx)
			// s.Sync(ctx)会遍历prometheus tsdb目录下的所有block目录(已上传的block或空block会被忽略,默认情况下被压缩过的block也会被忽略),并上传相应的文件
			return runutil.Repeat(30*time.Second, ctx.Done(), func() error {
				if uploaded, err := s.Sync(ctx); err != nil {
					// 至少有一个block上传失败,则打印日志
				}
				/*
					其他代码
				*/
				return nil
			})
		}, func(error) {
			cancel()
		})
	}
	
	level.Info(logger).Log("msg", "starting sidecar")
	return nil
}

3 Summary:

The code logic of the Thanos Sidecar component is simple and easy to understand. It accesses the prometheus instance bound to it through the http protocol. The data obtained from the prometheus instance is exposed through the grpc interface, traverses all block directories for file upload, and monitors promethues. Small functions for configuration file changes.

Guess you like

Origin blog.csdn.net/nangonghen/article/details/110731518