总结ceilometer处理compute节点传递过来的数据的完整步骤是:
步骤1: 关于轮询任务和pipeline
计算节点调用ceilometer/agent/manage.py中的AgentManager类的start方法,
start方法调用configure_polling_tasks方法,
configure_polling_tasks方法调用setup_polling_tasks方法来读取pipeline文件来建立各个监控项的轮询任务,
setup_polling_tasks方法调用create_polling_task来创建轮询任务PollingTask对象,轮询任务会调用
PollingTask类的poll_and_notify方法,具体如下所示
def poll_and_notify(self):
"""Polling sample and notify."""
cache = {}
discovery_cache = {}
poll_history = {}
for source_name in self.pollster_matches:
for pollster in self.pollster_matches[source_name]:
key = Resources.key(source_name, pollster)
candidate_res = list(
self.resources[key].get(discovery_cache))
if not candidate_res and pollster.obj.default_discovery:
candidate_res = self.manager.discover(
[pollster.obj.default_discovery], discovery_cache)
# Remove duplicated resources and black resources. Using
# set() requires well defined __hash__ for each resource.
# Since __eq__ is defined, 'not in' is safe here.
polling_resources = []
black_res = self.resources[key].blacklist
history = poll_history.get(pollster.name, [])
for x in candidate_res:
if x not in history:
history.append(x)
if x not in black_res:
polling_resources.append(x)
poll_history[pollster.name] = history
# If no resources, skip for this pollster
if not polling_resources:
p_context = 'new ' if history else ''
LOG.info(_("Skip pollster %(name)s, no %(p_context)s"
"resources found this cycle"),
{'name': pollster.name, 'p_context': p_context})
continue
LOG.info(_("Polling pollster %(poll)s in the context of "
"%(src)s"),
dict(poll=pollster.name, src=source_name))
try:
samples = pollster.obj.get_samples(
manager=self.manager,
cache=cache,
resources=polling_resources
)
sample_batch = []
# filter None in samples
samples = [s for s in samples if s is not None]
for sample in samples:
sample_dict = (
publisher_utils.meter_message_from_counter(
sample, self._telemetry_secret
))
if self._batch:
sample_batch.append(sample_dict)
else:
self._send_notification([sample_dict])
if sample_batch:
self._send_notification(sample_batch)
except plugin_base.PollsterPermanentError as err:
LOG.error(_(
'Prevent pollster %(name)s for '
'polling source %(source)s anymore!')
% ({'name': pollster.name, 'source': source_name}))
self.resources[key].blacklist.extend(err.fail_res_list)
except Exception as err:
LOG.warning(_(
'Continue after error from %(name)s: %(error)s')
% ({'name': pollster.name, 'error': err}),
exc_info=True)
分析:
上述方法的关键是调用
samples = pollster.obj.get_samples(
manager=self.manager,
cache=cache,
resources=polling_resources
)
来获取各个监控项插件,然后通过轮询任务定时获取对应监控项的值。
最后调用:
PollingTask的self._send_notification(sample_batch)
来将收集到的监控项的数据进行发送
步骤2: ceilometer在计算节点发送监控数据的过程
步骤1中有调用:
PollingTask的self._send_notification(sample_batch)
来将收集到的监控项的数据进行发送,该方法具体如下:
def _send_notification(self, samples):
self.manager.notifier.sample(
self.manager.context.to_dict(),
'telemetry.polling',
{'samples': samples}
)
分析:
从该方法可以发现,这里明显是通过消息队列的形式,以oslo.messaging
对象的sample方法来将监控数据发送到消息队列,发送的数据就是一个字典
{'samples': samples}
步骤3: ceilometer在控制节点如何处理消息队列中的数据
ceilometer在控制节点上的ceilometer/agent/plugin_base.py中的NotificationBase类的
sample方法,接收到步骤2中ceilometer-compute服务发送到消息队列的数据
该方法具体如下:
def sample(self, ctxt, publisher_id, event_type, payload, metadata):
"""RPC endpoint for notification messages at sample level
When another service sends a notification over the message
bus at sample priority, this method receives it.
:param ctxt: oslo.messaging context
:param publisher_id: publisher of the notification
:param event_type: type of notification
:param payload: notification payload
:param metadata: metadata about the notification
"""
notification = messaging.convert_to_old_notification_format(
'sample', ctxt, publisher_id, event_type, payload, metadata)
self.to_samples_and_publish(context.get_admin_context(), notification)
分析:
由于步骤2中通过oslo.messageing对象发送监控数据到消息队列是调用sample方法,所以这里的处理消息队列
中的监控数据同样是调用sample方法。
有一个疑问,消息队列是如何将发送方和接收方做关联的,记得应该是topic或者exchange,这个应该是
ceilometer读取ceilometer.conf中来保持两边有一致的topic或者exchange。具体是哪一个子类?
步骤4: 发布监控数据
步骤3调用to_samples_and_publish方法,该方法具体内容如下
def to_samples_and_publish(self, context, notification):
"""Return samples produced by *process_notification*.
Samples produced for the given notification.
:param context: Execution context from the service or RPC call
:param notification: The notification to process.
"""
with self.manager.publisher(context) as p:
p(list(self.process_notification(notification)))
分析:
1 这个里面调用了
p(list(self.process_notification(notification)))
来发布监控数据
2 p具体对应于ceilometer/pipeline.py文件中的PublishContext类的对象:
class PublishContext(object):
def __init__(self, context, pipelines=None):
pipelines = pipelines or []
self.pipelines = set(pipelines)
self.context = context
def add_pipelines(self, pipelines):
self.pipelines.update(pipelines)
def __enter__(self):
def p(data):
for p in self.pipelines:
p.publish_data(self.context, data)
return p
def __exit__(self, exc_type, exc_value, traceback):
for p in self.pipelines:
p.flush(self.context)
3 里面调用了__enter__方法,这个方法调用了ceilometer/pipeline.py文件中
SamplePipeline类生成的对象的publish_data方法,该方法具体如下
class SamplePipeline(Pipeline):
def publish_data(self, ctxt, samples):
if not isinstance(samples, list):
samples = [samples]
supported = [s for s in samples if self.source.support_meter(s.name)
and self._validate_volume(s)]
self.sink.publish_samples(ctxt, supported)
4 上述方法中调用
self.sink.publish_samples(ctxt, supported)
来进行实际的数据发布
该方法位于ceilometer/pipeline.py文件中class SampleSink(Sink)类的publish_samples方法,该方法具体如下:
def publish_samples(self, ctxt, samples):
self._publish_samples(0, ctxt, samples)
5 上述方法调用_publish_samples(0, ctxt, samples)方法具体如下
def _publish_samples(self, start, ctxt, samples):
"""Push samples into pipeline for publishing.
:param start: The first transformer that the sample will be injected.
This is mainly for flush() invocation that transformer
may emit samples.
:param ctxt: Execution context from the manager or service.
:param samples: Sample list.
"""
transformed_samples = []
if not self.transformers:
transformed_samples = samples
else:
for sample in samples:
LOG.debug(
"Pipeline %(pipeline)s: Transform sample "
"%(smp)s from %(trans)s transformer", {'pipeline': self,
'smp': sample,
'trans': start})
sample = self._transform_sample(start, ctxt, sample)
if sample:
transformed_samples.append(sample)
if transformed_samples:
for p in self.publishers:
try:
p.publish_samples(ctxt, transformed_samples)
except Exception:
LOG.exception(_(
"Pipeline %(pipeline)s: Continue after error "
"from publisher %(pub)s") % ({'pipeline': self,
'pub': p}))
分析:
这个方法中最重要的部分是利用pipeline中的监控项值的转换插件,进行例如:
监控项对应速率,单位等转换,具体是调用了
sample = self._transform_sample(start, ctxt, sample)
6 监控项值的转换处理
上述_transform_sample(start, ctxt, sample)方法具体如下
class SampleSink(Sink):
NAMESPACE = 'ceilometer.publisher'
def _transform_sample(self, start, ctxt, sample):
try:
for transformer in self.transformers[start:]:
sample = transformer.handle_sample(ctxt, sample)
if not sample:
LOG.debug(
"Pipeline %(pipeline)s: Sample dropped by "
"transformer %(trans)s", {'pipeline': self,
'trans': transformer})
return
return sample
分析:
这个方法是遍历该监控项对应的所有转换器,对每个转换器,应用到采样结果sample上,进行转换,
转换的结果存放在列表中,最后被发送出去
7 速率类型转换器的工作原理
上述调用了:
sample = transformer.handle_sample(ctxt, sample)
来进行监控项采样值的转换
其中以速率转换器为例,介绍处理过程
具体对应于:
ceilometer/transformer/conversions.py文件的RateOfChangeTransformer类,
其handle_sample方法具体如下
def handle_sample(self, context, s):
"""Handle a sample, converting if necessary."""
LOG.debug('handling sample %s', s)
'''
(Pdb) p s.name
u'cpu'
(Pdb) p s.resource_id
u'5de08a0d-ec72-4b2c-b7cf-3ce03db9f43d'
(Pdb) p key
u'cpu5de08a0d-ec72-4b2c-b7cf-3ce03db9f43d'
'''
key = s.name + s.resource_id
'''
(Pdb) p self.cache
{}
(Pdb) p prev
None
(Pdb) p timestamp
datetime.datetime(2018, 9, 29, 8, 1, 28, 151206, tzinfo=<iso8601.iso8601.Utc object at 0x2595c90>)
(Pdb) p s.volume
302000000000
'''
prev = self.cache.get(key)
timestamp = timeutils.parse_isotime(s.timestamp)
self.cache[key] = (s.volume, timestamp)
if prev:
prev_volume = prev[0]
prev_timestamp = prev[1]
time_delta = timeutils.delta_seconds(prev_timestamp, timestamp)
# disallow violations of the arrow of time
if time_delta < 0:
LOG.warn(_('dropping out of time order sample: %s'), (s,))
# Reset the cache to the newer sample.
self.cache[key] = prev
return None
# we only allow negative volume deltas for noncumulative
# samples, whereas for cumulative we assume that a reset has
# occurred in the interim so that the current volume gives a
# lower bound on growth
volume_delta = (s.volume - prev_volume
if (prev_volume <= s.volume or
s.type != sample.TYPE_CUMULATIVE)
else s.volume)
rate_of_change = ((1.0 * volume_delta / time_delta)
if time_delta else 0.0)
s = self._convert(s, rate_of_change)
LOG.debug('converted to: %s', s)
else:
'''
如果第一次的数据,是没有利用率的,没有前身
'''
LOG.warn(_('dropping sample with no predecessor: %s'),
(s,))
s = None
return s
分析:
1 上述主要的计算速率的方法就是,设置一个缓存cache,里面存放
监控项明成和资源id所组成的键,例如:
cpu5de08a0d-ec72-4b2c-b7cf-3ce03db9f43d
如果是第一次处理: 就设置该键对应的值为: 采样值和时间组成的元组。
并返回此次计算的速率值为None
否则: 从缓存cache中去除上一次的采样值和上一次的采样时间,
计算当前采样时间和上一次采样时间的时间差,
如果当前采样值>=上一次采样值,计算当前采样值减去上一次采样值的采样差值;
否则,采样值差值=当前采样值;
用采样差值除以时间差得到速率结果
2 关于pipeline中转换表达式的解析
在pipeline.yaml中
- name: cpu_sink
transformers:
- name: "rate_of_change"
parameters:
target:
name: "cpu_util"
unit: "%"
type: "gauge"
scale: "100.0 / (10**9 * (resource_metadata.cpu_number or 1))"
publishers:
- notifier://
3 根据在代码ceilometer/transformer/conversions.py中RateOfChangeTransformer类的
handle_sample方法中调用了如下方法
s = self._convert(s, rate_of_change)
该方法具体如下:
def _convert(self, s, growth=1):
"""Transform the appropriate sample fields."""
return sample.Sample(
name=self._map(s, 'name'),
unit=self._map(s, 'unit'),
type=self.target.get('type', s.type),
volume=self._scale(s) * growth,
user_id=s.user_id,
project_id=s.project_id,
resource_id=s.resource_id,
timestamp=s.timestamp,
resource_metadata=s.resource_metadata
)
分析:
3.1 这里调用了self._map方法,该方法具体如下
def _map(self, s, attr):
"""Apply the name or unit mapping if configured."""
'''
(Pdb) p self.source
{}
'''
mapped = None
from_ = self.source.get('map_from')
to_ = self.target.get('map_to')
if from_ and to_:
if from_.get(attr) and to_.get(attr):
try:
mapped = re.sub(from_[attr], to_[attr], getattr(s, attr))
except Exception:
pass
return mapped or self.target.get(attr, getattr(s, attr))
该方法主要就是从pipeline.yaml中读取对应监控项的sink信息,例如
(Pdb) p self.target
{'scale': '100.0 / (10**9 * (resource_metadata.cpu_number or 1))',
'type': 'gauge',
'name': 'cpu_util',
'unit': '%'}
从target中提取出待转换的监控项名称,单位,类型等内容
例如cpu_util监控项名称
3.2 这里最重要的是调用
volume=self._scale(s) * growth,
来计算真正监控项的变化率值
其中self._scale方法具体如下
def _scale(self, s):
"""Apply the scaling factor.
Either a straight multiplicative factor or else a string to be eval'd.
"""
ns = transformer.Namespace(s.as_dict())
scale = self.scale
return ((eval(scale, {}, ns) if isinstance(scale, six.string_types)
else s.volume * scale) if scale else s.volume)
分析:
上述中利用了python的eval(expression,globals=None, locals=None)
方法来计算字符串表达式的值,
具体如下
(Pdb) p ns.resource_metadata.cpu_number
1
'100.0 / (10**9 * (resource_metadata.cpu_number or 1))'
结果= 100.0 / (10**9 * 1) = 10**(-7)
而ns是一个对象
(Pdb) p ns.__dict__
defaultdict(<function <lambda> at 0x3fe4d70>, {'user_id': u'da4dbe35880943419199205b9de63787', 'name': u'cpu', 'resource_id': u'5de08a0d-ec72-4b2c-b7cf-3ce03db9f43d', 'timestamp': u'2018-09-29T08:03:28.180313', 'id': u'25e58ca6-c3be-11e8-a5ec-d6d221cfca45', 'volume': 307280000000, 'source': u'openstack', 'project_id': u'd01b1e60111d484d9cb7d248dc5863f7', 'type': u'cumulative', 'resource_metadata': <ceilometer.transformer.Namespace object at 0x3ab7710>, 'unit': u'ns'})
本质上cpu_util监控项的 放缩比例=100.0 / (10的9次方* 该虚机的cpu个数)
之所以要乘以100.0,是因为单位为 %,所以这里乘以100.0
之所以要除以(10的9次方* 该虚机的cpu个数),是因为之前获取的是这台虚机所有cpu的总使用信息,并且单位是10的9次方。
总结:
1 ceilometer的compute服务从计算节点设置对多个监控项的轮询任务,定时调用监控项插件的获取监控项值的方法,
随后将获取的监控项数据发送到消息队列上;
2 ceilometer的notification服务在控制节点上消费ceilometer的compute
服务发送到消息队列的监控项数据,经过pipeline中的处理【例如cpu利用率监控项经过pipeline的转换器,以python的eval(
expression, globals=None, locals=None)对pipeline中sink部分的target的scale表达式进行计算,例如:
scale: "100.0 / (10**9 * (resource_metadata.cpu_number or 1))"
】,将数据发送给ceilometer的collector服务