Some of the information collected about prometheus

Micro + service features cloud environment:
monitoring objects dynamically variable, not pre-configured;
monitoring range of complex, difficult integration;
calls between micro-services complex, difficult discharge failure;

Linux Foundation Foundation cloud computing native (CNCF) (Cloud Native Computing Foundation ) gives three characteristics cloud native application:
a container of the package: the container base and improve the overall level of development, formed of code reuse and assembly, simplified maintenance of native cloud applications. Runs in a container application and process and application deployment as a standalone unit, to achieve a high level of resource isolation.
Dynamic management: management of dynamically scheduling and dispatch by the centralized scheduling system.
Service-oriented micro-: clear dependencies between services, decoupled from each other.

- Kubernetes: cluster management across multiple hosts open-source application containers systems;
- Prometheus: focus on time-series data, providing extensive integration support for open source monitoring solution for the client and third-party data-dependent consumption;
- OpenTracing: vendor independent distributed track open standards;
- Fluentd: create a unified logging layer of open source data collector.

Advantages:
flexible data model: monitoring data from the value, a time stamp, a label; source data is recorded in a tag, the labels are modified when collecting support, so that it has strong scalability;
powerful query capabilities: provide a large number of calculation functions, in most cases found by PromQL aggregated data needs;
sound ecology: to support common operating system / middleware / database programming language // monitoring; providing Java / golang / Ruby, etc. SDK, from fast implementation defined monitoring;
good performance: in the case of hardware resources to meet, Prometheus single instance in the case of acquisition per 100,000 monitoring data, in terms of data processing and query still has a good performance;
excellent infrastructure: pull model, pulling specific case is determined by the server, based on the service discovery server can automatically monitored objects, the plurality of server data through the cluster fragmentation mechanisms;

Inadequate:
log monitoring, distributed tracking, lost data

All from the store in terms of monitoring indicators metric is the same, but there are some of these metric subtle differences under different scenarios.
For example, the reaction of the sample index node_load1 Node Exporter returned is the current system load state, over time the sample data this indicator is returned is constantly changing.
The index node_cpu the acquired sample data is different, it is a continuous increase in value, because the reaction is cumulative usage time of the CPU, in theory, as long as the system does not shut down, this value will become infinitely large.

In order to help the user to understand and distinguish the difference between these different monitoring index, Prometheus defines the different types of indicators. 4 (metric type): Counter (counter), Gauge (dashboard), Histogram (histogram), Summary (digest ).

Late idea:

  1. Service Discovery
  2. It supports both black box and white box monitoring:
    white box able to understand the actual operation of the internal state, through the observation of monitoring indicators can predict problems that may arise, so as to optimize the potential uncertainties.
    Monitoring black box, such as the common probe HTTP, TCP probes, etc., in case of failure can quickly inform the relevant person or service in the system for processing.
    • Long-term trend analysis: statistics through continuous collection and monitoring sample data, monitoring indicators of long-term trend analysis. For example, by determining the rate of growth of disk space, we can predict in advance the need for resources for expansion in the future what time node.
    • Comparative analysis:
  3. Refinement
    • Delay: The time required for the service request
    • Traffic: traffic monitoring system currently used to measure capacity requirements services.
    • Error: error monitoring of all current requests occurred systems to measure the rate of the current system error occurred.
    • Saturation: A measure of the current saturation of the service.

Prometheus is responsible for collecting the required performance metrics (such as: the current number of concurrent links, current cpu usage, etc.), an alarm is generated according to the rules defined alarm event, and then pass the alarm event to Alertmanager, triggered by the alertmanager webhook to achieve final the pod telescopic function;

Prometheus server regularly target discovery from target or service statically configured to pull data.
When the new data is larger than the pulled configuration of the cache memory, Prometheus will persist data to disk.
Prometheus configuration rule, and then polling the data, when the trigger conditions, will be pushed to alert Alertmanager configuration.
Alertmanager received warning time, depending on the configuration, aggregation, de-emphasis, noise reduction, and finally sending a warning.
You may be used API, Prometheus Console or Grafana query and aggregated data.

The current system CPU usage?
avg (irate (node_cpu {mode! = "idle"} [2m])) without (cpu, mode)
the host before the CPU usage 5 have?
topk (5, avg (irate ( node_cpu {mode! = "idle"} [2m])) without (cpu, mode))
forecast after four childhood, the disk space occupied roughly what would happen?
predict_linear (node_filesystem_free {job = "node "} [2h], 4 * 3600)

Where avg (), topk () are all built PromQL polymerization operation, irate (), predict_linear () is a function built PromQL, irate () function returns a period of time may be calculated time rate of change per unit time in the sequence of all samples . The function of internal predict_linear data trends predicted by a simple linear regression method.

Blackbox using the black box monitoring

In the previous section, we introduce the use of Node Exporter, Exporter for such purposes, they are primarily internal monitoring services or infrastructure use, that is, white-box monitoring. Through the observation of monitoring indicators can predict problems that may arise, so as to optimize the potential uncertainties.
From a logical point of complete monitoring, in addition to a large number of white-box monitoring application, you should also add the appropriate black box monitor.
I.e., an external black box to monitor the user's identity test service accessibility of visibility, the black box to monitor the common probe including HTTP, TCP or probes for detecting sites and services, and access efficiency.
Black box monitoring compared to white-box monitoring biggest difference is that the black box is a failure-oriented monitoring when a fault occurs, the black box monitors can quickly find faults, while white-box monitoring will focus on proactively identify or predict potential problems.
A perfect monitoring goal is to be able to identify potential problems from the perspective of the white box, to quickly find problems that have occurred in the perspective of the black box.
Here analogy agile agility test in the famous pyramid, for complete monitoring, we need a lot of white-box monitoring for internal health monitoring services, so as to effectively support the analysis of failure.
Black box also requires monitoring section for detecting whether a failure occurs primary service.
Blackbox Exporter is the official black box monitoring Prometheus community to provide solutions that allow users to: HTTP, HTTPS, DNS, TCP and ICMP way to detect the network.
Users can directly use Blackbox Exporter go get command to get the source and make a local executable.

All sample data stored in the form of a time window, Prometheus can significantly improve the efficiency of the query, when all the sample data within a period of time of the query, the query data simply to fall within the range from the block.
For delete historical data, it becomes very simple, just delete the block where the directory.
For Prometheus single node, the way that the local file-based storage system can be allowed to support millions of monitoring indicators, processing hundreds of thousands of data points per second. To keep things simple self-management and deployment, Prometheus gave up the complexity of managing HA.
So first of all, for such storage, we need to clear a few points:
Prometheus itself does not apply to long-term persistent storage of historical data, by default Prometheus leaving only 15 days of data.
Local storage also means that Prometheus itself can not be effective elastic stretch.
When monitoring scale becomes great time for a single Prometheus, its main challenges include the following:
availability of services, how to ensure that Prometheus single point of failure;
acquisition monitoring larger scale means, of Prometheus Job number also becomes larger (write) operations become very resource consuming;
it also means that demand large amounts of data storage.

For such as these containers Kubernetes or cloud environments, for Prometheus, it is an important problem to be solved is how to deploy dynamic found in Kubernetes environment need to monitor all targets.

For Kubernetes, as shown above, we can among all the resources are divided into several categories:
infrastructure layer (the Node): cluster nodes, and applications for the entire cluster runtime resource
container infrastructure (Container): application provides run-time environment
user applications (Pod): Pod will contain a set of containers, they work together, and provide external one (or a group) feature
internal load balancing service (service): in a cluster, exposed by the service in a cluster application function, load balancing between the interior of the application and the application access cluster
external access inlet (Ingress): an inlet providing access through the Ingress outside the cluster, which can make access to external clients deployed within the cluster service Kubernetes

Thus, without considering Kubernetes own components, if you want to build a complete monitoring system, we should consider the following five areas:
cluster node status monitoring: from kubelet cluster nodes basic services running state acquiring node;
resource usage deployed node Exporter acquisition node in the form Daemonset in the cluster, each node;: cluster node resource usage monitoring
vessel monitoring node running: obtaining node running status of all containers by each node kubelet built cAdvisor in and the use of resources;
deployment from the black box to monitor the angle of the probe in the cluster Blackbox Exporter services, Ingress detection service and availability;
applications if deployed in a cluster itself built monitoring support for Prometheus, then we should also find examples of appropriate Pod, and acquires the operating state of the internal monitoring index from the Pod example.

Appendix: What is the timing of the database?
Prometheus mentioned above is based on a database monitoring system timing, often abbreviated sequence database TSDB (Time Series Database). Many popular monitoring systems are in use sequence database to store data, it is because the characteristics of the database and monitoring system timing coincides.
By: the need for frequent write, and write the sort order is chronological
delete: no random deleted, delete a block of time under normal circumstances all of the data
changes: no need to write the data is updated
search: need for highly concurrent read operation, the read operation is ascending or descending chronological read data is very large, the cache does not work

With heapster no longer developed and maintained and influxdb cluster solution is no longer open source, heapster + influxdb monitoring program, only suitable for some relatively small-scale k8s cluster. The prometheus whole community is very active, in addition to the official community offers a range of high-quality exporter, for example node_exporter and so on. Telegraf (centralized collection metrics) + prometheus the program, but also a reduction in workload deployment and management of a variety exporter of very good programs.
Today, the main talk about our use prometheus process, some practical experience in storage.

Monitor their characteristics?
Dynamic change monitoring objects, not pre-configured; monitoring range of complex, difficult to monitor; calls between micro-services complex, from the difficulties;
configuration?
Graphical configuration file
database?
Relational database time series databases
exist zabbi, why the need to introduce Prometheus?
Prometheus timing database in the case of high concurrency, read and write performance is far higher than traditional relational databases, also provides many built-Based processing function of time, to simplify the difficulty of data aggregation.

Selection of the monitoring system
Prometheus monitoring scenarios: service monitoring, performance monitoring, monitoring of the container, the micro-service monitoring, application monitoring part (monitor application can do)
ZABBIX monitoring scenario: hardware monitoring, system monitoring, network monitoring, application monitoring part (eg: Oracle), other monitoring (URL monitoring, port monitoring)

Language extensions: Go / Java / Python / Ruby / Bash ...
plug-in extensions: Database (Mysql / Mongo), Hardware (Node / Ibm), MessageSys (Kafka / Rabbit), Http (apache / nginx) ...
Custom: PushGateway (http_request_duration_seconds_bucket {le = "0.05"} 24054 )

In the first half of the share, the lack of communication with colleagues, sharing process has been relatively quiet; the second half, Q & A and discussion more active; if the latter wish to share can be improved;

Questions about reading and writing speed / disk performance:

The issue of the time window, less than scrape_interval

Questions about the value of the timestamp:

About graphical configuration interface issues:

Guess you like

Origin blog.csdn.net/hcj1101292065/article/details/86470161