InfluxDB-FLUX query optimization, using InfluxDB to build an alarm system

FLUX query optimization

Queries using predicate pushdown

Predicate pushdown is common in SQL queries. A predicate in SQL usually refers to the where condition.

Let's look at the simplest SQL statement. It queries data from a table named A and filters the data according to the condition of n>10.

select * from A where n > 10

You can imagine the execution flow of this SQL statement in the computer. Generally speaking, there are the following two ways.

MMSIZE

  • One is to read all the data from the disk into the memory and then filter it in the memory. This way we usually say it has no memory pushdown

  • The other is to fetch only the data you need from the disk into the memory during query, and then perform the next step. Usually, we say that predicate pushdown is implemented in this way.

Although the FLUX language is ostensibly a scripting language, when it comes to queries, it is not executed line by line, but with the participation of the optimizer. When the FLUX language is executed, it will try its best to optimize predicate pushdown. For what kind of queries can achieve predicate pushdown, you can refer to the Optimized Query section of the official website document https://docs.influxdata.com/influxdb/v2.4 /query-data/optimize-queries/

MMSIZE

In addition, we will tell you how to view the execution plan of a query later.

Avoid setting window width too small

Windows (grouping data based on time intervals) are often used to aggregate and downsample data. Making the window longer can improve performance. A window that is too narrow will require more computing power to evaluate which window each piece of data should be allocated to. The reasonable window width should be determined based on the total time width of the query.

Avoid using "heavy" functions

The following functions will be heavier for FLUX. These functions will use more memory and CPU. When using these functions, you must consider whether it is necessary.

  • map()

  • reduce()

  • join()

  • union()

  • pivot()

However, the official said that InfluxData has been optimizing the performance of FLUX, so the current list is not necessarily the future situation.

Use set() instead of map() whenever possible

If you want to look up a static constant for the data, then set has a great performance advantage over map. map is the heavy operation we mentioned above. In the following examples, we will compare the gap between the two operations.

Balancing data timeframe and data accuracy

To ensure good query performance, you should balance the query time range and data precision. If there is a measurement data stored in the database every second, and you request 6 months of data at a time, then a sequence can contain 15.5 million points of data. If the number of sequences is higher, the data will likely become billions of points. Flux has to pull this data into memory and return it to the user. So on the one hand, do a good job of predicate pushdown to minimize the use of memory. In addition, if you must query data over a long time range, you should create a scheduled task to downsample the data, and then change the query target from the original data to the downsampled data.

Use the FLUX performance analysis tool to view query performance

When executing a FLUX query, you can import a package called profiler and then add an option to view the execution plan of the current FLUX statement. for example:

option profiler.enabledProfilers = ["query", "operator"]

The query and operator here are the two options of the query plan. Query means that you want to view the execution of the entire execution script, and operator means that you want to view the execution of each operator of a FLUX query.

(1) query (query)

query provides statistics about the entire Flux script execution. When enabled, the result will be an additional table with the following information:

  • TotalDuration: Query total duration in nanoseconds

  • CompileDuration: The time it took to compile the query script (in nanoseconds)

  • QueueDuration: Time spent queuing (in nanoseconds)

  • RequeueDration: Time spent requeuing (in nanoseconds)

  • PlanDuration: The time it took to plan the query (in nanoseconds)

  • ExecuteDuration: The time it took to execute the query (in nanoseconds)

  • Concurrency: Concurrency, assigned to goroutines that process queries.

  • MaxAllocated: Query the maximum number of bytes allocated (memory)

  • TotalAllocated: The total number of bytes allocated at query time (including memory released and then used again)

  • RuntimeErrors: Error messages returned during query execution

  • flux/query-plan: flux query plan

  • influxdb/scanned-values: Number of data items scanned by the database on the disk

  • influxdb/scanned-buytes: The number of bytes scanned by the database on the disk

(2) operator

Statistics about each operation in a query script. Operations performed in the storage layer are returned as a single operation. When this configuration is enabled, the results returned will have one more table and contain the following

  • Type: operation type

  • Label: label

  • Count: The total number of times this operation is performed

  • MinDuration: The fastest time (in nanoseconds) when the operation is executed multiple times.

  • MaxDuration: The time taken by the slowest operation among the multiple executions (in nanoseconds)

  • DurationSum: Total duration in nanoseconds for the current operation to complete.

  • MeanDuration: The average duration (in nanoseconds) that the operation was executed multiple times.

Example: Using profile to optimize queries

(1) Write a query

First, open DataExplorer. Write the following code:

from(bucket: "test_init")
 |> range(start: -1h)
 |> filter(fn: (r) => r["_measurement"] == "go_goroutines")
 |> map(fn: (r) => ({r with hello:"world"}))

This code will query a measurement named go_goroutines from the test_init bucket. There is no tag under this measurement, so we only have one sequence.

The map function helps us add a constant column to the filtered data, the column name is hello, and the value is the string world.

(2) Execute query

Now, SUBMIT the above code and click View Raw Data. The data should look like below.

MMSIZE

You can see that there is a constant column.

(3) Modify the code to view performance indicators and execution plans

Now, let's make some modifications to the code so that we can observe the execution plan.

import "profiler"
option profiler.enabledProfilers = ["query","operator"]
from(bucket: "test_init")
 |> range(start: -1h)
 |> filter(fn: (r) => r["_measurement"] == "go_goroutines")
 |> map(fn: (r) => ({r with hello:"world"}))

Code explanation:

  • option profiler.enabledProfilers, this is actually a switch option. When "query" appears in the following list, the performance and execution plan of the entire query will be displayed. Of course, when "operator" appears, the performance indicators of each operator will be specifically displayed.

  • Import "profiler", import the package, enabledProfilers is the switch in profiler, it cannot be used without importing it.

Now, click SUBMIT again and still observe the original data. What is displayed at the beginning is the data we queried. You need to switch to the end of the page to see the performance indicators and execution plan.

MMSIZE

HELP HIM

Now, we see two tables, one _measurement is profiler/query, which is the performance indicator and execution plan of our entire query. There is also a _measurement for profiler/operator, which is the performance index of each of our operators. It includes information such as how long a certain operator has been running.

(4) How to judge predicate pushdown

According to the official documentation, if predicate pushdown is implemented, multiple operators will be merged into one. We can see that in the operation list now, there is an operator operation called merged_ReadRange4_filter2, followed closely by our map operation. This shows that from -> range -> filter is merged. They are one-step operations.

(5) Check query performance

There are many indicators of query performance, but we only focus on two indicators, one is MaxAllocated. It represents the total memory used by us to complete the query (including memory applied for after release).

MMSIZE

The current value of this indicator is 52736, which means that in order to complete this query, we used about 50kb of memory.

The other is TotalDuration, which represents the total time to perform this operation, which is currently 17365972 nanoseconds.

MMSIZE

(6) Add AggregateWindow after map

Now, we add an AggregateWindow function after the map. The overall code is as follows:

import "profiler"
option profiler.enabledProfilers = ["query","operator"]
from(bucket: "test_init")
 |> range(start: -1h)
 |> filter(fn: (r) => r["_measurement"] == "go_goroutines")
 |> map(fn: (r) => ({r with hello:"world"}))
 |> aggregateWindow(column: "_value", every: 1h,fn:mean)

(7) Check query performance

First, pay attention to our operator table. You can see that the number of operands has changed from 2 to 3. Behind map, there is an additional aggregation window operation.

HELP HIM

The queried MaxAllocated is still 52736, unchanged.

Because hundreds of pieces of data needed to be returned before, now window aggregation only needs to return two pieces of data, so the duration of the query is shortened.

(8) Move AggregateWindow to the front of map

Now, we move the AggregateWindow before the map and after the filter. The modified code as a whole is as follows:

import "profiler"
option profiler.enabledProfilers = ["query","operator"]
from(bucket: "test_init")
 |> range(start: -1h)
 |> filter(fn: (r) => r["_measurement"] == "go_goroutines")
 |> aggregateWindow(column: "_value", every: 1h,fn:mean)
 |> map(fn: (r) => ({r with hello:"world"}))

(9) Check query performance

First of all, we should pay attention to the operator table. There were three operations in this table before, but now there are two. Before map, there is a ReadWindowAggregateByTime operation. In other words, our aggregateWindow operation implements predicate pushdown.

HELP HIM

When the disk read operation is completed, only the two pieces of aggregated data will exist in the memory. Now we focus on query performance metrics.

You can see that MaxAllocated has changed to 864. The previous value of this indicator was still 52736. It used to consume 50KB, but now it's less than 1KB.

The duration of queries has also been further reduced.

(10) Change map to set

Finally, it is worth mentioning that the principle of our map operation data is to process the data in the data set line by line. Here, we use it to implement the function of adding constants. In fact, there is another operator called set that can also complete such tasks. Its logic of operating data is to directly operate the entire data set.

The larger the amount of data, the more obvious the performance gap between the two operators becomes.

Here, we remove the aggregateWindow operator and change map to set. Compare this to the code when we first looked at performance, before we did the aggregation operation.

The modified code is as follows:

import "profiler"
option profiler.enabledProfilers = ["query","operator"]

from(bucket: "test_init")
 |> range(start: -1h)
 |> filter(fn: (r) => r["_measurement"] == "go_goroutines")
 |> set(key: "hello", value: "world")

(11) Check query performance

After running, check the query performance. You can see that the Set version of MaxAllocated is 51712, which is almost the same as the map version of 52736.

However, let's take a look at the TotalDuration indicator 4612562. The previous map version had 17365972 on this indicator.

This shows that the set operation is faster than the map.

Use InfluxDB to build an alarm system

what is monitoring

Monitoring actually calculates the data every once in a while. For example, I have a carbon monoxide concentration sensor, and every 1 minute I calculate the average indoor carbon monoxide concentration within this minute. Compare this result with a hard-coded standard value and alarm if it exceeds it. This is the basic logic of monitoring.

Therefore, monitoring in InfluxDB is actually a scheduled task written by FLUX script. However, whether it is on the HTTP API or the Web UI, InfluxDB treats it separately from scheduled tasks.

Understand inspections, alarm terminals and alarm rules

In the left toolbar of the Web UI, click the Alerts button to open an alarm configuration page. The upper option bar displays CHECKS (check), NOTIFICATION ENDPOINTS (alarm terminal) and NOTIFICATION RULES (alarm rules), which respectively correspond to the three components required by InfluxDB for alarming.

HELP HIM

The functions of the three components are as follows:

  • CHECKS: It is actually a scheduled task, we can call it a check task. The check task will read part of the data from the target bucket and then perform a threshold check, and finally generate a type 4 signal. CRIT (critical), WARN (alert), INFO (information) and OK (good).

    HELP HIM

  • NOTIFICATION ENDPOINTS (alarm terminal): It is a component that sends alarm signals to specified addresses.

  • NOTIFICATION RULES (alarm rules): It can specify which Checks have problems and send WeChat alarms, and which Checks have problems and can send email notifications. It is equivalent to the routing between Check and alarm terminal.

Example: Simulate alarm for carbon monoxide concentration (※)

(1) Demand

Suppose we now have a sensor that can collect carbon monoxide concentration. This sensor inserts a piece of data every once in a while through the IoT network to InfluxDB deployed on our server. The format is as follows:

co,code=01 value=0.001 1664851126000

Now, we hope to use InfluxDB to complete the following alarm function.

  • When the CO concentration is greater than 0.04, a CRIT (critical) level notification signal is issued.

  • A WARN level notification signal is issued when the CO concentration is between 0.04 and 0.01.

  • When the CO concentration is lower than 0.01, an OK level notification signal is issued.

Ultimately, when the CO concentration exceeds the standard, we hope that the relevant staff will receive a call so that they can respond quickly to the incident.

(2) Auxiliary tools

In order to facilitate the verification of the effect of the alarm terminal, a very simple HTTP service that only supports POST requests was written.

Directly use the following command to start a POST HTTP service listening on the local port 8080.

./simpleHttpPostServer-linux-x64

After this command is executed, the terminal will be blocked, and when it receives a POST request, it will automatically print the contents of the request body to the terminal. If 0.0.0.0 is not the host you want to bind or port 8080 is already occupied. You can modify it using the following two parameters.

For example:

./simpleHttpPostServer-linux-x64 -h localhost -p 8080

For details, please refer to the project address: https://github.com/realdengziqi/simpleHttpPostServer

(3) Create a new bucket

To prevent us from confusing the data for this example with the previous examples, here we first create a new bucket called example_alert.

(4) Prepare data template

In this example, we will manually insert data into InfluxDB one by one, so we can open a text editor (this tutorial uses vs code) and first write a data template for the InfluxDB row protocol. Later, we can directly copy the data and modify it slightly. Click the value and then insert it.

The data template is as follows:

co,code=01 value=0.001

(5) Insert one or two pieces of data in advance

This step is for the later operation of creating checks, there is something optional in the query builder. Therefore, in order to create the check smoothly, this step cannot be omitted.

Here, we import one piece of data each time on the Web UI window for importing row protocol data. as the picture shows:

HELP HIM

The data is as follows: First time

co,code=01 value=0.0015

the second time

co,code=01 value=0.0025

(6) Create a check (CHECK)

Click the Alerts button in the left toolbar. By default, you will enter the CHECKS page, as shown in the figure:

HELP HIM

Hover the mouse over the CREATE button in the upper right corner, and a drop-down menu will pop up, including two buttons:

  • THRESHOLD CHECK (threshold check): This type of inspection task is mainly to determine whether the data exceeds a certain threshold limit.

  • Deadman Check: This type of check task is to determine how long it has been since new data was written in a certain sequence. You can also set a value, such as sending a warning signal once no data for a certain sequence has been entered into the database for more than 30 seconds.

Here, we select Threshold Check to create a threshold check.

MMSIZE

A dialog window will pop up later. Its layout is very similar to that of Data Explorer, but there will be some differences in functionality.

HELP HIM

  • There is a Name this Check at the top. Click to name the currently created Check.

  • There is a tab in the upper left corner, and DEFINE QUERY is selected by default, which is the page effect shown in the picture above.

  • At the bottom of the page is a query builder. It should be noted that we cannot switch to the script editor here, that is, here, we can only use the query builder to implement the query.

  • There is also a list on the far right. As mentioned above, in order to create a threshold check, you must choose:

    • a field

    • An aggregate function (that is, the aggregate function after windowing)

    • One or more ranges.

Now, we need to construct the query. As shown below.

HELP HIM

  1. Select example_alert at the bucket

  2. Select co at _measurement

  3. Note that although we currently have only one sequence under this measurement, _field=value must still be added to the filter condition, otherwise the One Field check item in the upper right will not pass.

  4. Finally, change the aggregation logic from the default mean to max.

  5. Click submit to preview the query effect of the data.

  6. Click the CONFIGURE CHECK button in the upper left corner. This will take us to a new page where we can configure the threshold.

The first thing to notice is that only the bottom half of the page has changed.

HELP HIM

  • The leftmost card corresponds to further configuration of query and scheduling. Here we set Schedule Every to 15s so that the check will be called every 15 seconds.

  • The STATUS MESSAGE TEMPLATE in the middle is the status message template. Shell-style value syntax is supported here. ${}. The meaning of r here will be explained in detail later. Keep the default template here without any modifications.

  • The THRESHOLDS on the far right corresponds to the setting of the value range. There are 4 types of value fields here, corresponding to the 4 status messages that a check can emit. they are, respectively:

    • CRIT (the first 4 letters of critical) means serious emergency.

    • WARN (warning) means warning, warning

    • Info (Information) represents general information, reminder

    • ok means in good condition

At this time, click the CRIT button in the lower right corner, and a small settings window will pop up, as shown in the figure below:

The meaning here is that when the value is greater than a certain value, the check status is set to CRIT. Here, the "is above" on the right side of "When value" means greater than. You can see that this is still a drop-down menu. We can click on that. You will find that it has more optional options, including is below (less than), is inside range (within what range), etc.

0.00125 is automatically filled in by the Web UI for us based on our current query results. Here, according to our needs, the status is set to CRIT only when the concentration value of co is greater than 0.04. The effect is as follows.

In the same way, if Warn and ok are set, Info will not be set.

Finally, click the check mark in the upper right corner to save Check.

Now, we are back to the original CHECKS page, and we can see that there is a Check we just configured in the list below.

HELP HIM

(7) Test Check

Now, you can go back to the page where you uploaded data and try to insert two pieces of data to test the running effect of the check.

HELP HIM

The inserted data is as follows:

co,code=01 value=0.025

At this time, the concentration of carbon monoxide is 0.025, between 0.01 and 0.03. At this time, the CHECK we just created should send out a WARN level signal.

Now, we can click Alert History in the toolbar on the left.

HELP HIM

As you can see, there is a notification with level WARN in our status record. Here, the MESSAGE on the right shows Check:CO_Alert is: warn, which is the message generated by our message template.

(8) Modify message template

Currently, the message prompted by our message template is not accurate enough. We hope that the current carbon monoxide concentration value can also be output when alarming.

At this time, you can take a look at what the official documents say about templates. You can find that the official documents point out that we can access the specific value of the data through the r. field name.

HELP HIM

In this case, we can re-modify the message template. The final message template is as shown below:

HELP HIM

Note the r.code and r.value in the template. Through this operation, we can directly extract the device number and current carbon monoxide concentration value in the data.

(9) Verification message template

Next, we insert a piece of data again.

co,code=01 value=0.0146

0.0146 Between 0.01 and 0.03, the CHECK we created earlier should emit another WARN level signal.

Now, still click Alter History on the left toolbar to view the status record of the inspection report. We found that the MESSAGE in the new status record has changed. This time we can see the device number and the carbon monoxide concentration at that time in the message.

HELP HIM

(10) Create alarm terminal (NOTIFICATION ENDPOINT)

Status logging alone is not enough, we also need to send information to external systems, such as sending an email to the developer, or making a phone call. Then the component responsible for sending messages to the outside is the alarm terminal.

  • First click the Alerts button in the left workbar. After entering the page, select the NOTIFICATION ENDPOINTS tab in the upper toolbar.

    HELP HIM

  • Click the CREATE button in the upper right corner, and a dialog window as shown in the figure below will pop up.

    HELP HIM

There is a drop-down menu for Destination in the upper left corner. This is actually the type of alarm terminal. You can see that we are provided with 3 terminals here, HTTP, Slack and Pagerduty. Slack and Pagerduty are commonly used communication software by overseas development teams. Here we choose HTTP.

MMSIZE

  • After selecting HTTP, you can see that the configuration items in the window will change. The so-called HTTP terminal is actually sending a POST request to a target address.

  • We will not connect to Ruixiang Cloud for the time being. Instead, we will find a way to observe the data structure of the data sent by the HTTP terminal.

    Run ./simpleHttpPostServer-linux-x64. After execution, the program will listen to the 0.0.0.0:8080 address. When it receives a POST request, it prints the received data to the terminal.

  • Now, we can set the HTTP terminal address in InfluxDB to http://host1:8080. As shown below:

    MMSIZE

  • Finally, click CREATE NOTIFICATION ENDPOINT in the lower right corner to create a terminal.

(11) Create alarm rules (NOTIFICATION RULES)

Alarm rules play a routing role between alarm information and terminals. Alarm rules can specify which Check and what level of information is sent to which terminal.

Notice! The prerequisite for creating an alarm rule is that at least one alarm terminal has been created. Otherwise, the Create Alarm Rule button on the Web UI will turn gray, which means that the alarm rule cannot be created.

First click the Alerts button on the left toolbar, then click NOTIFICATION on the upper tab. Then click the CREATE button.

MMSIZE

Now you can see a pop-up window for setting alarm rules. As shown below:

MMSIZE

At the top, you can set the scheduling time, which looks like a scheduled task. Conditions in the middle means conditions. For example, the current default condition is that when the status of CHECK in InfluxDB is CRIT, http_endpoint is used to send alarm information. It should be noted that there is also a button in Conditions in the middle called tag Filter, which means filtering according to tags.

In order to see the alarm effect faster, we set the scheduling time to 15 seconds. You can choose whatever name you want. Results as shown below.

HELP HIM

In the middle Conditions area, click the Tag Filter button, and then add a tag filter condition of _check_name == CO_Alert. Where CO_Alert is the name of the check we created earlier. We'll talk about how inspection and notification rules work later. Here we set it up like this first, and the effect after setting is as shown below.

HELP HIM

In the Message area at the bottom of the window, because we currently only have one terminal named http_endpoint, the UI here automatically selects http_enpoint for us, so just keep the status quo.

MMSIZE

Finally, click the CREATE NOTIFICATION RULE button at the bottom to create the rule.

(12) Test the alarm signal sending effect

Now, we want to test the alarm link that has been established in InfluxDB. Just insert a piece of simulated carbon monoxide concentration data so that its value is greater than 0.04.

The inserted data is as follows:

co,code=01 value=0.05

as the picture shows:

HELP HIM

Next, click Alert History on the left toolbar to come to the alarm history page and wait for about 15 seconds.

HELP HIM

Under normal circumstances, a status message with a level of crit should appear in the check status history.

Click the NOTIFICATIONS button above, and a notification record should appear. This list contains records of notifications sent out by InfluxDB. You can see that there is a green check mark on the far right of this record, which indicates that our message has been successfully sent through http_endpoint.

HELP HIM

Go back to the terminal where simpleHttpPostServer was opened before and take a look at the contents.

As shown in the figure, we successfully received a POST request and printed the data in the request body on the console. The effect after formatting this json data is as follows:

HELP HIM

As you can see, it includes the time of the data, the alarm message, the alarm level, the carbon monoxide concentration value at the time of the incident, etc. If you can see the final json in the terminal, it means that the alarm configuration of InfluxDB has been completed and can work normally.

(13) Working principle of inspection and alarm rules

When we set up alarm rules before, we found that configuring alarm rules requires setting the scheduling time interval. It feels a bit strange. Why does a rule need to be executed every once in a while? Let’s start with how inspection works.

After InfluxDB is installed, there will be a bucket named _monitoring automatically created by InfluxDB. We can use DataExplorer to query the contents.

The query results are as follows:

HELP HIM

As you can see, the query results include the alarm information generated by the Check task. For example, there is a field named _check_name and the value is CO_Alert. This is the check name we set before.

That is to say, the Check we execute regularly actually queries the data from example_alert regularly, then performs a threshold check on it, and finally writes the checked status information into the _monitoring bucket. Results as shown below:

HELP HIM

In fact, the following notification strategy is also a scheduled task. It queries the data of the most recent period from _monitoring and filters the data according to the conditions you set. Finally, if there is data that meets the requirements, use our http_endpoint to convert the data into json. The format is sent. The final entire process is shown in the figure below.

HELP HIM

This is how Check, Notification rule and Notification endpoint work together.

Example: Integrating Ruixiang Cloud (Saas solution for alarm system)

(1) What is Ruixiang Cloud?

Ruixiang Cloud is an alarm platform that provides a variety of alarm methods. You can recharge, choose to call the police, follow the instructions to configure, and then you will get an API interface. In the future, when your system needs to send an alarm, you only need to send an http request to this API in the code, and Ruixiang Cloud will make a call according to the phone number you configured, and soundly remind the programmer that it is time to work overtime.

(2) Register for Ruixiang Cloud

Official website address: https://www.aiops.com/

(3) Create your own alarm API

(4) Create alarm API on Ruixiang Cloud

  1. First, enter the homepage of Ruixiang Cloud and click the Intelligent Alarm Platform button on the left to enter the work page of the Intelligent Alarm Platform.

    HELP HIM

  2. As shown in the figure below, click the integration button on the upper tab to enter the integration configuration page.

    HELP HIM

  3. At this time, there is a list of monitoring tools on the left. You can see that Ruixiang Cloud can be integrated with many monitoring tools, but there is no InfluxDB in this list. At this time, there is also a universal integration solution, REST API.

    The REST API will provide a URL to the outside. As long as your monitoring tool can send a POST request to Ruixiang Cloud in the data format required by the API, it can be integrated with Ruixiang Cloud.

    HELP HIM

  4. At this point, we will enter a configuration page. First you need to set an application name. Then click the blue button below to save and obtain the application key.

    HELP HIM

  5. At this time, a line of red text will appear on the page, this is Appkey. Be careful not to leak this key.

At this point, our alarm API has been configured.

(5) Create a dispatch strategy

Now the outside can send alarm information to Ruixiang Cloud through the interface, but how does the Ruixiang Cloud platform send the alarm information to specific individuals.

Forwarding alarm information to specific people is a process called dispatch.

Go back to the home page of the alarm platform, click the Configuration button at the top, and then click the New Dispatch button on the right below.

HELP HIM

At this point, you will enter a new configuration page, and you can follow the instructions in the figure below. Note that if nothing is displayed when setting the assignee here, it means that your account has not yet been bound to an email address. At this time, please bind the email address yourself before proceeding with the subsequent operations.

HELP HIM

After configuring, click Save.

Now, once our TICK_TEST API receives the alarm information, it will notify the specific person.

(6) InfluxDB tries to connect with Ruixiang Cloud

Now, we can try to connect with Ruixiang Cloud. It seems that we only need to make the notification data sent by InfluxDB comply with the requirements of Ruixiang Cloud API. We can look back at the documentation for the API we just created in Ruixiang Cloud (in Integration-\Application List on the right-\Find the REST API application you created-\Click Edit-\Visible at the bottom of the page)

As shown in the figure below, it is explained what format of data we should send.

HELP HIM

(7) Shortcomings of alarm terminals

Now we return to the Alerts page of the Web UI and click to the edit page of http_endpoint. You will find that there is no place to modify the format of the sent data. Yes, InfluxDB's alarm terminal cannot set the data format to be sent. So at this point, all our efforts have been wasted. But there is another solution, we will directly touch the bottom layer of inspection and alarm.

Example: Notebook and alarm bottom layer (※)

Before, we said that you can also create alarm tasks using Notebook, but we have never touched Notebook since then. Here we directly use Notebook to get into the bottom layer of alarm.

(1) Use Notebook to create alarm tasks

First, click the Notebooks button on the left to go to the Notebooks configuration page. Then, click Set an Alert template to create a new notebook.

HELP HIM

(2) After entering Notebook, you will find that the first Cell is a query constructor. Here, we set the bucket to example_alert, _measurement to co, and _field to value. Results as shown below:

HELP HIM

(3) Click the RUN button above to view the execution effect.

HELP HIM

HELP HIM

You can see the two cells below. One cell displays the queried data as it is, and the other draws the data into a line chart.

(4) Finally, there is a cell below, and the name of this cell is in the upper left corner, New Alert. That is to say, this cell is used to configure the alarm.

HELP HIM

There are two blocks at the top, one is used to set alarm conditions, and the other is used to set scheduling intervals. Here, we still set the alarm threshold to 0.04 as required. Careful students may find that we can only set one alarm threshold here, missing crit, warn, info and ok. We will mention this issue later.

HELP HIM

(5) Look at the bottom of this cell again. This is the configuration of the alarm terminal. Here, we still choose the http terminal. And set the target URL to http://host1:8080.

HELP HIM

(6) After the above operations are completed, click the EXPORT ALERT TASK button on the lower right.

We will be pleasantly surprised to find that Notebook directly generates a long FLUX script for us. As shown below. Now, it is recommended that you copy the script and paste it into Data Explorer. Later, we will study this script ourselves.

HELP HIM

(2) Script interpretation

The script is as follows:

import "strings"
import "regexp"
import "influxdata/influxdb/monitor"
import "influxdata/influxdb/schema"
import "influxdata/influxdb/secrets"
import "experimental"
import "http"
import "json"

option task = {name: "Notebook Task for local_8dc08939-f5df-447e-8e53-41532537902f", every: 10m, offset: 0s}
option v = {timeRangeStart: -24h, timeRangeStop: now()}
check = {_check_id: "local_8dc08939-f5df-447e-8e53-41532537902f", _check_name: "Notebook Generated Check", _type: "custom", tags: {}}
notification = {_notification_rule_id: "local_8dc08939-f5df-447e-8e53-41532537902f", _notification_rule_name: "Notebook Generated Rule", _notification_endpoint_id: "local_8dc08939-f5df-447e-8e53-41532537902f", _notification_endpoint_name: "Notebook Generated Endpoint"}
task_data = from(bucket: "example_alert") |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
 |> filter(fn: (r) => r["_measurement"] == "co")
 |> filter(fn: (r) => r["_field"] == "value")
trigger = (r) => r["value"] > 0.04
messageFn = (r) => "${strings.title(v: r._type)} for ${r._source_measurement} triggered at ${time(v: r._source_timestamp)}!"
task_data
 	|> schema["fieldsAsCols"]()
 	|> set(key: "_notebook_link", value: "http://host1:8086/orgs/d2377c7832daa87c/notebooks/0a0bc4b03a6ba000")
 	|> monitor["check"](data: check, messageFn: messageFn, crit: trigger)
 	|> monitor["notify"](
		data: notification,
 		endpoint: http["endpoint"](url: "http://host1:8080")(
 			mapFn: (r) => {
 				body = {r with _version: 1}
            	return {headers: {
   
   "Content-Type": "application/json"}, data: json["encode"](v: body)}
 			},
		 ),
 )

Next, we will explain this code to you from front to back.

  • Import package: We will skip the import code at the top and will not explain it further.

  • option task : The option task is actually the setting of the scheduled task. This line of code actually indicates the name. The alarm script generated by the notebook for us is essentially an InfluxDB scheduled task.

  • option v : The first line of code, option v, declares a record type variable, which contains two key-value pairs, which actually represent the beginning and end of the query time range. -24h is shown here, actually because when we operate directly in the notebook, the time range in the upper right corner is set to -24h. We will change it to -15s later.

  • Two variables, check and notification : These two lines of code declare a record respectively, which is actually used as a parameter for the subsequent monitor function, because fields such as _check_id, _check_name and _type are required in the _moitoring bucket. So the notebook is automatically arranged for us when it is automatically generated.

  • Query data : The code in the above figure completes the query for the example_alert bucket. And the queried table stream is assigned to a variable named task_data.

    task_data = from(bucket: "example_alert") |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
     |> filter(fn: (r) => r["_measurement"] == "co")
     |> filter(fn: (r) => r["_field"] == "value")
    
  • Declare the threshold function : Here, there is a function named trigger. It can be seen that its main logic is a predicate expression. The reason for declaring a function here is that there is a monitor function later that needs to pass in a predicate function. In addition, you can see that the logic of this function is used to determine whether the carbon monoxide concentration exceeds 0.04.

    trigger = (r) => r["value"] > 0.04
    
  • Message template : This is also a function, but it directly returns a string, and the content inside is actually the message template.

    messageFn = (r) => "${strings.title(v: r._type)} for ${r._source_measurement} triggered at ${time(v: r._source_timestamp)}!"
    
  • Alarm logic : The next large section is the alarm logic.

    task_data
     	|> schema["fieldsAsCols"]()
     	|> set(key: "_notebook_link", value: "http://host1:8086/orgs/d2377c7832daa87c/notebooks/0a0bc4b03a6ba000")
     	|> monitor["check"](data: check, messageFn: messageFn, crit: trigger)
     	|> monitor["notify"](
    		data: notification,
     		endpoint: http["endpoint"](url: "http://host1:8080")(
     			mapFn: (r) => {
     				body = {r with _version: 1}
                	return {headers: {
         
         "Content-Type": "application/json"}, data: json["encode"](v: body)}
     			},
    		 ),
     )
    

    1) First, the schema["fieldAsCols"] function plays the role of converting the data structure. Results as shown below:

    HELP HIM

    2) The set function adds a constant field to the table stream

    3) The monitor["check"] function plays the role of checking the status

    It should be noted that the check variables passed in by the data parameter are actually _check_id and _check_name. messageFn is a message template. crit is an alarm level in our previous CHECK, but here it becomes a formal parameter of the function. The function value passed in is trigger, which is the predicate function we mentioned before.

    Also note that although the crit parameter is passed by value in the script generated by notebook, the monitor["check"] function actually has other parameters that can be passed. As shown below:

    HELP HIM

    As you can see, there are also info, ok, and warn parameters. So in fact, we can still manually modify the script to fill in these value ranges.

    4) The monitor["notify"] function is used to send data to the outside. You can see that an http terminal is declared in it. Finally, it's important to note that there is a local variable called body. This is actually the request body when we send a POST request. r is the data in our table stream, so the crux of the failure to connect to Ruixiang Cloud is here.

    We only need to modify the body into a format that meets the requirements of Ruixiang Cloud API.

    5) Why don't we directly use if else logic to complete the greater than or less check and then send the request directly to the outside, but use two specialized monitor functions to complete the function? Mainly because the monitor function will leave traces in our Alter history. That is, the monitor["check"] and monitor["notification"] functions will write check and notification records to the _monitoring bucket, which is very important.

(3) Modify the script to integrate Ruixiang Cloud

Finally, we modify the local variable body so that it conforms to the format required by Ruixiang Cloud API.

HELP HIM

The modified code is shown in the picture above.

(4) Create scheduled tasks

Click the Tasks button on the left, and click the CREATE TASK button on the upper right.

HELP HIM

Paste the script we just modified into the editing area, write the information in the option task line of code to the setting form on the left, and delete the original option task code.

HELP HIM

Finally, click the Save button in the upper right corner to create this scheduled task.

(5) Test the alarm effect

Now, let's upload a piece of data with a value greater than 0.04 to test the docking effect.

HELP HIM

Data are as follows:

co,code=01 value=0.08

After waiting for a while, you can see that we received a call, which told us that the carbon monoxide concentration value exceeded 0.04.

Example: Improving alarm systems

(1) Current alarm architecture

MMSIZE

You can think of Ruixiang Cloud as a highly available alarm service, that is, Ruixiang Cloud can be accessed without failure 24 hours a day no matter what. Then combined with Ruixiang Cloud, our InfluxDB can set up a scheduled task to check the rationality of the data, and extract the latest data every once in a while to calculate it. If the data is inappropriate, an alarm signal will be sent to Ruixiang Cloud, and then Ruixiang Cloud will initiate a notification to our specific technical personnel.

(2) A more trustworthy architecture

There is a problem with the architecture in the previous section. If one night passes, my InfluxDB will crash abnormally. If InfluxDB is down, it will naturally not send alarm information to Ruixiang Cloud, so the night has passed and you have had a good sleep, but is it really a Christmas Eve?

Therefore, it would be great if Ruixiang Cloud could know whether InfluxDB is still alive. It would be best if Ruixiang Cloud could check whether my InfluxDB is still there and can be used every once in a while. This behavior is called business availability check.

MMSIZE

In this picture, the orange arrow is Rui Xiangyun's inspection of InfluxDB.

(3) The architecture in the following example

Using Ruixiang Cloud to alarm is actually purchasing software services from Ruixiang Cloud. This software is on Ruixiang Cloud's server, not your own company's server. This approach is called SaaS, software as a service. In this case, if we want Ruixiang Cloud to access our InfluxDB service in turn, we must expose the InfluxDB service to the public network. At this time, InfluxDB is either on the public network itself, or it uses the intranet to penetrate. Because the demonstration environment here is an intranet, intranet penetration needs to be built.

MMSIZE

In this way, no matter whether the internal network penetration collapses or the InfluxDB collapses, an alarm will be triggered.

(4) Build intranet penetration

Here we use the intranet penetration tool provided by Peanut Shell to achieve intranet penetration. A new Peanut Shell account has free intranet penetration quota and free 1M bandwidth.

1) Install the peanut shell intranet penetration client

Visit the official website download page: https://hsk.oray.com/download

HELP HIM

Pay attention to choosing the installation package that matches your system. We are demonstrating the use of CentOS, so here I choose CentOS Linux (x86_64)

Use the following command to install the deb package.

sudo rpm -ivh ./phddns_5.2.0_amd64.rpm

After the installation is complete, a service called Phtunnel will be automatically started, and you will have a command line tool called phddns that can control this service. All relevant information is displayed in the prompt message printed out after installation.

MMSIZE

2) Activate SN

Under normal circumstances, phddns will run automatically after installation. You can use the phddns status command to view the running status of the program.

phddns status

If ONLINE is displayed, it is operating normally.

Note the SN code here, which is our device identification code.

In addition, it shows here that we have a remote management address, which is http://b.oray.com .

Visit this address in your browser. You will enter a login page, as shown below:

MMSIZE

Now, switch to SN login, you can see that we need to enter our device SN code here. We were also prompted during the installation just now that the initial password is admin. Now enter the SN code and password and click login.

Here you need to register a Berry account and activate it.

MMSIZE

3) Configure intranet penetration

After successful activation, when you see the management page, click the intranet penetration button on the left toolbar to enter the intranet penetration management panel and click Add Mapping.

HELP HIM

Follow the sequence in the figure. Note that the intranet host refers to the host where you just installed intranet penetration. The trial version can only have a maximum bandwidth of 1Mbps, and the converted upstream and downstream speeds should be 128kb/s.

MMSIZE

After clicking OK, you will return to the intranet penetration management page.

If you can see the card shown in the picture below, it means that the intranet penetration has been configured successfully.

MMSIZE

From now on, when we visit https://1674b87n99.oicp.vip/ , it is equivalent to accessing the local 127.0.0.1:8086.

(5) Configure business availability detection

1) Create monitoring tasks

First, return to Ruixiang Cloud's homepage and click on the business availability monitoring platform marked in the picture on the left.

MMSIZE

After coming to the homepage of the monitoring task, click the green button marked in the picture (Create Monitoring)

MMSIZE

First complete the monitoring settings. Here we need to set the monitoring address to the address we just configured for intranet penetration. The address uses /health, and a Get request is issued to this address. Under normal circumstances, a json format data will be returned, which will tell us whether InfluxDB is currently healthy. Additionally, the status code should be 200 if the request is successful. Finally, making a get request to this interface does not require token blessing.

MMSIZE

The response part settings are as shown in the figure below. To explain, the meaning here is that if the interface completes the response within 2 seconds, the speed is satisfactory. If the response is between 2 and 5 seconds, it is relatively slow. If it is greater than 5 seconds, it is very slow.

MMSIZE

Click the result verification in the upper right corner and set the response code to 200, which means that the response code is 200, which is the normal state we expect.

MMSIZE

The monitoring frequency setting is 15 minutes. In fact, the free version can only be accessed once every 15 minutes at the fastest. After recharging, you can get a higher access frequency

MMSIZE

The operator and monitoring area refer to which province and which operator's network you want to use to access your interface, because sometimes an interface may be accessible through China Mobile's network, but not through China Unicom's network. In the end we only choose one host. As shown below.

MMSIZE

After completing the above operations, click the Save button in the upper right corner.

2) Configure alarm rules

After returning to the monitoring list, you can see that there is already a monitoring item on the page.

HELP HIM

Now click the alarm button on the left, and let's configure the alarm channel.

As you can see, we are faced with three concepts: alarm rules, alarm strategies, and alarm behaviors . Like our InfluxDB, the alarm rules here correspond to inspection tasks, which translate data into three signals: warning, serious, and normal. The alarm behavior is equivalent to the alarm terminal, you can choose to send an email or make a phone call. The alarm policy is used to connect alarm rules and alarm behaviors, which is equivalent to the alarm rules in InfluxDB.

MMSIZE

First, let's configure the alarm rules First, click the alarm rule button on the left, and then click the + sign

MMSIZE

Give the rule a name and select API Monitoring on the rule type.

MMSIZE

Click on the alarm object in the upper tab. You can see that the created API monitoring type monitoring task is on the left. Select the InfluxDB health status on the left, then click the > button in the middle to add it to the selected list, and then click Next step.

MMSIZE

As you can see, we need to set a serious condition here, which is equivalent to setting the CRIT threshold in InfluxDB. Here we set it so that if the availability rate in the past 15 minutes is not 100%, then it is considered that a serious error has occurred.

MMSIZE

As shown in the picture, click Next again in the lower right corner.

This step is called a warning condition, which is equivalent to warn in InfluxDB. Here we directly click the blue button to copy the same conditions from the severe condition, and then click Save in the lower right corner.

MMSIZE

Finally, if everything is normal, there will be one more alarm rule list, as shown in the figure below.

HELP HIM

3) Configure alarm behavior

Click the alarm behavior button on the left, and then click the + sign to create a new alarm behavior.

MMSIZE

In the pop-up window, first select the behavior type. Here we choose Webhook. You can see that a URL is required here, which is also equivalent to us sending a request outbound. Therefore, whoever this URL specifies should actually be connected to Ruixiang Cloud for processing.

MMSIZE

Return to the integration page of the intelligent alarm platform and find Ruixiang Cloud in the integration tools. Click to create.

MMSIZE

Name it first, then click Save directly below and get the application key.

MMSIZE

You can see that the URL in the configuration description is automatically completed with a random string, which is the key. Now copy it.

Go back to the window for creating an alarm behavior, fill in the URL, and click Test. If the test result shows connect success, it means that our configuration is correct. Click Save in the lower right corner.

MMSIZE

4) Modify the dispatch strategy

Now we have created two applications in our alarm platform. But we have created a dispatch strategy before, which will forward all alarm notifications received by the REST API to a certain user. But the webhook we just created is not yet included in this dispatch strategy.

So here, find the dispatch policy we set before and click the edit button on the right.

HELP HIM

HELP HIM

After entering the editing page, click the Add button indicated in the picture.

As you can see, here we will display two applications, that is, the alarm notifications received by these two interfaces will be forwarded to the users we specify.

HELP HIM

Finally, click Save, and the dispatch policy is modified successfully.

5) Create an alarm strategy

  1. Return to the monitoring platform and alarm management page. First click the alarm policy button on the left, and then click the + sign to create an alarm policy.

MMSIZE

  1. The configuration of the triggering policy is shown in the figure below.

    HELP HIM

  2. Click the Trigger Behavior Settings button above, then click the Add button on the right.

HELP HIM

  1. You can see that there is a tab that can be selected. This is the alarm behavior we created before. Click the mouse to select it, and then click the select button in the lower right corner.

    HELP HIM

  2. If the alarm behavior is successfully added, click the save button in the lower right corner. At this point, our alarm strategy has been successfully established.

    HELP HIM

Guess you like

Origin blog.csdn.net/qq_44766883/article/details/131586277
Recommended