Operation and maintenance development practices - based on error logs to build Sentry Monitoring System

Error log monitoring can also be called business logic monitors designed for business process system error log generated by the operation of induction collected and monitored alarms. It seems like there is deja vu? Yes ... that is mentioned in the "APM Application Performance Monitoring." But it is different with the APM, APM system mainly focus on the application layer behavior analysis, collection of more data is operational direction. And sentry done is to collect crash information underlying application code, to facilitate troubleshooting codes are codes Lennon exception. In simple terms it is a technology-oriented code Lennon troubleshooting tool.

1. Scene Description

With the advance of the operation and maintenance of process automation, operation and maintenance of various types of tools and systems have sprung up. Currently the number of our independent development of operation and maintenance system has been close to double digits. These systems are deployed on multiple machines, usually supporting a number of scripts running in the background. web end if an exception occurs, developers can receive timely feedback to fix it. the script because there is no interaction, the situation may occur when a major fault location to the problem.

2. Existing programs

  1. A rear end and an intermediate state using a script recording program python module built-in log, but also to redirect the output of both the specified file for uncaught exception information.

  2. Log on systems with multiple servers stored centrally to the same directory

  3. Using the timing pull rsync log files from multiple servers

  4. Log files are keyword matching and filter the results sent to the operation and maintenance developers by e-mail

The final integration of e-mail notification Figure

3. Problems

The above operation a partial solution to the problem of blind spot monitoring scripts to run the state, but there is a problem

  1. Not the first time the wrong perception

    Pull not real-time, web end user feedback script logs often lag. There is a problem to problem-solving cycle is too long, easily lead into a passive.

  2. Get error messages relatively inefficient

    Error messages and e-mail alerts contain user feedback is very limited, eventually had to look down at a large amount of information associated with the log. Code may also need to get some middle buried multi-variable data in a test environment, to locate problems caused a lot of trouble.

  3. Log approach is not flexible enough

    Generally speaking, in addition to run error, we also care about other unusual circumstances, such as data pollution, illegal request, third party API calls abnormalities. If such equivalent error recorded is likely to cause alarm spamming, and if not addressed such abnormalities may cause serious problems over time. we want the same log content can be flexible according to different scenarios

  4. Monitoring coverage limited

    Monitoring should cover the complete script, backend and front end of three parts, especially our new operation and maintenance system to achieve a separation after the front and rear end, the front end of a lot of problems can not be unified recorded.

In view of this, we understand some of the log collection and monitoring program, select the sentry.

4. On the sentry

4.1 Overview

sentry is a modern, error logging and aggregation platform. It supports almost all major development languages ​​and platforms, and offers modern UI, as shown in

And ELK, different splunk, sentry focus on the aggregation and monitoring of the application error logs generated by official SDK provides a number of languages.

Up to 30 integrated way

The first time allow developers to acknowledge the error message, and easily integrated into the workflow of themselves and their team.

4.2 before and after the use of contrast

For visual display of powerful sentry, where simulated a common scenario, any similarity is purely coincidental.

4.2.1 Scene One

Access the former sentry

  • User A: publishing does not take

  • Developer A:? Cut a figure which page

  • User A: (screenshot fat)

  • A developer can reproduce the bug was found, log in to view the server error log to confirm the program logic, no problem, check the database data, we found dirty data. Contact the developer responsible for updating the data B to check python script C.py.

  • Developer B logon server view the error log, find a logic error causes the script to strike has lasted for an hour. Affecting thousands of data

After access

  • Developer A, B simultaneously receive e-mail alerts, a minute ago script C.py quit unexpectedly.

  • Developers B to enter the sentry backstage to see the error message, locate the problem and fix it, then clean up the affected dozens of data.

  • No users are affected during this process without developer intervention A

4.2.2 Scene II

Access the former sentry

  • User: point the submit button did not respond

  • Developer:? Cut a figure which page

  • User: (fat shots)

  • Developer: I do not have this problem here, you open the developer tools, cut to the console panel, cut a figure I look

  • Users: how to get ??

    ….省略100字

  • 开发者拿到相关数据, 确定是代码问题. 但是js文件经过了压缩, 无法定位到有问题的代码. 只能打开本地开发服务器调试.

接入后

  • 开发者收到邮件告警, 显示前端有错误日志

  • 开发者进入sentry后台查看错误信息, 比如用户浏览器版本, 产生错误的页面url, 代码调用过程和最终引发错误的代码, 确认问题所在.

  • 开发者: 两分钟前你提交的工单备注字段的校验有点问题, 你先把那一栏留空再提交, 稍后我会更新一个hotfix版本, 到时跟你说下.

  • 用户: 好的, 刚想问.

5. sentry的配置

sentry官方提供了详细的部署文档, 网上也可以搜索到中文的安装教程, 安装过程不赘述. 想要尝鲜的小伙伴也可以直接使用sentry官方提供的saas, 免费版支持每天5000个event. 地址是 https://sentry.io

5.1 概念

使用sentry, 需要弄清楚几个概念:

  • event

    直译是"事件", 是可操作数据的基本单位. 每一次日志输出就产生一个event. event并不一定就是错误, 如果日志记录级别设置很低, 那么后台会产生很多的event, 所以正确的设置日志级别很重要

  • issue

    直译是"工单"或者"问题", 是同一类event的聚合. 某一个错误可能因为重复执行而被记录多出, 在sentry会自动聚合到一起, 方便处理. 通常我们操作的对象就是issue

  • DSN

    DSN即客户端密钥, 用来进行客户端和服务器的通信. DSN是一个url, 包含一个公钥一个私钥, 项目标记和服务器地址, 比如https://1703147af2094458bevb1bfadcfa1c2:[email protected]/1545. 这类DSN是私密的, 还有一类是非私密的, 在sentry后台中显示为DSN(public), 给前端项目使用.

  • Raven

    整个错误日志监控系统包括客户端和服务端, Sentry是服务端的名称, 客户端名称是Raver, 需要两者配合才能工作.

5.2 配置

sentry服务端的配置主要是名称, 告警规则等, 至于被监控项目是前端还是后端区别不大.

5.2.1 创建项目

  1. 进入sentry系统后台, 点击右上角新建项目

  2. 命名为[项目名][前|后端], 比如"蓝海前端".

  3. 在配置应用框架页面, 点击可以查看各个语言或框架的接入文档(可以忽略这一步)

  4. 点击左上角项目名称, 进入项目首页, 可以看到页面显示"Waiting for events…"

5.2.2 获取和测试DSN

  1. 在"项目设置"页, 在左侧列表中点击"客户端密钥", 进入页面

  2. 拷贝DSN, 后端的是DSN, 前端是DSN(public)

  3. 以python为例, 执行pip install raven安装客户端后, 执行raven test DSN, 如果一切顺利, 可以在sentry后台项目首页看到新增了一条测试消息

5.2.3 配置警报

  1. 在"项目设置"页面, 在左侧列表中点击"警报", 进入警报配置页

  2. 点击规则标签页, 可以看到已有一个规则, 当事件首次发生时告警

  3. 根据需要修改规则

告警规则的配置相当灵活, 且可以对多个条件进行与或判断

5.2.4 集成告警

  1. 在"项目设置"页面, 在左侧列表中点击"所有集成"

  2. 勾选需要接入的类型,比如Mail

邮件服务器的配置请参考官方文档, 自己搭建的sentry服务器如果发现集成类型很少, 可以安装官方或第三方插件进行扩展

在服务端配置结束后, 可以开始配置客户端.

6. 后端的接入

因为我们的系统主要用python开发, 在此以python为例.

python接入sentry十分简单. 官方提供了十几种python环境(框架)下使用sentry的例子, 比如在celery中

from raven import Clientfrom raven.contrib.celery import register_signal, register_logger_signalclient = Client(DSN)# register a custom filter to filter out duplicate logsregister_logger_signal(client)# The register_logger_signal function can also take an optional argument# `loglevel` which is the level used for the handler created.# Defaults to `logging.ERROR`register_logger_signal(client, loglevel=logging.INFO)# hook into the Celery error handlerregister_signal(client)# The register_signal function can also take an optional argument# `ignore_expected` which causes exception classes specified in Task.throws# to be ignoredregister_signal(client, ignore_expected=True)

个人推荐借鉴logging使用的例子, 原因是通常开发者会根据logging模块定制自己的日志配置, 不直接使用框架内的日志模块. 如果你在应用程序中只用了logging模块, 那么接入sentry对已存在的代码来说是透明的, 无需多加修改.

用logging模块接入sentry只需两步:

6.1 安装客户端

pip install raven

6.2 初始化配置

在应用程序的入口文件(tornado中的app.py等)中, 或者自定义的日志模块中, 插入如下代码

from raven.handlers.logging import SentryHandlerfrom raven.conf import setup_logginghandler = SentryHandler(DSN)handler.setLevel(logging.ERROR)setup_logging(handler)

完成了这两步操作之后, 就可以像之前那样使用logging模块

import logginglogger = logging.getLogger(__name__)logger.info('This is a test message')

当上面的代码被执行时, 除了原有的打log操作之外, raven还会向sentry服务器发送日志内容, 并向标准输出添加

Sending message of length xxx to https://xxxx

如果希望向sentry发送更多上下文信息, 可以带上extra参数

logger.error('This is a test message', extra={'stack': True})

最终显示在后台的日志信息如图

包含了日志级别, python环境信息, SDK信息, 栈调用, 前后n次日志输出, 相关的其他事件等等, 如果是未捕获的异常或带上extra参数, 还会显示中间变量的值, 很方便的定位到出错的位置和数据, 无需再去代码埋点.

7. 前端的接入

前端的接入相对来说复杂一些

第一, 需要对sentry服务器的域名进行解析. 内部的系统, 后端监控可以给线上机器添加hosts指定sentry服务器的IP. 而前端, 因为错误日志是从用户浏览器发出的, 需要用户能自动解析sentry服务器的域名

第二, 如果前端项目用到了打包工具, 而通常打包工具会对代码进行压缩甚至混淆, 就会出现sentry收集到的日志无法准确定位问题代码的情况 所幸, sentry支持导入sourcemap自动解析和还原代码, 让开发者在后台能看到development环境一样详细的栈调用. (当然如果没有用打包工具可以忽略这一步)

前端的接入这里以reactjs为例

7.1 安装依赖

npm i raven-js --save

7.2嵌入raven

在index.js文件(入口文件)中,

import Raven from 'raven-js';# 在适当的地方加入, 尽可能让它早执行Raven.config(DSN(public)).install();

其他前端框架的接入请参考官方文档 https://docs.sentry.io/clients/java

7.3导入sourcemap

提前生成好sourcemap文件, 实测source-map级别可以完美工作, cheap-source-map能定位到, 但显示不友好. 当然, 最推荐的是cheap-module-source-map

# 安装npm i -g sentry-cli-binary# 登录sentrysentry-cli --url SENTRY_URL login# SENTRY_URL指自建服务或官方saas地址, 执行命令后会访问API TOKEN创建页面, 生成一个TOKEN, 拷贝进来, 成功后TOKEN会被保留到系统用户某个配置目录下, 后续的请求都会重复使用这个TOKEN# 创建一个releasesentry-cli releases -o sentry -p 7d04f2c51f32 new test01 --finalize# 这里的sentry 和7d04f2c51f32 是指 组织名称和项目名称, 均指*简称*, 与sentry页面上默认显示的不同, 需要到配置页面查看# 上传dist目录下的文件sentry-cli releases -o sentry -p 7d04f2c51f32 files test01 upload-sourcemaps dist# 删除旧的release下的所有文件sentry-cli releases -o sentry -p 7d04f2c51f32 files test01 delete --all# 当然这个命令是不想要release上的文件的时候执行的

注意, 生成的map文件与上传的相对路径需要一致. 比如, dist目录是打包后的文件存放目录, map文件为sourcemap/[file].map, 则sentry-cli上传目录应该是dist, 这样map文件才会显示在sentry后台的~/sourcemap/目录下.

这样的webpack配置

devtool = 'source-map';output.path: 'dist';output.sourceMapFilename = 'sourcemap/[file].map';

则对应这样的命令

sentry-cli releases -o sentry -p 7d04f2c51f32 files test02 upload-sourcemaps dist

另外, sentry-cli提供了一个参数--url-prefix, 可以为上传的map文件添加前缀, 默认是~/, 有兴趣的同学可以试试看

再补充一点, sentry需要根据js文件的sourceMappingURL来解析map文件路径, 所以sourcemap级别不能用hide-source-map或者类似的.

代码上传完毕后, 在版本->工件页面可以看到该release上的文件, 如图

最终错误日志效果如图

8. sentry管理后台的使用

篇幅所限, sentry后台的使用简要讲几点

第一,自定义过滤

sentry提供了丰富的过滤选项, 默认过滤条件是"Unresolved Issues", 用户也可以自己组合过滤条件, 并保存成个人或团队的默认选项

第二,页面实时更新

上图中间的按钮可以开启或关闭issues页面的实时刷新,

第三,统计和概览

上图是系统管理员页面, 可以看到系统调用, 等待中的任务队列等的情况, 在个人帐号首页也能看到项目的统计信息.

9. 需要注意的点

  1. 用sentry做错误日志监控不能取代原有的日志存储方案, 只是在日志收集和监控方面做了扩展. 使用sentry应着重利用其实时性和快捷性, 做到快速响应. sentry会清除较旧的日志内容, 这与ELK之类的日志处理系统也有差别.

  2. sentry能否用得好还取决于打log的开发者的功力. 如果原始日志记录缺少关键信息或无效信息过多, 再强大的日志分析系统也无能为力. 因此在引入sentry做日志监控的同时, 也要同步加强开发团队打log的意识, 规范日志级别, 格式和内容

Guess you like

Origin www.cnblogs.com/duanxz/p/11797929.html