nginx/tomcat log format specification

    Recently, I plan to design and develop a log collection platform, and then conduct real-time log analysis, business monitoring and early warning. Before this, it is necessary to formulate the format specification of the log, and of course other binding specifications, in order to well realize the characteristics of log collection, data sorting, and data analysis.

 

    The method and purpose of formulating the log format specification:

    1) All projects have a unified log format, which can greatly simplify the complexity of log collection and analysis.

    2) Log formats such as nginx and tomcat need to be reasonable to make log viewing and troubleshooting more convenient, exclude unused field information, and add more valid fields.

    3) Considering that the log format will always change in the future, but the log data will be used by the operation and maintenance, development, big data platform, BI, security and other teams. In terms of impact, we divide the fields in the log into "domains"; each domain includes multiple fields, and different teams focus on different domains. When the field list in a domain is changed, it does not affect the use of data by other teams. . When we parse log data, we first divide the log into multiple "domains" according to the domain separator, and then obtain the field value according to the relative position of the field in the domain, instead of using the position of the field in the entire log. We use the "^_^" symbol as the domain separator.

    4) In order to facilitate data sorting and log collection, we agree that all log file names must follow uniform rules, which is very beneficial for Flume to collect data. For example, nginx log, tomcat access log, business log, etc., the file name of the log follows: <project-name>.<tag>.log.<yyyy-MM-dd>.<index>; for example: order-center.error. log.2017-10-11.0, where <index> is the index number generated during rolling. The reason for unifying the log name is that it is easy to understand the source and core characteristics of the log through the file name. In addition, for Flume, important information such as the name of the project and the log can be known from the file name. , Logs are classified and stored.

    5) Strictly control the size of the log file, and roll the log file in time. We agree that the size of any log file should not exceed 256M. When this value is exceeded, the log should be rolled. The reason is very simple. Large log files are not easy to collect, transmit, or view. In addition, large logs will reduce the efficiency of file IO. On this basis, we require that log information needs to be reasonably planned when printing logs, and log information should be simplified as much as possible. The complex and huge log information is not only less valuable, but also consumes storage. In addition, the output of large log content will increase. The IO load of the host machine, after all, the IOPS of our ordinary application machine is usually not high.

    6) In order to facilitate log sorting, readability and locality of log content, we print tag information such as "IP of the current machine" and "timestamp generated by the log" in all log content such as nginx and tomcat.

 

1. nginx log format:

log_format  main  '$time_local|$hostname|$remote_addr|$upstream_addr|$request_time|$upstream_response_time|$upstream_connect_time|'

'$status|$upstream_status|-|$bytes_sent|-|-|$remote_user|$request|$http_user_agent|$http_referer|^_^|'

'$scheme|$request_method|$request_trace_id|$request_trace_seq|^_^|'

'$http_x_forwarded_for|$http_Authorization|$cookie_uid';

 

    There are several "-" placeholders. I use nginx version 1.10, but there are several very important fields that need to be used after the release of version 1.11, so I use "-" placeholders here first.

    For the nginx log format, I refer to AWS ELB. I think the log format of ELB is relatively standardized and has high reference value. The nginx format here corresponds to the ELB log format in the fields above. For fields missing from nginx, use "-" first.

   We divide the nginx log into four domains. The first domain contains some of the most commonly used and important fields, which are usually related to performance evaluation and data sorting; the second domain represents some status information of the request to troubleshoot problems. You can pay attention to this domain when you want to; the third domain is about request tracking. We set request_id, etc. for each request (where $request_trace_id, $request_trace_seq are custom variables), and then you can use abnormal requests during business monitoring The full tracking chain is sorted out; the fourth field, for development, is usually used to print some HTTP parameters, etc.

    $request_trace_id and $request_trace_seq are custom variables that represent "request trace ID" and "request trace sequence number" respectively:

    A new request will have a unique trace_id. This trace_id is usually generated by the top-level proxy. After generation, the trace_id will be added to the header and passed to the upstream layer (tomcat, etc.). If there is a request fanout in the upstream application, The trace_id will continue to be issued, and finally the requested link tracking will be realized. We will collect and organize the tracking information of the link, and later use it to evaluate the interface performance, service monitoring, traffic abnormality detection, capacity planning, etc.

    If the proxy layer finds that the request header already contains trace_id, then we consider this request to be "a link in the link" rather than a new request, keep the trace_id at this time, and add it to the header to continue forwarding. The default value of request_trace_seq is 0. Upstream is responsible for maintaining the value of seq. For example, it increments its value every time it is delivered. The reason why nginx is not responsible for the increment of this value is that the application program decides the order of seq.

##trace.setting
set	$request_trace_id $http_x_request_id;
set $request_trace_seq $http_x_request_req;

if ( $request_trace_id = '' ) {
	set	$request_trace_id $pid-$connection-$bytes_sent-$msec;
}
if ( $request_trace_seq = '' ) {
   set	$request_trace_seq 0;
}

 

....
server {
    listen       80;
    server_name  demo.com;
    include trace.setting;
    access_log  /var/log/nginx/demo.log  main;
    proxy_send_timeout      1800s;
    proxy_read_timeout      1800s;    

    location / {
            proxy_pass         http://10.0.0.1:8080;
            proxy_set_header   Host             $host;
            proxy_set_header   X-Request-ID $request_trace_id;
            proxy_set_header   X-Request-Seq $request_trace_seq;
    }
}
....

 

    Field meaning (one by one):

Field name explanation
time_local log time
hostname hostname of the current machine (not IP)
remote_addr client address
upstream_addr The address of the backend server
request_time The length of time that nginx processes the request, starting from obtaining the first byte of the Client request to the completion of sending the response data, the unit is "second + millisecond"
upstream_response_time starts from the establishment of the connection between nginx and upstream until the response data is received.
upstream_connect_time The time to establish a connection with upstream.
status nginx response status code
upstream_status Status code returned by upstream to nginx (tomcat or successor nginx)
bytes_received The size of the request data received by nginx from the Client, which can only be supported by version 1.11, here is replaced by the "-" placeholder
bytes_sent The size of the data returned by nginx to the Client
upstream_bytes_sent The number of bytes sent by nginx to upstream, which is only supported in version 1.11. Use the "-" placeholder instead.
upstream_bytes_received The number of bytes of upstream response received by nginx, which is only supported in version 1.11. Use the "-" placeholder instead.
remote_user User information in basic authentication
request HTTP request line - first line
"User-Agent" value in the http_user_agent header
"Referer" value in the http_referer header
scheme Requested Scheme, HTTP or HTTPS
request_method HTTP(S) request method name: GET, POST, etc.
request_trace_id Get the "X-Request-ID" value in the header, if this header is not included, create a new Trace_id.
request_trace_seq Get the "X-Request-Seq" value in the header. If it exists, it indicates that this request is a request sent by the trace link. This value is used to track the level or order of the request chain
http_{key} Get the value of the specified key in HEADER.
cookie_{key} Get the value of the specified key in the cookie.

 

 2. Tomcat Access log format specification

    For JAVA WEB projects, tomcat provides a built-in access log mechanism, which is somewhat similar to nginx's access log; when enabled, tomcat will print the received request information in the log, which is useful for us to analyze data, troubleshoot problems, and perform performance testing. etc helps a lot. Just modify the server.xml file:

<Valve className="org.apache.catalina.valves.AccessLogValve" directory="logs"

               prefix="access.log" suffix="" renameOnRotate="true"

pattern="%{yyyy-MM-dd HH:mm:ss}t|%A|%a|%p|%m|%s|%D|

%b|%{begin:msec}t|%{end:msec}t|^_^|

%{X-Request-ID}i|%{X-Request-Seq}i|^_^|

%S|%r|%{Referer}i|%{User-Agent}i" />

 

    The basic principle is the same as that of nginx. It is divided into domains. The first domain is still related to data sorting and performance testing; the second domain is about request tracking and will be connected to the tracking chain system in the future; the third domain is development concerns. Usually it prints some HTTP parameters, etc. Unfortunately, the field information provided by the tomcat access log is relatively small, not as rich as nginx.

    In addition, in order to make the file name of the access log follow the specification, we have adjusted the configuration of prefix and suffix.

    In order to achieve the purpose of request tracking, we also print the relevant information of request_trace_id and seq in the tomcat access log.

 

3. Business log format

    Our java program also prints some logs. These logs are very important for troubleshooting, business monitoring, and data statistics, but the users of these data are usually big data teams, BI departments, architecture teams, etc., so the format of these logs is standardized It is critical; in addition, the format is uniform, which is also very beneficial for development engineers to collaborate across teams.

    The principle remains the same: it is convenient for data sorting and domain division; but for better execution, we need to agree that the log component is logback + sl4j; because we will use the MDC mechanism in logback to print more runtimes information.

<encoder class="ch.qos.logback.classic.encoder.PatternLayoutEncoder">
    <pattern>%d{yyyy-MM-dd/HH:mm:ss.SSS}|%X{localIp}|%X{requestId}|%X{requestSeq}|^_^|uid:%X{uid}|^_^|[%t]|%-5level|%logger{50} %line - %m%n</pattern>
</encoder>

 

    In the whole process, we pass the request_id from nginx to tomcat, and pass it to the inside of the java application, mainly to pave the way for the later request tracking platform, and it will be easier for us to troubleshoot problems. It should be noted that both %X{requestId} and %X{requestSeq} are packaged by MDC Filter.

    In the second field, each project is allowed to customize its own log field, which is a KV mode, such as "uid:10010|orderId:10000", and the KVs are separated by ":", after which our log analysis component can convert them Parse it into a map and save it, for example, in ES. The reason for this is that each business system needs to print different custom fields, which poses a great challenge to the subsequent data analysis and statistical components. loosely coupled.

 

    In the above, we also mentioned the size of the business log, which requires a certain rolling strategy. We agree that the version of the logback component of all projects is not lower than 1.1.7, and the rolling strategy is unified:

<appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
    <file>${LOG_HOME}/order-center.log</file>
    <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
        <fileNamePattern>${LOG_HOME}/order-center.log.%d{yyyy-MM-dd}.%i</fileNamePattern>
        <maxFileSize>256MB</maxFileSize>
        <maxHistory>15</maxHistory>
        <totalSizeCap>32GB</totalSizeCap>
    </rollingPolicy>
….
</appender>

 

4. Basic data platform planning

    1) Based on the Flume component, collect the log files of the entire network project, including historical files (files generated by rotate every day, files generated by rolling) and real-time log information (tail); store the log files in one or more bastion machines uniformly (group server), and categorize logs by project and time.

    2) The business system is usually deployed in a distributed manner. A project is deployed on multiple machines, and each machine will generate new log information. If online problems are encountered, multiple machines need to be accessed during the troubleshooting. In order to solve this problem, we The log information generated by each project in real time is unified into a file on the bastion machine, such as: /order-center/2017-10-11/access.log.tail on the bastion machine, this file is saved in the order-center project 2017-10-11, multiple tomcat real-time logs, they are mixed together.

    3) Flume forwards real-time logs to kafka, and other data statistics and analysis systems consume data through kafka. In addition, our Flume will adopt a "multi-layer" architecture design, deploying a Flume agent on each application host machine, the agent of the same project will send data to a Flume collector (also a Flume agent), and the log information will be unified by the collector. Classification, local storage, Filter, and forwarding to kafka, etc.

    4) Analyze the real-time log information based on storm + kafka, and match the log content according to the rules. When encountering "business abnormality", "traffic abnormality", etc., an alarm is triggered, and the corresponding information is organized to form a traceing link and show it to relevant personnel.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326489478&siteId=291194637