Log Archiving and Data Mining (Log Center)

Log Archiving and Data Mining

http://netkiller.github.io/journal/log.html

Mr . Neo Chen (陈景峰) , netkiller, BG7NYT


Xishan Meidi, Minzhi Street, Longhua , Shenzhen, Guangdong Province, China 518131 +86 13113668890 +86 755 29812080



Copyright Notice

Please contact the author for reprinting. When reprinting, please be sure to indicate the original source of the article, author information and this statement.

Documentation source:
http://netkiller.github.io
http://netkiller.sourceforge.net

 

2014-12-16

Summary

2013-03-19 First edition

2014-12-16 Second edition


1. What log archive

Archiving refers to the process of arranging the logs and saving the valuable files and sending them to the log server to save them after being organized by the system.

2. Why do log archives

  • Call up the historical log query at any time.
  • Do data mining through logs to mine valuable data.
  • View the working status of the application

3. When to do log archiving

Log archiving should be a system ("archiving system") stipulated by the enterprise, and log archiving should be considered at the beginning of system construction. If your business does not have this work or system, it is recommended that you implement it immediately after reading this article.

4. Where to put archived logs

A simple one-node server plus backup solution can be used.

As the log scale expands, a distributed file system must be adopted in the future, even involving remote disaster recovery.

5. Who does log archiving

My answer is log filing automation, manual inspection or spot checks.

6. How to do log archiving

There are several ways to aggregate logs from all servers in one place

Common methods of log archiving:
  • ftp must be downloaded. This method is suitable for small files and the amount of logs is not large. It must be downloaded to the designated server. The disadvantage is that the transmission is repeated and the real-time performance is poor.
  • Programs such as rsyslog are more general, but inconvenient to expand.
  • rsync is definitely a synchronization, suitable for file synchronization, better than FTP, but poor real-time performance.

6.1. Log format conversion

First let me introduce a simple solution

I wrote a program in D language to regularize the WEB log and then pipe it to the database handler

6.1.1. Putting logs into the database

Pipe the web server logs and write them to the database

handler source code

				
$ vim match.d
import std.regex;
import std.stdio;
import std.string;
import std.array;

void main()
{
    // nginx
	//auto r = regex(`^(\S+) (\S+) (\S+) \[(.+)\] "([^"]+)" ([0-9]{3}) ([0-9]+) "([^"]+)" "([^"]+)" "([^"]+)"`);

	// apache2
	auto r = regex(`^(\S+) (\S+) (\S+) \[(.+)\] "([^"]+)" ([0-9]{3}) ([0-9]+) "([^"]+)" "([^"]+)"`);

	foreach(line; stdin.byLine)
	{

		foreach(m; match(line, r)){
			//writeln(m.hit);
			auto c = m.captures;
			c.popFront ();
			//writeln(c);
			auto value = join(c, "\",\"");
			auto sql = format("insert into log(remote_addr,unknow,remote_user,time_local,request,status,body_bytes_sent,http_referer,http_user_agent,http_x_forwarded_for) value(\"%s\");", value );
			writeln(sql);
		}
	}
}
				
				

compile

$ dmd match.d
$ strip match

$ ls
match  match.d  match.o
				

Simple usage

$ cat access.log | ./match
				

Advanced usage

				
$ cat access.log | match | mysql -hlocalhost -ulog -p123456 logging
				
				

To process logs in real time, first create a pipeline and write the log file into the pipeline.

cat pipename | match | mysql -hlocalhost -ulog -p123456 logging
				

This enables real-time log insertion.

hint

The above program can be slightly modified to realize Hbase, Hypertable version

6.1.2. Apache Pipe

Apache log pipeline filtering CustomLog "| /srv/match >> /tmp/access.log" combined

				
<VirtualHost *:80>
        ServerAdmin webmaster@localhost

        #DocumentRoot /var/www
        DocumentRoot /www
        <Directory />
                Options FollowSymLinks
                AllowOverride None
        </Directory>
        # <Directory / var / www />
        <Directory /www/>
                Options Indexes FollowSymLinks MultiViews
                AllowOverride None
                Order allow,deny
                allow from all
        </Directory>

        ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/
        <Directory "/usr/lib/cgi-bin">
                AllowOverride None
                Options +ExecCGI -MultiViews +SymLinksIfOwnerMatch
                Order allow,deny
                Allow from all
        </Directory>

        ErrorLog ${APACHE_LOG_DIR}/error.log

        # Possible values include: debug, info, notice, warn, error, crit,
        # alert, emerg.
        LogLevel warn

        #CustomLog ${APACHE_LOG_DIR}/access.log combined
        CustomLog "| /srv/match >> /tmp/access.log" combined

    Alias /doc/ "/usr/share/doc/"
    <Directory "/usr/share/doc/">
        Options Indexes MultiViews FollowSymLinks
        AllowOverride None
        Order deny,allow
        Deny from all
        Allow from 127.0.0.0/255.0.0.0 ::1/128
    </Directory>

</VirtualHost>
				
				

The effect of the log transformed by the pipeline

				
$ tail /tmp/access.log
insert into log(remote_addr,unknow,remote_user,time_local,request,status,body_bytes_sent,http_referer,http_user_agent,http_x_forwarded_for) value("192.168.6.30","-","-","21/Mar/2013:16:11:00 +0800","GET / HTTP/1.1","304","208","-","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172 Safari/537.22");
insert into log(remote_addr,unknow,remote_user,time_local,request,status,body_bytes_sent,http_referer,http_user_agent,http_x_forwarded_for) value("192.168.6.30","-","-","21/Mar/2013:16:11:00 +0800","GET /favicon.ico HTTP/1.1","404","501","-","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172 Safari/537.22");
insert into log(remote_addr,unknow,remote_user,time_local,request,status,body_bytes_sent,http_referer,http_user_agent,http_x_forwarded_for) value("192.168.6.30","-","-","21/Mar/2013:16:11:00 +0800","GET / HTTP/1.1","304","208","-","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172 Safari/537.22");
				
				

6.1.3. Log format

By defining LogFormat, you can directly output logs in SQL form

Apache

LogFormat "%v:%p %h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" vhost_combined
LogFormat "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" combined
LogFormat "%h %l %u %t \"%r\" %>s %O" common
LogFormat "%{Referer}i -> %U" referer
LogFormat "%{User-agent}i" agent
				

Nginx

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';
				

But it will cause some trouble for system administrators to use grep, awk, sed, sort, uniq to analyze. So I suggest still using regular decomposition

To generate a regular log format, Apache:

LogFormat \
        "\"%h\",%{%Y%m%d%H%M%S}t,%>s,\"%b\",\"%{Content-Type}o\",  \
        \"%U\",\"%{Referer}i\",\"%{User-Agent}i\""
				

Import the access.log file into mysql

LOAD DATA INFILE '/local/access_log' INTO TABLE tbl_name
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '\\'
				

6.1.4. Importing logs into MongoDB

# rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
# yum install mongodb
				

D language log handler

				
import std.regex;
//import std.range;
import std.stdio;
import std.string;
import std.array;

void main()
{
	// nginx
	auto r = regex(`^(\S+) (\S+) (\S+) \[(.+)\] "([^"]+)" ([0-9]{3}) ([0-9]+) "([^"]+)" "([^"]+)" "([^"]+)"`);
	// apache2
	//auto r = regex(`^(\S+) (\S+) (\S+) \[(.+)\] "([^"]+)" ([0-9]{3}) ([0-9]+) "([^"]+)" "([^"]+)"`);
	foreach(line; stdin.byLine)
	{
		//writeln(line);
		//auto m = match(line, r);
		foreach(m; match(line, r)){
			//writeln(m.hit);
			auto c = m.captures;
			c.popFront ();
			//writeln(c);
			/*
			SQL
			auto value = join(c, "\",\"");
			auto sql = format("insert into log(remote_addr,unknow,remote_user,time_local,request,status,body_bytes_sent,http_referer,http_user_agent,http_x_forwarded_for) value(\"%s\");", value );
			writeln(sql);
			*/
			// MongoDB
			string bson = format("db.logging.access.save({
						'remote_addr': '%s',
						'remote_user': '%s',
						'time_local': '%s',
						'request': '%s',
						'status': '%s',
						'body_bytes_sent':'%s',
						'http_referer': '%s',
						'http_user_agent': '%s',
						'http_x_forwarded_for': '%s'
						})",
						c[0],c[2],c[3],c[4],c[5],c[6],c[7],c[8],c[9]
						);
			writeln(bson);

		}
	}

}
				
				

compile log handler

dmd mlog.d
				

usage

cat /var/log/nginx/access.log | mlog | mongo 192.169.0.5/logging -uxxx -pxxx
				

Handling missed logs

# zcat /var/log/nginx/*.access.log-*.gz | /srv/mlog | mongo 192.168.6.1/logging -uneo -pchen
				

Collect logs in real time

tail -f /var/log/nginx/access.log | mlog | mongo 192.169.0.5/logging -uxxx -pxxx
				

6.2. Log Center Solution

Although the above solution is simple, it relies too much on the system administrator and needs to configure many servers. The logs generated by each application software are different, so it is very complicated. If there is a failure in the middle, a log will be lost.

So I went back to the starting point, all logs are stored on my own server, and they are synchronized to the log server regularly, which solves the problem of log archiving. Remotely collect logs and push and summarize them to the log center through the UDP protocol, which solves the real-time monitoring and capture of logs and other requirements for high real-time requirements.

For this, I wrote a software for two or three days, download address: https://github.com/netkiller/logging

This solution is not the best, but it is more suitable for my scenario, and it only took me two or three days to complete the software development. Later, I will further expand and add the function of message queue delivery log.

6.2.1. Software Installation

$ git clone https://github.com/netkiller/logging.git
$ cd logging
$ python3 setup.py sdist
$ python3 setup.py install
				

6.2.2. Node Pusher

Install startup script

CentOS

# cp logging/init.d/ulog /etc/init.d			
				

Ubuntu

$ sudo cp init.d/ulog /etc/init.d/	

$ service stake
Usage: /etc/init.d/ulog {start|stop|status|restart}			
				

configure script, open /etc/init.d/ulog file

Configure the IP address of the log center

HOST=xxx.xxx.xxx.xxx
				

Then configure the port and collect those logs

				
	done << EOF
1213 /var/log/nginx/access.log
1214 /tmp/test.log
1215 /tmp/$(date +"%Y-%m-%d.%H:%M:%S").log
EOF
				
				

The format is

Port | Logfile
------------------------------
1213 /var/log/nginx/access.log
1214 /tmp/test.log
1215 /tmp/$(date +"%Y-%m-%d.%H:%M:%S").log
				

1213 The destination port number (log center port) is followed by the log you need to monitor. If the log generates a file every day, the writing method is similar to /tmp/$(date +"%Y-%m-%d.%H:%M:% S").log

hint

To generate a new log file every day, you need to restart ulog regularly. The method is /etc/init.d/ulog restart

Start the pusher after the configuration is complete

# service bet start
				

View status

$ service stake status
13865 pts/16   S      0:00 /usr/bin/python3 /usr/local/bin/rlog -d -H 127.0.0.1 -p 1213 /var/log/nginx/access.log				
				

stop push

# service bet stop				
				

6.2.3. Log collector

# cp logging/init.d/ucollection /etc/init.d

# /etc/init.d/ucollection
Usage: /etc/init.d/ucollection {start|stop|status|restart}
				

Configure the receiving port and save the file, open the /etc/init.d/ucollection file, see the following paragraphs

				
done << EOF
1213 /tmp/nginx/access.log
1214 /tmp/test/test.log
1215 /tmp/app/$(date +"%Y-%m-%d.%H:%M:%S").log
1216 /tmp/db/$(date +"%Y-%m-%d")/mysql.log
1217 /tmp/cache/$(date +"%Y")/$(date +"%m")/$(date +"%d")/cache.log
EOF
				
				

The format is as follows, which means to receive data from port 1213 and save it to the /tmp/nginx/access.log file.

Port | Logfile
1213 /tmp/nginx/access.log
				

If you need to split the log configuration as follows

1217 /tmp/cache/$(date +"%Y")/$(date +"%m")/$(date +"%d")/cache.log
				

The above configuration log file will be generated in the following directory

$ find /tmp/cache/
/tmp/cache/
/tmp/cache/2014
/tmp/cache/2014/12
/tmp/cache/2014/12/16
/tmp/cache/2014/12/16/cache.log
				

hint

Likewise, if the log is split, the collector program needs to be restarted.

Start the collector

# service bet start	
				

stop the program

# service bet stop			
				

View status

$ init.d/ucollection status
12429 pts/16   S      0:00 /usr/bin/python3 /usr/local/bin/collection -d -p 1213 -l /tmp/nginx/access.log
12432 pts/16   S      0:00 /usr/bin/python3 /usr/local/bin/collection -d -p 1214 -l /tmp/test/test.log
12435 pts/16   S      0:00 /usr/bin/python3 /usr/local/bin/collection -d -p 1215 -l /tmp/app/2014-12-16.09:55:15.log
12438 pts/16   S      0:00 /usr/bin/python3 /usr/local/bin/collection -d -p 1216 -l /tmp/db/2014-12-16/mysql.log
12441 pts/16   S      0:00 /usr/bin/python3 /usr/local/bin/collection -d -p 1217 -l /tmp/cache/2014/12/16/cache.log
				

6.2.4. Log Monitoring

Monitor data from 1217 wide port

$ collection -p 1213

192.168.6.20 - - [16/Dec/2014:15:06:23 +0800] "GET /journal/log.html HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"
192.168.6.20 - - [16/Dec/2014:15:06:23 +0800] "GET /journal/docbook.css HTTP/1.1" 304 0 "http://192.168.6.2/journal/log.html" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"
192.168.6.20 - - [16/Dec/2014:15:06:23 +0800] "GET /journal/journal.css HTTP/1.1" 304 0 "http://192.168.6.2/journal/log.html" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"
192.168.6.20 - - [16/Dec/2014:15:06:23 +0800] "GET /images/by-nc-sa.png HTTP/1.1" 304 0 "http://192.168.6.2/journal/log.html" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"
192.168.6.20 - - [16/Dec/2014:15:06:23 +0800] "GET /js/q.js HTTP/1.1" 304 0 "http://192.168.6.2/journal/log.html" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"
				

Send the latest logs in real time after startup

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327104320&siteId=291194637
log
log