Log Archiving and Data Mining
http://netkiller.github.io/journal/log.html
Copyright © 2013, 2014 Netkiller. All rights reserved.
Copyright Notice
Please contact the author for reprinting. When reprinting, please be sure to indicate the original source of the article, author information and this statement.
|
|
2014-12-16
2013-03-19 First edition
2014-12-16 Second edition
1. What log archive
Archiving refers to the process of arranging the logs and saving the valuable files and sending them to the log server to save them after being organized by the system.
2. Why do log archives
- Call up the historical log query at any time.
- Do data mining through logs to mine valuable data.
- View the working status of the application
3. When to do log archiving
Log archiving should be a system ("archiving system") stipulated by the enterprise, and log archiving should be considered at the beginning of system construction. If your business does not have this work or system, it is recommended that you implement it immediately after reading this article.
4. Where to put archived logs
A simple one-node server plus backup solution can be used.
As the log scale expands, a distributed file system must be adopted in the future, even involving remote disaster recovery.
5. Who does log archiving
My answer is log filing automation, manual inspection or spot checks.
6. How to do log archiving
There are several ways to aggregate logs from all servers in one place
- ftp must be downloaded. This method is suitable for small files and the amount of logs is not large. It must be downloaded to the designated server. The disadvantage is that the transmission is repeated and the real-time performance is poor.
- Programs such as rsyslog are more general, but inconvenient to expand.
- rsync is definitely a synchronization, suitable for file synchronization, better than FTP, but poor real-time performance.
6.1. Log format conversion
First let me introduce a simple solution
I wrote a program in D language to regularize the WEB log and then pipe it to the database handler
6.1.1. Putting logs into the database
Pipe the web server logs and write them to the database
handler source code
$ vim match.d import std.regex; import std.stdio; import std.string; import std.array; void main() { // nginx //auto r = regex(`^(\S+) (\S+) (\S+) \[(.+)\] "([^"]+)" ([0-9]{3}) ([0-9]+) "([^"]+)" "([^"]+)" "([^"]+)"`); // apache2 auto r = regex(`^(\S+) (\S+) (\S+) \[(.+)\] "([^"]+)" ([0-9]{3}) ([0-9]+) "([^"]+)" "([^"]+)"`); foreach(line; stdin.byLine) { foreach(m; match(line, r)){ //writeln(m.hit); auto c = m.captures; c.popFront (); //writeln(c); auto value = join(c, "\",\""); auto sql = format("insert into log(remote_addr,unknow,remote_user,time_local,request,status,body_bytes_sent,http_referer,http_user_agent,http_x_forwarded_for) value(\"%s\");", value ); writeln(sql); } } }
compile
$ dmd match.d $ strip match $ ls match match.d match.o
Simple usage
$ cat access.log | ./match
Advanced usage
$ cat access.log | match | mysql -hlocalhost -ulog -p123456 logging
To process logs in real time, first create a pipeline and write the log file into the pipeline.
cat pipename | match | mysql -hlocalhost -ulog -p123456 logging
This enables real-time log insertion.
hint
The above program can be slightly modified to realize Hbase, Hypertable version
6.1.2. Apache Pipe
Apache log pipeline filtering CustomLog "| /srv/match >> /tmp/access.log" combined
<VirtualHost *:80> ServerAdmin webmaster@localhost #DocumentRoot /var/www DocumentRoot /www <Directory /> Options FollowSymLinks AllowOverride None </Directory> # <Directory / var / www /> <Directory /www/> Options Indexes FollowSymLinks MultiViews AllowOverride None Order allow,deny allow from all </Directory> ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/ <Directory "/usr/lib/cgi-bin"> AllowOverride None Options +ExecCGI -MultiViews +SymLinksIfOwnerMatch Order allow,deny Allow from all </Directory> ErrorLog ${APACHE_LOG_DIR}/error.log # Possible values include: debug, info, notice, warn, error, crit, # alert, emerg. LogLevel warn #CustomLog ${APACHE_LOG_DIR}/access.log combined CustomLog "| /srv/match >> /tmp/access.log" combined Alias /doc/ "/usr/share/doc/" <Directory "/usr/share/doc/"> Options Indexes MultiViews FollowSymLinks AllowOverride None Order deny,allow Deny from all Allow from 127.0.0.0/255.0.0.0 ::1/128 </Directory> </VirtualHost>
The effect of the log transformed by the pipeline
$ tail /tmp/access.log insert into log(remote_addr,unknow,remote_user,time_local,request,status,body_bytes_sent,http_referer,http_user_agent,http_x_forwarded_for) value("192.168.6.30","-","-","21/Mar/2013:16:11:00 +0800","GET / HTTP/1.1","304","208","-","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172 Safari/537.22"); insert into log(remote_addr,unknow,remote_user,time_local,request,status,body_bytes_sent,http_referer,http_user_agent,http_x_forwarded_for) value("192.168.6.30","-","-","21/Mar/2013:16:11:00 +0800","GET /favicon.ico HTTP/1.1","404","501","-","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172 Safari/537.22"); insert into log(remote_addr,unknow,remote_user,time_local,request,status,body_bytes_sent,http_referer,http_user_agent,http_x_forwarded_for) value("192.168.6.30","-","-","21/Mar/2013:16:11:00 +0800","GET / HTTP/1.1","304","208","-","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172 Safari/537.22");
6.1.3. Log format
By defining LogFormat, you can directly output logs in SQL form
Apache
LogFormat "%v:%p %h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" vhost_combined LogFormat "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" combined LogFormat "%h %l %u %t \"%r\" %>s %O" common LogFormat "%{Referer}i -> %U" referer LogFormat "%{User-agent}i" agent
Nginx
log_format main '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent" "$http_x_forwarded_for"';
But it will cause some trouble for system administrators to use grep, awk, sed, sort, uniq to analyze. So I suggest still using regular decomposition
To generate a regular log format, Apache:
LogFormat \ "\"%h\",%{%Y%m%d%H%M%S}t,%>s,\"%b\",\"%{Content-Type}o\", \ \"%U\",\"%{Referer}i\",\"%{User-Agent}i\""
Import the access.log file into mysql
LOAD DATA INFILE '/local/access_log' INTO TABLE tbl_name FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '\\'
6.1.4. Importing logs into MongoDB
# rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm # yum install mongodb
D language log handler
import std.regex; //import std.range; import std.stdio; import std.string; import std.array; void main() { // nginx auto r = regex(`^(\S+) (\S+) (\S+) \[(.+)\] "([^"]+)" ([0-9]{3}) ([0-9]+) "([^"]+)" "([^"]+)" "([^"]+)"`); // apache2 //auto r = regex(`^(\S+) (\S+) (\S+) \[(.+)\] "([^"]+)" ([0-9]{3}) ([0-9]+) "([^"]+)" "([^"]+)"`); foreach(line; stdin.byLine) { //writeln(line); //auto m = match(line, r); foreach(m; match(line, r)){ //writeln(m.hit); auto c = m.captures; c.popFront (); //writeln(c); /* SQL auto value = join(c, "\",\""); auto sql = format("insert into log(remote_addr,unknow,remote_user,time_local,request,status,body_bytes_sent,http_referer,http_user_agent,http_x_forwarded_for) value(\"%s\");", value ); writeln(sql); */ // MongoDB string bson = format("db.logging.access.save({ 'remote_addr': '%s', 'remote_user': '%s', 'time_local': '%s', 'request': '%s', 'status': '%s', 'body_bytes_sent':'%s', 'http_referer': '%s', 'http_user_agent': '%s', 'http_x_forwarded_for': '%s' })", c[0],c[2],c[3],c[4],c[5],c[6],c[7],c[8],c[9] ); writeln(bson); } } }
compile log handler
dmd mlog.d
usage
cat /var/log/nginx/access.log | mlog | mongo 192.169.0.5/logging -uxxx -pxxx
Handling missed logs
# zcat /var/log/nginx/*.access.log-*.gz | /srv/mlog | mongo 192.168.6.1/logging -uneo -pchen
Collect logs in real time
tail -f /var/log/nginx/access.log | mlog | mongo 192.169.0.5/logging -uxxx -pxxx
6.2. Log Center Solution
Although the above solution is simple, it relies too much on the system administrator and needs to configure many servers. The logs generated by each application software are different, so it is very complicated. If there is a failure in the middle, a log will be lost.
So I went back to the starting point, all logs are stored on my own server, and they are synchronized to the log server regularly, which solves the problem of log archiving. Remotely collect logs and push and summarize them to the log center through the UDP protocol, which solves the real-time monitoring and capture of logs and other requirements for high real-time requirements.
For this, I wrote a software for two or three days, download address: https://github.com/netkiller/logging
This solution is not the best, but it is more suitable for my scenario, and it only took me two or three days to complete the software development. Later, I will further expand and add the function of message queue delivery log.
6.2.1. Software Installation
$ git clone https://github.com/netkiller/logging.git $ cd logging $ python3 setup.py sdist $ python3 setup.py install
6.2.2. Node Pusher
Install startup script
CentOS
# cp logging/init.d/ulog /etc/init.d
Ubuntu
$ sudo cp init.d/ulog /etc/init.d/ $ service stake Usage: /etc/init.d/ulog {start|stop|status|restart}
configure script, open /etc/init.d/ulog file
Configure the IP address of the log center
HOST=xxx.xxx.xxx.xxx
Then configure the port and collect those logs
done << EOF 1213 /var/log/nginx/access.log 1214 /tmp/test.log 1215 /tmp/$(date +"%Y-%m-%d.%H:%M:%S").log EOF
The format is
Port | Logfile ------------------------------ 1213 /var/log/nginx/access.log 1214 /tmp/test.log 1215 /tmp/$(date +"%Y-%m-%d.%H:%M:%S").log
1213 The destination port number (log center port) is followed by the log you need to monitor. If the log generates a file every day, the writing method is similar to /tmp/$(date +"%Y-%m-%d.%H:%M:% S").log
hint
To generate a new log file every day, you need to restart ulog regularly. The method is /etc/init.d/ulog restartStart the pusher after the configuration is complete
# service bet start
View status
$ service stake status 13865 pts/16 S 0:00 /usr/bin/python3 /usr/local/bin/rlog -d -H 127.0.0.1 -p 1213 /var/log/nginx/access.log
stop push
# service bet stop
6.2.3. Log collector
# cp logging/init.d/ucollection /etc/init.d # /etc/init.d/ucollection Usage: /etc/init.d/ucollection {start|stop|status|restart}
Configure the receiving port and save the file, open the /etc/init.d/ucollection file, see the following paragraphs
done << EOF 1213 /tmp/nginx/access.log 1214 /tmp/test/test.log 1215 /tmp/app/$(date +"%Y-%m-%d.%H:%M:%S").log 1216 /tmp/db/$(date +"%Y-%m-%d")/mysql.log 1217 /tmp/cache/$(date +"%Y")/$(date +"%m")/$(date +"%d")/cache.log EOF
The format is as follows, which means to receive data from port 1213 and save it to the /tmp/nginx/access.log file.
Port | Logfile 1213 /tmp/nginx/access.log
If you need to split the log configuration as follows
1217 /tmp/cache/$(date +"%Y")/$(date +"%m")/$(date +"%d")/cache.log
The above configuration log file will be generated in the following directory
$ find /tmp/cache/ /tmp/cache/ /tmp/cache/2014 /tmp/cache/2014/12 /tmp/cache/2014/12/16 /tmp/cache/2014/12/16/cache.log
hint
Likewise, if the log is split, the collector program needs to be restarted.Start the collector
# service bet start
stop the program
# service bet stop
View status
$ init.d/ucollection status 12429 pts/16 S 0:00 /usr/bin/python3 /usr/local/bin/collection -d -p 1213 -l /tmp/nginx/access.log 12432 pts/16 S 0:00 /usr/bin/python3 /usr/local/bin/collection -d -p 1214 -l /tmp/test/test.log 12435 pts/16 S 0:00 /usr/bin/python3 /usr/local/bin/collection -d -p 1215 -l /tmp/app/2014-12-16.09:55:15.log 12438 pts/16 S 0:00 /usr/bin/python3 /usr/local/bin/collection -d -p 1216 -l /tmp/db/2014-12-16/mysql.log 12441 pts/16 S 0:00 /usr/bin/python3 /usr/local/bin/collection -d -p 1217 -l /tmp/cache/2014/12/16/cache.log
6.2.4. Log Monitoring
Monitor data from 1217 wide port
$ collection -p 1213 192.168.6.20 - - [16/Dec/2014:15:06:23 +0800] "GET /journal/log.html HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36" 192.168.6.20 - - [16/Dec/2014:15:06:23 +0800] "GET /journal/docbook.css HTTP/1.1" 304 0 "http://192.168.6.2/journal/log.html" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36" 192.168.6.20 - - [16/Dec/2014:15:06:23 +0800] "GET /journal/journal.css HTTP/1.1" 304 0 "http://192.168.6.2/journal/log.html" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36" 192.168.6.20 - - [16/Dec/2014:15:06:23 +0800] "GET /images/by-nc-sa.png HTTP/1.1" 304 0 "http://192.168.6.2/journal/log.html" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36" 192.168.6.20 - - [16/Dec/2014:15:06:23 +0800] "GET /js/q.js HTTP/1.1" 304 0 "http://192.168.6.2/journal/log.html" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"
Send the latest logs in real time after startup