Project Analysis: Log Analysis

log analysis
Purpose:
A large number of system logs, application logs, security logs and other logs will be generated in production. By analyzing the logs, you can understand the load and health status of the database, analyze the distribution of customers, and analyze the behavior of customers, and even make decisions based on these analysis. predict.
 
General process:
Log output-->collection (Logstash, Flume, Scribe)-->storage (landing, can be directly analyzed without landing)-->analysis-->storage (database, nosql) [persistent results and facilitate query]--> Visualization (e.g. browser display)
 
Open source real-time log analysis ELK platform
Logstash collects logs and stores them in the ElasticSearch cluster, while kibana queries data from the ES cluster to generate icons and return them to the browser.
 
The premise of analysis: The data is semi-structured data.
Logs are semi-structured data, organized and formatted data that can be divided into rows and columns, which can be understood and processed as a table, and of course the data inside can be analyzed.
 
Structured data: within mysql
Semi-structured: logs, text files.
Unstructured: video, audio.
 
Text Analysis:
The log is a text file and needs to rely on technologies such as file IO, string manipulation, and regular expressions.
Use these techniques to extract the required data from the log.
Example: Analyze the following logs, routine:
183.60.212.153 - - [19/Feb/2013:10:23:29 +0800] "GET /o2o/media.html?menu=3 HTTP/1.1" 200 16691 "-" "Mozilla/5.0 (compatible; EasouSpider; +http://www.easou.com/search/spider.html)"
1. Data extraction
1) String cutting: Use slices to achieve a traversal using multiple separators:
def makekey(line:str):
    chars = set(' \t')
    start = 0
    skip = False
    for i,v in enumerate(line):
        if not skip and v in '"[':
            skip = True
            start = i +1
        elif skip and v in '"]':
            skip = False
            yield line[start:i]
            start = i+1
            continue
        elif skip:
            continue
        if not skip:
            if v in chars:
                if i == start:
                    start = i+1
                    continue
                yield line[start:i]
                start = i+1
    else:
        if start <len(line):
            yield line[start:]
 
print(list(makekey(line)))
2) Regular expressions realize string cutting
pattern ='''(?P<remote>[\d.]{7,}) - - \[(?P<datatime>[\w /+:]+)\] \
"(?P<method>\w+) (?P<url>\S+) (?P<protocol>[\w/\d.]+)" \
(?P<status>\d+)(?P<length>\d+) .+ "(?P<useragent>.+)"'''
 
2. Process data --> convert to the required type:
①Time processing:
datetime
Time difference type: timedelta
Convert to seconds: timedelta.total_seconds()
1. Convert string format to time format:
def turn(timestr):
    return datetime.datetime.strptime(timestr,'%d/%b/%Y:%H:%M:%S %z')
time = '19/Feb/2013:10:23:29 +0800'
print(turn(time),type(turn(time)))
2. Create now with time zone: [Time with time zone cannot be calculated with time without time zone]
tz = datetime.timezone (datetime.timedelta (hours = 8))
current = datetime.datetime.now(tz)
3. The result of datetime subtraction is a timedelta object, and the total_seconds() method is required to convert to seconde
(current - start).total_seconds() >=interval
②map, zip realizes triples combined with corresponding processing to form a dictionary:
dict(map(lambda item:(item[0],item[2](item[1])),zip(name,makekey(line),ops)))
③ Use named groups to extract the required data and form a dict. Call groupdict with matching results:
pattern=‘abc’
regex = re.compile(pattern,str)
matcher = regex.match(line)
matcher.groupdict()
#####Code implementation, slice makekey, build data dictionary####
name = ('remote','','','datetime','request','status','lenth','','useragent')
ops = (None,None,None,lambda timestr:datetime.datetime.strptime(timestr,'%d/%b/%Y:%H:%M:%S %z'),\
       lambda request:dict(zip(('method','url','proto'),request.split())),int,int,None,None)
 
def extract(line:str):
    t=zip(name,makekey(line),ops)
    d=dict(map(lambda item:(item[0],item[2](item[1]) if item[2] is not None else item[1]),t))
    return d
print(extract(line))
############Regular expression named tuple implementation #########################
ops ={'datatime':lambda timestr:datetime.datetime.strptime(timestr,'%d/%b/%Y:%H:%M:%S %z'),'status':int,'length':int}
 
regex = re.compile(pattern)
 
def extract(line:str) ->dict:
    matcher = regex.match(line)
    d={}
    for k,v in matcher.groupdict().items():
        d[k]=ops.get(k,lambda x:x)(v) #Return a function that does nothing if it is not found in ops.
    return d
 
print(extract(line))
3. Data loading
This item is an N-line record, and the loaded data is file IO.
def load(path):
    with open(path) as f:
        for line in f:
            ret = extract(line)
            if ret:
                yield ret
            else:
                continue #Skip if parsing fails
 
Fourth, time window analysis [also called sliding window]
Many data, such as logs, are time-related and are generated in chronological order.
When the generated data is analyzed, it should be evaluated according to time.
 
interval represents the time interval between each evaluation.
width The width of the time window, which refers to the width of the time window for one evaluation. That is, how long to analyze the data.
 
1. width > interval, the longest method used in production.
For example, width=8, interval=5, the operation is average. That is, the data of nearly 8 seconds is averaged every 5 seconds, and there will be overlaps in the data evaluation.
2.width = interval, for example: w=5, i=5. Operate the number of nearly 5 seconds every 5 seconds. There is no overlap in data evaluation.
 
 
3.width<interval. Generally not used, there will be collected data not involved in the calculation.
 
Time series data
In the operation and maintenance environment, the data generated by logs, monitoring, etc. are all time-related data, and the data is generated and recorded according to time, so the data is generally analyzed according to time.
 
Basic program structure of data analysis
 
④ The map function takes out the value of the same key in multiple dictionaries.
items=[{'name':hh,value:10},{'name':tt,value:20},{'name':vv,value:30}]
list(map(lambda x:x['value'],items)) 取出[10,20,30]  
 
Infinitely generate random numbers, generate time-related data, and return a dictionary of time and random numbers.
Take 3 data each time and calculate the average.
############The simulation calculates the average after each 3 data is generated############################ #
def source():
    while True:
        yield{'value':random.randint(1,10),'datetime':datetime.datetime.now()}
 
s = source()
items = [next(s) for _ in range(3)]
 
def handle(iterable):
    mapper=map(lambda x:x['value'],iterable)
    ret =sum(mapper)/len(iterable)
    return right
 
print('{:.2f}'.format(handle(items)))
##################################################
The implementation of the window function, the average value is calculated after each period of time
################################################
def source(second=1):
    while True:
        yield {'value':random.randint(1,10),'datetime':datetime.datetime.now(datetime.timezone(datetime.timedelta(hours=8)))} #Create a time object with a time zone
        time.sleep(second)
 
def handle(iterable):
    return sum(map(lambda x : x['value'],iterable))/len(iterable)
 
def window(iterator,handle,width:int,interval:int):
    """
    window function
    :param iterator: data source, generator, feed data
    :param handle: data processing function
    :param interval: time window width, seconds
    :param width: processing time interval, seconds
    :return: return the average value of this period
    """
#I don't know the time of the first entry, set a time much earlier than the first entry, and replace it after reading in.
    start = datetime.datetime.strptime('20170101 000000 +0800','%Y%m%d %H%M%S %z')
    current = datetime.datetime.strptime('20170101 010000 +0800','%Y%m%d %H%M%S %z')
    # start = datetime.datetime.strptime('20170101 000000','%Y%m%d %H%M%S')
    # current = datetime.datetime.strptime('20170101 010000','%Y%m%d %H%M%S')
    buffer = [] #Data to be calculated in the window
    delta = datetime.timedelta (seconds = width-interval)
    while True:
        #Get data from the data source
        data = next(iterator)
        if data:
            buffer.append(data) #Store in a temporary buffer and wait for calculation
            current = data['datetime']
 
        #Calculate the data in the buffer every interval
        if (current - start).total_seconds() >= interval:
            ret = handle (buffer)
            print ('{:. 2f}'. format (ret))
            start = current
            #Clear data that exceeds width
            buffer = [x for x in buffer if x['datetime'] > current -delta]
 
window(source(),handle,10,5)
#####################################################################
5. Distribution
Producer Consumer Model
For a monitoring system, a lot of data needs to be processed, including logs. Collection and analysis of existing data.
The monitored object is the producer of the data, and the processing program of the data is the consumer of the data.
 
Producer-Consumer Traditional Model
The code coupling is too high, if the generation scale is expanded, it is not easy to scale, and the speed of production and consumption is difficult to match, etc.
What is the problem with producers and consumers?
Example:
What happens if you sell buns and continue to steam buns if the buns are not sold out? Buns pile up.
If you steam some first, then steam some more when they are almost sold out. There will be no queues waiting for buns.
If the supply is in short supply and the noodles have not been made, the buns have been reserved, and there will be queues waiting for the buns.
To sum up, the core problem is the problem of matching the speed of producers and consumers.
However, often the speed is not a good match.
The solution - queue queue
 
queue queue
Function---decoupling, buffering
The log producer often deploys several programs and generates a lot of logs, and the consumer also has multiple programs to extract log points.
Analysis processing.
 
The production of data is unstable, which will cause a "surge" of data in a short period of time, requiring buffering.
The ability of consumers to consume is different, fast or slow, and consumers can decide to consume the data in the buffer by themselves.
 
A single machine can use the built-in module of queue to build an in-process queue to meet the production and consumption needs of multiple threads.
Large systems can use third-party message middleware: RabbitMQ, RocketMQ, kafka
 
queue module -- queue
The queue module provides a first-in, first-out queue Queue. Queues are not iterable.
 
queue.Queue(maxsize=0)
Create a FIFO queue and return a Queue object. maxsize is less than or equal to 0, and there is no limit to the queue length.
 
Queue.get(block=True,timeout=None)
overflow the element from the queue and return this element
block is blocking, timeout is timeout.
1. If block is True, it is blocked, and if timeout is None, it is blocked all the time.
2. If block is True but timeout has a value, it will block for a certain number of seconds and throw an Empty exception.
3. If block is False, it is non-blocking, the timeout will be ignored, and either an element is returned successfully, or an empty exception is thrown.
 
Queue.put(item,block=True,timeout=None)
Add an element to the queue.
block=True, timeout=None, block until there is space to put elements.
block=True, timeout=5, a Full exception will be thrown after blocking for 5 seconds.
block=False, the timeout is invalid, and it returns immediately. If it can be plugged in, it will be plugged. If not, a Full exception will be thrown.
 
Queue.put_nowait(item)
Equivalent to put(item, False), that is, if it can be plugged in, it will be plugged in, and if not, a Full exception will be thrown.
 
⑤Queue usage:
from queue import Queue
 
q = Queue()
q.put(10)
q.put(5)
 
print(q.get()) #10, first in first out
print(q.get())  #5
print(q.get(timeout=3))#Blocks, throws queue.Empty exception after timeout.
####################################################################
The implementation of the dispatcher:
The producer (data source) produces data and buffers it into the message queue
 
Data processing flow:
Data Loading-->Extraction-->Analysis (Sliding Window Function)
 
When processing large amounts of data, multiple consumers are required for a data source to process. But how to distribute the data is a problem.
A dispatcher (scheduler) is required to distribute data to different consumers for processing.
 
Registration: If consumers want to get data, they must register with the distributor, indicating that they need to be distributed.
Distribution: It is possible to share a message queue, but the contention problem needs to be solved. In this example, we assume that handle1 and handle2 are two
  Different analysis functions need to use a queue for each.
Register implementation: record those consumers inside the scheduler, each consumer has its own queue.
Distribution implementation: Thread, since a piece of data will be processed by multiple different registered handles, the best way is multi-threading.
⑥ Examples of thread usage
Dispatcher code implementation:
##################################################
def dispatcher(src):
    #Record the handle in the dispatcher and save the respective queues at the same time
    handles = [] #Record consumer list
    queues = [] #Record the respective queues
 
    def reg(handle,width:int,interval:int):
        """
        Register a window handler
        Each time a function is registered, a separate queue needs to be opened for the function.
        And start a separate thread to run it.
 
        """
        q = Queue() #Create a queue
        queues.append(q)
 
        h = threading.Thread(target=window,args = (q,handle,width,interval)) #线程
        handles.append(h)
 
    def run():
            for t in handles:
                t.start()
 
            for item in src:
                for q in queues:
                    q.put(item)
    return reg,run
 
reg,run = dispatcher(source())
 
reg(handle,10,5)
run()
##################################################################################

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325651917&siteId=291194637