ELK6.2.2 (elasticsearch+logstash+kibana) open source log analysis platform construction (zero): origin

  • 1 Status:

The current log situation of our company is that four major business line programs are deployed in each IDC. The logs are uniformly printed to the /logdata directory of the respective IDC. This directory is a hundred T-level gluster. In it, the log is stored according to the directory structure of machine name = "program name. In the same directory, there are the running log (process.log), error log (error.log), downstream log (submit.log), upstream log (recv.log) and other categories of logs of the same program, and they are divided according to the date. point.

[superuser@ft3q-app47 logs]$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup-LogVol01
                      1.1T  106G  920G  11% /
/dev/sda1             190M   41M  140M  23% /boot
tmpfs                  24G     0   24G   0% /dev/shm
192.168.193.201:/logdata
                      110T   45T   65T  41% /logdata
192.168.173.201:/logdata
                       37T   11T   27T  28% /logdata-1
  • 2 problems that arise:

2.1 The operation and maintenance colleagues still use the traditional query file method to view the log. cat error.log_yyyy_mm_dd|grep "keyword". In a distributed and microservice environment, it is cumbersome and easy to miss, and the most important thing is low efficiency.

2.2 When inquiring about the message trace of a message, there is no way to start. The server, logic and gateway modules passed through respectively are too distributed.

  • 3 pain points:

3.1 From the massive scattered files, quickly and accurately find the log containing a certain keyword.

3.2 Completely find out each log in which a certain keyword appears in a certain period of time, that is, the message track.

  • 4 Solutions:

Build our own log analysis system.

  • 5 Option One

5.1splunk

    Splunk is the first NASDAQ-listed company in the field of big data, and Splunk provides an engine for machine data. Use Splunk to collect, index, and leverage fast-moving computer data generated by all applications, servers, and devices—physical, virtual, and in the cloud. Search and analyze all real-time and historical data from one location. Working with computer data with Splunk lets you troubleshoot problems and investigate security incidents in minutes instead of hours or days. Monitor your end-to-end infrastructure to avoid service degradation or disruption. Meet compliance requirements at a lower cost. Correlate and analyze complex events that span multiple systems. Gain new levels of operational visibility and IT and business intelligence.

However, because it requires a daily data volume of less than 500M, otherwise it will be charged at $225 per month, and its scalability is poor. The daily log volume of a business line of our company can reach 300G. So we don't have to. (Not because of money)

5.2ELK stack

 

    ELK is a complete set of log collection and display solutions provided by elastic company. It is the acronym of three products, namely ElasticSearch, Logstash and Kibana
    ElasticSearch for short ES, it is a real-time distributed search and analysis engine, it Can be used for full text search, structured search and analysis. It is a search engine based on the full-text search engine Apache Lucene, written in Java language. ELK is a complete set of log collection and display solutions provided by elastic company. It is the acronym of three products, namely ElasticSearch, Logstash and Kibana.

  Logstash is a data collection engine with real-time transmission capability, which is used for data collection (such as reading text files), parsing, and sending data to ES.

    Kibana provides a web platform for analytics and visualization for Elasticsearch. It can find, interact with data in Elasticsearch indexes, and generate various dimension tables and graphs.

    Its characteristics are that the code is open source, and it is written in JAVA. It has a complete REST API, and we can expand the program. So choose ELK for log analysis.

  • 6 Planning 1: ELK reads scattered files

Design:

Add a set of ELK to read decentralized files without changing the existing platform. Read each file through the logstash component to create an index into es, and the operation and maintenance personnel can easily query through kibana. The architecture is as follows:

Advantages: Existing programs are completely unchanged.

shortcoming:

    1 The log is still scattered, and the cluster has reported the problem of poor file performance.

    2 Logs take up space, and indexes take up space after importing into ES, that is, data redundancy.

    3 The maintenance of Logstash Agent is not ideal. Logstash is based on Ruby. When starting it, tens of megabytes of memory will be eaten by jvm. We want to avoid having to start a Logstash on each machine. And during our use, Logstash is very unstable, it died inexplicably, and a daemon process is needed to guard it.

    4log normalization. At this stage, the log of each program is still not standard.

  • 6Plan 2: Increase KAFKA

    Logs are directly written to KAFKA, and logstash extracts KAFKA data into ES, and operation and maintenance personnel can easily query through kibana. The architecture is as follows:

    This architecture avoids the problem that a single logstash will get stuck when reading multiple files. Using kafak, unified modules such as cmpp server can be written to the same kafka topic, so that the load balancing module can be completely entered. Logstash assigns a certain index such as logstash-server-cmpp by reading a topic, so that you can batch query logs of the same function in future queries.

advantage:

    1 Remove the log gluster.

    2 After the logs are normalized, each log can be queried finely. And can be searched by word segmentation.

    3 Complete and precise. The message trace can be queried.

    4 After the corresponding index is established, it can be quickly retrieved.

shortcoming:

    1. The existing program needs to be changed, 1. Establish a standardized log printing public package 2. Introduce the kafka package 3. Modify the log4j configuration file

    2 To apply for resources, each IDC needs to build at least one set of kafka and ELK.

Impact on existing platforms:

    1. Program changes. The program must be changed, and it can be launched in grayscale in batches.

    2. Fault tolerance mechanism. Kafka uses cluster deployment and generally does not hang. If kafka hangs, the program can run.

    3. Performance. The bottom layer uses Apache's own kafka-clients, and the performance should be guaranteed. After testing, the speed of a single log into a single topic is 700/s.

    4. Operation and maintenance personnel query mechanism. From the past, log in to the springboard, grep each module file, convert it to the query module index, specify a certain field as a certain value, and query the web page.

    5. The alarm mechanism has been changed. Previously based on error log files. Now you can use the query API that comes with ES to monitor a certain TOPIC, and a keyword alarm occurs.

    6. Development of supporting programs. Delete kafka data regularly, delete es index regularly, monitor programs, etc.

  • 7Plan 3: Build multiple sets of ELK for each IDC business

    After the log volume is large, each IDC is subdivided according to the business, and multiple sets of log systems are divided. The architecture is as follows:

    According to the usage of LinkedIn, there are more than 20 ELK clusters for different business modules, 1000+ servers, mainly Elasticsearch. The logs generated by the production system within 1 minute can be searched here, and all logs are kept for 7 to 14 days. Now there are about 50 billion indexed documents, ranging from 500 to 800T. In the previous test, pushing to 1500 to 2000T can work.

  • 8 The current test situation:

    ft3q IDC's tapp21-tapp23 deploys three kafka clusters. tapp25 deploys ELK. The test program simulates the server, logic, and gateway to send to 3 topics, logstash consumes information and assigns 3 indexes. Among them, I made a word segmentation for the submit details, and parsed the mobile phone number, content, and redis queue name.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324482039&siteId=291194637