Minute-level data real-time extraction of Logstash + Kafka + python

1. Install the KAFKA environment first

concept:

Kafka is an open source stream processing platform developed by the Apache Software Foundation, written in Scala and Java. Kafka is a high-throughput distributed publish-subscribe messaging system that can handle all action stream data of consumers in the website. Such actions (page browsing, searching, and other user actions) are a key factor in many social functions on the modern web. These data are usually addressed by processing logs and log aggregation due to throughput requirements. This is a viable solution for systems that log data and analyze offline like Hadoop, but require real-time processing constraints. The purpose of Kafka is to unify online and offline message processing through Hadoop's parallel loading mechanism, and to provide real-time messages through clusters.

The configuration of Kafka will contain some proper nouns, we will encounter these nouns in the configuration of the server, let's understand first:

Broker: The Kafka cluster contains one or more servers, which are called brokers;

Topic: Each message published to the Kafka cluster has a category. This category is called Topic. Physically different Topic messages are stored separately. Logically, a Topic message is stored on one or more brokers but Users only need to specify the Topic of the message to produce or consume data without caring where the data exists;

Partition: Partition is a physical concept. In fact, it is the same as the partition we have seen in Hive. Each Topic contains one or more Partitions;

Producer: The service responsible for publishing messages to the Kafka broker, we are also called producers;

Consumer: Message consumer service, a client that reads data from Kafka broker.

Now we start to build the Kafka server

First prepare the JAVA environment and the kafka installation package:

I am using the kafka_2.12-3.2.0.tgz version of kafka. In addition, kafka needs a zookeeper cluster environment, but kafka has its own zookeeper service, so there is no additional installation here.

1. Unzip kafka and switch to the kafka directory

tar -zxvf kafka_2.12-3.2.0.tgz

cd kafka_2.12-3.2.0/

2. Start the zookeeper service, here is the background startup, and redirect the log printed by the server to the zookeeper.log file

./bin/zookeeper-server-start.sh config/zookeeper.properties >> zookeeper.log &

3. Modify the kafka configuration file

vim config/server.properties

Modify the service address of the file to the local ip

And add the topic creation and deletion settings after the partition setting switch. In the production environment of the project team, here is the setting of false

4. Start the kafka service in the background

./bin/kafka-server-start.sh ./config/server.properties &

5. You can use the ps command to check whether the kafka service is started

6. Create a kafka topic, create a topic named lucifer-topic, only one copy, one partition

sh bin/kafka-topics.sh --create --topic lucifer-topic --replication-factor 1 --partitions 1 --bootstrap-server 192.168.222.132:9092

7. You can check whether the topic is created successfully by sh bin/kafka-topics.sh --bootstrap-server 192.168.222.132:9092 --describe --topic lucifer-topic command

sh bin/kafka-topics.sh --bootstrap-server 192.168.222.132:9092 --describe --topic lucifer-topic

 

8. Then we test whether the kafka service and communication are normal in two different new windows

First start a producer window:

sh bin/kafka-console-producer.sh --broker-list 192.168.222.132:9092 --topic lucifer-topic

Then start a consumer window:

sh bin/kafka-console-consumer.sh --bootstrap-server 192.168.222.132:9092 --topic lucifer-topic --from-beginning

9. The information entered in the producer window will be displayed in the consumer window

Producer window display:

Consumer window display:

10. Delete kafka consumption data

Kafka's consumption data log is in the /tmp/kafka-logs folder by default, we only need to delete all the data in this folder.

2. Environment for installing Logstash

concept:

Logstash is a powerful tool that can be integrated with various platforms and tools. It provides a large number of plug-ins to help you analyze data, which can be analyzed and converted through Filter data, and can dynamically collect, convert and transmit data. Easily ingest data from logs, metrics, web applications, data stores, and various services in a continuous stream, regardless of format or complexity. Finally, it can output its own data to various required data storage places, such as file, kafka, ElasticSearch, etc.

1. Upload the compressed package of logstash and decompress it. Here, the files of logstash-7.9.2.tar.gz version are used

tar -zxf logstash-7.9.2.tar.gz

2. Go to the logstash folder, debug and verify through the logstash tool

./logstash-7.9.2/bin/logstash -e ""

After waiting for logstash to start, you can enter data to view the returned content

To exit the current mode, use ctrl + c.

3. Or edit a conf file, such as test.conf, and write the following content in the file

input { stdin { } }

output { stdout { codec => rubydebug } }

Then run the command with the -f option:

./logstash-7.9.2/bin/logstash -f test.conf

4. Then try to use logstash to produce data, and then use kafka to consume data

Prepare a file, /usr/u.txt

1001,aa

1002,bb

1003,cc

Edit a file, say test.conf:

5. Then use the logstash command to run the test.conf file

After running, you can see the output data in the current command line window and the kafka consumer window at the same time.

./logstash-7.9.2/bin/logstash -f test.conf

At this time, if you add new data to the file:

Then kafka will get the updated data part in real time:

Here we need to know that logstash is divided into three parts: input, filtering and output. What we wrote above is to read data

Some of the main data sources for input include:

  • jdbc: relational database: mysql, oracle, etc.
  • file: read from a file on the file system
  • syslog: Listen for syslog messages on known port 514
  • redis: redis message
  • beats: Handle events sent by Beats
  • kafka: kafka real-time data stream

filter is an intermediate processing device in the Logstash pipeline. We understand it as the ETL link of data processing. Some useful filters include:

  • Grok: parse and construct arbitrary text, Grok is currently the best way to parse unstructured log data into structured and queryable content in Logstash
  • mutate: Performs normal transformations on event fields. You can rename, delete, replace and modify fields in events
  • drop: completely delete events, such as debug events
  • clone: ​​make a copy of the event, possibly adding or removing fields
  • geoip: add information about the geographic location of an IP address

The output output is the final stage of the Logstash pipeline. Frequently used outputs include:

  • elasticsearch: send event data to Elasticsearch database
  • file: write event data to a file on disk
  • kafka: write events to Kafka

3. Use Logstash to obtain database table synchronization to kafka consumption

1. Now create a library and table on the mysql database, and prepare a few rows of data

2. Copy the driver jar package of mysql to the lib folder of logstash

cp /home/mysql-connector-java-5.1.42.jar /home/logstash-7.9.2/lib/

3. Edit a mysql.conf file and try to read the content of the database table for kafka consumption

input {
        stdin{ }
        jdbc {
                jdbc_connection_string => "jdbc:mysql://192.168.222.132:3306/test_log?characterEncoding=UTF-8&autoReconnect=true"
                jdbc_user => "root"
                jdbc_password => "123456"
                jdbc_driver_library => "/home/mysql-connector-java-5.1.42.jar"
                jdbc_driver_class => "com.mysql.jdbc.Driver"
                statement => "select id,name,age from log_01"
                # schedule => "* * * * *"
        }
}
filter{
        mutate{
                remove_field => ["@version"]
                remove_field => ["@timestamp"]
        }
}
output {
        kafka {
                bootstrap_servers => "192.168.222.132:9092"
                topic_id => "lucifer-topic"
                codec => json_lines
        }
        stdout{ }
}

Here I have added two output methods of kafka and stdout at the same time, so that I can compare and view when I get data. It is also possible to only use kafka().

If you want to extract regularly, we can also add schedule => "* * * * *" in jdbc{} to set the task for timing acquisition, the operation method is the same as crontab.

4. Use python to consume kafka data

Now we use python to consume kafka data, and at the same time write the acquired data to hdfs for storage and display of hive database.

1. First create a table in the hive database

2. Use Python to consume kafka data, and format it and write it into hdfs. Now the source database data obtained by kafka is displayed in json, so what we need is to parse the json, and then convert it into a dataframe and write it into csv document.

The python libraries that need to be used here are pykafka and pyhdfs:

#coding=utf-8

from pykafka import KafkaClient
from pyhdfs import HdfsClient
import json
from pandas import DataFrame
import time

# 先连接kafka
client = KafkaClient(hosts="192.168.222.132:9092")
topic=client.topics['lucifer-topic']
# 创建kafka消费者窗口
consumer = topic.get_simple_consumer(consumer_timeout_ms=5*1000)
# 创建列表准备存储表格数据
datas=[]
# 一行行读取kafka的数据
for record in consumer:
    # 设置读取的数据为utf8格式
    a=record.value.decode('utf8')
    # 将json数据转换成字典
    row=json.loads(a)
    print(row)
    # 将行信息添加到列表中
    datas.append(row)
# 转换数据成 dataframe 格式,并且写入 csv 文件
df=DataFrame(datas)
# 指标表格字段的顺序
df=df[['id','name','age']]
print(df)
df.to_csv("E:/log01.csv",header=False,index=False)
# 连接 hdfs
client=HdfsClient(hosts='192.168.222.132:50070',user_name='root')
# 将csv文件写入到表格对应的文件夹路径中,判断当前文件是否存在,存在则覆盖
try:
    client.delete("/user/hive/warehouse/bigdata.db/my_log_01/log01.csv")
finally:
    client.copy_from_local("E:/log01.csv",'/user/hive/warehouse/bigdata.db/my_log_01/log01.csv')

3. Check the hive database and find that the data has been written into the table

4. So far, what we have achieved is the operation of extracting data in full, that is, reading the data in the mysql table completely every time. If you want to extract data incrementally, you need to read the table every time. Record the content of the field at time. Currently, the data that can be recorded are numeric of numeric type and timestamp of time type.

We can judge new data based on the primary key field or time field of the relational database, and add it to the logstash configuration file

In the sql statement, the range judgment of the id value should also be added.

After running the logstash statement, you will see the record with the last value in the log01 file:

5. Run the logstash command again to read the log_01 table of mysql, and you will only see the newly added data

5. Realize the minute-level real-time data extraction of logstash+kafka+python

Start the scheduled task keyword in the logstash file, and use schedule to set the timing rule

input {
        stdin{ }
        jdbc {
                jdbc_connection_string => "jdbc:mysql://192.168.222.132:3306/test_log?characterEncoding=UTF-8&autoReconnect=true"
                jdbc_user => "root"
                jdbc_password => "123456"
                jdbc_driver_library => "/home/mysql-connector-java-5.1.42.jar"
                jdbc_driver_class => "com.mysql.jdbc.Driver"
                schedule => "* * * * *"
                statement => "select id,name,age from log_01 where id > :sql_last_value"
                use_column_value => true
                tracking_column => "id"
                tracking_column_type => "numeric"
                record_last_run => true
                last_run_metadata_path => "/home/logstash-7.9.2/sync_data/log01"
        }
}
filter{
        mutate{
                remove_field => ["@version"]
                remove_field => ["@timestamp"]
        }
}
output {
        kafka {
                bootstrap_servers => "192.168.222.132:9092"
                topic_id => "lucifer-topic"
                codec => json_lines
        }
        stdout{ }
}

Modify the code of python to consume kafka as follows: Wait and extract the data of the time period by means of while+sleep

#coding=utf-8

from pykafka import KafkaClient
from pyhdfs import HdfsClient
import json
from pandas import DataFrame
import time
import datetime

# 先连接kafka
client = KafkaClient(hosts="192.168.222.132:9092")
topic=client.topics['lucifer-topic']
# 创建kafka消费者窗口
consumer = topic.get_simple_consumer(consumer_timeout_ms=5*1000)
# 为了避免同时写入数据文件,先造一个自增数字
f = 0

# 使用while循环每隔60秒取数一次
while True:
    # 一行行读取kafka的数据
    for record in consumer:
        # 修改成每次循环都生成一个新的列表,创建列表准备存储表格数据
        datas = []
        # 设置读取的数据为utf8格式
        a=record.value.decode('utf8')
        # 将json数据转换成字典
        row=json.loads(a)
        # 将行信息添加到列表中
        datas.append(row)
        # 转换数据成 dataframe 格式,并且写入 csv 文件
        df=DataFrame(datas)
        # 指标表格字段的顺序
        df=df[['id','name','age']]
        print(df)
        # 通过时间给文件名字进行命名
        n=datetime.datetime.now().strftime("%Y%m%d%H%M%S")
        filename="log{}.csv".format(n)
        df.to_csv("E:/"+filename,header=False,index=False)
        # 连接 hdfs
        client=HdfsClient(hosts='192.168.222.132:50070',user_name='root')
        # 每次拿到当前数据,都直接往hdfs中进行当前时间命令的数据文件的写入
        try:
            client.copy_from_local("E:/"+filename,'/user/hive/warehouse/bigdata.db/my_log_01/'+filename)
        except Exception as e:
            filename=filename+'_'+str(f)
            f+=1
            df.to_csv("E:/" + filename, header=False, index=False)
            client.copy_from_local("E:/"+filename, '/user/hive/warehouse/bigdata.db/my_log_01/'+filename)

    # 等待60秒
    time.sleep(60)

It can be seen that after the timing is started, the extraction of logstash becomes once per minute:

It can also be seen from the data file command imported by python to hdfs that python realizes the data import at the level of every minute:

6. Optimization of Logstash

Tuning and analyzing Logstash performance are set through the Logstash configuration file. The configuration file is located at

In the logstash-7.9.2/config folder, the name is logstash.yml

Logstash provides the following options to optimize pipeline performance, pipeline.workers, pipeline.batch.size and pipeline.batch.delay.

pipeline.workers

This setting determines how many threads to run for filtering and output processing. If you find that events are being backed up or the CPU is not saturated, consider increasing this parameter to make better use of available processing power.

pipeline.batch.size

This setting defines the maximum number of events that a single worker thread collects before attempting to execute filters and outputs. Generally speaking, the larger the value, the more effective it is, but it will increase a lot of memory overhead, so it is easy to cause the JVM process to crash.

pipeline.batch.delay

Adjustments are rarely required. This setting adjusts the latency of the Logstash pipeline. The pipeline batch delay is the maximum time, in milliseconds, that Logstash waits for a new message after receiving an event in the current pipeline worker thread. After this time, Logstash starts executing filters and outputs. We can understand that this is the waiting time set by the program.

Guess you like

Origin blog.csdn.net/adamconan/article/details/130373869