Case TimescaleDB write data read from the Apache Kafka

Original link: https://streamsets.com/blog/ingesting-data-apache-kafka-timescaledb/

Time series database

Optimized time-series database, the data processing time can be indexed efficiently processing data query within a certain time range. There are several on the market time-series database, in fact, Data Collector has long had the ability to write InfluxDB , but I'm interested in TimescaleDB it is built on PostgreSQL on. Full disclosure: I, as a developer communicators Salesforce spent five and a half's time, and PostgreSQL was, and still is the core of the Heroku platform, but I also like PostgreSQL MySQL as powerful alternatives.

TimescaleDB entry

While listening to the presentation of Diana, I ran TimescaleDB Docker image , map port 54321 on the laptop to Docker containers in 5432, so it will not deploy with my existing PostgreSQL conflict. Once Diana left the stage, I would browse TimescaleDB quick start of "Creating Hypertables" section , create a PostgreSQL database, enable it to TimescaleDB, and has its write a line of data:

tutorial=# INSERT INTO conditions(time, location, temperature, humidity)
tutorial-#   VALUES (NOW(), 'office', 70.0, 50.0);
INSERT 0 1
tutorial=# SELECT * FROM conditions ORDER BY time DESC LIMIT 10;
             time              | location | temperature | humidity 
-------------------------------+----------+-------------+----------
 2019-05-25 00:37:11.288536+00 | office   |          70 |       50
(1 row)

The first TimescaleDB Pipeline

Since TimescaleDB is built based PostgreSQL, so the standard PostgreSQL the JDBC driver can use it directly. Since I have the drivers installed in the Data Collector, so I spent a lot of about two minutes to build a simple test pipeline data is written to the second row shiny new TimescaleDB server:

æ¶é'å ° ºåº|æμè¯ç®¡é

Generated test result data

tutorial=# SELECT * FROM conditions ORDER BY time DESC LIMIT 10;
            time            |         location          |    temperature     |      humidity      
----------------------------+---------------------------+--------------------+--------------------
 2020-12-25 23:35:43.889+00 | Grocery                   |  0.806543707847595 | 0.0844637751579285
 2020-10-27 02:20:47.905+00 | Shoes                     | 0.0802439451217651 |  0.398806214332581
 2020-10-24 01:15:15.903+00 | Games & Industrial        |  0.577536821365356 |  0.405274510383606
 2020-10-22 02:32:21.916+00 | Baby                      | 0.0524919033050537 |  0.499088883399963
 2020-09-12 10:30:53.905+00 | Electronics & Garden      |  0.679168224334717 |  0.427601158618927
 2020-08-25 19:39:50.895+00 | Baby & Electronics        |  0.265614211559296 |  0.274695813655853
 2020-08-15 15:53:02.906+00 | Home                      | 0.0492082238197327 |  0.046688437461853
 2020-08-10 08:56:03.889+00 | Electronics, Home & Tools |  0.336894452571869 |  0.848010659217834
 2020-08-02 09:48:58.918+00 | Books & Jewelry           |  0.217794299125671 |  0.734709620475769
 2020-08-02 08:52:31.915+00 | Home                      |  0.931948065757751 |  0.499135136604309
(10 rows)

Data from Kafka to TimescaleDB

One major use case time-series database is stored data from the IOT. I took a few minutes to write a simple Python Kafka client, it can simulate a set of sensors to produce more real than my test pipeline temperature and humidity data:

from kafka import KafkaProducer
from kafka.errors import KafkaError
import json
import random

# Create a producer that JSON-encodes the message
producer = KafkaProducer(value_serializer=lambda m: json.dumps(m).encode('ascii'))

# Send a quarter million data points (asynchronous)
for _ in range(250000):
	location = random.randint(1, 4)
	temperature = 95.0 + random.uniform(0, 10) + location
	humidity = 45.0 + random.uniform(0, 10) - location
	producer.send('timescale', 
		{'location': location, 'temperature': temperature, 'humidity': humidity})

# Block until all the messages have been sent
producer.flush()

Note that, the simulator emits an integer value for the location, without adding timestamp data. As you can see, just for fun, I let the simulator generates 250,000 data points. This is enough to make a little run TimescaleDB, without the need to spend a lot of time to generate.

I replaced Dev Data Generator source I with a pipe Kafka Consumer, and added several processors in the pipeline:

Kafka Timescale管é

Expression Evaluator just add a time stamp for each record, using some expression language to create the correct format:

${time:extractStringFromDate(time:now(), 'yyyy-MM-dd HH:mm:ss.SSSZZ')}

Static lookup string substitution integer processor uses the location field to match the pattern table TimescaleDB:

éææ ¥ æ¾éç½®

in conclusion

TimescaleDB left me a deep impression. Unboxing experience to quickly and painlessly. Although I only gave it the easiest to kick the tires, but everything is for the first time. TimescaleDB is built on PostgreSQL, which allows me to easily use existing tools to write data, and I was able to use the familiar SQL commands in the data in the super state use them .

If you are using TimescaleDB, please download StreamSets Data Collector and try to meet your data integration needs. TimescaleDB the core, it provides as open source under the Apache 2.0 license, free for test, development and production.

Guess you like

Origin blog.csdn.net/zwahut/article/details/90691758