Introduction to StreamSets and entry cases

table of Contents

1. Introduction to Streamsets

Two, installation steps

2.1 Java environment

2.2 Number of open files

Three, entry case

3.1 Parse local files to HDFS

1. The overall design of the data flow

2. Specific design steps of pipeline flow

3.2 Mysql query component


1. Introduction to Streamsets

Streamsets is a big data real-time collection and ETL tool , which can realize data collection and circulation without writing a line of code. Through the drag-and-drop visual interface, the design of data pipelines (Pipelines) and timing task scheduling are realized. The biggest features are:

  1. Visual interface operation, complete data collection and circulation without writing code
  2. Built-in monitoring , but real-time viewing of basic information and data quality of data streaming
  3. Powerful integration and full support for existing common components, including 50 data sources, 44 data operations, and 46 destinations.

For Streamsets, the most important concepts are data sources (Origins), operations (Processors), and destinations (Destinations) . Creating a Pipelines pipeline configuration is basically these three aspects.

  1. Origins includes Kafka, DataLake, ES, JDBC, HDFS, etc.;
  2. Processors can perform operations such as filtering, changing, encoding, and aggregating each field;
  3. Destinations are similar to Origins and can be written to Kafka, Flume, JDBC, HDFS, Redis, etc.

After configuring the pipeline, click "Start" and the "Data Collector" starts to work. Data Collector processes data when it reaches the origin, and waits quietly when it is not needed. You can view real-time statistics about the data, check the data as it passes through the pipeline, or take a closer look at the data snapshot.


Two, installation steps

  • Installation method  : manually decompress the Tarball package to install, install via RPM software package, install via Cloudera Manager, install via Docker
  • Download file :StreamSets Data Collector version 3.15.0 https://archives.streamsets.com/index.html

2.1 Java environment

Because StreamSets is developed in java language, the java runtime environment must be installed and configured

2.2 Number of open files

Use the following command to view the number of open files in the operating system: ulimit  -n 

As we can see from the figure below, StreamSets requires at least 32768 to open the file data.

Put the downloaded file in a new folder streamSets, create a soft connection in the /var/www/html directory, and use it as the download address of the offline package

Use the following command to run Data Collecto in the background:

nohup bin/streamsets dc >/dev/null 2>&1 &

Enter the work platform for the first time (default username and password: admin/admin)


Three, entry case

3.1 Parse local files to HDFS

1. The overall design of the data flow

2. Specific design steps of pipeline flow

(1) Create a new data stream

(2) Select a file to input

( 3 ) From the component area, select a data processing plug-in, here select the JavaScript plug-in, and write a js script (  parse the field plugins )

(4) Choose a data processing plug-in again, filter out the data that does not contain the data field in the record, and keep the record whose name field is host

( 5 ) The data records that meet condition 1 are further processed, and the data records that do not meet condition 1 are thrown away; configure the record filter plug-in to keep three fields in the record:

( 6 ) Select a record to expand the field, and tile the data in the /data field in the record

(7) Choose another expression plugin to add missing values

You can try the input method of /data/cpu.wait in the above Output Field. It seems that it is wrong, but I didn't change it in the end. After the machine format, I was not doing this demo.

(8) Data flows into HDFS

 

(8) Kerberos authentication

vim /hadoop/software/streamsets-datacollector-3.15.0/etc/sdc.properties

Kerberos authentication is complete, restart

(9) Data stream configuration, verification and preview

Data flow verification and preview: Click the small red box to verify the data flow

Click eyes

The point of StreamSets is real-time streaming, and you can see the data inflow and outflow of each component from the above figure, and you can also see the data inflow and outflow for the data stream of multiple components.

     But this piece is still reported wrong. The data obtained is from the blogger above. In the subsequent adjustments, we can analyze the data flow in and out according to the data flow in and out of each component, so I cut some of the above design steps. But basically the data is synchronized from the local to HDFS. 

(10) Start pipeline flow

The following is the visual monitoring interface after the entire pipeline flow is started. You can clearly see how much data flows in and out of each stage, and the data of each batch

 

3.2 Mysql query component

  • SQL syntax is equal to no spaces on both sides
  • Full query, adding where filter can not query the data stream

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/qq_35995514/article/details/107258957