By Spark SQL real-time data archiving SLS

I introduced the previous article alarm Spark SQL time monitoring of operations based on HDFS . Today, I will illustrate how to use Spark SQL to develop a streaming application. This paper is divided into three parts:

  • Flow computing and SQL
  • Brief Spark SQL syntax flow development
  • Real-time data to HDFS SLS archive

1. Stream SQL calculation and

The value of data over time gradually decreased. Timely data is processed as soon as possible to enhance the value of the data, the application flow computing systems are increasingly being used. The commonly used flow computing framework Storm, Spark Streaming Flink and the like, but also such Streams Kafka Kafka based streaming processing library. Various streaming framework has its own API, developers inevitably need to learn how to use the API. How to provide simple and effective tool for development, thus bringing more energy invested in your business dealings. So, each streaming systems support SQL API gradually as the development language, so that users can handle as image processing Stream Table. E.g. KSQL supports SQL Kafka streamed data. Spark is also proposed as the latest generation of Structured Streaming streaming system, the underlying processing engine is Spark SQL. But in the upper SQL API, the lack of Structured Streaming necessary functions, such as window, watermark and so on. EMR were extensions on the Spark open source version, SQL API supports the use of a complete streaming query development on the Spark.

2. Spark SQL-flow development portal

This section briefly describes the Spark SQL in the streaming open concepts and syntax.

Table 2.1 build

When we need to convection read and write data source, you must first create a table to represent the data source. The syntax for defining table is as follows:

CREATE TABLE tbName[(columnName dataType [,columnName dataType]*)]
USING providerName
OPTIONS(propertyName=propertyValue[,propertyName=propertyValue]*);

Above syntax, for specific source, does not require certain specified table column definitions. When the column definition is not specified, the information will automatically identify the data source schema. for instance:

CREATE TABLE driver_behavior 
USING kafka 
OPTIONS (
kafka.bootstrap.servers = "${BOOTSTRAP_SERVERS}",
subscribe = "${TOPIC_NAME}",
output.mode = "${OUTPUT_MODE}",
kafka.schema.registry.url = "${SCHEMA_REGISTRY_URL}",
kafka.schema.record.name = "${SCHEMA_RECORD_NAME}",
kafka.schema.record.namespace = "${SCHEMA_RECORD_NAMESPACE}");

When the data source is Kafka, the name will go according to Kafka Topic Kafka Schema Registry to find schema information. Of course, we can also specify column definitions, for example:

CREATE TABLE driverbehavior(deviceId string, velocity double)
USING kafka 
OPTIONS (
kafka.bootstrap.servers = "${BOOTSTRAP_SERVERS}",
subscribe = "${TOPIC_NAME}",
output.mode = "${OUTPUT_MODE}",
kafka.schema.registry.url = "${SCHEMA_REGISTRY_URL}",
kafka.schema.record.name = "${SCHEMA_RECORD_NAME}",
kafka.schema.record.namespace = "${SCHEMA_RECORD_NAMESPACE}");

When the specified column definitions, field definitions and requirements must be consistent in the Source. When executing the CREATE TABLE operation definition table is saved to the Hive MetaStore.

2.2 CTAS

We can create tables and query results written statement to the table are merged together, then that is CREATE TABLE ... AS SELECT ... syntax:

CREATE TABLE tbName[(columnName dataType [,columnName dataType]*)]
USING providerName
OPTIONS(propertyName=propertyValue[,propertyName=propertyValue]*)
AS
queryStatement;

As an example (cited from here : q103):

CREATE TABLE kafka_temp_table
USING kafka
OPTIONS (
kafka.bootstrap.servers = "${BOOTSTRAP_SERVERS}",
subscribe = "${TOPIC_NAME}",
output.mode = "${OUTPUT_MODE}",
kafka.schema.registry.url = "${SCHEMA_REGISTRY_URL}",
kafka.schema.record.name = "${SCHEMA_RECORD_NAME}",
kafka.schema.record.namespace = "${SCHEMA_RECORD_NAMESPACE}") AS
SELECT
  i_brand_id brand_id,
  i_brand brand,
  sum(ss_ext_sales_price) ext_price
FROM date_dim, kafka_store_sales, item
WHERE d_date_sk = ss_sold_date_sk
  AND ss_item_sk = i_item_sk
  AND i_manager_id = 28
  AND d_moy = 11
  AND d_year = 1999
  AND delay(ss_data_time) < '2 minutes'
GROUP BY TUMBLING(ss_data_time, interval 1 minute), i_brand, i_brand_id

When performing the operation, to create a table and the actual generation StreamQuery example, the query result is written into the result table.

2.3 DML

Streaming SQL query SQL standard syntax and off-line most of the same, here introduces the insert operation. Streaming is the query SELECT operation alone does not allow, SELECT query results must be written into the table. So, we need to perform INSERT operations before SELECT operations.

INSERT INTO tbName[(columnName[,columnName]*)]
queryStatement;

Once more syntax flow query: This statement will actually generate a StreamQuery example, the query result is written into the result table. for instance:

INSERT INTO kafka_temp_table
SELECT
  i_brand_id brand_id,
  i_brand brand,
  sum(ss_ext_sales_price) ext_price
FROM date_dim, kafka_store_sales, item
WHERE d_date_sk = ss_sold_date_sk
  AND ss_item_sk = i_item_sk
  AND i_manager_id = 28
  AND d_moy = 11
  AND d_year = 1999
  AND delay(ss_data_time) < '2 minutes'
GROUP BY TUMBLING(ss_data_time, interval 1 minute), i_brand, i_brand_id

2.4 window及watermark

Due to space limitations, this article does not describe the time being how to use Spark SQL window and watermak, are interested can look at the data , follow-up will introduce dedicated to the author.

2.5 job flow configuration

When using SQL streaming operations and development, some necessary configuration can not be expressed in Query, you need to be set individually. Here we use the SET operation flow operation necessary parameters, the current need to set two parameters:

name config
Streaming query instance name streaming.query.name
Checkpoint job stream address spark.sql.streaming.checkpointLocation.${streaming.query.name}

You need to be arranged before each query instance flow, that is, when CTAS or Insert operation, must precede these two configurations. A SQL file to support multiple streaming queries, such as:

-- test.sql

SET streaming.query.name=query1;
SET spark.sql.streaming.checkpointLocation.query1=/tmp/spark/query1
INSERT INTO tbName1 [(columnName[,columnName]*)]
queryStatement1;

SET streaming.query.name=query2;
SET spark.sql.streaming.checkpointLocation.query2=/tmp/spark/query2
INSERT INTO tbName2 [(columnName[,columnName]*)]
queryStatement2;

3. SLS combat real-time data archiving

Assumes a scenario, now SLS log collected by the service server, the need to archive the HDFS facilitate subsequent off-line analysis. This involves two data sources: SLS and HDFS. Spark HDFS is officially supported data sources, supports reading and writing flow and batch. SLS is Ali cloud services, EMR has supported streaming read and write.

  • Environmental ready
    need E-MapReduce 3.21.0 or later cluster environment, the publications currently being prepared, and soon to meet you, so stay tuned.
  • Command Line
spark-sql --master yarn-client --conf spark.sql.streaming.datasource.provider=loghub --jars emr-logservice_shaded_2.11-1.7.0-SNAPSHOT.jar

Note: emr-logservice_shaded_2.11-1.7.0-SNAPSHOT.jar will be released out of the EMR SDK 1.7.0 version.

  • It was created two tables: sls_service_log and hdfs_service_log
CREATE DATABASE IF NOT EXISTS default;
USE default;

DROP TABLE IF EXISTS hdfs_service_log;
CREATE TABLE hdfs_service_log (instance_name string, ip string, content string)
USING PARQUET
LOCATION '/tmp/hdfs_service_log';

DROP TABLE IF EXISTS sls_service_log;
CREATE TABLE sls_service_log
USING loghub
OPTIONS (
sls.project = "${logProjectName}",
sls.store = "${logStoreName}",
access.key.id = "${accessKeyId}",
access.key.secret = "${accessKeySecret}",
endpoint = "${endpoint}");
  • By Spark SQL SLS will launch a Stream Query data in real-time synchronization to HDFS
set streaming.query.name=sync_sls_to_hdfs;
set spark.sql.streaming.checkpointLocation.sync_sls_to_hdfs=hdfs:///tmp/spark/sync_sls_to_hdfs;

INSERT INTO hdfs_service_log
select
__tag__hostname__ as instance_name,
ip,
content
from sls_service_log;
  • View HDFS data archiving case

image

  • Use Spark SQL to archived data for offline analysis: for example, how many IP statistics, a total of
select distinct(ip) from hdfs_service_log;

image

4. Conclusion

Above, we introduced the Spark SQL a very simple example in streaming in. In fact, we can also use a more complex Spark SQL streaming processing tasks. Follow-up article, I will introduce the concept of window operation, watermark, etc., and how to perform simple machine learning operations on streaming data.

Guess you like

Origin yq.aliyun.com/articles/705625
Recommended