StreamSet学习（一）Pipeline Concepts and Design

一、数据流的合并和分支

二、Dropping Unwanted Records

（1）required field可以在processor, executor, and most destination 节点，如果一个记录没人包含任何的必要字段那么这条记录将别丢给错误处理操作。

（2）Preconditions 前置条件，数据必须满足前置条件的情况下才能进入相关的步骤进行数据的处理。前置条件可以在processor, executor, and most destination 中使用，在前置条件中可以使用function,变量和运行时的属性

例如使用如下表达式排

${record:value('/COUNTRY') == 'US'}

三、Error Record Handling

record handling 可以在stage level 和 pipeline level进行定义，也可以将stage中定义的错误不进行处理，传给pipeline的错误处理。

stage中的错误处理机制会优先与pipeline的错误机制。

注意：缺少必要字段的记录将不会进入此stage,这些数据直接被pipeline的error handing进行处理

错误处理的常见方式：

pipeline error Handing

1.丢弃数据（discard） 2.写入管道（Write to Another Pipeline）：此模式需要自己创建SDC RPC origin pipeline 3.写入文件系统 4.写入kafka

stage error handing

1.丢弃数据（discard）2.将error发送到pipeline error handing（Send to Error）进行处理 3.停止pipeline(Stop Pipeline):停止pipeline并记录相关错误信息，停止管道打错误在管道历史记录中显示为error .注意：集群模式暂不支持此模式

四、Processing Changed Data（处理变化的数据）

Steamsets可以捕获数据的变化，比如增删改查。

1.常用CDC-enabled stages:

JDBC Query Consumer for Microsoft SQL Server

MySQL Binary Log

Oracle CDC Client

PostgreSQL CDC Client

SQL Parser，

SQL Server CDC Client，

SQL Server Change Tracking

2.CRUD enabled stages:

JDBC Tee processor

JDBC Producer destination

数据处理（processing the data），因为存储的log日志保存了不同格式的记录数据，使用 JDBC Tee processor and JDBC Producer能够解码大多数更改日志格式，从而根据原始更改日志生成记录数据。当使用其他的CRUD-enabled destinations，你需要添加其它stage用于处理数据格式

对应mysql的数据变更捕获：https://streamsets.com/documentation/controlhub/latest/help/pdesigner/datacollector/UserGuide/Pipeline_Design/CDC-Overview.html#concept_apw_l2c_ty

In contrast, the MySQL Server binary logs read by the My SQL Binary Log origin provides new or updated data in a New Data map field and changed or deleted data in a Changed Data map field. You might want to use the Field Flattener processor to flatten the map field with the data that you need, and a Field Remover to remove any unnecessary fields

For details on the format of generated records, see the documentation for the CDC-enabled origin.

常见案列：

You can use CDC-enabled origins and CRUD-enabled destinations in pipelines together or individually. Here are some typical use cases:

CDC-enabled origin with CRUD-enabled destinations

You can use a CDC-enabled origin and a CRUD-enabled destination to easily process changed records and write them to a destination system.

For example, say you want to write CDC data from Microsoft SQL Server to Kudu. To do this, you use the CDC-enabled JDBC Query Consumer origin to read data from a Microsoft SQL Server change capture table. The origin places the CRUD operation type in the sdc.operation.type header attribute, in this case: 1 for INSERT, 2 for DELETE, 3 for UPDATE.

You configure the pipeline to write to the CRUD-enabled Kudu destination. In the Kudu destination, you can specify a default operation for any record with no value set in the sdc.operation.type attribute, and you can configure error handling for invalid values. You set the default to INSERT and you configure the destination to use this default for invalid values. In the sdc.operation.type attribute, the Kudu destination supports 1 for INSERT, 2 for DELETE, 3 for UPDATE, and 4 for UPSERT.

When you run the pipeline, the JDBC Query Consumer origin determines the CRUD operation type for each record and writes it to the sdc.operation.type record header attribute. And the Kudu destination uses the operation in the sdc.operation.type attribute to inform the Kudu destination system how to process each record. Any record with an undeclared value in the sdc.operation.type attribute, such as a record created by the pipeline, is treated like an INSERT record. And any record with an invalid value uses the same default behavior.

CDC-enabled origin to non-CRUD destinations

If you need to write changed data to a destination system without a CRUD-enabled destination, you can use an Expression Evaluator or scripting processor to move the CRUD operation information from the sdc.operation.type header attribute to a field, so the information is retained in the record.

For example, say you want to read from Oracle LogMiner redo logs and write the records to Hive tables with all of the CDC information in record fields. To do this, you'd use the Oracle CDC Client origin to read the redo logs, then add an Expression Evaluator to pull the CRUD information from the sdc.operation.type header attribute into the record. Oracle CDC Client writes additional CDC information such as the table name and scn into oracle.cdc header attributes, so you can use expressions to pull that information into the record as well. Then you can use the Hadoop FS destination to write the enhanced records to Hive.

Non-CDC origin to CRUD destinations

When reading data from a non-CDC origin, you can use the Expression Evaluator or scripting processors to define the sdc.operation.type header attribute.

For example, say you want to read from a transactional database table and keep a dimension table in sync with the changes. You'd use the JDBC Query Consumer to read the source table and a JDBC Lookup processor to check the dimension table for the primary key value of each record. Then, based on the output of the lookup processor, you know if there was a matching record in the table or not. Using an Expression Evaluator, you set the sdc.operation.type record header attribute - 3 to update the records that had a matching record, and 1 to insert new records.

When you pass the records to the JDBC Producer destination, the destination uses the operation in the sdc.operation.type header attribute to determine how to write the records to the dimension table.

五、异常特定字符（Control Character Removal）

您可以使用几个步骤从数据中删除控制字符——例如转义字符或传输结束字符。删除控制字符以避免创建无效记录，当数据收集器删除控制字符时，它将删除ASCII字符代码0-31和127，但以下情况除外：

9 - Tab
10 - Line feed
13 - Carriage return

什么源支持Ignore Ctrl Characters 请查看：https://streamsets.com/documentation/controlhub/latest/help/pdesigner/datacollector/UserGuide/Pipeline_Design/ControlCharacters.html

六、Development Stages

您可以使用几个Development stages来帮助开发和测试管道

参考地址：https://streamsets.com/documentation/controlhub/latest/help/pdesigner/datacollector/UserGuide/Pipeline_Design/DevStages.html#concept_czx_ktn_ht

七、配置Test Origin

重要点：配置TestOrigin的时候和原来的Origin配置方式一样，具体配置步骤请查如下信息

Configuring a Test Origin

Configure a test origin in the pipeline properties. When using Control Hub, you can also configure a test origin in the pipeline fragment properties.

On the General tab of the pipeline or fragment properties, select the origin type that you want to use.
You can select any available origin type.
On the Test Origin tab, configure the origin properties.
Origin properties for the test origin are the same as for real origins, with all properties displaying on a single tab.

For details about origin configuration, see "Configuring an <origin type> Origin" in the Origins chapter. For example, for help configuring a Directory test origin, see Configuring a Directory Origin.

Using a Test Origin in Data Preview

To use a configured test origin in data preview, configure the preview configuration options.

Click the Data Preview icon to start data preview.
In the Preview Configuration dialog box, set the Preview Source property to Test Origin, then configure the rest of the data preview properties as needed.
For more information about using data preview, see Data Preview Overview.

八、Understanding Pipeline States

关于Pipeline运行的相关状态一般是start ,run,stop ,edit等

详细的请查看：https://streamsets.com/documentation/controlhub/latest/help/pdesigner/datacollector/UserGuide/Pipeline_Maintenance/PipelineStates-Understanding.html