Kangaroo Cloud Product Function Update Report Issue 08|Nearly a hundred new features and optimizations, everything you want is here!

17128600:

Welcome to the Kangaroo Cloud 08 product feature update report! In the ever-changing market environment, we are well aware of our customers' needs and expectations. Therefore, we timely launch the latest product updates and optimizations of Kangaroo Cloud, including data governance center, Hive SQL performance optimization, new plug-ins, etc., to help enterprises move forward in the digital world. .

The following is the content of Kangaroo Cloud Product Function Update Report Issue 08. For more exploration, please continue reading.

Offline development platform

New feature updates

1. Support application and approval for docking Inceptor table permissions

Background: The customer is using the platform's web layer permission control solution, and expects the Inceptor table to also support web layer permission control.

Description of new features:

As shown in the figure, when the table permissions are passed, the user will have the approved Inceptor table permissions offline. Permissions are mainly divided into the following three points:

• DQL: mainly select statements, read-only permissions

• DML: Mainly insert update statement, write permission only

• DDL: Mainly alter statement, changing table records

file

2. Batch operations support filtering tasks based on baselines

Background: Customers hope to expand on the basis of baseline functions. In addition to realizing the line-breaking alarm function, they also hope to support batch setting of resource rents. This allows faster recovery when an error occurs on one of the baselines.

New function description: Add baseline filtering items in batch operations .

file

3. Task priority

Background: If there are no exceptions (errors or delays) in the task, the cluster resources can generally support the normal operation of the task, and it is rare for the task to be blocked in a large area during normal operation. However, if the task dependency tree is complex, and several important upstream tasks are abnormal and the repair takes a long time, which will cause downstream tasks to run together after recovery, then task congestion may occur, so the setting of task priority is particularly important.

New function description: Supports setting 1-5 levels of priority for tasks in baseline management . The larger the value, the higher the priority of task running. Tasks with higher priority will receive scheduling resources first when scheduling resources are tight.

After setting the priority for the baseline, all tasks on the baseline and its effective upstream tasks are automatically assigned this priority. After configuring the priority, it will take effect in the periodic instance generated by T+1.

file

4. Task release docking approval center

Background: Some customers have high security requirements when releasing tasks to production projects, and hope to complete the release after approval.

New function description: After the release approval process is enabled, after the release action is executed offline, the approver needs to conduct approval in the approval center before the release process can continue.

file file

5. The project supports binding database accounts

Background: Some customers have encountered such a scenario. Different projects are handled by different teams, and the corresponding data permissions are also different. Therefore, they hope to bind database accounts in the project dimension.

New feature description: RDB database accounts support setting in the project. In the console, you can also set up database accounts at the cluster and individual levels . The priority relationship between the three is Personal>Project>Cluster.

file

Function optimization

1.Hive SQL performance optimization

Background: When running Hive SQL on the client side , there is feedback that the Hive SQL task execution is slow.

Experience optimization description: After performance optimization, the speed of simple queries has been significantly improved. The specific use cases and time comparison are as follows:

• SELECT * FROM putong0629.dl_user WHERE id > 0; (The table has 18 fields and 100,000 pieces of data)

file

• SELECT * FROM putong0629.dl_user WHERE id is not null LIMIT 1; (the table has 18 fields and 100,000 pieces of data)

file

2. SQL editor formatting optimization and support for rollback

• ctrl+Z/command+Z to undo formatted content

• After formatting, the format has been optimized and adjusted with reference to the formatting methods of competing products and other open source code editors.

file

3. Log real-time printing optimization

Background: The task log is polled every 2.5 seconds. Failure to continue polling the log after the task is completed will result in the loss of key information in the log.

Experience optimization description: Real-time log printing is optimized. After a task fails, the log will be polled and printed again.

4. The menu drawer on the right side of the offline development IDE interface supports left and right dynamic stretching.

Background: The previous interaction logic is as shown in the figure. The right drawer is fixed. When filling in parameters with a lot of field information such as parameters, it is very inconvenient and requires pulling back and forth to view the information.

file

Experience optimization instructions: You can freely stretch the width of the right drawer and adjust it to a comfortable width before filling it in.

file

5. SQL query result null value optimization

Background: There is a problem with the query results currently displayed offline. Whether they are empty or string, they are displayed as empty and users cannot distinguish them.

Experience optimization description: The query results distinguish three situations: "The object is a string and is "null"" "The object is a string and is """ and "The object is empty".

file

6. When a task goes offline, the current downstream dependent tasks will be prompted.

Background: When a task is offline, it will affect all downstream tasks of the current task. Users usually have no good way to judge which downstream tasks are specifically affected.

Experience optimization description: When a task is offline, a pop-up window will appear to display the currently affected task range.

7. GitLab code synchronization function optimization

• Adapted to GitLab version 15.7.8

• Change project pull to asynchronous operation to prevent pull timeout

• Modified task push from "save and then push" to "save after completion of push"

• Support pulling according to task directory

• Change to optional when selecting by file type

• Batch operations want to support batch push and pull

file

8.SQL query result optimization

• Offline metadata synchronization supports view synchronization: The metadata synchronization function of the offline data source page supports metadata synchronization and view synchronization.

file

• Support local data import from data source

file

• Query returns the number of rows

file

• Query results support sorting

file

• Query result table name identifies field type tag

file

9. When the scheduling cycle is monthly, the last day can be selected.

When the scheduling period is "monthly", the time supports selecting "last day of each month".

file

10.Inceptor reading supports range partitioning

Background: In data synchronization, offline Inceptor reading does not support range partitioning (Range Partitioning), but only single-value partitioning (Single-Value Partitioning).

Experience optimization description: When selecting Inceptor data source for offline data synchronization, range partitioning is supported.

Real-time development platform

New feature updates

1.TBDS account

Users with a TBDS account can submit tasks to the cluster with their personal account, and all other tasks can be submitted with the default account.

2. The global/task alarm has a new trigger method of "start/stop policy execution failure"

Background: The current platform cannot sense whether the start-stop policy is successfully executed, such as whether the running task is stopped normally according to the start-stop policy, and whether the stopped task is restarted according to the start-stop policy.

New feature description: After configuring rules, you can see the specific failure reasons in the alarm content.

file file

3. Support user-defined roles

Background: Currently, the roles and corresponding permission points used by users in the platform are built-in and fixed. When the permission points or role types that different users should have for roles are inconsistent with the ideas provided by the platform, they cannot be modified according to their own needs.

Description of new functions: Supports adding custom roles and editing corresponding role permissions in "Role Management" , and optimizes the permissions of operating members within the project.

file

4.Flink1.16 tasks support running on k8s

Supports configuring k8s with collection type NFS in the console- cluster configuration . The configuration steps can be viewed in "Overall Description-Scheduling Support".

5. Add Hudi as the source table/result table of FlinkSQL

Supports the introduction of HMS data sources, and the Hudi table can be selected in the source table/result table of FlinkSQL wizard mode .

file

6. Added HBase/ElasticSearch HuaweiCloud as the dimension table/result table of FlinkSQL

Supports selecting and using the HBase/ES HuaweiCloud data source adapted to the fusioninsight/MRS cluster in the result table/dimension table .

file file

7. SQL query, debugging and pre-sales demo tasks of real-time tasks are submitted through session mode

Background: At present, the task submission of real-time platforms adopts perjob mode by default. However, for real-time SQL query, debugging, and demo task scenarios, faster data output is required and does not require continuous long-term operation. The advantages of perjob mode are used. Not on. Moreover, the disadvantage of the perjob mode is that the submission process is long and is not suitable for such scenarios.

Description of new functions: The following three configuration items are added to the session configuration to support real-time task scenarios:

file

8. Added Upsert Kafka plug-in to the source table

Added the Upsert Kafka plug-in as the source table and result table of FlinkSQL.

file

9. Added [Real-time Lake Warehouse] module

The [ Real-time Lake Warehouse ] module has been added to support the management and calculation of lake tables.

Function optimization

1. Enhance the accuracy of FlinkSQL syntax parsing in the IDE

Background: The previous syntax analysis will still highlight errors for many correct SQL writing methods.

Experience optimization instructions: Improve the accuracy of SQL syntax parsing.

2.Starrocks result table, wizard mode supports update mode

Background: The Starrocks plug-in supports upsert to define primary keys, but the platform wizard mode does not support it. The update mode needs to be adjusted and adapted in the wizard mode.

Experience optimization description: Wizard mode adapts to Starrocks data source and upsert custom primary key.

file

3. Added oushu target table

The result table supports the oushuDB data source .

file

4. Business data issues in log printing

Background: Currently, business data is printed in the running logs of real-time tasks, which poses data security risks and needs to be blocked.

Experience optimization description: Check whether there is printing business data in the running log, task manager log, and history log. If it exists, hide the printing business data.

file

5. Added [Task Offline] function and added [Task Stop Time] column

Optimize the interactive experience of some task operation and maintenance, add the [Task Offline] function, and add the [ Task Stop Time ] column to the task list.

file

6. Various data sources in wizard mode are unified and open to custom parameter configuration.

Background: Currently, the "Add Custom Parameters" and "Update Strategy" configuration items of some data sources in the results table are missing.

Experience optimization instructions:

• Result table—Sql server dimension table—mysql, oracle, sql server, Postgresql, kingbaseES8, greatdb, doris0.14.x(http), doris0.14.x(jdbc)starrocks, impala, clinkhouse, inceptor, ES6.x , ES7.x, TBDS_HBASE, argodb, and vastbase add open custom parameter configurations to the above data sources.

file

• Result table—Involving data sources: Sql server, Postgresql, kingbaseES8, new update strategies for the above data sources.

file

7. [Task Operation and Maintenance] Health sub-model optimization

Functional optimization of task operation and maintenance has been made. Descriptions of task deduction items and troubleshooting guidance for common problems have been added. Users can view specific deduction items through health points and improve them, making it easier for users to troubleshoot problems.

file

8. [Real-time development] Task import and export function optimization

Background: The import and export function of real-time tasks uses the serial number of the database instead of the name when replacing the task resource group information, resulting in an error when importing across environments. (Because of this information across environments, the ids in the database are most likely to be different)

Experience optimization description: When importing and exporting tasks, information that needs to be replaced, such as resource groups, data sources, etc., should be replaced with names. In this way, you only need to ensure that the names maintained by the two environments are consistent, and you can implement cross-environment policy import and export.

Data asset platform

New feature updates

1.Trino supports metadata synchronization

The Trino meta data source assets generated by other product module creation projects such as offline, indicators, labels, etc. support automatic introduction, and the Trino meta data source supports quality project authorization.

2. Support cross-source comparison of TDSQL and Inceptor tables through Trino

Background: The support of hyperbase, hyperbase drive, and search was not previously considered in the comparison of Inceptor tables.

New function description: Data quality can realize cross-source comparison of TDSQL and Inceptor (hyperbase, hyperbase drive, search) tables through Trino.

3. Partitioned tables support displaying partition information in the table structure

If the data table is a partitioned table , add the partition information of the display table in Table Details - Table Structure.

4. Support online and offline approval operations of data standards

Data standards created by ordinary users of the data standard module need to be reviewed by the approval center before they can be online or offline. Only after the data standards are online can standard mapping and standard binding operations be performed.

file

5. Metadata synchronization supports configuring automatic synchronization filtering rules

Background: For the logic of monitoring offline ddl statements and synchronizing tables into assets in real time, filter conditions are added to the customer metadata synchronization task. If you do not want to collect tmp tables to the data map, you can filter them out through the metadata synchronization task, but real-time ddl There is no place to add filtering conditions in the monitoring logic, so when running tasks offline, the tmp table inside will still be collected into assets.

Description of new functions: Added [ Automatic Synchronization ] function in the metadata synchronization module , which is used to configure filtering rules for automatic synchronization.

file

6.greenplum data source supports view synchronization

The greenplum data source supports view synchronization. The gp view and the gp data table share a metamodel. The source table name (view-specific) and view description (view-specific) technical attributes are added to the metamodel. When selecting the data under the gp class data source , you can select a specific view to perform operations such as metadata synchronization and data desensitization.

file

7. Assets support the automatic introduction of MySQL type data sources

For the meta data source generated when creating a project offline, the asset supports the automatic introduction of MySQL type data sources. After automatic introduction, periodic tasks need to be automatically created.

8. [Data Governance] Governance workbench, governance configuration function

Background: The significance of data governance is to promote users to develop data in accordance with normative standards, and to conduct data governance from the five dimensions of computing, storage, quality, norm, and value. The purpose is to optimize storage costs, save computing resources, promote standards, and allow users to See problems and see results through data governance.

Description of new features: This iteration supports data governance from the computing and storage dimensions, supports automatic synchronization of project information created by the offline development module, can perform periodic governance of projects by configuring governance tasks , and assign handlers to outstanding issues that arise. Process and achieve closed-loop management of problems.

file file

Function optimization

1. The content of the alarm email increases the planned time of the instance.

Add "Scheduled Time" to the alarm email , and change the original "Scheduling Time" to "Start Time", so that users can directly observe the specific day on which the quality task verification failed through notifications such as emails.

2. Data source display optimization

• The connected data sources are sorted in descending order according to the priority of number of data sources - number of libraries - table data - storage size.

• In data directory distribution, data resource content is displayed based on the sub-product modules connected to the current tenant.

3. When data security is turned on, the desensitization entry for the application entry for web layer table permissions is removed.

When the permission control policy is enabled in the data security sub-module, the permission policy configured in the data security module shall prevail, and the application entry for table permissions in the asset module will be hidden.

If the desensitization strategy for hive/sparkthrift/trino is enabled in the data security sub-module , the desensitization application in the desensitization portal cannot select data tables under these types of data sources.

4. Table life cycle IDE script synchronization

The offline development module supports life cycle configuration through IDE scripts . When the life cycle changes, it can be synchronized to the assets, and the life cycle information can be displayed when viewing table details in the metadata module.

5. Optimization of data desensitization management

After the data masking rules are configured, the masking application configuration page supports editing operations.

file

6. Normative rule verification optimization

Normative rule logic optimization, for example, setting the minimum length = 20, the logic is that the string length is greater than or equal to 20, which is considered in compliance with the rules (the logic of the maximum length function is the same).

7. [Data Map] Data table display optimization

In the list display of the data table, the display content is adjusted from "Data Source·Database" to "Data Source | Database", and the mouse hover prompts "Data Source | Database".

If there are multiple data sources, the complete information of the first data source name is displayed, and the others are represented by "...", such as "mysql_test1... | dbtest1"; for Trino data sources, the displayed content is "data source | catalog | database" .

On the table details page, in the technical attributes column, below the "Table Name" field, a new field is "Data Source", which displays the data source information of the data table. Multiple data sources are separated by English semicolons. On the technical attribute page in the metadata model, a new technical attribute "data source" is added.

8. Table structure field list editing interaction optimization

Background: It is troublesome to edit the editable content in the field list one by one. After optimizing the entire table, all positions can be edited. After editing, the entire table is saved.

Experience optimization instructions:

• Optimization of interactive logic for label addition

• Support batch editing of field descriptions and field labels

Data service platform

New feature updates

1. Composition and time-consuming analysis of each stage of API call

Add the call analysis tab to the test API page and generate API test interface. You can use the waterfall chart to see the total time taken and execution content, as well as the specific reasons for function errors and other problems.

file

Service orchestration is similar to API generation, with call analysis added to view specific time consuming and failure reasons.

file

Each call record is added to save the input parameter content (generate API, register API, service orchestration, service analysis) and call analysis (generate API, register API, service orchestration), and support viewing the call analysis logic consistent with the generated API call input parameters.

file

2. Service orchestration supports JAVA

Service orchestration changes from python nodes to function nodes. Function nodes can choose which function type they are, including python2.7, python3.9 and JAVA functions. The parameter input method is the same as before.

file

The function type has been added to Java8. When the function type is selected as JAVA8, it will jump to the JAR package upload interface and the python function will be the same as before. After the Java function jumps, first upload the JAR package or zip file with a size less than 50MB, and then fill in the class name and class method. When entering parameters, click Parameter Parsing to automatically parse field types, parameter names, etc.

file file

3. Service orchestration supports display result return examples

In the advanced configuration of service orchestration, display return result samples and save test results as json samples are added.

file

4. Support API path prefix customization

This part is mainly implemented through configuration item changes and code logic compatibility. The configuration item changes are as follows (the values ​​configured between services with the same configuration item must be exactly the same):

api-web变更:
(废弃) gateway.url
(新增) gateway.url.host = http://gateway-default-api530-api.base53.devops.dtstack.cn
(新增) gateway.url.custom.prefix = /custom/data
(新增) gateway.url.custom.open = true

gateway变更:
(新增) gateway.url.custom.open = true
(新增) gateway.url.custom.prefix = /custom/data

nginx变更/conf/conf.d/apigw.conf:
(localtion后面的配置需要基础运维进行提取变量,支持通过em进行配置项的变更,变更值与api配置文件中gateway.url.custom.prefix保持一致)
#location /api/gateway {
-> 修改成:
#location /custom/data {
      proxy_max_temp_file_size         0k;
      fastcgi_buffers 32 8k;
      proxy_http_version 1.1;
      proxy_set_header X-Real-IP       $remote_addr;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host            $host;
      proxy_pass http://real-rdos-api-gw;

      if ($request_method = 'OPTIONS') {
            return 204;
      }
  }

Configuration item description:

• gateway.url.custom.open: whether to use a custom url prefix, default false

• gateway.url.host: request url, consisting of http(https)://hostname:port

• gateway.url.custom.prefix: Custom prefix, starting with a slash, supports multiple levels, defaults to /api/gateway

5.API supports batch submission, publishing, and withdrawal

The API supports batch submission, batch publishing, batch withdrawal and other operations to improve the operating efficiency of the API.

file

Function optimization

1. The API input parameter supports multiple parameters and requires filling in at least a few parameters.

For non-required fields, you can select several required fields for input. For example: mobile phone number, name, ID card. This function can limit the number of fields that must be filled in. If the number box is 2, then two fields must be filled in. You can fill in the mobile phone number or name, ID card or mobile phone number to get the return parameters, otherwise the call will fail.

file

2. Supports selecting exported content when exporting API documents

When exporting API documents, you can select the content to be exported, and you can also select some APIs in the directory for document export.

file file

3. Whether the registration API return result has platform default structure support configuration

Background: The API currently registered in the data service will be wrapped with a layer of content, causing the return results after registration to be inconsistent with the native API.

Experience optimization description: Add a configuration item to the backend to configure whether to add our own content to the returned results. It is added by default.

4. Support API policy creation for circuit breaker and downgrade

Policies that support creating circuit breakers and downgrades :

file

After creation, you can select applications in a single API:

file

Customer Data Insight Platform

New feature updates

1. The label directory supports batch upload and download.

Background: There are usually two environments for customer development tags: production environment and test environment. The tag directory created in the test environment is expected to be directly synchronized to the production environment, eliminating the need for repeated operations.

Description of new features:

• Enter the tag directory list of project A and click "Directory Download" to download the directory file

file

• Enter the tag directory list of project B and click "Directory Upload" to upload the directory file

file

• When you upload a directory file in CSV format, the system will perform incremental updates based on the directory name, directory level, and upper-level directory name . You must ensure that the upper-level directory already exists in the file or online directory. The file directory is updated asynchronously, and the directory cannot be modified during the update process.

file

2. Label copy across projects/entities

Background: When the test environment and the production environment are co-located, we want to realize that after the label processing test function is correct in the test environment, the label can be set in the production environment in a simple way.

Description of new features:

• When "New Label", you can use " Cross-Project Copy " to copy labels from other entities or other projects under the current project to the current entity to quickly create labels.

file

• After selecting a specific tag, you will enter the tag creation page, and quickly fill in the configuration information of the copied tag into the configuration of the current new tag. If it involves tables/tags that have not been configured for the current entity, you need to manually select it again.

3. Data synchronization supports synchronization to Inceptor and generates tables in hyperbase format.

Background: The bottom layer of the data stack supports the use of TDH. The corresponding data clients of the upper layer are stored in the Inceptor. Accordingly, the data synchronization results need to be synchronized to the Inceptor.

New function description: In the API access data source settings, the Inceptor data source can be set.

Function optimization

1. Tag SQL optimization, partition related field splitting to improve processing efficiency

Background: Historical SQL splicing is based on the entire table for data query. When the amount of data is large, there will be a memory overflow scenario, resulting in an error.

Experience optimization description: SQL optimization adjustment is to first determine the required partition, and then perform data query on this specific partition to avoid error reporting.

2. Add source table description information to the entity

Background: The history of table information involved in the entity is displayed by table name. The English form is not convenient for intuitive understanding. Supplementary display table description content.

Experience optimization instructions:

• Display table description information in entity details

file

• Create/edit entities and display table description information

file

3. Tag configuration, tag market, and tag group pages display tag name + description information

file

file

4. List optimization

Tag circle group instance list page, group details group list page, group intersection and difference pages, etc., slide the page to the list area, and the list area can be displayed in full width to display more content.

file

Indicator management platform

New feature updates

1. The historical data of the indicator result table supports row-level updates.

Background: In a performance appraisal scenario, performance allocation rules are formulated by business personnel. Usually the introduction of the rules has a lag, that is, the rules are launched on April 1, 2023, and the rules will take effect from January 1, 2023. This It is necessary to update the data since January 1, 2023. The whole table update method is slow and takes up a lot of resources. Only updating the affected rows will shorten the update cycle and have a relatively small impact on normal business use.

Description of new features:

The overall operation process of row update is as follows:

• When creating a data model , set the Hudi table that requires row updates in the source table to require row updates. After the model is created, the system will provide an interface for the table to pass in change data conditions.

file

• Create corresponding indicators as needed. Since the row update table is used in the model, subsequent indicators will be calculated through Spark and stored as Hudi tables. At the same time, because Spark does not support concurrent writing to Hudi tables, cross-cycle dependencies are involved in scheduling. Content needs to be selected to be self-dependent

• Call the row update interface of the table and pass in the change bar. The interface information can be viewed through the table details in " Data Source Management ". The system will automatically identify the affected rows in all indicator tables based on the received change records based on the update frequency set in advance, calculate the new results and then batch update the historical data. If the row update of the data is urgent , you can also click " Row Update " to execute it immediately

file

• In the [Data Source Management] module, query the row update progress of subsequent indicators after the relevant records are changed.

file

2. The indicator directory supports permission control

Background: Based on the indicator security level , different indicators need to be authorized to different people. Usually the indicator directory is divided by business. Considering the complexity of the operation, it is planned to put the indicator authorization function into the indicator directory and control the directory through the indicator directory. View/edit access to all indicators.

Description of new features:

Click the "Authorize" button on the right side of the directory to open the directory authorization window.

file

On the authorization page, the system will set the newly created directory to be editable by all members by default. On this basis, it can be modified to be viewable by all members and editable by some users; the all-member setting can also be turned off and only viewable and editable by some users. Edit operations.

file

Users who have been granted permission can see all indicators in this directory, and can also select a directory with permission when creating/editing indicators.

3. The indicator supports custom addition of UDF functions.

Background: The functions currently supported by the system are all system functions supported by Trino . On this basis, there will be some scenarios that require the use of user-defined functions, such as: to get the date of last Monday, this content needs to be obtained through a custom function. accomplish.

Description of new features:

For the Trino385 version, Trino custom functions can be created in the " Function Management " module, and successfully created custom functions can be referenced in custom indicators.

file

Step 1: Before creating a custom function on the platform, you need to write the custom function plug-in and package the file into a zip package.

Step 2: Click "New Custom Function" to enter the function setting window, configure the function information and upload the packaged file.

file

Step 3: Enter the custom indicator new/edit page, write SQL and call the custom function.

file

4. Add time parameters to the statistical cycle: beginning of last quarter, beginning of last month, end of last month, beginning of last year

Background: In the performance appraisal scenario, the statistical cycle involves the statistics of summary data of the previous month, last quarter, and previous year. The corresponding time parameters need to support the beginning of last quarter, the beginning of last month, the end of last month, and the beginning of last year in the format of yyyyMMdd and yyyy-MM-dd. parameter.

Description of new features:

• Early last quarter: bdp.system.preqrtrstart, bdp.system.preqrtrstart2

• Early last month: bdp.system.premonthstart, bdp.system.premonthstart2

• End of last month: bdp.system.premonthend, bdp.system.premonthend2

• Early last year: bdp.system.preyrstart, bdp.system.preyrstart2

5. Support row updates for new models generated based on the indicator result table

Background: In the performance appraisal scenario, there is a way to create indicator 1 based on model 1, and use the results of indicator 1 as the data source table of model 2. It is necessary to update the rows of the table of model 1 so that both indicators 1 and model 2 can be updated. row update.

New function description: The indicator provides a row update status follow-up interface . The business calls the status through the interface, and then calls the next model to update.

• When the table in the data model settings is selected from Hive Catalog, there is no need to set row updates, and the update method can be modified; when Hudi Catalog is selected, row updates need to be set

• Only Hudi data sources are displayed in data source management

• For tables that require row updates, there are two deletion methods available:

1) Physical deletion: Data deletion of the table is done directly. At this time, you need to ensure that CDC is enabled for the table or the file storage method is op_key_only/data_before_after. Otherwise, the system will not be able to track the data differences before and after the change.

2) Logical deletion: Table data deletion is distinguished by the value change of a deleted field. At this time, you need to specify the deleted field and the corresponding value.

• The corresponding query update progress of each indicator row can be queried through the interface:

1) Input parameters: table information, request id, row update involving model identification/indicator identification/API name

2) Output parameters: model/indicator/API update status of required table & required request batch, table data update start time, table data update end time

• The table creation statement of the Hudi table for row update-related indicators has been adjusted accordingly.

Function optimization

1. Indicator sharing increases detailed information display

Background: The indicator sharing module function has been revised, making it inconvenient to view the sharing rules of shared indicators/models.

Experience optimization instructions: Click the shared indicator/model name to view the corresponding content details, including sharing information and sharing rules.

file

2. Adjustment of view creation rules generated by indicator sharing

Background: The scenario of generating Spark read and write data based on the row update function. Since Spark does not currently support querying Trino views, the view needs to be changed from being created by Trino to being created by Spark.

Experience optimization instructions:

• The views involved in the indicator/model sharing process are changed to be created through Spark

• Changes in view names generated by shared indicators and models 1) Model view name: table name_project id_model code_index_view 2) Indicator view name: indicator result table_project id_index_view

3. Downstream linkage update of model/indicator updates

Background: During the indicator processing process, there are situations where upstream configuration items are changed. At this time, the corresponding downstream SQL needs to be updated synchronously to ensure efficient global configuration and unified functions. The representative usage scenarios are as follows:

• During customer use, information such as the partition field/dimension object attribute configuration of the model table may change. After the edited technical information is changed, currently only the increase or decrease in the dimension check will be updated downstream, and the remaining technical information will be updated. Changes also require linked updates

• When customers calculate indicators, there may be situations where the processing calibers within different statistical intervals of the same indicator are different. In this case, data under two calibers will exist in the table at the same time based on the caliber action time. For example: the data in 2022 is the result of caliber 1. , the 2023 data is the result of caliber 2

Experience optimization instructions:

• After the model modifies the table association, the atomic indicators and derived indicators will update the SQL in conjunction, and update the SQL of the model part to the new version. At the same time, if the table selected by the dropped model changes, the model result table will delete the table and create a new table. Update model table

• After the selected dimension in the model modifies the associated dimension object and associated dimension attributes, the dimension object and dimension attribute information referenced by the atomic indicator are updated simultaneously.

• Modify the upstream task dependencies in the model and adjust the upstream task configuration in the derived indicators simultaneously.

• When the model/indicator dimensions are reduced, causing the dimensions used by downstream indicators to disappear, use the method of deleting the table and re-creating the table to update the indicator table.

• When the field type of the source table used by the model changes, the indicator table that subsequently references the field will update the indicator table by deleting the table and re-creating the table.

4. Row update performance optimization

The first version of row update was optimized on a partition-by-partition basis, and the overall speed was slow. This optimization targets specific rows within the partition for optimization, improving the overall row update efficiency.

"Dutstack Product White Paper" download address: https://www.dtstack.com/resources/1004?src=szsm

"Data Governance Industry Practice White Paper" download address: https://www.dtstack.com/resources/1001?src=szsm

For those who want to know or consult more about big data products, industry solutions, and customer cases, visit the Kangaroo Cloud official website: https://www.dtstack.com/?src=szkyzg

IntelliJ IDEA 2023.3 & JetBrains Family Bucket annual major version update new concept "defensive programming": make yourself a stable job GitHub.com runs more than 1,200 MySQL hosts, how to seamlessly upgrade to 8.0? Stephen Chow's Web3 team will launch an independent App next month. Will Firefox be eliminated? Visual Studio Code 1.85 released, floating window Yu Chengdong: Huawei will launch disruptive products next year and rewrite the history of the industry. The US CISA recommends abandoning C/C++ to eliminate memory security vulnerabilities. TIOBE December: C# is expected to become the programming language of the year. A paper written by Lei Jun 30 years ago : "Principle and Design of Computer Virus Determination Expert System"
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3869098/blog/10315721