Kangaroo Cloud Product Function Update Report Issue 04丨The first time in 2023, the product upgrade is "hurricane"

In the new year, we have stepped up the speed of update and iteration, added the data lake platform EasyLake and the big data basic platform EasyMR , and upgraded and optimized more than 40 functions. We will continue to maintain the rhythm of product upgrades to meet more needs of users in different industries and bring users the ultimate product experience .

The following is the content of the fourth issue of the Kangaroo Cloud product function update report. For more exploration, please continue reading.

Data Lake Platform

1. [Metadata Management] Create Catalog

Create a Catalog on the [Metadata Management] page, and fill in the Catalog name, Hive MetaStore, and Spark Thrift.

A Calalog can only be bound to one Hive MetaStore, and Spark Thrift is used for Iceberg table creation and data transfer from lake to table . Users can use Calalog to isolate business department data.

file

2. [Metadata Management] Database Creation

Create a Database on the [Metadata Management] page and bind Calalog.

file

3. [Metadata Management] Iceberg table creation

• Create a Table on the [Metadata Management] page: select the Catalog and Database where the Table is located, currently only supports the creation of Iceberg Lake tables;

• Set the common column of the table, support setting the primary key for the common column field , which can be used as the unique identifier of the lake table;

• Select ordinary column fields as partition fields, support multiple conversion functions, and timestamp data type fields support time fields to be partitioned according to the granularity of year, month, day, and hour;

• Support row group level index settings, select ordinary columns as index fields, and set Bloom index;

• Customize advanced parameter settings.

file

4. [Metadata Management] Iceberg Table Snapshot Management

Supports snapshot history management, snapshot change comparison between multiple versions, and lake table time travel , and one-click rollback to the specified data version.

file

5. [Data into the lake] Support Hive to Iceberg table to realize Hive table into the lake

Create a lake entry task on the [Data into the lake] page, select the Parquet, ORC, Avro format Hive table to transfer the table into the lake, and generate the lake table information with one click.

file

6. Support small file merging, orphan file cleaning, expired snapshot cleaning

Create a new task template on the [Data File Governance]-[Task Template] page, which supports data file governance tasks such as small file merging, snapshot cleanup , and orphan file cleanup, and supports multiple data governance methods such as immediate support, scheduled governance, and periodic governance.

file

Big Data Basic Platform

1. [Global] Use the host name as the unique identifier of the machine

The EM platform product is changed to use the hostname Hostname as the unique identifier to manage the host;

• The communication between hosts defaults to IP communication, which can be switched on the [Platform Management] - [Communication Configuration] page.

file

2. Function optimization

• Alarm: dtalert and grafana alarm channels are inconsistent when the new alarm channel is abnormal

• Warning: The dtalert mount directory is inconsistent with the uploaded jar package directory

• Alarm: After adding a custom alarm channel and saving it, the edited and uploaded jar package is not displayed

• Hadoop security: EM enables Hadoop security, but the service is not restarted, and it directly shows that the enablement is successful

• Backup optimization: EM backup management query optimization

• Redis role acquisition: Redis is running normally, but the role acquisition information is wrong, which leads to the inability to correctly acquire the redis role status when deploying other services

Offline development platform

1. The number of data queries can be limited in the data development IDE

User pain points: The temporary operation of the data development page does not limit the number of data result queries. In extreme cases, there is a risk of filling up the system disk.

New function description: For all SQL type tasks , a data query number input box is added to the right of the run button. The default number of queries is 1,000, and the upper limit is 1,000,000 (the upper limit is a configuration item, which can be configured in the background) .

file

2. Data preview global control function docking

Added data preview global control switch in the data source center :

• Data preview and global control of sub-products and projects

• Can perform data preview control of a single data source

file file

3.FTP as the target data source supports 4 write modes

• append: Overwrite and write by file name;

• overwrite: first clear the files in the directory and then write;

• nonconflict: Search by file name, if there is a file with the same name, an error will be reported, and if there is no file with the same name, it will be written normally;

• insert: append to the file, modify the file name of the new file by adding a suffix when the same name exists;

file

4. Running timeout interrupt

The task supports setting the timeout period, and the background will be automatically killed when the running time exceeds this time.

file

5. The data synchronization channel control page supports configuration of advanced parameters

file file

6. Other new features

• Inceptor table access data map: the Inceptor existing data map supports metadata query, data desensitization, blood relationship display and other functions;

• Support Flink Batch task type;

• HBase REST API supports synchronous reading of data;

• Sybase supports synchronous reads of data.

7. Supplementary data optimization

• Supplementary data supports three supplementary data modes: supplementary data for a single task, filter batch task supplementary data in the task management list according to filter conditions, and select multiple task supplementary data according to the upstream and downstream relationship of tasks;

• For multiple tasks that are in the same dependency tree but have gaps/do not directly depend on each other, the generated complementary data instances will still be executed in the original dependency order;

• Support to choose whether to turn off retry;

• Supplementary data supports selection of future time.file

8. Optimization of alarm rule task selection method

Supports selecting all tasks by project or selecting all tasks in the directory by task management directory.

file

9. Optimization of the whole database synchronization function

• Whole database synchronization support options: Oracle MySQL DB2 Hive TiDB PostgreSQL ADB Doris Hana as the whole database synchronization target;

• Advanced settings can view historical configuration, and for the same data source and schema, can record the rule content of advanced settings.

file

10. Greenplum task adjustment

• The running logic of Greemplum SQL and Inceptor SQL is changed from synchronous operation to asynchronous operation when temporarily running complex SQL and multi-segment SQL;

• Greenplum metadata information can be viewed in table query ;

• Support for syntax hints.

11. Support specifying file name when synchronizing data to HDFS

User pain points: When writing history to HDFS, the specified file name is actually the specified leaf directory name, and the actual file name cannot be specified.

Explanation for experience optimization: The parameter strictMode has been added in the advanced configuration. When the parameter value is "true", the strict mode is enabled, and when the parameter value is "false", the loose mode is enabled. In strict mode, specify the file name under the leaf path, and only one file name is allowed, and multi-parallelism and resumable upload will not take effect.

file

12. Only English letters are allowed to start with the creation project

Because some engines can only create/read schemas that start with English letters (such as Trino), the project ID is limited to only start with English letters when creating a project.

13. Publish button click logic optimization

Before modification: Only the submitted task release button can be clicked.

After modification: The task release button in all states can be clicked.

14. Adjustment of event task copywriting

Temporary operation needs to pass the parameter value as 000000000000.

file

15. New prompt for project-level kerberos

file

16. Data synchronization optional table range optimization

User pain points: The data source corresponding to the meta schema and the connecting user are both consoles. If the data source in the project is not restricted, only the schema that the project is connected to can be selected, which means that each project can bypass the data through data synchronization Permission control directly synchronizes the schema tables of all other projects under the cluster to the current project, which is a very large permission loophole.

Explanation of experience optimization:

• Filter dirty data tables;

• For the data sources corresponding to all meta schemas, the range of optional schemas is fixed, only the schemas connected to the current project;

• If you need to use other schemas in the synchronization task of the current project, you can import the meta schemas of other projects into the current project through the authorization of the tenant administrator.

file

17. Optimized the display of running indicators for data synchronization instances

The running log of the data synchronization task instance optimizes the way of displaying the synchronization performance.

file

18. Other experience optimization items

• The security audit operation object "script" is changed to "temporary query";

• Optimization of network overhead calls inside for loops.

real-time development platform

1. Customize Connector

User pain points: With the growth of real-time product customers, there is a constant demand for various data source plug-ins. We hope that customers with development capabilities can develop plug-ins to use products by themselves without waiting for product iterations, making product capabilities more and more open flexible.

Description of new functions: For data sources not yet supported by ChunJun, it supports uploading [user-developed/third-party] plug-in packages (need to meet the development requirements of Flink Connector, and the platform does not verify the availability of plug-ins), and then perform tasks in script mode used in development.

file

2. Session mode

User pain point: the debugging function of the previous real-time task, and the per job mode of the normal task. Although this mode can guarantee the stability of task operation, the entire submission-apply resource-run process has a long back-end processing process, which does not meet the functional scenario of debugging (debugging does not require continuous stability, but requires quick results) .

New function description: Debugging tasks run in session mode to improve debugging efficiency. Users need to allocate slot resources for real-time debugging on the console first.

file

3. Table management

User pain points: Before the development of each real-time task, the Flink table needs to be temporarily mapped, and the development efficiency is low; the previously provided Hive catalog table management requires the user to maintain the Hive Metastore, which has certain intrusions on the original Hive.

Description of new functions: Provide data stack MySQL as the storage medium of Flink metadata; provide two modes of wizard and script to maintain Catalog-database-table; support directly creating and referencing Flink library tables on the IDE development page (requires Catalog.DB. table reference).

file

4. Data source addition/optimization

• Added GreatDB as the dimension table and result table of FlinkSQL;

• Added HBase2.x as the result table of FlinkSQL;

• Added Phoenix5.x as the result table of FlinkSQL;

Optimize Oracle data source , add support for sequence management and clob/blob long text data type.

5. Dirty data management

User pain points: original dirty data management only supports FlinkSQL tasks.

New function description: Real-time collection also supports dirty data management.

file

6. Function optimization

• Task operation and maintenance: add a list filter to support filtering queries by status, task type, responsible person, etc.;

• Data development: Optimize the layout of buttons related to task operations; IDE input supports automatic association; real-time acquisition script mode supports comments.

Data Asset Platform

1. Data source

• New data source support:

Greenplum、DB2、PostgreSQL(V5.3.0)

Hive3.x(Apache)、Hive3.x(CDP)、TDSQL、StarRocks(V5.3.1)

• Meta data source automatic authorization support:

Hive3.x(Apache)、Hive3.x(CDP)(V5.3.0)

TiDB(V5.3.1)

2. Data map

• Added indicators: indicators into the data map , as a type of asset on the asset platform;

• Kafka metadata optimization: Kafka hides the table structure, and adds a partition query tab;

• Tag filtering optimization: The tasks collected by tags were not distinguished according to the entity before, and the tag names may be the same. The new function adds the attribute of "belonging entity" to the tag and adds entity filtering in the quick filter bar;

• Table label optimization: When the table dimension is entered, "table label" is displayed , and other dimensions display "label"; the labels of each dimension are isolated from each other, and when entering from different dimensions, you can no longer see all the labels.

file

3.API blood relationship

Realized the blood relationship link from table to API and from API to API .

file

4. Indicator/label lineage

In this issue, the kinship relationship inside the index label is first shown in assets, and the next issue will realize the blood relationship from table to indicator and table to label.

file file

5. Blood relationship optimization

• The truncate keyword is added for blood relationship analysis: when the truncate data of a table is cleared, the blood relationship between tables and between tables and tasks needs to be deleted;

• Self-to-self kinship and recurring kinship are excluded;

Solve the problem of line segments and tables covering each other : the right-angle blood flow line segment is changed to a curved gray line; dragging is supported; the inflow and outflow of the currently covered or clicked table is highlighted.

file

6. Data file management

Migrate the data file governance on the offline side to the data governance module on the asset side for optimization and compatibility. Governance rules include periodic governance and one-time governance.

file

7. Data file governance optimization and adjustment

• Changed the cycle governance "Select Project" to "Select Data Source", the scope of governance is the optional meta data source , and the sorting of the drop-down box is reversed according to time;

• One-time governance "select project" is changed to "select data source", and the scope of governance is the Hive table under the optional meta data source;

• If the time of small file management exceeds 3 hours, the management will fail, and the timeout condition is changed to a configurable item, which can be supported by the configuration file, and the default is 3 hours;

• The statistics target of occupied storage is changed from a partition/table to a file.

file

8. Metadata synchronization deinitialization process

User pain point: V5.2 merged and transformed, before the metadata synchronization and data source management functions were split, the original logic was to initialize after the data source was introduced. When data is synchronized, look up the library table information, which will occupy more resources and storage, and lead to more useless data, such as slow loading data of asset inventory and other problems.

Experience optimization description: cancel the initialization process after the data source is imported, and query the database table information in the data source in real time during metadata synchronization .

9. Optimization of metadata center coupling relationship

• Incremental SQL optimization: At present, the basic metadata center of the metadata center can support independent deployment, but the current incremental SQL cannot support it;

• Product authority optimization: If a customer has asset authority, it is no problem to call the data model of the metadata center on the indicator side, but if the customer does not have asset authority, calling the data model of the metadata center will prompt that there is no authority.

10. Data source plug-in optimization

• Synchronize all database table parameters. If the actual database table changes, if no parameters are passed, the data source plug-in will check the database table name in real time;

• Restart binlog after it is closed: the script has stopped and has not been reawakened, and it needs to be automatically awakened when it is restarted.

11. Function optimization

• Dirty data: The default storage effectiveness for management is 90 days, the global prompt should be modified accordingly, and the scope of dirty data management is for the current project;

• Increased word root matching accuracy: The added word roots and standards on the interface need to be added to the tokenizer, which solves the problem that the Chinese name of the field is matched according to the word segmentation and cannot be matched in some cases.

Customer Data Insight Platform

1. Demo integration of securities, banking, and insurance labeling systems

Enter the label platform and experience the demo through the pop-up window, or enter the platform to experience the demo through the view demo button at the top of the platform home page.

file

2. [Tag Management] Supports configuration of custom attributes

User Pain Points: At present, the information when creating tags is fixed. Except for some common attributes, customers in different industries have different metadata information for tags. For example, bank customers have the need to define the financial security level of tags, but this attribute is not suitable Funds and retail customers, so it must be realized through tag custom attributes.

Description of new features:

• Set custom attributes on the "Tag Metadata" page, and view the metadata information of common attributes and custom attributes on the list page;

• Add label responsible person , business caliber, and technical caliber fields to general attributes;

• Customized attributes are used for attribute setting when creating labels later.

file

3. [Project Management] Designate a handover person when removing the person in charge of the label, etc.

[Project Management] Designate a handover person when removing the tag owner, task owner, alarm receiver, and group subscriber.

file

4. [Project Management] Hive tables and HBase tables support custom life cycles

• Supports life cycle settings for large and wide tag tables, all overdue data can be deleted, and data at a specific time in each cycle can also be retained;

file

• The saved label group can set a life cycle, and all expired data can be deleted, or the data at a specific time in each cycle can be retained;

file

• The life cycle is set for the management section of the physical table, and all expired data can be deleted, or the data at a specific time in each cycle can be retained.

file

5. Data synchronization function optimization

• Optimization of the Rowkey pre-partition function: hbase tables are pre-partitioned by default, and the number of partitions is 30, to remove the impact of the concurrent number of jobs on the partition calculation;

• Optimization of the number of concurrent jobs: the input limit of the number of concurrent jobs is adjusted to 1-100 to meet the business needs of more data synchronization efficiency;

• Support setting the number of dirty data allowed: When the number of dirty data generated exceeds the set threshold, the job will stop synchronizing and set as failed; if it is set to 0 or empty, it means that no dirty data is allowed.

file

6. [Tag API] Support querying tag results without specifying business date

User pain points: During the data query process of the label API, there may be cases where the API cannot query the latest specified business date data due to the fact that the data synchronization task has not been completed. This will cause business congestion. In order not to affect the normal operation of the business, Hbase needs to The data is degraded and backed up.

Explanation for experience optimization: HBase will back up and store a copy of the latest successful synchronization data of the latest business date that was successfully synchronized.

When passing parameters through the API, the business date is adjusted as an optional item:

(1) Specify the business date, and the system will return the data corresponding to the business date;

(2) If no business date is specified, the system will return the backup data.

7. Function optimization

SQL optimization: optimization of schema reading problems at the beginning of numbers;

Label directory: labels can be hung under the parent directory and subdirectories;

API call: add pageNo field.

Index management analysis platform

1. [Indicator Management] Support life cycle settings

file

The indicator hive table supports lifecycle settings;

file

Metrics API supports lifecycle settings.file

2. [Indicator Management] Support batch publishing

Batch release of unpublished and offline non-customized SQL indicators is supported . After the release is successful, this indicator can be queried in the indicator market.

fileFor those who want to know or consult more about Kangaroo Cloud big data products, industry solutions, and customer cases, visit Kangaroo Cloud official website: https://www.dtstack.com/?src=szkyzg

At the same time, students who are interested in big data open source projects are welcome to join "Kangaroo Cloud Open Source Framework DingTalk Technology qun" to exchange the latest open source technology information, qun number: 30537511, project address: https://github.com/DTStack

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3869098/blog/7927871