Kangaroo cloud data in table columns V2.0 | data sets of data integration

image

About Kangaroo cloud data in table columns V2.0

How the data sets defined? What is the relationship of enterprise data and the data in the table? How the data sets to support the strategic transformation of the enterprise? Kangaroo process cloud the past two years, we have for dozens of large enterprises in Taiwan to provide data consulting and implementation services floor, has accumulated a wealth of practical experience, but also in customer service, and constantly improve and sublimate their own data Taiwan theoretical system and practical methodology. We hope that by sharing the follow-up article, to communicate with you readers, jointly accelerate the process of enterprise-wide data. This column updates 1-2 articles per week, so stay tuned ~

Data sets of data integration

1

In modern enterprises, the use of scenarios, differences in business forms, technology selection, development framework, often have multiple heterogeneous information systems run on different hardware and software platforms based on these systems, data sources independent of each other, mutual closed, making it difficult data exchange between systems, sharing and integration, thus forming "islands of information." With the deepening of information technology applications within the enterprise, the demand for enterprise information exchange with external increasingly strong, urgent need to integrate existing information, China Unicom "islands of information" and share information.

At the request of the enterprise to build data sets to solve data interoperability and sharing of data, "data integration" is a bridge to get through the pipeline and information systems and data sets, is the Taiwan-wide integration, constitutes an important basis through the data.

image
All-pass system data

This article talked about data integration, mainly refers to the storage medium of data from different data will be synchronized to the links in the table of data, in some scenarios, may also be referred to as "data collection", "data synchronization" on "data cloud".

2 Preparations

Before the implementation of data integration development, we generally conduct research and preparation for the following:

  • Data Source Category : supra data source in a table of data, determines the type of the data source, the data and timeliness requirements, the technology component to determine acquisition

- Network and Environment : determining the network environment information and data sources, according to an embodiment integrated, and the existing network environment and optimization of the necessary transformation

image

  • Data content : the size of the full amount of research data, increment size, distribution
  • 数据质量:调研数据的增量标记、索引、主键信息等
  • 数据范围:调研需要集成的数据范围,筛选出需要集成到数据中台的相关数据,一般以支撑业务流程或带业务属性的数据为主

3 业务架构

针对采集的业务内容,以及常见的同步分类,我们将数据集成的业务架构整理如下:

image
数据集成的业务架构

4 集成流程

以下通过几个典型的数据同步场景案例,来介绍数据同步流程。

4.1 关系型数据库离线同步流程

image

4.2 API类数据同步

image

4.3 实时类数据同步

image

5 袋鼠云数栈 DTinsight - 数据同步模块

数据同步模块是在各个存储单元之间执行数据交换的管道。

为了在「DTinsightIDE」进行大规模数据集的挖掘与计算,通常的做法是在任务执行前将数据传输至DTinsightIDE,并在任务执行结束后将计算结果传输至外部存储单元(例如MySQL等应用数据库)。

数据集成的作用如下图所示:

image

袋鼠云数栈-数据同步模块

袋鼠云数栈-数据同步模块的具有以下特性:

  • 丰富的数据源支持
    数据同步模块可对MySQL、Oracle、SQLServer、PostgreSQL、HDFS、Hive、HBase、FTP、ElasticSearch、ODPS、ElasticSearch、Redis、MongoDB等数据源,支持对这些数据源进行读取或写入数据。使用时仅需配置数据源的连接信息(例如填写Oracle数据库的JDBC URL、用户名、密码等信息),再配置对应的数据同步任务即可。
  • 分布式系统架构
    数据同步模块在系统架构上采用先进的分布式系统架构(FlinkX[1]),可实现多个节点并发读取、写入数据,可极大的提升数据同步的吞吐量,相比Sqoop、Kettle等开源数据同步方案,数据吞吐能力更高、配套功能。
  • Visualization configuration
    user when using the data synchronization module to complete the creation and configuration synchronization tasks quickly through visual configuration, including synchronization task selection Source Library source table, the target database object table, configuration field mapping, configuration synchronization speed.
  • The total amount / increment synchronization
    incremental process of reading data from the business system, the influence on the service system to minimize the need for synchronization of the data typically. It includes the case where the data in the source database table change time field, supports incremental data relational database synchronization, the user need only enter the appropriate data statements to be filtered.
  • Synchronous speed control
    support data synchronization speed control to adjust the synchronization by setting an upper limit rate, this hardware configuration and parameter values required to adjust the amount of data, the user selects a set according to business needs.
  • Dirty data management
    support for the dirty data to record configuration, you can specify the storage table of dirty data, life cycle, and can be configured when the task is set when the dirty data amount exceeds a certain amount or a certain percentage to fail, prompting the user to timely investigation and dirty problems, and generate analysis reports.

More exciting

About Kangaroo cloud data sets series of columns V2.0

Enterprise Data Cognition: The data is productivity!
Enterprise three realms: the business interface, application interface, data interface
business figures of the construction of the Three paradigms
enterprise digital (data interface) overall architecture
review data sets: three dimensional look at the data in table
data source data in the table

About Kangaroo cloud

Kangaroo cloud of enterprise data is a total solution provider, data architecture in Taiwan advocate, leader, open up the data through the supply chain, to build enterprise data of the drive engine, to accelerate the process of enterprise data, so that data become core competitiveness of enterprises. DTSTACK.COM
data intelligence, let the future into the present

Guess you like

Origin yq.aliyun.com/articles/704530