20. Data integration, data integration, data fusion

This is my doctoral research topic. In fact, I have a concept of the research content and direction, but I can't find a suitable term.

Data Integration means

Data integration has been developed for a long time, although there are still many problems worthy of research. But Baidu is full of application steps, application methods, and even advertisements from Ali and Microsoft.

Definition: Data integration is to integrate interrelated distributed heterogeneous data sources together so that users can access these data sources in a transparent manner.

Integration refers to maintaining the overall data consistency of data sources and improving the efficiency of information sharing and utilization;

The transparent way means that users don't need to care about how to realize access to heterogeneous data source data, but only care about which data is accessed in which way.

Data integration difficulties:

(1) Heterogeneity: The data sources to be integrated are usually developed independently, and the data models are heterogeneous, which brings great difficulties to the integration. These heterogeneities are mainly manifested in: data semantics, the expression form of data with the same semantics, the use environment of data sources, etc.

(2) Distributed: The data source is distributed in different places, relying on the network to transmit data, so there are problems such as network transmission performance and security.

( 3 ) Autonomy: Each data source has strong autonomy, and they can change their own structure and data without notifying the integration system, which challenges the robustness of the data integration system.
--------------------- 
Author: raymond_lan 
Source: CSDN 
Original: https://blog.csdn.net/raymond_lan/article/details/80302870 
Copyright statement: This article Original article for the blogger, please attach the link to the blog post for reprinting!

In the process of data preprocessing, it is often necessary to integrate the data in multiple data sets into a data warehouse, that is, the database needs to be integrated. At the same time, in order to better mine the data in the data warehouse, it is inevitable to transform the data in the data warehouse. This paper mainly discusses two issues of data integration and data change.

Data integration needs to focus on solving three problems in the process of integrating multiple databases into one database: pattern matching, data redundancy, and data value conflicts. Data from multiple datasets have different names for equivalent entities due to differences in naming, which poses challenges for data integration. How to better match multiple entities from different sources is the first problem facing data integration, which involves entity recognition and mainly uses metadata to distinguish.

  Data redundancy may come from inconsistencies in the naming of data attributes. In the process of solving data redundancy, the numerical attributes can be measured by Pearson product moment Ra,b, which is a value between [-1,1]. If it is greater than zero, there is a positive correlation between the attributes, otherwise it is an anti-correlation. The larger the absolute value, the stronger the correlation between the two. For discrete data, the chi-square test can be used to detect the association between two attributes.

  The last important problem in data integration is the problem of data value conflict, which is mainly manifested in the fact that unified entities with different sources have different data values.

data integration means

Data integration is the process of sharing or merging data from two or more applications to create a more functional enterprise application. Traditional business applications are strongly object-oriented —that is, they rely on persistent data structures to model business entities and processes. When this happens, the logical way is to integrate through data sharing or merging, while in other cases, data from one application may be restructured to match the data structure of another application , and then written directly into another application. database . [2] 

1. Decentralized data and information systems

After years of development of China's informatization [4] , many computer information systems and database systems   have been developed , and a large amount of basic data has been accumulated. However, due to the different construction periods, different development departments, different equipment used, different technological development stages and different ability levels of rich data resources, data storage management is extremely scattered, resulting in excessive data redundancy and data inconsistency, making data Resources are difficult to query and access, and the management cannot obtain effective decision-making data support. Often managers need to access many different systems to understand the information of different departments under their jurisdiction, and the data cannot be directly compared and analyzed.

2. The utilization of information resources is low

Some information systems have low integration , poor interconnection, scattered information management, and there are large gaps in data integrity, accuracy, and timeliness [1]   . Some units have established intranets and the Internet, but the information systems that have been developed or introduced scattered over the years cannot provide a unified data interface for a large amount of data, cannot adopt a common standard and specification, and cannot obtain shared common data sources , so different application systems will inevitably form isolated islands of information . Lack of shared, networked and highly available information resource system.

3. Low ability to support management decision-making

At the same time, with the increase in the number of computer services, the operations of managers are becoming more and more complex [4], and many increasingly complex intermediate business processing links still rely more or less on manual processing; information processing Poor analysis methods, unable to directly collect data from various types of business information systems at all levels and make comprehensive use of them, unable to collect and feedback external information in a timely and accurate manner, and unable to refine and sublimate a large amount of data generated by business systems into useful information and provide timely To the management decision-making department; the existing business information system platform and development tools are incompatible with each other and cannot be applied in a wide range.

The degree of data sharing cannot meet the unit's requirements for the overall development and utilization of information resources. There are many simple applications, and there are many cross-repetitions, few applications that can support management and decision-making, and fewer applications that can use the network to carry out business activities. There are huge information resources in the data, but they have not been fully exploited and utilized through effective tools, and the value-added role of information resources has not been fully utilized in the management decision-making process.

Data Fusion means

2. Why is data fusion needed?

One of the most important reasons is the fragmentation of user data, which cannot fully outline the whole picture of users. For example, your shopping data is in JD Tmall, call data is in Mobile Telecom, transaction data is in banking and finance, social data is in Tencent WeChat, search data is in Baidu, etc.

Fragmentation of data leads to one-sided understanding of users, and wrong decisions may be made. For example: the "Jingtiao Project" between JD.com and Toutiao is a case of data cooperation, that is, the items you search on JD.com will be displayed in Toutiao from time to time, increasing the purchase rate. There is a flaw here that if you have already purchased items on Taobao, but the page will still appear, resulting in a decline in user experience perception.

Another value of data fusion is the discovery of new rules and new values. For example, in the past, user credit was mainly based on whether there was a historical loan default, but many people did not have loan relationship data, how to evaluate it. Sesame Credit innovatively integrates online data, identity characteristics, behavior preferences, social relationships and other life attribute data to profile users' credit. This is the value of data fusion.

The integration of data from different industries is complementary and complete, and will effectively enhance the intrinsic value of data.

Quote: https://baijiahao.baidu.com/s?id=1569437547573684&wfr=spider&for=pc

Data fusion technology includes the collection, transmission, synthesis, filtering, correlation and synthesis of useful information given by various information sources, so as to assist people in situation/environmental judgment, planning, detection, verification, and diagnosis . This is extremely important for timely and accurate acquisition of various useful information on the battlefield, timely and complete evaluation of battlefield conditions and threats and their importance, and implementation of tactical and strategic auxiliary decision-making and command and control of combat troops. The future battlefield is changing rapidly, and there are more and more complex factors affecting decision-making. Commanders are required to make the most accurate judgments on the battlefield situation in the shortest possible time and implement the most effective command and control of combat troops. And the realization of this series of "most" must have the most advanced data processing technology as a basic guarantee. Otherwise, no matter how clever military leaders and commanders are, they will be overwhelmed by the vast amount of data, which may lead to misjudgments, or delay decision-making and lose opportunities for combat, resulting in disastrous consequences.

Data Layer Fusion

It is the fusion performed directly on the collected raw data layer, and the data synthesis and analysis are carried out before the raw predictions of various sensors are not preprocessed. Data layer fusion generally uses a centralized fusion system for fusion processing. This is a low-level fusion. For example, in an imaging sensor, the process of confirming the target attribute by performing image processing on a blurred image containing several pixels belongs to the data layer fusion.

feature layer fusion

Feature layer fusion belongs to the fusion of the middle level. It first extracts features from the original information from the sensor (features can be the edge, direction, speed, etc.) of the target, and then comprehensively analyzes and processes the feature information. The advantage of feature layer fusion is that it achieves considerable information compression, which is conducive to real-time processing, and because the extracted features are directly related to decision analysis, the fusion results can give the feature information required for decision analysis to the maximum extent. Feature layer fusion generally adopts a distributed or centralized fusion system. Feature layer fusion can be divided into two categories: one is target state fusion; the other is target feature fusion.

decision-making fusion

Decision-making fusion observes the same target through different types of sensors, and each sensor completes basic processing locally, including preprocessing, feature extraction, recognition or judgment, to establish a preliminary conclusion on the observed target. Then the decision-level fusion judgment is performed through association processing, and finally the joint inference result is obtained.

The concept of data fusion comes from the military field, and currently sensor data fusion is the most widely used. But that's not what I want to do.

 

Moreover, the content published by Baidu is all application-oriented. For scientific research, Baidu can’t help you. Anyway, doctoral study and master’s study are different, and you still have to rely on yourself.

After building up the concepts, let me finally talk about what I want to study, which should be in-depth data fusion based on data integration, with the purpose of improving the value of data, with static data such as social network or management data as the processing object, mainly to solve data heterogeneity , data mapping, entity recognition, feature recognition, credibility judgment and other issues.

Let's piece together these first, I have to prepare for the doctoral entrance exam quickly.

Guess you like

Origin blog.csdn.net/u010752777/article/details/84330619