Literature reading notes (five)

2019 Journal of Web Semantics_Linking and disambiguating entities across heterogeneous RDF graphs笔记整理

First, the paper organize your thoughts flow

1.1 Related research papers

CBD (Concise Bounded Descriptions) concise bounded description
Reduce the difficulties faced when manual identification data parameters (e.g., which attribute is selected as a marker)
According to the classification of the data a large number of heterogeneous semantic and practical examples of instances
Examples of CBD proposed based analysis framework for representation and comparison of data sources in the matching stage
A new strategy for automatically identifying remove "problem" properties between the two data sets (not suitable as a label attribute)
For a large number of open source benchmark measure after many tests (benchmarks measure)
Examples of open source systems have a simple interactive interface of the proposed
First proposed the classification of heterogeneous data: According to previous studies, this paper focuses on the different between the two data sets so that a description of the form of heterogeneity was found (attribute or category) value and the structure. This article in particular synthetic benchmarks concern and the use of highly heterogeneous reality of classical music and a large number of data sets IM @ OAEI generated.

1.2 thesis problem

Papers problem-solving process 1.3

1) heterogeneous data value dimension: The term heterogeneity (due synonymy, ambiguity produce words in different words, but also a small number of spelling mistakes), language heterogeneity (different from each other due to language translation ), data attributes and object attributes heterogeneity (information may be represented as a txt may be represented by the url)

2) physical dimensions Heterogeneity: Heterogeneity structure (due to different particle sizes result), different depths from the heterogeneity of the properties (the same information in different figures may be the source of the information), describe heterogeneous (one example in another dataset can be described more information), a keyword heterogeneity

3) Heterogeneity logical dimensions: heterogeneity classification, attributes heterogeneity

4) the data quality dimensions Heterogeneity: Heterogeneity of data types, data set consistency

Comprising preprocessing the data link (setting parameters, data processing), matching, after treatment (removal of broken links, insert the new link). This paper considering the details of the process prior to the actual phase comparison example, i.e. simplify and automate preprocessing stage.
Selection and classification properties: a lot due to the current key automatically generated key can not be generated by the system as an identifier. Therefore, the key measurement of impact will be very important, the key generated by selecting the most useful measure of the linked two data sets as a label.
Link Specification: provided as a comparison between the two sets of data elements, note binding complex similarity criterion of similarity measure, the similarity measure threshold setting
Legato Data Link: The system will be two RDF graph as input, and then automatically pretreated and then through the matching stage instance, the disambiguation example, generate a set of link connection selected as the final result.
Related definitions:

1) As used herein, "source" or "instance" (resource \ instance) as an identifier of an entity (usually triplet s)

2) RDF data key: two known sources s1, s2, and their predicate (attribute). That all the key s1, the same properties of K attribute values s2 = {P: P ⊆pred (G), any of s1, s2 ∈ subj (G) and p (s1) = p (s2), the ∀p ∈ P}

CBD: FIG sub RDF graph, the subgraph is one source for r, s This is the subgraph comprising of all triples r, and in this case o triples the CBD is blank and this blank node s node triples
Data Link: Find all RDF graph equivalence between the two properties
Before the CBD (r) of the following: o to r for the triplets
The CBD (r) successor: triplet in the r to s
↑ CBD (r): including the CBD (r) and all of its predecessor
↓ CBD (r): comprising a CBD (r) and all its successors
↕ CBD (r): including the CBD (r) and all of its predecessor, successor
CBD * (r): includes all of the above, including triples
Examples Analysis: Text component RDF graph G is L (G), the instance of the text f (r) is the set of all L (G) belonging to the CBD *
Legato configuration module:

1) attribute filtering: filtering out not as attribute identifiers, leaving only a single attribute as the attribute identifiers two sources

2) Main matching module: Examples include those based on analysis of CBD, instances mapped to the vector analysis (mapping to instances of the vector space and the vector to be limiting and is weighted), the matching-based vectors examples

Examples disambiguation module: the vector space as an input, and finally generates a standard similarity to aggregate (having high similarity data sets together), and so generate candidate set links.
Links were combined: l for any link between the two sources (rs source and destination rt) = (rs, rt) placed in the candidate set links, and then determining the concentration looking l '= (rs, r' t), If found then l concentrated deleted from the candidate.
Heterogeneity of determining the data set

1) data values Heterogeneity: the package and examples be considered as word mapped to the vector, calculates a similarity vector

Consider the use of CBD for in depth from the source node n: 2) Logical Heterogeneity

1.4 Experimental methods used paper

The dataset used: DOREMUS (including 9-HT, 4-HT (heterogeneities) and FP-trap (false positives trap)), the synthetic data set (SPIMBENCH 2015, SPIMBENCH 2016, SPIMBENCH 2017)
Context Set

1) Automatic Identification measure link generation problem attribute (attribute for automatically evaluating the efficiency of the filtration module)

2) Analysis of selected examples of

3) Effect for example using keys disambiguated

4) compare the overall Legato with other systems

5) Legato automatic comparison with other methods of generating links

Use indicators: Fm, P, R
Properties filtration efficiency: Consider all the properties of the problem is then removed attribute data set evaluated DOREMUS. It was found that a method using an automatic filtering properties in HT, 9-HT data set better performance
Efficiency Analysis Example: Consider the use of different analysis to Legato example, a data set OAEI2017. It was found that the analysis considered ↕ CBD get a higher score Fm
The efficiency of the subsequent process: the main consideration, and examples of the disambiguation links merge module, using DOREMUS2017 dataset. Consider the candidate set links in proportion to determine the proportion of concentrated, delete or add links. It was found that follow-up procedures in highly similar data set is extremely important.
Overall efficiency: using automatic version of Legato, compared with the competition tool of IM @ OAEI2015,2016,2017. It was found that when the performance of Legato heterogeneity in the data set contains a good solid dimensions
Automatically generated links efficiency: Legato with EAGLE and WOMBAT were compared, Legato performance even better.

The final evaluation of the results of experiments 1.5

1.6 follow-up paper

Future focus on information complementarity between the data sets that solve entities are complementary attributes described and exist in different data sets into RDF cause problems of lack of comparative information

Second, the paper innovation

A new Automatic Identification remove "problem" properties between the two data sets policy framework can automatically discover Legato link between RDF graphs

Third, the techniques and methods used in the paper

IM @ OAEI method

RDF and other auto-link tool EAGLE

Legato framework

Fourth, recommended reading references

[48]、 [51]、http://islab.di.unimi.it/content/im_oaei/2016、[5]