Skip to content

Guidelines or suggestions for data reconciliation (updated from time to time; collecting advice from everyone)

Junjun Cao edited this page Sep 18, 2024 · 7 revisions

0. Regarding the latent Schema behind the data (to be finished)

0.1 Detect the relation set in the CSV files (to be finished)

0.2 Detect the redundancy among the CSV files (to be finished)

1. Q and P in WikiData

1.0 Basic attention

  • clarification of name space:

wd:http://www.wikidata.org/entity/ not wd:https://www.wikidata.org/wiki/
wdt:http://www.wikidata.org/prop/direct/ not wdt:https://www.wikidata.org/wiki/Property:

  • Don't mix them in using.
  • Be cautious of ambiguity of some term:

For example, the "recording" entity of TheSession is not a "recorded music"(Q49017950) but indeed an "album"(Q482994); but in MusicBrainz, "recorded music" and "album" coexist and are different.

1.1 Q

1.1.1 Find the class/type for the entities/instances

Such as <entity> rdf:type <entity>. or <entity> wdt:P31 <entity>.

In the future, we may add semantics like rdf:type owl:equivalentProperty wdt:P31..

1.1.2 Reconcile each instances with those of Wikidata(see below)

1.2 P

It can only be done manually.

1.2.0 About rdfs:label

Some properties[such as name(wdt:P2561), title(wdt:title)] are basically similar to rdfs:label, which is preferably recommended for the convenience of LLM2SPARQL.

1.2.1 Note the type of property values

Since the coming out of RDFS, property can be substantially divided into 2 types:

(1)object property:the data type is another item which has URI, for example, see day of week (P2894)
(2)data property:the data type is not URI but rdfs:Literal...

Perhaps, consistency of properties type is recommended. Till now, it's also because a clear distinction between object property and data property will contribute to the accuracy of LLM2SPARQL. For example:

If you ask a question to ChatGPT, it usually render any property as either an object property or a data property. To clarify, you probably have to use isIRI(?x). For example:

See a specific question "Find in TheSession performers who are Canadians. And find the recordings they performed in TheSession".
The expected SPARQL can be:

PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>

SELECT distinct ?recording ?performer 
WHERE {
  GRAPH <http://sample/thesession/reconciled> {
    ?recording a wd:Q482994 ;
               wdt:P175 ?performer .
    FILTER isIRI(?performer)# Without the FILTER, it will report "Virtuoso S1TAT Error Query did not complete due to ANYTIME timeout."
  }
  SERVICE <https://query.wikidata.org/sparql> { 
    ?performer wdt:P27 wd:Q16
  }
}
1.2.2 Wikidata:WikiProject_Music

We can also refer to this: https://www.wikidata.org/wiki/Wikidata:WikiProject_Music to get a lot of recommended properties for LinkedMusic.

2. Ask ChatGPT to recommend properties or types for entities

3. Check the context where the property is used in Wikidata

Especially the context of subject->property->object

4. In addition to Wikidata, we can reconcile with other metadata schema/ontology such as schema.org

The recommended list of schema/ontology sorted in descending order based on priority is: ...

5. Reconciliation for entities:

All the entities' format should abide by the ones denoted by namespace prefixes wd and wdt

@preifx wd:http://www.wikidata.org/entity/
@preifx wdt:http://www.wikidata.org/prop/direct/

Be very careful that it's "http" instead of "https"; for wd, it's /entity/ instead of /wiki/...
Or the reconciled URI won't be recognized by Wikidata SPARQL Endpoint.

5.0 Keep records for reconciliation

Especially for those that you have to reconciled manually on OpenRefine, you had better have a spreadsheet to record the mapped entities from Wikidata.

6. For those not easy to be mapped to an exact property or type, we prepare two methods as substitute:

1. Add to Wikidata

2. Use the URL of the webpage of the corresponding database

2.1 If necessary, we use hash #(document fragment delimiter). Such as https://musicbrainz.org/doc/Event#Cancelled in MusicBrainz

2.2 If necessary, we can use a fake URL.

7. Special situation:

7.1 Sometimes we can model with respect to some special situation. Such as

image Please refer to https://github.com/DDMAL/linkedmusic-datalake/issues/107

7.2 About wdt:P2888:exact match

  • We had better also use owl:sameAs, because it can supplement latent data via an activated reasoning function (in Virtuoso), see:
INSERT {
GRAPH <urn:reason.example> {
     <http://InstanceA_local> <http://property> <http://InstanceB_local>.
     <http://InstanceA_local> owl:sameAs <http://InstanceA_wiki>.
     <http://InstanceB_local> owl:sameAs <http://InstanceB_wiki>.# This reasoning condition doesn't take effect, to be investigated in the future.
  }
}

After insertion of data as above, if you check what property http://InstanceA_wiki will have, you may query while activating the reasoning function:

DEFINE input:same-as "yes"
SELECT distinct ?p ?o 
FROM <urn:reason.example> 
WHERE {
    <http://InstanceA_wiki> ?p ?o .  
}

The result can be:

p o
http://www.w3.org/2002/07/owl#sameAs http://InstanceA_wiki
http://property http://InstanceB_local

p o http://www.w3.org/2002/07/owl#sameAs http://instancea_wiki/ http://property/ http://instanceb_local/

Clone this wiki locally