Skip to content

Ermilov's wiki.publicdata.eu CSV2RDF Application

Tim L edited this page May 28, 2013 · 43 revisions

What is first

What we will cover

Ermilov et al. presented a wiki-based approach to crowd-sourcing the enhancements of ~9k datasets listed at http://publicdata.eu (WebSci 2012 paper).

A year after its publication, how far has the crowd-sourcing come?

This pages provides a summary and review of Ermilov's wiki.publicdata.eu CSV2RDF Application.

Let's get to it

How many people contributed to the "crowd-source" enhancement?

Four accounts contributed, and the two non-author accounts provided fewer than ten contributions.

find manual/pages -name "*.ttl" | xargs -L1 grep "wasAttributedTo" | sort -u shows only a handful of contributors:

      prov:wasAttributedTo <http://wiki.publicdata.eu/wiki/User:178.25.43.32>;
      prov:wasAttributedTo <http://wiki.publicdata.eu/wiki/User:2001:638:902:2010:0:168:35:101>;
      prov:wasAttributedTo <http://wiki.publicdata.eu/wiki/User:Iermilov>;
      prov:wasAttributedTo <http://wiki.publicdata.eu/wiki/User:IvanErmilov>;
      prov:wasAttributedTo <http://wiki.publicdata.eu/wiki/User:Soeren>;

How many datasets are covered?

2,035 of the 19,000 are mentioned in their mapping wiki.

find manual/pages -name "*.ttl" | xargs grep -h -B1 "datafaqs:CKANDataset" | grep -v "^--" | grep -v datafaqs:CKANDataset | sort -u | wc -l verified by SPARQL query:

PREFIX datafaqs: <http://purl.org/twc/vocab/datafaqs#>
PREFIX dcat:     <http://www.w3.org/ns/dcat#>

SELECT ?dataset
WHERE {
  ?dataset a datafaqs:CKANDataset, dcat:Dataset .
}
<http://publicdata.eu/dataset/-municipal-waste-generation-in-england-from-2000-01-to-2009-10>
<http://publicdata.eu/dataset/01-bve-adressen-instellingen--ministerie-van-ocw>
<http://publicdata.eu/dataset/01-bve-crebo--beroepsopleidingen-ministerie-van-ocw-2010-2011-ministerie-van-ocw>
<http://publicdata.eu/dataset/01-bve-deelnemers-per-instelling-en-type-mbo-ministerie-van-ocw>
<http://publicdata.eu/dataset/01-ho-adressen-hbo-instellingen-en-universiteiten--ministerie-van-ocw>
<http://publicdata.eu/dataset/01-po-hoofdvestigingen-basisonderwijs-ministerie-van-ocw>
<http://publicdata.eu/dataset/01-po-leerlingen-basisonderwijs-naar-gewicht-ministerie-van-ocw>
<http://publicdata.eu/dataset/01-vo-adressen-hoofdvestigingen--ministerie-van-ocw>
...
<http://publicdata.eu/dataset/years-of-life-lost-due-to-suicide>
<http://publicdata.eu/dataset/young-first-time-offenders-borough>
<http://publicdata.eu/dataset/ypla_financial_transactions_december>
<http://publicdata.eu/dataset/ypla_financial_transactions_november>
<http://publicdata.eu/dataset/ypla_financial_transactions_october>
<http://publicdata.eu/dataset/zuzuge>

How many existing vocabulary terms did the crowd-sourced enhancement produce?

Fifteen terms were reused from nine vocabularies for more than 9,000 datasets. We skip the three non-CURIEs listed below because it is not clear that they are RDF terms.

find manual/pages -name "*.xml.ttl" | xargs -L1 grep "conversion:label" | sed 's/conversion:label//' | grep : | sed 's/^ *"/"/' | grep -v " " | sort -u:

"cgov:fullTimeEquivalentSalary";
"cgov:lowerBound";
"cgov:upperBound";
"dce:date";
"foaf:mbox";
"foaf:name";
"foaf:phone";
"http://dbpedia.org/resource/Category:Ministerial_departments_of_the_United_Kingdom_Government";
"http://statistics.data.gov.uk/id/local-authority/32UC";
"http://www.google.co.uk";
"org:OrganizationalUnit";
"org:organization";
"org:unitOf";
"pc:supplier";
"rdf:type";
"rdfs:comment";
"skos:Amount";
"whois:Job";

Benefits

  • http://publicdata.eu aggregates from many other European-based CKAN instances.
  • Enables community-editable mappings using an existing mechanism (wikimedia).
  • The main CKAN dataset listing site links to the mapping wiki.
  • User-invokable reconversion.

Shortcomings

Usability Shortcomings:

Linked Data Best Practices Shortcomings:

  • curl -H "Accept: application/rdf+xml" -L http://publicdata.eu/dataset/directgov-referring-sites returns a gzipped HTML file (appending .rdf works, though: http://publicdata.eu/dataset/directgov-referring-sites.rdf).
  • The mappings are NOT expressed in RDF; they are only expressed as mediawiki template arguments (and sparqlify behind the scenes, but they aren't available for public inspection). Although the intent is to make them easy to read/write for a novice, that does not mean that they shouldn't be lifted behind the scenes and made available as RDF for other systems to use.
  • The mappings are NOT described with RDF, since it's just a wiki page (The Special:Export can be used, but it's not findable from the page itself using linked data principles). The mapping description does NOT refer back to the dataset that they enhance [using RDF], and they do NOT refer to the resulting RDF conversion [using RDF].
  • The namespace used (http://wiki.publicdata.eu/ontology/) for the RDF properties 404s.
  • The site for the converter tool (http://sparqlify.org/wiki/Main_Page) 404s.
  • The RDF conversion dump files use the NTriples serialization but have the extension .rdf (which is generally reserved for application/rdf+xml serialization). (e.g. http://csv2rdf.aksw.org/sparqlified/f449751c-68d3-4f84-8fe3-5c3a4cb86c84_default-tranformation-configuration.rdf). This confuses even the best-of-breed RDF serialization tools.

Mapping Capabilities Shortcomings:

  • It can't specify a datatype for a cell's value like conversion:range does (e.g. ""85" is an xsd:integer).
  • It can't "promote" a cell value to a URI like conversion:range does (e.g. "http://www.google.co.nz" becomes <http://www.google.co.nz>).
  • It can't type a URI to a given class like conversion:range_template/conversion:subclass_of do (e.g. <http://www.google.co.nz> is a sioc:Space).
  • It's property creation strategy (put everything into http://wiki.publicdata.eu/ontology/) is not conservative enough and fosters collisions. csv2rdf4lod uses a hierarchical naming based on the publishing organization, the dataset, and the version of the dataset (the so-called "[SDV naming](Conversion process phase: name)") to avoid terminology collisions while facilitating natural and incremental dataset integration.

Provenance and Metadata Shortcomings:

  • (to be enumerated)
  • Can we trust the aggregation that http://publicdata.eu does from the many other European-based CKAN instances? Or, do we have to redo it to get better results?

organize:

* Claims "formal tabular model" and provides classifications of 100 tables in the wild (which?).
* Uses SPARQL-inspired mapping language.
* Creates a wiki page ([e.g.](http://wiki.publicdata.eu/wiki/Csv2rdf:0001d90c-eeb8-4857-bf33-ea25850a24cc)) for each mapping. Page permit "retransform" and "download result RDF" functions, and points back to original dataset download file.
* Enables dataset navigation according to shared property use (but not actually useful when you click around as a user).
* Their exemplar: http://wiki.publicdata.eu/wiki/Csv2rdf:00e0737c-6920-479a-9916-ff83b9de692c
* Can get a wiki page's source text by submitting the page name to [Special:Export](http://wiki.publicdata.eu/wiki/Special:Export).
* Results are not available as best-practice Linked Data. ([this](http://wiki.publicdata.eu/wiki/Special:URIResolver/Csv2rdf-3A0001d90c-2Deeb8-2D4857-2Dbf33-2Dea25850a24cc) is sitting around or could be found from the [RDF export module](http://wiki.publicdata.eu/wiki/Special:ExportRDF), but it doesn't reuse appropriate properties (e.g. `rdf_number_of_triples`), is not available via conneg, and does not provide the mapping itself.)
* The [most used properties](http://wiki.publicdata.eu/wiki/Most_used_properties) are all invented - nothing is reused from an external vocab.
* "mapping creation can be easily crowdsourced" - but I doubt many in a crowd could figure out how to use their system.
* Only provides one namespace for properties (plus reuse of external properties). csv2rdf4lod provides for hierarchical vocabulary namespaces according to the source organization and dataset (and, dataset version if desired).
* Converts at 4k triples/second, which is the same order of magnitude as csv2rdf4lod.
* Isn't reported as part of http://datahub.io/group/lodcloud
* Doesn't provide a compelling example
* Doesn't quantify quality of enhancements
* Does not report on real use or acceptance of tool/results
* Claims "main challenge" is to find headers. That's our biggest problem?
* Cannot handle multiple tables in an archive file.
* Very "deliberately minimal" "mapping language", which is a subset of csv2rdf4lod's structural parameters: [where is the header](conversion:HeaderRow), what is [delimiter](conversion:delimits_cell), what property do I use for a column ([conversion:label](conversion:label), [conversion:equivalent_property](conversion:equivalent_property), [conversion:subproperty_of](conversion:subproperty_of)), what row/col do I [omit](conversion:Omitted) (especially [if it is empty](conversion:Only_if_column))? 
* Can only handle "one dimensional" tabular organizations (i.e. row-based, as opposed to [cell-based](Converting with cell based subjects))

Using wiki enhancements in csv2rdf4lod

We crawled the enhancements on their wiki to produce csv2rdf4lod enhancement parameters. They are currently unpublished with the URI http://logd.tw.rpi.edu/source/publicdata-eu/dataset/wiki-csv2rdf-mappings/version/2013-May-13 (we need to find an appropriate home for these before publishing them).

Clone this wiki locally