Ermilov's wiki.publicdata.eu CSV2RDF Application

What is first

Notes within the alternate converters list

What we will cover

Ermilov et al. presented a wiki-based approach to crowd-sourcing the enhancements of ~9k datasets listed at http://publicdata.eu (WebSci 2012 paper).

A year after its publication, how far has the crowd-sourcing come?

This pages provides a summary and review of Ermilov's wiki.publicdata.eu CSV2RDF Application.

Let's get to it

How many people contributed to the "crowd-source" enhancement?

Four accounts contributed, and the two non-author accounts provided fewer than ten contributions.

find manual/pages -name "*.ttl" | xargs -L1 grep "wasAttributedTo" | sort -u shows only a handful of contributors:

      prov:wasAttributedTo <http://wiki.publicdata.eu/wiki/User:178.25.43.32>;
      prov:wasAttributedTo <http://wiki.publicdata.eu/wiki/User:2001:638:902:2010:0:168:35:101>;
      prov:wasAttributedTo <http://wiki.publicdata.eu/wiki/User:Iermilov>;
      prov:wasAttributedTo <http://wiki.publicdata.eu/wiki/User:IvanErmilov>;
      prov:wasAttributedTo <http://wiki.publicdata.eu/wiki/User:Soeren>;

How many datasets are covered?

2,035 of the 19,000 are mentioned in their mapping wiki.

PREFIX datafaqs: <http://purl.org/twc/vocab/datafaqs#>
PREFIX dcat:     <http://www.w3.org/ns/dcat#>

SELECT ?dataset
WHERE {
  ?dataset a datafaqs:CKANDataset, dcat:Dataset .
}

<http://publicdata.eu/dataset/-municipal-waste-generation-in-england-from-2000-01-to-2009-10>
<http://publicdata.eu/dataset/01-bve-adressen-instellingen--ministerie-van-ocw>
<http://publicdata.eu/dataset/01-bve-crebo--beroepsopleidingen-ministerie-van-ocw-2010-2011-ministerie-van-ocw>
<http://publicdata.eu/dataset/01-bve-deelnemers-per-instelling-en-type-mbo-ministerie-van-ocw>
<http://publicdata.eu/dataset/01-ho-adressen-hbo-instellingen-en-universiteiten--ministerie-van-ocw>
<http://publicdata.eu/dataset/01-po-hoofdvestigingen-basisonderwijs-ministerie-van-ocw>
<http://publicdata.eu/dataset/01-po-leerlingen-basisonderwijs-naar-gewicht-ministerie-van-ocw>
<http://publicdata.eu/dataset/01-vo-adressen-hoofdvestigingen--ministerie-van-ocw>
...
<http://publicdata.eu/dataset/years-of-life-lost-due-to-suicide>
<http://publicdata.eu/dataset/young-first-time-offenders-borough>
<http://publicdata.eu/dataset/ypla_financial_transactions_december>
<http://publicdata.eu/dataset/ypla_financial_transactions_november>
<http://publicdata.eu/dataset/ypla_financial_transactions_october>
<http://publicdata.eu/dataset/zuzuge>

How many existing vocabulary terms did the crowd-sourced enhancement produce?

Fifteen terms were reused from nine vocabularies for more than 9,000 datasets. We skip the three non-CURIEs listed below because it is not clear that they are RDF terms.

"cgov:fullTimeEquivalentSalary";
"cgov:lowerBound";
"cgov:upperBound";
"dce:date";
"foaf:mbox";
"foaf:name";
"foaf:phone";
"http://dbpedia.org/resource/Category:Ministerial_departments_of_the_United_Kingdom_Government";
"http://statistics.data.gov.uk/id/local-authority/32UC";
"http://www.google.co.uk";
"org:OrganizationalUnit";
"org:organization";
"org:unitOf";
"pc:supplier";
"rdf:type";
"rdfs:comment";
"skos:Amount";
"whois:Job";

Benefits

http://publicdata.eu aggregates from many other European-based CKAN instances.
Enables community-editable mappings using an existing mechanism (wikimedia).
The main CKAN dataset listing site links to the mapping wiki.
User-invokable reconversion.

Shortcomings

Usability Shortcomings:

The wiki-page is hard to use because it is disconnected from both the original and resulting data.
The community hasn't used the tool, even though it has been available for a year.
The mapping wiki pages have meaningless names (e.g. http://wiki.publicdata.eu/wiki/Csv2rdf:F449751c-68d3-4f84-8fe3-5c3a4cb86c84).

Linked Data Best Practices Shortcomings:

curl -H "Accept: application/rdf+xml" -L http://publicdata.eu/dataset/directgov-referring-sites returns a gzipped HTML file (appending .rdf works, though: http://publicdata.eu/dataset/directgov-referring-sites.rdf).
The mappings are NOT expressed in RDF; they are only expressed as mediawiki template arguments (and sparqlify behind the scenes, but they aren't available for public inspection). Although the intent is to make them easy to read/write for a novice, that does not mean that they shouldn't be lifted behind the scenes and made available as RDF for other systems to use.
The mappings are NOT described with RDF, since it's just a wiki page (The Special:Export can be used, but it's not findable from the page itself using linked data principles). The mapping description does NOT refer back to the dataset that they enhance [using RDF], and they do NOT refer to the resulting RDF conversion [using RDF].
The namespace used (http://wiki.publicdata.eu/ontology/) for the RDF properties 404s.
The site for the converter tool (http://sparqlify.org/wiki/Main_Page) 404s.
The RDF conversion dump files use the NTriples serialization but have the extension .rdf (which is generally reserved for application/rdf+xml serialization). (e.g. http://csv2rdf.aksw.org/sparqlified/f449751c-68d3-4f84-8fe3-5c3a4cb86c84_default-tranformation-configuration.rdf). This confuses even the best-of-breed RDF serialization tools.

Mapping Capabilities Shortcomings:

It can't specify a datatype for a cell's value like conversion:range does (e.g. ""85" is an xsd:integer).
It can't "promote" a cell value to a URI like conversion:range does (e.g. "http://www.google.co.nz" becomes <http://www.google.co.nz>).
It can't type a URI to a given class like conversion:range_template/conversion:subclass_of do (e.g. <http://www.google.co.nz> is a sioc:Space).
It's property creation strategy (put everything into http://wiki.publicdata.eu/ontology/) is not conservative enough and fosters collisions. csv2rdf4lod uses a hierarchical naming based on the publishing organization, the dataset, and the version of the dataset (the so-called "[SDV naming](Conversion process phase: name)") to avoid terminology collisions while facilitating natural and incremental dataset integration.

Provenance and Metadata Shortcomings:

(to be enumerated)
Can we trust the aggregation that http://publicdata.eu does from the many other European-based CKAN instances? Or, do we have to redo it to get better results?

organize:

* Claims "formal tabular model" and provides classifications of 100 tables in the wild (which?).
* Uses SPARQL-inspired mapping language.
* Creates a wiki page ([e.g.](http://wiki.publicdata.eu/wiki/Csv2rdf:0001d90c-eeb8-4857-bf33-ea25850a24cc)) for each mapping. Page permit "retransform" and "download result RDF" functions, and points back to original dataset download file.
* Enables dataset navigation according to shared property use (but not actually useful when you click around as a user).
* Their exemplar: http://wiki.publicdata.eu/wiki/Csv2rdf:00e0737c-6920-479a-9916-ff83b9de692c
* Can get a wiki page's source text by submitting the page name to [Special:Export](http://wiki.publicdata.eu/wiki/Special:Export).
* Results are not available as best-practice Linked Data. ([this](http://wiki.publicdata.eu/wiki/Special:URIResolver/Csv2rdf-3A0001d90c-2Deeb8-2D4857-2Dbf33-2Dea25850a24cc) is sitting around or could be found from the [RDF export module](http://wiki.publicdata.eu/wiki/Special:ExportRDF), but it doesn't reuse appropriate properties (e.g. `rdf_number_of_triples`), is not available via conneg, and does not provide the mapping itself.)
* The [most used properties](http://wiki.publicdata.eu/wiki/Most_used_properties) are all invented - nothing is reused from an external vocab.
* "mapping creation can be easily crowdsourced" - but I doubt many in a crowd could figure out how to use their system.
* Only provides one namespace for properties (plus reuse of external properties). csv2rdf4lod provides for hierarchical vocabulary namespaces according to the source organization and dataset (and, dataset version if desired).
* Converts at 4k triples/second, which is the same order of magnitude as csv2rdf4lod.
* Isn't reported as part of http://datahub.io/group/lodcloud
* Doesn't provide a compelling example
* Doesn't quantify quality of enhancements
* Does not report on real use or acceptance of tool/results
* Claims "main challenge" is to find headers. That's our biggest problem?
* Cannot handle multiple tables in an archive file.
* Very "deliberately minimal" "mapping language", which is a subset of csv2rdf4lod's structural parameters: [where is the header](conversion:HeaderRow), what is [delimiter](conversion:delimits_cell), what property do I use for a column ([conversion:label](conversion:label), [conversion:equivalent_property](conversion:equivalent_property), [conversion:subproperty_of](conversion:subproperty_of)), what row/col do I [omit](conversion:Omitted) (especially [if it is empty](conversion:Only_if_column))? 
* Can only handle "one dimensional" tabular organizations (i.e. row-based, as opposed to [cell-based](Converting with cell based subjects))

Using wiki enhancements in csv2rdf4lod

We crawled the enhancements on their wiki to produce csv2rdf4lod enhancement parameters. They are currently unpublished with the URI http://logd.tw.rpi.edu/source/publicdata-eu/dataset/wiki-csv2rdf-mappings/version/2013-May-13 (we need to find an appropriate home for these before publishing them).

Ermilov's wiki.publicdata.eu CSV2RDF Application

What is first

What we will cover

Let's get to it

How many people contributed to the "crowd-source" enhancement?

How many datasets are covered?

How many existing vocabulary terms did the crowd-sourced enhancement produce?

Benefits

Shortcomings

organize:

Using wiki enhancements in csv2rdf4lod

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!