-
Notifications
You must be signed in to change notification settings - Fork 36
Script: pcurl.sh
pcurl.sh is one of the most essential scripts in csv2rdf4lod.
It is our ticket to accountability, repeatability, and attribution because it captures the association between your file on disk and the URL from which you obtained it -- encoded in the Proof Markup Language. pcurl.sh helps us fulfill our Design Objective: Capturing and Exposing Provenance.
Conversion process phase: retrieve also shows an example for how pcurl.sh is used.
bash-3.2$ pcurl.sh
usage: pcurl.sh [-I] url [-n name] [-e extension] [url [-n name] [-e extension]]*
-I : do not download file; just obtain HTTP header information (c.f. curl -I)
url : the URL to retrieve
-n : use 'name' as the local file name.
-e : use 'extension' as the extension to the local file name.
pcurl.sh http://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0111.zip
creates WhiteHouse-WAVES-Released-0111.zip and WhiteHouse-WAVES-Released-0111.zip.pml.ttl in the current working directory.
The remaining blocks of PML encoded in Turtle are one continuous provenance capture of the URL retrieval shown above, with comments describing the subsequent representation.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix pmlp: <http://inference-web.org/2.0/pml-provenance.owl#> .
@prefix pmlj: <http://inference-web.org/2.0/pml-justification.owl#> .
@prefix irw: <http://www.ontologydesignpatterns.org/ont/web/irw.owl#> .
@prefix nfo: <http://www.semanticdesktop.org/ontologies/nfo/#> .
@prefix conv: <http://purl.org/twc/vocab/conversion/> .
@prefix httphead: <http://inference-web.org/registry/MPR/HTTP_1_1_HEAD.owl#> .
@prefix httpget: <http://inference-web.org/registry/MPR/HTTP_1_1_GET.owl#> .
The URL from which we retrieved our file, with a modification date reported by the HTTP server:
<http://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0111.zip>
a pmlp:Source;
.
<http://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0111.zip>
a pmlp:Source;
pmlp:hasModificationDateTime "2011-01-28T23:19:12"^^xsd:dateTime;
.
The file on our local disk, which we md5 hashed:
<WhiteHouse-WAVES-Released-0111.zip>
a pmlp:Information;
pmlp:hasReferenceSourceUsage <sourceUsage_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_content>;
nfo:hasHash <md5_b76602e45b2e9a76869200b877d01f1c>;
.
<md5_b76602e45b2e9a76869200b877d01f1c>
a nfo:FileHash;
nfo:hashAlgorithm "md5";
nfo:hasHash "b76602e45b2e9a76869200b877d01f1c";
.
Justifying the existence of our file on disk as the result of curl HTTP requesting the file:
<nodeSet_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_content>
a pmlj:NodeSet;
pmlj:hasConclusion <WhiteHouse-WAVES-Released-0111.zip>;
pmlj:isConsequentOf [
a pmlj:InferenceStep;
pmlj:hasIndex 0;
pmlj:hasAntecedentList ();
pmlj:hasSourceUsage <sourceUsage_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_content>;
pmlj:hasInferenceEngine conv:curl_md5_5670dffdc5533a4c57243fc97b19a654;
pmlj:hasInferenceRule httpget:HTTP_1_1_GET;
];
.
The time that we retrieved the file:
<sourceUsage_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_content>
a pmlp:SourceUsage;
pmlp:hasSource <http://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0111.zip>;
pmlp:hasUsageDateTime "2011-02-22T22:30:42-05:00"^^xsd:dateTime;
.
The header information of the HTTP retrieval:
<info_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_url_header>
a pmlp:Information, conv:HTTPHeader;
pmlp:hasRawString """HTTP/1.1 200 OK
ETag: "b76602e45b2e9a76869200b877d01f1c:1296256752"
Last-Modified: Fri, 28 Jan 2011 23:19:12 GMT
Accept-Ranges: bytes
Content-Length: 1247427
Content-Type: application/zip
Date: Wed, 23 Feb 2011 03:30:41 GMT
Connection: keep-alive
Server: White House
P3P: CP="NON DSP COR ADM DEV IVA OTPi OUR LEG"
""";
pmlp:hasReferenceSourceUsage <sourceUsage_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_url_header>;
.
Justifying the HTTP header as the result of curl requesting an HTTP HEAD:
<nodeSet_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_url_header>
a pmlj:NodeSet;
pmlj:hasConclusion <info_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_url_header>;
pmlj:isConsequentOf [
a pmlj:InferenceStep;
pmlj:hasIndex 0;
pmlj:hasAntecedentList ();
pmlj:hasSourceUsage <sourceUsage_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_url_header>;
pmlj:hasInferenceEngine conv:curl_md5_5670dffdc5533a4c57243fc97b19a654;
pmlj:hasInferenceRule httphead:HTTP_1_1_HEAD;
];
.
The time that we retrieved the HTTP HEAD:
<sourceUsage_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_url_header>
a pmlp:SourceUsage;
pmlp:hasSource <http://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0111.zip>;
pmlp:hasUsageDateTime "2011-02-22T22:30:42-05:00"^^xsd:dateTime;
.
Identifying the curl implementation that performed the retrievals for us:
conv:curl_md5_5670dffdc5533a4c57243fc97b19a654
a pmlp:InferenceEngine, conv:Curl;
dcterms:identifier "md5_5670dffdc5533a4c57243fc97b19a654";
dcterms:description """curl 7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 OpenSSL/0.9.8l zlib/1.2.3
Protocols: tftp ftp telnet dict ldap http file https ftps
Features: GSS-Negotiate IPv6 Largefile NTLM SSL libz """;
.
conv:Curl rdfs:subClassOf pmlp:InferenceEngine .
Invoking
pcurl.sh http://www.data.gov/download/1554/csv -n 1554 -e csv
results in this PML, which is illustrated in this pdf diagram (NOTE: pre-attribution illustration).
Note the provenir:, hartigprov: and oprov: attributes associating the InferenceStep to a user account, which in turn is related to a Person.
(Note: using VSR, which isn't published yet)
pcurl.sh `vsr-query-redirect-beginning-sources.sh ../../2010-10-31/source/sparql-service-description.ttl.pml.ttl` -e ttl