Skip to content

Script: pcurl.sh

Tim L edited this page Feb 4, 2014 · 45 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

What is first

What we will cover

Let's get to it

pcurl.sh is one of the most essential scripts for providing transparency in csv2rdf4lod.

While the rest of csv2rdf4lod-automation is dedicated to converting and publishing well-structured, highly connected RDF from tabular data, pcurl.sh captures the implicit connection from all of our local processing results to the original data provided by a more authoritative source organization. Associating our local results to the original data source enables accountability, repeatability, and attribution -- essential aspects for establishing trust in our third party enhancements. pcurl.sh helps us fulfill our Design Objective: Capturing and Exposing Provenance.

Conversion process phase: retrieve also shows an example for how pcurl.sh is used.

Script location: $CSV2RDF4LOD_HOME/bin/pcurl.sh

Usage

bash-3.2$ pcurl.sh 
usage: pcurl.sh [-I] url [-n name] [-e extension] [url [-n name] [-e extension]]*
  -I  : do not download file; just obtain HTTP header information (c.f. curl -I)
  url : the URL to retrieve
  -n  : use 'name' as the local file name.
  -e  : use 'extension' as the extension to the local file name.

Example usage

pcurl.sh http://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0111.zip

creates WhiteHouse-WAVES-Released-0111.zip and WhiteHouse-WAVES-Released-0111.zip.pml.ttl in the current working directory.

Example provenance captured - Direct URL from whitehouse.gov

The remaining blocks of PML encoded in Turtle are one continuous provenance capture of the URL retrieval shown above, with comments describing the subsequent representation.

@prefix rdfs:     <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd:      <http://www.w3.org/2001/XMLSchema#> .
@prefix dcterms:  <http://purl.org/dc/terms/> .
@prefix pmlp:     <http://inference-web.org/2.0/pml-provenance.owl#> .
@prefix pmlj:     <http://inference-web.org/2.0/pml-justification.owl#> .
@prefix irw:      <http://www.ontologydesignpatterns.org/ont/web/irw.owl#> .
@prefix nfo:      <http://www.semanticdesktop.org/ontologies/nfo/#> .
@prefix conv:     <http://purl.org/twc/vocab/conversion/> .
@prefix httphead: <http://inference-web.org/registry/MPR/HTTP_1_1_HEAD.owl#> .
@prefix httpget:  <http://inference-web.org/registry/MPR/HTTP_1_1_GET.owl#> .

The URL from which we retrieved our file, with a modification date reported by the HTTP server:

<http://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0111.zip>
   a pmlp:Source;
.

<http://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0111.zip>
   a pmlp:Source;
   pmlp:hasModificationDateTime "2011-01-28T23:19:12"^^xsd:dateTime;
.

The file on our local disk, which we md5 hashed:

<WhiteHouse-WAVES-Released-0111.zip>
   a pmlp:Information;
   pmlp:hasReferenceSourceUsage <sourceUsage_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_content>;
   nfo:hasHash <md5_b76602e45b2e9a76869200b877d01f1c>;
.

<md5_b76602e45b2e9a76869200b877d01f1c>
   a nfo:FileHash; 
   nfo:hashAlgorithm "md5";
   nfo:hasHash "b76602e45b2e9a76869200b877d01f1c";
.

Justifying the existence of our file on disk as the result of curl HTTP requesting the file:

<nodeSet_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_content>
   a pmlj:NodeSet;
   pmlj:hasConclusion <WhiteHouse-WAVES-Released-0111.zip>;
   pmlj:isConsequentOf [
      a pmlj:InferenceStep;
      pmlj:hasIndex 0;
      pmlj:hasAntecedentList ();
      pmlj:hasSourceUsage     <sourceUsage_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_content>;
      pmlj:hasInferenceEngine conv:curl_md5_5670dffdc5533a4c57243fc97b19a654;
      pmlj:hasInferenceRule   httpget:HTTP_1_1_GET;
   ];
.

The time that we retrieved the file:

<sourceUsage_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_content>
   a pmlp:SourceUsage;
   pmlp:hasSource        <http://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0111.zip>;
   pmlp:hasUsageDateTime "2011-02-22T22:30:42-05:00"^^xsd:dateTime;
.

The header information of the HTTP retrieval:

<info_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_url_header>
   a pmlp:Information, conv:HTTPHeader;
   pmlp:hasRawString """HTTP/1.1 200 OK
ETag: "b76602e45b2e9a76869200b877d01f1c:1296256752"
Last-Modified: Fri, 28 Jan 2011 23:19:12 GMT
Accept-Ranges: bytes
Content-Length: 1247427
Content-Type: application/zip
Date: Wed, 23 Feb 2011 03:30:41 GMT
Connection: keep-alive
Server: White House
P3P: CP="NON DSP COR ADM DEV IVA OTPi OUR LEG"

""";
   pmlp:hasReferenceSourceUsage <sourceUsage_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_url_header>;
.

Justifying the HTTP header as the result of curl requesting an HTTP HEAD:

<nodeSet_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_url_header>
   a pmlj:NodeSet;
   pmlj:hasConclusion <info_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_url_header>;
   pmlj:isConsequentOf [
      a pmlj:InferenceStep;
      pmlj:hasIndex 0;
      pmlj:hasAntecedentList ();
      pmlj:hasSourceUsage     <sourceUsage_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_url_header>;
      pmlj:hasInferenceEngine conv:curl_md5_5670dffdc5533a4c57243fc97b19a654;
      pmlj:hasInferenceRule   httphead:HTTP_1_1_HEAD;
   ];
.

The time that we retrieved the HTTP HEAD:

<sourceUsage_4921a599-96e2-46a3-bb6e-e6769d4b3fdf_url_header>
   a pmlp:SourceUsage;
   pmlp:hasSource        <http://www.whitehouse.gov/files/disclosures/visitors/WhiteHouse-WAVES-Released-0111.zip>;
   pmlp:hasUsageDateTime "2011-02-22T22:30:42-05:00"^^xsd:dateTime;
.

Identifying the curl implementation that performed the retrievals for us:

conv:curl_md5_5670dffdc5533a4c57243fc97b19a654
   a pmlp:InferenceEngine, conv:Curl;
   dcterms:identifier "md5_5670dffdc5533a4c57243fc97b19a654";
   dcterms:description """curl 7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 OpenSSL/0.9.8l zlib/1.2.3
Protocols: tftp ftp telnet dict ldap http file https ftps 
Features: GSS-Negotiate IPv6 Largefile NTLM SSL libz """;
.

conv:Curl rdfs:subClassOf pmlp:InferenceEngine .

Example provenance captured - Following HTTP redirects to the final URL

Invoking

pcurl.sh http://www.data.gov/download/1554/csv -n 1554 -e csv

results in this PML, which is illustrated in this pdf diagram (NOTE: pre-attribution illustration).

Note the provenir:, hartigprov: and oprov: attributes associating the InferenceStep to a user account, which in turn is related to a Person.

Using previous provenance to re-retrieve URLs

(Note: using VSR, which isn't published yet)

pcurl.sh `vsr-query-redirect-beginning-sources.sh ../../2010-10-31/source/sparql-service-description.ttl.pml.ttl` -e ttl

Provenance of HTTP POSTs

The curl command:

curl http://www.epa-echo.gov/cgi-bin/effluentdata.cgi \
         -F "permit=NY0261343" -F "hits=1" > NY0261343.csv

can be performed using pcurl.sh:

pcurl.sh http://www.epa-echo.gov/cgi-bin/effluentdata.cgi \
         -F "permit=NY0261343" -F "hits=1" -n NY0261343 -e csv

and captures the POST fields and values at pmlj:hasVariableMapping:

<inferenceStep_d159b3fc-7c38-44e1-b68d-b1d591385d68_content>
   a pmlj:InferenceStep;
   pmlj:hasIndex 0;
   pmlj:hasAntecedentList ();
   pmlj:hasSourceUsage     <sourceUsage_d159b3fc-7c38-44e1-b68d-b1d591385d68_content>;
   pmlj:hasInferenceEngine conv:curl_md5_5670dffdc5533a4c57243fc97b19a654;
   pmlj:hasInferenceRule   httppost:HTTP_1_1_POST;
   oboro:has_agent          <http://tw.rpi.edu/web/inside/machine/lebot_macbook#lebot>;
   hartigprov:involvedActor <http://tw.rpi.edu/web/inside/machine/lebot_macbook#lebot>;
   pmlj:hasVariableMapping [ pmlj:mapFrom "permit"; pmlj:mapTo "NY0261343"; ];
   pmlj:hasVariableMapping [ pmlj:mapFrom "hits"; pmlj:mapTo "1"; ];
.

What is next

Clone this wiki locally