Skip to content

Script: cr test conversion.sh

timrdf edited this page Aug 7, 2011 · 108 revisions

Motivation

Since csv2rdf4lod is being continually developed, it is good to use the latest and greatest version (by using git pull). But what if some new behavior of the converter changes, producing your data differently? Well, that's a problem. And you need to know about it ASAP. Even better, I need to know about it ASAP. Ideally, I would know about the problem and fix it before I even release the next version of the converter. That way, you wouldn't have to worry about it. cr-test-conversion.sh helps you identify these problems so that you can handle them quickly. At the same time, it helps you share your explicit expectations for the converter so that I can verify that it works for you before I release another version.

Implementation

The script $CSV2RDF4LOD_HOME/bin/util/cr-test-conversion.sh is a start at tackling this challenge. Like virtually all other cr- scripts, it is invoked from any conversion cockpit. When invoked, it applies a variety of SPARQL queries to verify the converted data.

Dependencies

The testing infrastructure is currently using Jena's TDB because it lets us set up a triple store in a local directory of our choosing. See TWC's page for help installing Jena TDB. If you can successfully tdbloader and tdbquery, then you're good to go. (If you have a burning desire to test using other triple stores, go vote for #150)

Using version-controlled csv2rdf4lod skeletons to report bugs

version control strategies discusses how csv2rdf4lod-automation can be used within a version control system. When using one, it becomes incredibly easy to report a bug, all one needs to do is commit the .rq and point others to the URL of the test on the SVN web server. For example, someone could say:

Hey, this [1] doesn't work and I need it Real Soon!, it's for my demo.

[1] https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/data-gov-au/catalog/version/2011-Jun-27/rq/test/ask/present/thing_2.rq

With just this URL, I can run to my terminal:

$ mkdir hurry-and-fix; cd hurry-and-fix
$ svn checkout https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/data-gov-au/catalog \
               source/data-gov-au/catalog
$ cd source/data-gov-au/catalog/version/2011-Jun-27
$ export CSV2RDF4LOD_PUBLISH=true; export CSV2RDF4LOD_PUBLISH_TDB=true
$ ./convert-catalog.sh

bash-3.2$ cr-test-conversion.sh --verbose
................................................................................
rq/test/ask/absent/subject-uri-follows-sdv-naming.rq (Ask => No)

      <http://logd.tw.rpi.edu/source/data-gov-au/dataset/catalog/data.gov.au/version/2011-Jun-27/thing_2> ?p ?o .

-\-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-!-/           - - - FAIL - - 
rq/test/ask/present/thing_2-keywords-parsed.rq (Ask => No)

      :thing_2 dcterms:subject "Bicycles", 
                               "Bike paths",
                               "Cycling",
                               "Transport" .

-\-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-!-/           - - - FAIL - -
rq/test/ask/present/thing_2-keywords-unparsed.rq (Ask => No)

                   #http://logd.tw.rpi.edu/source/data-gov-au/dataset/catalog/version/2011-Jun-27/
      :thing_2 dgtwc:keywords   "Bicycles ,  Bike paths ,  Cycling ,  Transport" ;
               e1:keywords_tags "Bicycles ,  Bike paths ,  Cycling ,  Transport" .

-\-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-!-/           - - - FAIL - -
rq/test/ask/present/thing_2.rq (Ask => No)

      :thing_2
         e1:data_gov_au_category "Community ,  Health ,  Transport" ;
         dgtwc:categories        "Community ,  Health ,  Transport" ;
         # The following two should be parsed into the three triples below:
         dgtwc:category  "Community", 
                         "Health",
                         "Transport";
         # The following two should be parsed into the three triples below:
         e1:keywords_tags        "Bicycles ,  Bike paths ,  Cycling ,  Transport" ;
         dgtwc:keywords          "Bicycles ,  Bike paths ,  Cycling ,  Transport" ;
         dcterms:subject "Bicycles", 
                         "Bike paths",
                         "Cycling",
                         "Transport" .

--------------------------------------------------------------------------------
1 of 4 passed

And I can see your new concerns!

Exposing RDF conversion unit tests as RDF

By extending Vocabulary of Interlinked Datasets (VoID) and reusing Description of a Project (DOAP), we can model an abstract dataset that is under version control and has unit tests:

<http://logd.tw.rpi.edu/source/worldbank-org/dataset/world-development-indicators>
  a conversion:AbstractDataset, void:Dataset;
  a conversion:VersionControlledDataset;
  doap:repository [
    a doap:SVNRepository;
    doap:location <https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/worldbank-org/world-development-indicators/>;
  ];
  a conversion:UnitTestedDataset;
  conversion:testable_by [ 
     a doap:Project;
     doap:developer <http://tw.rpi.edu/instances/MaryamFazel-Zarandi>;
     doap:developer <http://tw.rpi.edu/instances/TimLebo>;
     doap:repository [ 
       a doap:SVNRepository;
       doap:location <https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/worldbank-org/world-development-indicators/rq/>
     ];
  ];

Sometimes tests can only apply to specific versions, since they have to assume specific values for a specific data element. Although they aren't as broadly applicable, they are still useful. The following RDF encoding states A versioned dataset is under version control and has unit tests:

<http://logd.tw.rpi.edu/source/data-gov-au/dataset/catalog/version/2011-Jun-27>
  a conversion:VersionedDataset, void:Dataset;
  a conversion:VersionControlledDataset;
  doap:repository [
    a doap:SVNRepository;
    doap:location <https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/data-gov-au/catalog/>;
  ];
  a conversion:UnitTestedDataset;
  conversion:testable_by [ 
     a doap:Project;
     doap:developer <http://tw.rpi.edu/instances/YongmeiShi>;
     doap:developer <http://tw.rpi.edu/instances/TimLebo>;
     doap:repository [ 
       a doap:SVNRepository;
       doap:location 
  <https://scm.escience.rpi.edu/svn/public/logd-csv2rdf4lod/data/source/data-gov-au/catalog/version/2011-Jun-27/rq/>
     ];
  ];
.

cr-test-conversion.sh --catalog -w will write a listing that types the SPARQL-based unit test as an earl:TestCase. For example, source/worldbank-org/world-development-indicators/rq/test/list.ttl:

@prefix earl: <http://www.w3.org/ns/earl#> .

<ask/absent/impossible_series.rq>      a earl:TestCase .
<ask/absent/impossible.rq>             a earl:TestCase .
<ask/present/has-a-triple.rq>          a earl:TestCase .
<ask/present/has-impossible_series.rq> a earl:TestCase .
<ask/present/has-a-indicator.rq>       a earl:TestCase .
<ask/present/has-a-entry.rq>           a earl:TestCase .
<ask/present/has-a-country.rq>         a earl:TestCase .

cr-test-conversion.sh usage

cr-test-conversion.sh --help:

usage: cr-test-conversion.sh
 --rq                   : Create initial rq/test/ask/{present,absent}/*.rq directory structure.
 --setup                : Run tests, populate the tdb/ beforehand.
 --setup {--verbose, -v}: Run tests, populate the tdb/ beforehand, and show query contents.
                        : Run tests. Needs rq/test or ../../rq/test and publish/tdb/.
 {--verbose, -v}        : Run tests. Needs same as above. Shows the query contents while testing.
 --catalog -w           : Find all rq/test and create rq/test/list.ttl rdf:typing them to earl:TestCase.
 --catalog              : Show dryrun of finding all rq/test; print hypothetical contents of rq/test/list.ttl.
 --show-catalog         : Show all rq/test/list.ttl

Setup

bash-3.2$ cd /source/medicare-gov/catalog

bash-3.2$ ls
version/

bash-3.2$ cr-test-conversion.sh --rq 
Creating rq/test for dataset medicare-gov catalog
rq/test/ask/present
rq/test/ask/present/a-dataset-exists.rq
rq/test/ask/absent
rq/test/ask/absent/impossible.rq

bash-3.2$ ls
version/
rq/

The two sample queries (a-dataset-exists.rq and impossible.rq) take the following form. If you follow this capitalization and structure, the --verbose flag will be a little cleaner when executing the tests.

...
ASK
WHERE {
   GRAPH ?g {
      ...
   }
}

(or on another machine, according to Version control strategies: only the essential minimum is needed)

Next, we can hop into a conversion cockpit and prepare to test:

bash-3.2$ cd version/2011-Jul-18/

bash-3.2$ ls
source/
doc/
manual/
convert-catalog.sh
automatic/
publish/
bash-3.2$ export CSV2RDF4LOD_PUBLISH_TDB=true

bash-3.2$ publish/bin/publish.sh
...
 WARN [main] (FactoryGraphTDB.java:241) - No BGP optimizer
Load: publish/medicare-gov-catalog-2011-Jul-18.nt
34,552 triples: loaded in 2.3 seconds [15,254.7 triples/s]

Test!

SOURCE THE my-csv2rdf4lod-source-me.sh for the project that you are testing against. See my-csv2rdf4lod-source-me.sh.

  • then reset your CSV2RDF4LOD_HOME CSV2RDF4LOD_CONVERT_MACHINE_URI CSV2RDF4LOD_CONVERT_PERSON_URI to point to your copy of the converter.
bash-3.2$ cr-test-conversion.sh 
../../rq/test/ask/absent/impossible.rq Ask => No
../../rq/test/ask/present/a-dataset-exists.rq Ask => Yes
--------------------------------------------------------------------------------
2 of 2 passed

If you'd like to see a bit more, use -v or --verbose:

bash-3.2$ cr-test-conversion.sh --verbose
................................................................................
../../rq/test/ask/absent/impossible.rq (Ask => No)

      twi:TimLebo owl:sameAs twi:notTimLebo .

................................................................................
../../rq/test/ask/present/a-dataset-exists.rq (Ask => Yes)

      ?dataset a conversion:Dataset, void:Dataset .

--------------------------------------------------------------------------------
2 of 2 passed

Example: Testing GovTrack

From a conversion cockpit:

bash-3.2$ find rq
rq
rq/test
rq/test/ask
rq/test/ask/absent
rq/test/ask/absent/9-to-7.rq
rq/test/ask/present
rq/test/ask/present/0-to-2.rq
rq/test/ask/present/2-to-3.rq
rq/test/ask/present/3-to-5.rq
rq/test/ask/present/3-to-7.rq
rq/test/ask/present/5-to-1.rq
rq/test/ask/present/7-to-5.rq

export CSV2RDF4LOD_PUBLISH_TDB=true to load the conversion into a TDB directory to query.

http://download.geonames.org/export/zip/US.zip diagram of enhancements to geonames zip code dump
bash-3.2$ cr-test-conversion.sh -v
-\-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-*-!-!-!-*-!-!-/
rq/test/ask/absent/9-to-7.rq (Ask => Yes)           - - - FAIL - - -

      typed_subdivision_order_3:r40040c9reference_199_VA_US geonames:parentFeature <http://logd.tw.rpi.edu/source/geonames-org/dataset/zip-us/us/typed/subdivision_order_2/199_VA_US> .

................................................................................
rq/test/ask/present/0-to-2.rq (Ask => Yes) 

      zip-us-us:point_40040 
         a                       wgs:Point;
         geonames:parentFeature <http://logd.tw.rpi.edu/id/usps-com/zip/23690>;
         wgs:lat                ?lat;
         wgs:long               ?long .

................................................................................
rq/test/ask/present/2-to-3.rq (Ask => Yes) 

      <http://logd.tw.rpi.edu/id/usps-com/zip/23690> geonames:parentFeature typed_place:Yorktown_VA_US .

................................................................................
rq/test/ask/present/3-to-5.rq (Ask => Yes) 

      typed_place:Yorktown_VA_US geonames:parentFeature typed_subdivision_order_1:VA_US .

................................................................................
rq/test/ask/present/3-to-7.rq (Ask => Yes) 

      typed_place:Yorktown_VA_US geonames:parentFeature <http://logd.tw.rpi.edu/source/geonames-org/dataset/zip-us/us/typed/subdivision_order_2/199_VA_US> .

................................................................................
rq/test/ask/present/5-to-1.rq (Ask => Yes) 

      typed_subdivision_order_1:VA_US geonames:parentFeature typed_country:US .

................................................................................
rq/test/ask/present/7-to-5.rq (Ask => Yes) 

      <http://logd.tw.rpi.edu/source/geonames-org/dataset/zip-us/us/typed/subdivision_order_2/199_VA_US> geonames:parentFeature typed_subdivision_order_1:VA_US .

--------------------------------------------------------------------------------
6 of 7 passed

Test results vocabularies

Clone this wiki locally