Automated creation of a new Versioned Dataset

csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

What is first

(for distinction among Abstract Dataset, Versioned Dataset, and Layer Dataset, see Springer LOGD book chapter (in press)).
Directory structure as described in Conversion process phase: retrieve.

What we will cover

This page describes how to set up datasets so that others can recreate them from their original sources.

Let's get to it!

Creating a new Versioned Dataset from a Google spreadsheet

Going into a dataset's version/ directory:

$ cd /source/twc-rpi-edu/instance-hub-us-states-and-territories/version/
$ ls

we see a directory for each version of abstract dataset http://logd.tw.rpi.edu/source/twc-rpi-edu/dataset/instance-hub-us-states-and-territories:

2011-Apr-01/
2011-Apr-09/
2011-Mar-31/

We can set up future versions by creating a script retrieve.sh with contents:

#!/bin/bash
#
#3> <>
#3> rdfs:comment 
#3> "Script to retrieve and convert a new version of the dataset.";
#3>
#3> rdfs:seeAlso 
#3> <https://github.com/timrdf/csv2rdf4lod-automation/wiki/Automated-creation-of-a-new-Versioned-Dataset>,
#3> <https://github.com/timrdf/csv2rdf4lod-automation/wiki/tic-turtle-in-comments>;
#3> .

export CSV2RDF4LOD_CONVERT_OMIT_RAW_LAYER="true"
$CSV2RDF4LOD_HOME/bin/util/google2source.sh -w t9QH44S-_D6-4FQPOCM81BA auto

Running google2source.sh will describe it's usage. The -w indicates to actually create the version directory (instead of a dry run), t9QH44S-_D6-4FQPOCM81BA is the Google spreadsheet key that can be copied from the URL when viewing it, and auto says to use a default name for the local file created when retrieving the spreadsheet.

Remember to chmod +x retrieve.sh the first time, then run:

./retrieve.sh

whenever you want to create a new versioned dataset by retrieving another copy of the Google spreadsheet. When doing so, the initial raw conversion will be run automatically and any enhancement conversions will be run if the [global enhancement parameters are in place](Reusing enhancement parameters for multiple versions or datasets) (e.g., /source/twc-rpi-edu/instance-hub-us-states-and-territories/version/).

If global enhancement parameters are established and the raw layer is useless to you, include CSV2RDF4LOD_CONVERT_OMIT_RAW_LAYER:

If Google returns an empty row before the header, use conversion:HeaderRow.

Creating a new Versioned Dataset from any URL (Acting on stand-along metadata)

If we have a data download URL and determined the source identifier and the dataset identifier with which we want to organize its RDF conversion, we can use describe it in DCAT and let cr-retrieve.sh act upon the access metadata to set up the directory structure and convert.

> cr-pwd.sh 
source/

> mkdir -p cms-gov/hhs-documentation
> cd cms-gov/hhs-documentation

Make a file dcat.ttl to contain similar to the following (change the Distribution name and the download URL):

@prefix rdfs:       <http://www.w3.org/2000/01/rdf-schema#> .
@prefix conversion: <http://purl.org/twc/vocab/conversion/> .
@prefix dcat:       <http://www.w3.org/ns/dcat#> .
@prefix void:       <http://rdfs.org/ns/void#> .
@prefix prov:       <http://www.w3.org/ns/prov#> .
@prefix datafaqs:   <http://purl.org/twc/vocab/datafaqs#> .
@prefix :           <http://purl.org/twc/health/id/> .

<http://purl.org/twc/health/source/cms-gov/dataset/hha-documentation>
   a void:Dataset, dcat:Dataset;
   conversion:source_identifier  "cms-gov";
   conversion:dataset_identifier "hha-documentation";
   prov:wasDerivedFrom :as_a_csv_2012-09-29cms-gov-hha-documentation;
.

:as_a_csv_2012-09-29cms-gov-hha-documentation
   a dcat:Distribution;
   dcat:downloadURL <http://www.cms.gov/Research-Statistics-Data-and-Systems/Files-for-Order/CostReports/DOCS/HHA-DOCUMENTATION.zip>;
.

#3> <> prov:wasAssociatedWith <http://tw.rpi.edu/instances/TimLebo> .

Creating a new Versioned Dataset from any URL (older, and deprecated)

In the twc-lobd svn, going to a dataset's version/ directory:

$ cd /source/ncbi-nlm-nih-gov/gene2ensembl/version/
$ ls

we see a directory for each version of abstract dataset http://health.tw.rpi.edu/source/ncbi-nlm-nih-gov/dataset/gene2ensembl:

2011-Apr-16/

We can set up future versions by creating a script retrieve.sh with contents (now version controlled):

#!/bin/bash
#
#3> @prefix doap:    <http://usefulinc.com/ns/doap#> .
#3> @prefix dcterms: <http://purl.org/dc/terms/> .
#3> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
#3> 
#3> <#>
#3>    a doap:Project; 
#3>    dcterms:description 
#3>      "Script to retrieve and convert a new version of the dataset.";
#3>    rdfs:seeAlso 
#3>      <https://github.com/timrdf/csv2rdf4lod-automation/wiki/Automated-creation-of-a-new-Versioned-Dataset>;
#3> .

export CSV2RDF4LOD_CONVERT_OMIT_RAW_LAYER="true"
$CSV2RDF4LOD_HOME/bin/cr-create-versioned-dataset-dir.sh cr:auto                                               \
                                                        'ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2ensembl.gz' \
                                                       --comment-character '#'                                 \
                                                       --header-line        0                                  \
                                                       --delimiter         '\t'

Running cr-create-versioned-dataset-dir.sh will describe it's usage:

$ cr-create-versioned-dataset-dir.sh 
usage: cr-create-versioned-dataset-dir.sh version-identifier URL [--comment-character char]
                                                                 [--header-line        row]
                                                                 [--delimiter         char]
   version-identifier: conversion:version_identifier for the VersionedDataset to create (use cr:auto for default)
   URL               : URL to retrieve the data file.

Remember to chmod +x retrieve.sh the first time, then run:

./retrieve.sh

whenever you want to create a new versioned dataset by retrieving the data file URL. When doing so, the initial raw conversion will be run automatically and any enhancement conversions will be run if the [global enhancement parameters are in place](Reusing enhancement parameters for multiple versions or datasets) (e.g., /source/ncbi-nlm-nih-gov/gene2ensembl/version/gene2ensembl.e1.params.ttl).

Creating a new Versioned Dataset using custom code

Use the retrieve.sh template.

cd source/contactingthecongress/directory-for-the-112th-congress/version
cp $CSV2RDF4LOD_HOME/bin/cr-create-versioned-dataset-dir.sh retrieve.sh

Preparation trigger

A global 2manual.sh can codifying intermediate tweaks from source/ to manual/.

#!/bin/bash
#
#3> <> a conversion:PreparationTrigger;
#3>    foaf:name "2manual.sh";
#3> rdfs:seeAlso
#3>  <https://github.com/timrdf/csv2rdf4lod-automation/wiki/Automated-creation-of-a-new-Versioned-Dataset>,
#3>  <https://github.com/timrdf/csv2rdf4lod-automation/wiki/Conversion-trigger>,
#3>  <https://github.com/timrdf/csv2rdf4lod-automation/wiki/Conversion-cockpit>;
#3> .
#
# This script is responsible for processing files in source/ and storing their modified forms
# as files in the manual/ directory. These modified files should be ready for conversion.
# 
# This script is also responsible for constructing the conversion trigger
#   (e.g., with cr-create-conversion-trigger.sh -w manual/*.csv)
#
# When this script resides in a cr:directory-of-versions directory,
# (e.g. source/datahub-io/corpwatch/version)
# it is invoked by retrieve.sh (or cr-retrieve.sh).
#   (see https://github.com/timrdf/csv2rdf4lod-automation/wiki/Directory-Conventions)
#
# When this script is invoked, the conversion cockpit is the current working directory.
#

../../src/html2csv.xsl will convert a source/*.html.tidy into manual/*.csv because of this.

Automated creation of a new Versioned Dataset

What is first

What we will cover

Let's get to it!

Creating a new Versioned Dataset from a Google spreadsheet

Creating a new Versioned Dataset from any URL (Acting on stand-along metadata)

Creating a new Versioned Dataset from any URL (older, and deprecated)

Creating a new Versioned Dataset using custom code

Preparation trigger

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!