Skip to content

How to publish a txt corpora with NIF as Linked Data

Sebastian Hellmann edited this page Aug 26, 2013 · 3 revisions

We assume that you have a whole lot of txt files, which you want to annotate and publish as Linked Data: http://www.grammararchive.org/txt/

Expected Result

Identifiers

Each text has its own URI, e.g. starting with http://www.grammararchive.org/resource/abbadie_kam1872

Retrieve the plain text

curl -IL "Accept: plain/text" http://www.grammararchive.org/resource/abbadie_kam1872
HTTP/1.1 303 See Other
Location: http://www.grammararchive.org/txt/abbadie_kam1872.txt

Retrieve the annotations in NIF RDF:

curl -I "Accept: text/turtle" http://www.grammararchive.org/resource/abbadie_kam1872
HTTP/1.1 303 See Other
Location: http://www.grammararchive.org/rdf/abbadie_kam1872.ttl

curl http://www.grammararchive.org/rdf/abbadie_kam1872.ttl

should return something like:

@prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
<http://www.grammararchive.org/resource/abbadie_kam1872#char=0,9115>
       rdf:type nif:RFC5147String , nif:Context ;
       nif:beginIndex "0" ;
       nif:endIndex "9115" ;
# add all chars as object
       nif:isString """- 66

pommelés de la Celtibérie changeaient de robe quand ils étaient transportés .................<- 9115 chars total""" ;
# optionally link to the sourcefile:
      nif:sourceUrl "http://www.grammararchive.org/txt/abbadie_kam1872.txt" .

# just the page number
<http://www.grammararchive.org/resource/abbadie_kam1872#char=2,4>
     a rdf:type nif:RFC5147String .
     nif:beginIndex "2" ;
     nif:endIndex "4" ;
     nif:referenceContext <http://www.grammararchive.org/resource/abbadie_kam1872#char=0,9115> ;
# add your own annotations here, feel free to use whatever e.g. 
     myvocab:PageNumber ; 
     myvocab:pn "true"  ;
     myvocab:number "66"^^xsd:integer

Tools to help you

NIF Converter Web Service

Code: https://github.com/NLP2RDF/software/blob/master/php/nif-ws.php Parameter docu: http://persistence.uni-leipzig.org/nlp2rdf/specification/api.html Deployment (off the shelf): http://nlp2rdf.lod2.eu/nif-ws.php

Convert your data via our webservice:

Please consider deploying the code locally to save traffic.

curl -H "Accept: text/turtle" --data-urlencode input@abbadie_kam1872.txt 
        "http://nlp2rdf.lod2.eu/nif-ws.php?informat=text" > abbadie_kam1872.ttl

NIF Validator

https://github.com/NLP2RDF/software#nif-validator

.htaccess (untested)

Activate Mod Rewrite

sudo a2enmod rewrite  
sudo service apache2 restart

Sample .htaccess (untested)

Options -MultiViews

AddType application/rdf+xml .rdf .owl
AddType text/plain .ttl
RewriteEngine On
AddCharset utf-8 .txt .log .ttl



##################
# Rewrite rule to serve  text/plain content if requested
##################
RewriteCond %{HTTP_ACCEPT} text/plain
RewriteRule ^resource/(.*)$ /txt/$1.txt [R=303,L]

RewriteCond %{HTTP_ACCEPT} application/rdf+xml
RewriteRule ^resource/(.*)$ /rdf/$1.rdf [R=303,L]

#################
# Default
#################
RewriteRule ^resource/(.*)$ /rdf/$1.ttl [R=303,L]