Gawati Content Extraction and Relation with Gawati AKN metadata #6

kohsah · 2017-11-22T12:03:33Z

Content Extraction and Mapping with Akoma Ntoso Documents

Gawati provides a separation between the metadata of a document, and the actual document itself.
Here by actual document we mean the PDF, Word, HTML legal document uploaded by the user into the system.

Relation between AKN and PDF documents

Currently the relation between the AKN metdata document and the PDF (for e.g.) is made via an xml reference.

The AKN metadata document provides a standard way to reference the document itself, via the <an:identification> block. If you open any AKN document you will see this part of the document with akomaNtoso->(doc type)->meta->identification :

<an:identification source="#gawati">
        <an:FRBRWork>
            <an:FRBRthis value="/akn/za/act/1961-05-18/gn_no_47-1961/!main"/>
            <an:FRBRuri value="/akn/za/act/1961-05-18/gn_no_47-1961"/>
            <an:FRBRdate name="Work Date" date="1961-05-18"/>
            <an:FRBRauthor href="#author"/>
            <an:FRBRcountry value="za" showAs="South Africa"/>
            <an:FRBRnumber value="gn_no_47-1961" showAs="GN No. 47/1961"/>
            <an:FRBRprescriptive value="false"/>
            <an:FRBRauthoritative value="false"/>
        </an:FRBRWork>
        <an:FRBRExpression>
            <an:FRBRthis value="/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main"/>
            <an:FRBRuri value="/akn/za/act/1961-05-18/gn_no_47-1961/eng@"/>
            <an:FRBRdate name="Expression Date" date="1961-05-18"/>
            <an:FRBRauthor href="#author"/>
            <an:FRBRlanguage language="eng"/>
        </an:FRBRExpression>
        <an:FRBRManifestation>
            <an:FRBRthis value="/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main.xml"/>
            <an:FRBRuri value="/akn/za/act/1961-05-18/gn_no_47-1961/eng@/.akn"/>
            <an:FRBRdate name="Manifestation Date" date="2016-03-30"/>
            <an:FRBRauthor href="#author"/>
            <an:FRBRformat value="xml"/>
        </an:FRBRManifestation>
    </an:identification>

Just a quick explanation here of work / expression and manifestation. The Work refers to the Legislation in general - in this case the "Republic of South Africa (Temporary Provisions) Act 1961". The act can have multiple amendments over the years and can be published in different languages and formats - the "Work" encompasses everything. The expression is a specific published version of the Act at a specific point in time, indicated by the Expression date (FRBRExpression->FRBRdate). The manifestation is even more specific and talks about a specific format.

The XML document above is typically referenced via the Expression IRI (Internationalized Resource Identifier):

  <an:FRBRthis value="/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main"/>

If you look further down the AKN document you will find a reference to a PDF file:

        <an:body>
            <an:book refersTo="#mainDocument">
                <an:componentRef src="/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main.pdf"
                    alt="akn_za_act_1961-05-18_gn_no_47-1961_eng_main.pdf" GUID="#embedded-doc-1"
                    showAs="The Republic of South Africa (Temporary Provisions) Act, 1961"/>
            </an:book>
        </an:body>

Here the <an:componentRef> provides a platform indpendent way to resolve the PDF document, lets examine it closely:

        <an:componentRef 
	    src="/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main.pdf"
            alt="akn_za_act_1961-05-18_gn_no_47-1961_eng_main.pdf" 
            GUID="#embedded-doc-1"
            showAs="The Republic of South Africa (Temporary Provisions) Act, 1961"
	/>

The first attribute @src is a FRBRManifestation iri to the PDF. In gawati the binary document (the PDF) is stored on the file system:

So with an IRI like:

/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main.pdf, the folder part of it would be: /akn/za/act/1961-05-18/gn_no_47-1961/eng@/

The @alt attribute identifes the actual file name:

akn_za_act_1961-05-18_gn_no_47-1961_eng_main.pdf; so the actual path of the file within the file system repository of PDFs would be:

/akn/za/act/1961-05-18/gn_no_47-1961/eng@/akn_za_act_1961-05-18_gn_no_47-1961_eng_main.pdf

Content Extraction

By Content Extraction we mean extracting the content from the PDF (eventually other formats) and associating it with the AKN metadata of the same document. We want to get this extracted content into an XML document, so we can search the AKN metadata and correlate that easily with the extracted content information by keeping them in the same context. The way to link the content extraction with the AKN metadata document would be again the AKN IRI of the document.

A specific document structure will be used to hold the extracted content:

<document xmlns="http://gawati.org/ns/text/1.0">
   <source>
      <work iri="/akn/za/act/1961-05-18/gn_no_47-1961/!main" />
      <expression iri="/akn/za/act/1961-05-18/gn_no_47-1961/eng@!main" />
      <manifestation iri="/akn/za/act/1961-05-18/gn_no_47-1961/eng@!main.pdf" />
   </source>
   <text>
      <page no="1">
         page 1 text in terms of words, sematic lines sentences etc...
      </page>
      <page no="2">
         page 2 text...
      </page>
      <page no="3">
        page 3 text ....
      </page>
      <page no="4">
        page 4 text...
      </page>
    .....
   </text>
</document>

Here the <source> element provides a reference point to the AKN metadata document. Each of the @iri attributes <work ... /> <expression .../> <manifestation... /> refer to <FRBRthis href=... within <FRBRWork.../>, <FRBRExpression.../> and <FRBRManifestation.../>.

   <source>
      <work iri="/akn/za/act/1961-05-18/gn_no_47-1961/!main" />
      <expression iri="/akn/za/act/1961-05-18/gn_no_47-1961/eng@!main" />
      <manifestation iri="/akn/za/act/1961-05-18/gn_no_47-1961/eng@!main.pdf" />
   </source>

The actual extracted text is within the <text> element, and the content is split by page number. The page number is an important structural aspect when the source document is in a binary format like PDF or Word where page number is significant (Instead if the source document was in HTML or XML the page number would be irrelevant) . Having the page number would allow us to link from search results of the content directly to the specific page number in the PDF.

The text was updated successfully, but these errors were encountered:

kohsah added the enhancement label Nov 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gawati Content Extraction and Relation with Gawati AKN metadata #6

Gawati Content Extraction and Relation with Gawati AKN metadata #6

kohsah commented Nov 22, 2017 •

edited

Loading

Gawati Content Extraction and Relation with Gawati AKN metadata #6

Gawati Content Extraction and Relation with Gawati AKN metadata #6

Comments

kohsah commented Nov 22, 2017 • edited Loading

Content Extraction and Mapping with Akoma Ntoso Documents

Relation between AKN and PDF documents

Content Extraction

kohsah commented Nov 22, 2017 •

edited

Loading