Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gawati Content Extraction and Relation with Gawati AKN metadata #6

Open
kohsah opened this issue Nov 22, 2017 · 0 comments
Open

Gawati Content Extraction and Relation with Gawati AKN metadata #6

kohsah opened this issue Nov 22, 2017 · 0 comments

Comments

@kohsah
Copy link
Contributor

kohsah commented Nov 22, 2017

Content Extraction and Mapping with Akoma Ntoso Documents

Gawati provides a separation between the metadata of a document, and the actual document itself.
Here by actual document we mean the PDF, Word, HTML legal document uploaded by the user into the system.

Relation between AKN and PDF documents

Currently the relation between the AKN metdata document and the PDF (for e.g.) is made via an xml reference.

The AKN metadata document provides a standard way to reference the document itself, via the <an:identification> block. If you open any AKN document you will see this part of the document with akomaNtoso->(doc type)->meta->identification :

<an:identification source="#gawati">
        <an:FRBRWork>
            <an:FRBRthis value="/akn/za/act/1961-05-18/gn_no_47-1961/!main"/>
            <an:FRBRuri value="/akn/za/act/1961-05-18/gn_no_47-1961"/>
            <an:FRBRdate name="Work Date" date="1961-05-18"/>
            <an:FRBRauthor href="#author"/>
            <an:FRBRcountry value="za" showAs="South Africa"/>
            <an:FRBRnumber value="gn_no_47-1961" showAs="GN No. 47/1961"/>
            <an:FRBRprescriptive value="false"/>
            <an:FRBRauthoritative value="false"/>
        </an:FRBRWork>
        <an:FRBRExpression>
            <an:FRBRthis value="/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main"/>
            <an:FRBRuri value="/akn/za/act/1961-05-18/gn_no_47-1961/eng@"/>
            <an:FRBRdate name="Expression Date" date="1961-05-18"/>
            <an:FRBRauthor href="#author"/>
            <an:FRBRlanguage language="eng"/>
        </an:FRBRExpression>
        <an:FRBRManifestation>
            <an:FRBRthis value="/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main.xml"/>
            <an:FRBRuri value="/akn/za/act/1961-05-18/gn_no_47-1961/eng@/.akn"/>
            <an:FRBRdate name="Manifestation Date" date="2016-03-30"/>
            <an:FRBRauthor href="#author"/>
            <an:FRBRformat value="xml"/>
        </an:FRBRManifestation>
    </an:identification>

Just a quick explanation here of work / expression and manifestation. The Work refers to the Legislation in general - in this case the "Republic of South Africa (Temporary Provisions) Act 1961". The act can have multiple amendments over the years and can be published in different languages and formats - the "Work" encompasses everything. The expression is a specific published version of the Act at a specific point in time, indicated by the Expression date (FRBRExpression->FRBRdate). The manifestation is even more specific and talks about a specific format.

The XML document above is typically referenced via the Expression IRI (Internationalized Resource Identifier):

  <an:FRBRthis value="/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main"/>

If you look further down the AKN document you will find a reference to a PDF file:

        <an:body>
            <an:book refersTo="#mainDocument">
                <an:componentRef src="/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main.pdf"
                    alt="akn_za_act_1961-05-18_gn_no_47-1961_eng_main.pdf" GUID="#embedded-doc-1"
                    showAs="The Republic of South Africa (Temporary Provisions) Act, 1961"/>
            </an:book>
        </an:body>

Here the <an:componentRef> provides a platform indpendent way to resolve the PDF document, lets examine it closely:

        <an:componentRef 
	    src="/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main.pdf"
            alt="akn_za_act_1961-05-18_gn_no_47-1961_eng_main.pdf" 
            GUID="#embedded-doc-1"
            showAs="The Republic of South Africa (Temporary Provisions) Act, 1961"
	/>

The first attribute @src is a FRBRManifestation iri to the PDF. In gawati the binary document (the PDF) is stored on the file system:

So with an IRI like:

/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main.pdf, the folder part of it would be: /akn/za/act/1961-05-18/gn_no_47-1961/eng@/

The @alt attribute identifes the actual file name:

akn_za_act_1961-05-18_gn_no_47-1961_eng_main.pdf; so the actual path of the file within the file system repository of PDFs would be:

/akn/za/act/1961-05-18/gn_no_47-1961/eng@/akn_za_act_1961-05-18_gn_no_47-1961_eng_main.pdf

Content Extraction

By Content Extraction we mean extracting the content from the PDF (eventually other formats) and associating it with the AKN metadata of the same document. We want to get this extracted content into an XML document, so we can search the AKN metadata and correlate that easily with the extracted content information by keeping them in the same context. The way to link the content extraction with the AKN metadata document would be again the AKN IRI of the document.

A specific document structure will be used to hold the extracted content:

<document xmlns="http://gawati.org/ns/text/1.0">
   <source>
      <work iri="/akn/za/act/1961-05-18/gn_no_47-1961/!main" />
      <expression iri="/akn/za/act/1961-05-18/gn_no_47-1961/eng@!main" />
      <manifestation iri="/akn/za/act/1961-05-18/gn_no_47-1961/eng@!main.pdf" />
   </source>
   <text>
      <page no="1">
         page 1 text in terms of words, sematic lines sentences etc...
      </page>
      <page no="2">
         page 2 text...
      </page>
      <page no="3">
        page 3 text ....
      </page>
      <page no="4">
        page 4 text...
      </page>
    .....
   </text>
</document>

Here the <source> element provides a reference point to the AKN metadata document. Each of the @iri attributes <work ... /> <expression .../> <manifestation... /> refer to <FRBRthis href=... within <FRBRWork.../>, <FRBRExpression.../> and <FRBRManifestation.../>.

   <source>
      <work iri="/akn/za/act/1961-05-18/gn_no_47-1961/!main" />
      <expression iri="/akn/za/act/1961-05-18/gn_no_47-1961/eng@!main" />
      <manifestation iri="/akn/za/act/1961-05-18/gn_no_47-1961/eng@!main.pdf" />
   </source>

The actual extracted text is within the <text> element, and the content is split by page number. The page number is an important structural aspect when the source document is in a binary format like PDF or Word where page number is significant (Instead if the source document was in HTML or XML the page number would be irrelevant) . Having the page number would allow us to link from search results of the content directly to the specific page number in the PDF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant