You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Content Extraction and Mapping with Akoma Ntoso Documents
Gawati provides a separation between the metadata of a document, and the actual document itself.
Here by actual document we mean the PDF, Word, HTML legal document uploaded by the user into the system.
Relation between AKN and PDF documents
Currently the relation between the AKN metdata document and the PDF (for e.g.) is made via an xml reference.
The AKN metadata document provides a standard way to reference the document itself, via the <an:identification> block. If you open any AKN document you will see this part of the document with akomaNtoso->(doc type)->meta->identification :
Just a quick explanation here of work / expression and manifestation. The Work refers to the Legislation in general - in this case the "Republic of South Africa (Temporary Provisions) Act 1961". The act can have multiple amendments over the years and can be published in different languages and formats - the "Work" encompasses everything. The expression is a specific published version of the Act at a specific point in time, indicated by the Expression date (FRBRExpression->FRBRdate). The manifestation is even more specific and talks about a specific format.
The XML document above is typically referenced via the Expression IRI (Internationalized Resource Identifier):
If you look further down the AKN document you will find a reference to a PDF file:
<an:body>
<an:bookrefersTo="#mainDocument">
<an:componentRefsrc="/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main.pdf"alt="akn_za_act_1961-05-18_gn_no_47-1961_eng_main.pdf"GUID="#embedded-doc-1"showAs="The Republic of South Africa (Temporary Provisions) Act, 1961"/>
</an:book>
</an:body>
Here the <an:componentRef> provides a platform indpendent way to resolve the PDF document, lets examine it closely:
<an:componentRefsrc="/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main.pdf"alt="akn_za_act_1961-05-18_gn_no_47-1961_eng_main.pdf"GUID="#embedded-doc-1"showAs="The Republic of South Africa (Temporary Provisions) Act, 1961"
/>
The first attribute @src is a FRBRManifestation iri to the PDF. In gawati the binary document (the PDF) is stored on the file system:
So with an IRI like:
/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main.pdf, the folder part of it would be: /akn/za/act/1961-05-18/gn_no_47-1961/eng@/
The @alt attribute identifes the actual file name:
akn_za_act_1961-05-18_gn_no_47-1961_eng_main.pdf; so the actual path of the file within the file system repository of PDFs would be:
By Content Extraction we mean extracting the content from the PDF (eventually other formats) and associating it with the AKN metadata of the same document. We want to get this extracted content into an XML document, so we can search the AKN metadata and correlate that easily with the extracted content information by keeping them in the same context. The way to link the content extraction with the AKN metadata document would be again the AKN IRI of the document.
A specific document structure will be used to hold the extracted content:
Here the <source> element provides a reference point to the AKN metadata document. Each of the @iri attributes <work ... /> <expression .../> <manifestation... /> refer to <FRBRthis href=... within <FRBRWork.../>, <FRBRExpression.../> and <FRBRManifestation.../>.
The actual extracted text is within the <text> element, and the content is split by page number. The page number is an important structural aspect when the source document is in a binary format like PDF or Word where page number is significant (Instead if the source document was in HTML or XML the page number would be irrelevant) . Having the page number would allow us to link from search results of the content directly to the specific page number in the PDF.
The text was updated successfully, but these errors were encountered:
Content Extraction and Mapping with Akoma Ntoso Documents
Gawati provides a separation between the metadata of a document, and the actual document itself.
Here by actual document we mean the PDF, Word, HTML legal document uploaded by the user into the system.
Relation between AKN and PDF documents
Currently the relation between the AKN metdata document and the PDF (for e.g.) is made via an xml reference.
The AKN metadata document provides a standard way to reference the document itself, via the
<an:identification>
block. If you open any AKN document you will see this part of the document withakomaNtoso->(doc type)->meta->identification
:Just a quick explanation here of work / expression and manifestation. The Work refers to the Legislation in general - in this case the "Republic of South Africa (Temporary Provisions) Act 1961". The act can have multiple amendments over the years and can be published in different languages and formats - the "Work" encompasses everything. The expression is a specific published version of the Act at a specific point in time, indicated by the Expression date (
FRBRExpression->FRBRdate
). The manifestation is even more specific and talks about a specific format.The XML document above is typically referenced via the Expression IRI (Internationalized Resource Identifier):
If you look further down the AKN document you will find a reference to a PDF file:
Here the
<an:componentRef>
provides a platform indpendent way to resolve the PDF document, lets examine it closely:The first attribute
@src
is aFRBRManifestation
iri to the PDF. In gawati the binary document (the PDF) is stored on the file system:So with an IRI like:
/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main.pdf
, the folder part of it would be:/akn/za/act/1961-05-18/gn_no_47-1961/eng@/
The
@alt
attribute identifes the actual file name:akn_za_act_1961-05-18_gn_no_47-1961_eng_main.pdf
; so the actual path of the file within the file system repository of PDFs would be:/akn/za/act/1961-05-18/gn_no_47-1961/eng@/akn_za_act_1961-05-18_gn_no_47-1961_eng_main.pdf
Content Extraction
By Content Extraction we mean extracting the content from the PDF (eventually other formats) and associating it with the AKN metadata of the same document. We want to get this extracted content into an XML document, so we can search the AKN metadata and correlate that easily with the extracted content information by keeping them in the same context. The way to link the content extraction with the AKN metadata document would be again the AKN IRI of the document.
A specific document structure will be used to hold the extracted content:
Here the
<source>
element provides a reference point to the AKN metadata document. Each of the@iri
attributes<work ... /> <expression .../> <manifestation... />
refer to<FRBRthis href=...
within<FRBRWork.../>
,<FRBRExpression.../>
and<FRBRManifestation.../>
.The actual extracted text is within the
<text>
element, and the content is split by page number. The page number is an important structural aspect when the source document is in a binary format like PDF or Word where page number is significant (Instead if the source document was in HTML or XML the page number would be irrelevant) . Having the page number would allow us to link from search results of the content directly to the specific page number in the PDF.The text was updated successfully, but these errors were encountered: