docu

imixs · Nov 6, 2020 · d339c8c · d339c8c
1 parent c545fc8
commit d339c8c
Show file tree

Hide file tree

Showing 3 changed files with 29 additions and 9 deletions.
diff --git a/imixs-archive-documents/README.md b/imixs-archive-documents/README.md
@@ -45,9 +45,28 @@ Both, the *OCRDocumentPlugin* as also the *OCRDocumentAdapter* can be configured
 	<!-- Tika Options -->
 	<tika name="options">X-Tika-PDFocrStrategy=OCR_AND_TEXT_EXTRACTION</tika>
 	<tika name="options">X-Tika-PDFOcrImageType=RGB</tika>
-	<tika name="options">X-Tika-PDFOcrDPI=400</tika>
+	<tika name="options">X-Tika-PDFOcrDPI=72</tika>
+	<tika name="options">X-Tika-OCRLanguage=eng+deu</tika>
 
-In this example configuration the OCR processing will be started with 3 additional tika options. For more details about the OCR configuration see the [Imixs-Archive-OCR project](https://github.com/imixs/imixs-archive/tree/master/imixs-archive-ocr).
+In this example configuration the OCR processing will be started with 4 additional tika options. 
+
+ - X-Tika-PDFOcrImageType=RGB  - set color mode
+ - X-Tika-PDFOcrDPI=72     - set DPI to 72
+ - X-Tika-OCRLanguage=deu  - set OCR language to german
+
+
+#### Overriding the configured language as part of your request
+
+Different requests may need processing using different language models. These can be specified for specific requests using the X-Tika-OCRLanguage custom header. An example of this is shown below:
+
+	X-Tika-OCRLanguage=deu
+
+Or for multiple languages:
+
+	X-Tika-OCRLanguage: eng+fra"
+
+
+For more details about the OCR configuration see the [Imixs-Archive-OCR project](https://github.com/imixs/imixs-archive/tree/master/imixs-archive-ocr).
 
 
 ## Searching Documents

diff --git a/imixs-archive-ocr/README.md b/imixs-archive-ocr/README.md
@@ -1,7 +1,6 @@
 # Imixs-Archive-OCR
 
-*Imixs-Archive-OCR* is a sub-project of Imixs-Archive. The project provides methods to extract textual information from documents
-attached to a Workitem. The text content of attachments is either extracted by the PDFBox API or by optical character recognition (OCR). This text content is stored in the $file attribute 'text' and can be use for further processing or to search for document content.
+*Imixs-Archive-OCR* is a sub-project of Imixs-Archive. The project is decoupled form the Imixs-Workflow Engine and provides a service component to extract textual information from documents attached to a Workitem. The text content of attachments is either extracted by the PDFBox API or by optical character recognition (OCR). This text content is stored in the $file attribute 'text' and can be use for further processing or to search for document content.
 
 
 ## OCR 
@@ -43,17 +42,19 @@ For example to set the DPI mode call:
 	// define options
 	List<String> options=new ArrayList<String>();
 	options.add("X-Tika-PDFocrStrategy=OCR_AND_TEXT_EXTRACTION");
-	options.add("X-Tika-PDFOcrImageType=RGB");
-	options.add("X-Tika-PDFOcrDPI=400");
-
+	options.add("X-Tika-PDFOcrImageType=RGB"); 	//  support colors 
+	options.add("X-Tika-PDFOcrDPI=72");    			// set DPI
+	options.add("X-Tika-OCRLanguage=eng"); 			// set english language	
 	// start ocr 
 	tikaDocumentService.extractText(workitem, "TEXT_AND_OCR", options)
 
 **Note:** Options set by this method call overwrite the options defined in a tika config file. 
 
 You have various options to configure the Tika server. Find details about how to configure imixs-tika [here](https://github.com/imixs/imixs-docker/tree/master/tika).	
 
-
+ - https://cwiki.apache.org/confluence/display/TIKA/TikaServer
+ - https://cwiki.apache.org/confluence/display/TIKA/TikaOCR
+ - https://cwiki.apache.org/confluence/display/tika/PDFParser%20(Apache%20PDFBox)
 
 
 ## How to Install

diff --git a/imixs-archive-ocr/src/main/java/org/imixs/archive/ocr/OCRService.java b/imixs-archive-ocr/src/main/java/org/imixs/archive/ocr/OCRService.java
@@ -122,7 +122,7 @@ public void extractText(ItemCollection workitem, ItemCollection snapshot, String
         // validate OCR MODE....
         if ("TEXT_ONLY, OCR_ONLY, TEXT_AND_OCR".indexOf(pdfMode) == -1) {
             throw new PluginException(OCRService.class.getSimpleName(), PLUGIN_ERROR,
-                    "Invalid TIKA_OCR_MODE - exprected one of the following options: TEXT_ONLY | OCR_ONLY | TEXT_AND_OCR");
+                    "Invalid TIKA_OCR_MODE - expected one of the following options: TEXT_ONLY | OCR_ONLY | TEXT_AND_OCR");
         }
 
         long l = System.currentTimeMillis();