|
| 1 | +# TikaDocument |
| 2 | + |
| 3 | +> TikaDocument Transform Plugin |
| 4 | +
|
| 5 | +## Description |
| 6 | + |
| 7 | +The `TikaDocument` transform plugin uses Apache Tika to extract text content and metadata from various document formats including PDF, Microsoft Office documents (Word, Excel, PowerPoint), plain text, HTML, XML, and many other file formats. This transform converts binary document data into structured text content and metadata fields. |
| 8 | + |
| 9 | +The plugin supports comprehensive error handling, content processing options, and can handle both binary data and Base64-encoded document content. |
| 10 | + |
| 11 | +## Options |
| 12 | + |
| 13 | +| Name | Type | Required | Default Value | Description | |
| 14 | +|------------------------------------|--------|----------|---------------|-------------------------------------------------------------------------------------------------------| |
| 15 | +| source_field | string | yes | - | The name of the source field containing document data (binary or Base64) | |
| 16 | +| output_fields | map | no | auto-generated| Mapping of extracted content to output field names | |
| 17 | +| parse_options.extract_text | bool | no | true | Whether to extract text content from documents | |
| 18 | +| parse_options.extract_metadata | bool | no | true | Whether to extract document metadata | |
| 19 | +| parse_options.max_string_length | int | no | 10000 | Maximum length of extracted text content | |
| 20 | +| content_processing.remove_empty_lines | bool | no | false | Whether to remove empty lines from extracted text | |
| 21 | +| content_processing.trim_whitespace | bool | no | false | Whether to trim whitespace from extracted text | |
| 22 | +| content_processing.normalize_whitespace | bool | no | false | Whether to normalize multiple whitespaces to single spaces | |
| 23 | +| content_processing.min_content_length | int | no | 0 | Minimum content length threshold (shorter content will be skipped) | |
| 24 | +| error_handling.on_parse_error | enum | no | skip | How to handle document parsing errors: `fail`, `skip`, `null` | |
| 25 | +| error_handling.on_unsupported_format | enum | no | skip | How to handle unsupported document formats: `fail`, `skip`, `null` | |
| 26 | +| error_handling.log_errors | bool | no | false | Whether to log error messages | |
| 27 | +| timeout_ms | long | no | 30000 | Timeout for document processing in milliseconds | |
| 28 | + |
| 29 | +### common options [string] |
| 30 | + |
| 31 | +Transform plugin common parameters, please refer to [Transform Plugin](common-options.md) for details |
| 32 | + |
| 33 | +### source_field [string] |
| 34 | + |
| 35 | +The name of the input field that contains the document data. This field should contain either: |
| 36 | +- Binary document data (byte array) |
| 37 | +- Base64-encoded document data (string) |
| 38 | + |
| 39 | +### output_fields [map] |
| 40 | + |
| 41 | +A mapping that specifies which extracted fields should be output and their corresponding field names. If not specified, the plugin will automatically generate output fields based on the parsing options. |
| 42 | + |
| 43 | +**Default output fields:** |
| 44 | +```hocon |
| 45 | +output_fields { |
| 46 | + content = "extracted_text" # Extracted text content |
| 47 | + content_type = "mime_type" # MIME type of the document |
| 48 | + title = "doc_title" # Document title (if available) |
| 49 | +} |
| 50 | +``` |
| 51 | + |
| 52 | +**Custom output fields:** |
| 53 | +```hocon |
| 54 | +output_fields { |
| 55 | + content = "document_content" |
| 56 | + content_type = "file_type" |
| 57 | + title = "document_title" |
| 58 | + author = "document_author" |
| 59 | + subject = "document_subject" |
| 60 | + keywords = "document_keywords" |
| 61 | + language = "document_language" |
| 62 | + created_date = "creation_date" |
| 63 | + modified_date = "modification_date" |
| 64 | + metadata = "all_metadata" |
| 65 | +} |
| 66 | +``` |
| 67 | + |
| 68 | +### parse_options |
| 69 | + |
| 70 | +#### extract_text [bool] |
| 71 | + |
| 72 | +Whether to extract text content from documents. When enabled, the plugin will extract readable text from the document. |
| 73 | + |
| 74 | +#### extract_metadata [bool] |
| 75 | + |
| 76 | +Whether to extract document metadata such as title, author, creation date, etc. |
| 77 | + |
| 78 | +#### max_string_length [int] |
| 79 | + |
| 80 | +Maximum length of extracted text content. Text longer than this limit will be truncated. |
| 81 | + |
| 82 | +### content_processing |
| 83 | + |
| 84 | +#### remove_empty_lines [bool] |
| 85 | + |
| 86 | +Whether to remove empty lines from the extracted text content. |
| 87 | + |
| 88 | +#### trim_whitespace [bool] |
| 89 | + |
| 90 | +Whether to trim leading and trailing whitespace from the extracted text. |
| 91 | + |
| 92 | +#### normalize_whitespace [bool] |
| 93 | + |
| 94 | +Whether to normalize multiple consecutive whitespace characters to single spaces. |
| 95 | + |
| 96 | +#### min_content_length [int] |
| 97 | + |
| 98 | +Minimum length threshold for extracted content. Content shorter than this length will be considered invalid and handled according to the error handling strategy. |
| 99 | + |
| 100 | +### error_handling |
| 101 | + |
| 102 | +#### on_parse_error [enum] |
| 103 | + |
| 104 | +Specifies how to handle document parsing errors: |
| 105 | +- `fail`: Throw an exception and stop processing |
| 106 | +- `skip`: Skip the current row and continue processing |
| 107 | +- `null`: Fill output fields with null values |
| 108 | + |
| 109 | +#### on_unsupported_format [enum] |
| 110 | + |
| 111 | +Specifies how to handle unsupported document formats: |
| 112 | +- `fail`: Throw an exception and stop processing |
| 113 | +- `skip`: Skip the current row and continue processing |
| 114 | +- `null`: Fill output fields with null values |
| 115 | + |
| 116 | +#### log_errors [bool] |
| 117 | + |
| 118 | +Whether to log detailed error messages when processing failures occur. |
| 119 | + |
| 120 | +### timeout_ms [long] |
| 121 | + |
| 122 | +Timeout for document processing in milliseconds. If document processing takes longer than this timeout, it will be terminated and handled according to the error handling strategy. |
| 123 | + |
| 124 | +## Supported Document Formats |
| 125 | + |
| 126 | +The TikaDocument transform supports a wide variety of document formats through Apache Tika: |
| 127 | + |
| 128 | +- **Text formats**: TXT, RTF, CSV |
| 129 | +- **PDF documents**: PDF |
| 130 | +- **Microsoft Office**: DOC, DOCX, XLS, XLSX, PPT, PPTX |
| 131 | +- **OpenOffice/LibreOffice**: ODT, ODS, ODP |
| 132 | +- **Web formats**: HTML, XML, XHTML |
| 133 | +- **Archive formats**: ZIP, TAR, GZIP |
| 134 | +- **Image formats** (with OCR if available): JPEG, PNG, TIFF, GIF |
| 135 | +- **Email formats**: MSG, EML, MBOX |
| 136 | +- **eBook formats**: EPUB, MOBI |
| 137 | +- **And many more** |
| 138 | + |
| 139 | +## Examples |
| 140 | + |
| 141 | +### Basic Document Processing |
| 142 | + |
| 143 | +```hocon |
| 144 | +transform { |
| 145 | + TikaDocument { |
| 146 | + source_field = "document_data" |
| 147 | + output_fields = { |
| 148 | + content = "extracted_text" |
| 149 | + content_type = "mime_type" |
| 150 | + } |
| 151 | + } |
| 152 | +} |
| 153 | +``` |
| 154 | + |
| 155 | +### Advanced Configuration with Content Processing |
| 156 | + |
| 157 | +```hocon |
| 158 | +transform { |
| 159 | + TikaDocument { |
| 160 | + source_field = "file_content" |
| 161 | + output_fields = { |
| 162 | + content = "document_text" |
| 163 | + content_type = "file_type" |
| 164 | + title = "doc_title" |
| 165 | + author = "doc_author" |
| 166 | + metadata = "all_metadata" |
| 167 | + } |
| 168 | + parse_options = { |
| 169 | + extract_text = true |
| 170 | + extract_metadata = true |
| 171 | + max_string_length = 50000 |
| 172 | + } |
| 173 | + content_processing = { |
| 174 | + remove_empty_lines = true |
| 175 | + trim_whitespace = true |
| 176 | + normalize_whitespace = true |
| 177 | + min_content_length = 10 |
| 178 | + } |
| 179 | + error_handling = { |
| 180 | + on_parse_error = "skip" |
| 181 | + on_unsupported_format = "null" |
| 182 | + log_errors = true |
| 183 | + } |
| 184 | + timeout_ms = 60000 |
| 185 | + } |
| 186 | +} |
| 187 | +``` |
| 188 | + |
| 189 | +### Multi-table Processing |
| 190 | + |
| 191 | +```hocon |
| 192 | +transform { |
| 193 | + TikaDocument { |
| 194 | + source_field = "document_data" |
| 195 | + output_fields = { |
| 196 | + content = "extracted_content" |
| 197 | + content_type = "document_type" |
| 198 | + } |
| 199 | + multi_tables = true |
| 200 | + } |
| 201 | +} |
| 202 | +``` |
| 203 | + |
| 204 | +## Data Type Mapping |
| 205 | + |
| 206 | +| Input Type | Output Type | Description | |
| 207 | +|------------|-------------|-------------| |
| 208 | +| BYTES | STRING | Binary document data → Extracted text content | |
| 209 | +| STRING | STRING | Base64 document data → Extracted text content | |
| 210 | + |
| 211 | +Output fields data types: |
| 212 | +- `content`: STRING (extracted text) |
| 213 | +- `content_type`: STRING (MIME type) |
| 214 | +- `title`: STRING (document title) |
| 215 | +- `author`: STRING (document author) |
| 216 | +- `subject`: STRING (document subject) |
| 217 | +- `keywords`: STRING (document keywords) |
| 218 | +- `language`: STRING (document language) |
| 219 | +- `created_date`: STRING (creation date in ISO format) |
| 220 | +- `modified_date`: STRING (modification date in ISO format) |
| 221 | +- `metadata`: MAP<STRING, STRING> (all metadata as key-value pairs) |
| 222 | + |
| 223 | +## Performance Considerations |
| 224 | + |
| 225 | +- **Memory Usage**: Large documents may consume significant memory during processing |
| 226 | +- **Processing Time**: Complex documents (especially PDFs with images) may take longer to process |
| 227 | +- **Timeout Settings**: Adjust `timeout_ms` based on your document sizes and processing requirements |
| 228 | +- **Batch Size**: For high-volume processing, consider adjusting batch sizes to balance memory usage and throughput |
| 229 | + |
| 230 | +## Error Handling Best Practices |
| 231 | + |
| 232 | +1. **Use appropriate error handling strategies** based on your use case: |
| 233 | + - `fail`: For critical pipelines where document processing must succeed |
| 234 | + - `skip`: For batch processing where some failures are acceptable |
| 235 | + - `null`: When you want to preserve row structure but mark failed extractions |
| 236 | + |
| 237 | +2. **Enable logging** during development and testing to understand processing issues |
| 238 | + |
| 239 | +3. **Set reasonable timeouts** to prevent hanging on corrupted or very large documents |
| 240 | + |
| 241 | +4. **Monitor extraction success rates** in production environments |
| 242 | + |
| 243 | +## Troubleshooting |
| 244 | + |
| 245 | +### Common Issues |
| 246 | + |
| 247 | +1. **OutOfMemoryError**: Reduce `max_string_length` or increase JVM heap size |
| 248 | +2. **Timeout issues**: Increase `timeout_ms` for large documents |
| 249 | +3. **Unsupported formats**: Check document format support or use appropriate error handling |
| 250 | +4. **Encoding issues**: Ensure proper character encoding for text documents |
| 251 | + |
| 252 | +### Debug Tips |
| 253 | + |
| 254 | +- Enable `log_errors = true` to see detailed error messages |
| 255 | +- Use `on_parse_error = "null"` to identify problematic documents |
| 256 | +- Test with small document samples first |
| 257 | +- Verify document integrity before processing |
0 commit comments