Skip to content

Commit 5c95db0

Browse files
committed
1
1 parent d4ec6e7 commit 5c95db0

File tree

2 files changed

+514
-0
lines changed

2 files changed

+514
-0
lines changed
Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
# TikaDocument
2+
3+
> TikaDocument Transform Plugin
4+
5+
## Description
6+
7+
The `TikaDocument` transform plugin uses Apache Tika to extract text content and metadata from various document formats including PDF, Microsoft Office documents (Word, Excel, PowerPoint), plain text, HTML, XML, and many other file formats. This transform converts binary document data into structured text content and metadata fields.
8+
9+
The plugin supports comprehensive error handling, content processing options, and can handle both binary data and Base64-encoded document content.
10+
11+
## Options
12+
13+
| Name | Type | Required | Default Value | Description |
14+
|------------------------------------|--------|----------|---------------|-------------------------------------------------------------------------------------------------------|
15+
| source_field | string | yes | - | The name of the source field containing document data (binary or Base64) |
16+
| output_fields | map | no | auto-generated| Mapping of extracted content to output field names |
17+
| parse_options.extract_text | bool | no | true | Whether to extract text content from documents |
18+
| parse_options.extract_metadata | bool | no | true | Whether to extract document metadata |
19+
| parse_options.max_string_length | int | no | 10000 | Maximum length of extracted text content |
20+
| content_processing.remove_empty_lines | bool | no | false | Whether to remove empty lines from extracted text |
21+
| content_processing.trim_whitespace | bool | no | false | Whether to trim whitespace from extracted text |
22+
| content_processing.normalize_whitespace | bool | no | false | Whether to normalize multiple whitespaces to single spaces |
23+
| content_processing.min_content_length | int | no | 0 | Minimum content length threshold (shorter content will be skipped) |
24+
| error_handling.on_parse_error | enum | no | skip | How to handle document parsing errors: `fail`, `skip`, `null` |
25+
| error_handling.on_unsupported_format | enum | no | skip | How to handle unsupported document formats: `fail`, `skip`, `null` |
26+
| error_handling.log_errors | bool | no | false | Whether to log error messages |
27+
| timeout_ms | long | no | 30000 | Timeout for document processing in milliseconds |
28+
29+
### common options [string]
30+
31+
Transform plugin common parameters, please refer to [Transform Plugin](common-options.md) for details
32+
33+
### source_field [string]
34+
35+
The name of the input field that contains the document data. This field should contain either:
36+
- Binary document data (byte array)
37+
- Base64-encoded document data (string)
38+
39+
### output_fields [map]
40+
41+
A mapping that specifies which extracted fields should be output and their corresponding field names. If not specified, the plugin will automatically generate output fields based on the parsing options.
42+
43+
**Default output fields:**
44+
```hocon
45+
output_fields {
46+
content = "extracted_text" # Extracted text content
47+
content_type = "mime_type" # MIME type of the document
48+
title = "doc_title" # Document title (if available)
49+
}
50+
```
51+
52+
**Custom output fields:**
53+
```hocon
54+
output_fields {
55+
content = "document_content"
56+
content_type = "file_type"
57+
title = "document_title"
58+
author = "document_author"
59+
subject = "document_subject"
60+
keywords = "document_keywords"
61+
language = "document_language"
62+
created_date = "creation_date"
63+
modified_date = "modification_date"
64+
metadata = "all_metadata"
65+
}
66+
```
67+
68+
### parse_options
69+
70+
#### extract_text [bool]
71+
72+
Whether to extract text content from documents. When enabled, the plugin will extract readable text from the document.
73+
74+
#### extract_metadata [bool]
75+
76+
Whether to extract document metadata such as title, author, creation date, etc.
77+
78+
#### max_string_length [int]
79+
80+
Maximum length of extracted text content. Text longer than this limit will be truncated.
81+
82+
### content_processing
83+
84+
#### remove_empty_lines [bool]
85+
86+
Whether to remove empty lines from the extracted text content.
87+
88+
#### trim_whitespace [bool]
89+
90+
Whether to trim leading and trailing whitespace from the extracted text.
91+
92+
#### normalize_whitespace [bool]
93+
94+
Whether to normalize multiple consecutive whitespace characters to single spaces.
95+
96+
#### min_content_length [int]
97+
98+
Minimum length threshold for extracted content. Content shorter than this length will be considered invalid and handled according to the error handling strategy.
99+
100+
### error_handling
101+
102+
#### on_parse_error [enum]
103+
104+
Specifies how to handle document parsing errors:
105+
- `fail`: Throw an exception and stop processing
106+
- `skip`: Skip the current row and continue processing
107+
- `null`: Fill output fields with null values
108+
109+
#### on_unsupported_format [enum]
110+
111+
Specifies how to handle unsupported document formats:
112+
- `fail`: Throw an exception and stop processing
113+
- `skip`: Skip the current row and continue processing
114+
- `null`: Fill output fields with null values
115+
116+
#### log_errors [bool]
117+
118+
Whether to log detailed error messages when processing failures occur.
119+
120+
### timeout_ms [long]
121+
122+
Timeout for document processing in milliseconds. If document processing takes longer than this timeout, it will be terminated and handled according to the error handling strategy.
123+
124+
## Supported Document Formats
125+
126+
The TikaDocument transform supports a wide variety of document formats through Apache Tika:
127+
128+
- **Text formats**: TXT, RTF, CSV
129+
- **PDF documents**: PDF
130+
- **Microsoft Office**: DOC, DOCX, XLS, XLSX, PPT, PPTX
131+
- **OpenOffice/LibreOffice**: ODT, ODS, ODP
132+
- **Web formats**: HTML, XML, XHTML
133+
- **Archive formats**: ZIP, TAR, GZIP
134+
- **Image formats** (with OCR if available): JPEG, PNG, TIFF, GIF
135+
- **Email formats**: MSG, EML, MBOX
136+
- **eBook formats**: EPUB, MOBI
137+
- **And many more**
138+
139+
## Examples
140+
141+
### Basic Document Processing
142+
143+
```hocon
144+
transform {
145+
TikaDocument {
146+
source_field = "document_data"
147+
output_fields = {
148+
content = "extracted_text"
149+
content_type = "mime_type"
150+
}
151+
}
152+
}
153+
```
154+
155+
### Advanced Configuration with Content Processing
156+
157+
```hocon
158+
transform {
159+
TikaDocument {
160+
source_field = "file_content"
161+
output_fields = {
162+
content = "document_text"
163+
content_type = "file_type"
164+
title = "doc_title"
165+
author = "doc_author"
166+
metadata = "all_metadata"
167+
}
168+
parse_options = {
169+
extract_text = true
170+
extract_metadata = true
171+
max_string_length = 50000
172+
}
173+
content_processing = {
174+
remove_empty_lines = true
175+
trim_whitespace = true
176+
normalize_whitespace = true
177+
min_content_length = 10
178+
}
179+
error_handling = {
180+
on_parse_error = "skip"
181+
on_unsupported_format = "null"
182+
log_errors = true
183+
}
184+
timeout_ms = 60000
185+
}
186+
}
187+
```
188+
189+
### Multi-table Processing
190+
191+
```hocon
192+
transform {
193+
TikaDocument {
194+
source_field = "document_data"
195+
output_fields = {
196+
content = "extracted_content"
197+
content_type = "document_type"
198+
}
199+
multi_tables = true
200+
}
201+
}
202+
```
203+
204+
## Data Type Mapping
205+
206+
| Input Type | Output Type | Description |
207+
|------------|-------------|-------------|
208+
| BYTES | STRING | Binary document data → Extracted text content |
209+
| STRING | STRING | Base64 document data → Extracted text content |
210+
211+
Output fields data types:
212+
- `content`: STRING (extracted text)
213+
- `content_type`: STRING (MIME type)
214+
- `title`: STRING (document title)
215+
- `author`: STRING (document author)
216+
- `subject`: STRING (document subject)
217+
- `keywords`: STRING (document keywords)
218+
- `language`: STRING (document language)
219+
- `created_date`: STRING (creation date in ISO format)
220+
- `modified_date`: STRING (modification date in ISO format)
221+
- `metadata`: MAP<STRING, STRING> (all metadata as key-value pairs)
222+
223+
## Performance Considerations
224+
225+
- **Memory Usage**: Large documents may consume significant memory during processing
226+
- **Processing Time**: Complex documents (especially PDFs with images) may take longer to process
227+
- **Timeout Settings**: Adjust `timeout_ms` based on your document sizes and processing requirements
228+
- **Batch Size**: For high-volume processing, consider adjusting batch sizes to balance memory usage and throughput
229+
230+
## Error Handling Best Practices
231+
232+
1. **Use appropriate error handling strategies** based on your use case:
233+
- `fail`: For critical pipelines where document processing must succeed
234+
- `skip`: For batch processing where some failures are acceptable
235+
- `null`: When you want to preserve row structure but mark failed extractions
236+
237+
2. **Enable logging** during development and testing to understand processing issues
238+
239+
3. **Set reasonable timeouts** to prevent hanging on corrupted or very large documents
240+
241+
4. **Monitor extraction success rates** in production environments
242+
243+
## Troubleshooting
244+
245+
### Common Issues
246+
247+
1. **OutOfMemoryError**: Reduce `max_string_length` or increase JVM heap size
248+
2. **Timeout issues**: Increase `timeout_ms` for large documents
249+
3. **Unsupported formats**: Check document format support or use appropriate error handling
250+
4. **Encoding issues**: Ensure proper character encoding for text documents
251+
252+
### Debug Tips
253+
254+
- Enable `log_errors = true` to see detailed error messages
255+
- Use `on_parse_error = "null"` to identify problematic documents
256+
- Test with small document samples first
257+
- Verify document integrity before processing

0 commit comments

Comments
 (0)