You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: solr/solr-ref-guide/modules/indexing-guide/pages/indexing-with-tika.adoc
+46-6Lines changed: 46 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -51,7 +51,41 @@ The next step after any update handler is the xref:configuration-guide:update-re
51
51
52
52
== Tika Server
53
53
54
-
TODO: Add documentation about Tika Server backend.
54
+
The `tikaserver` backend lets Solr delegate content extraction to an external Apache Tika Server process instead of running Tika parsers inside the Solr JVM. This can improve operational isolation (crashes or heavy parsing won’t impact Solr), simplify dependency management, and allow you to scale Tika independently of Solr.
The quickest way to run Tika Server for development is using Docker. The examples below expose Tika on port 9998 on localhost, matching the default value when `tikaserver.url` is not explicitly set.
73
+
74
+
[,bash]
75
+
----
76
+
docker run --rm -p 9998:9998 apache/tika:3.2.3.0-full
77
+
----
78
+
79
+
NOTE: If Solr runs in Docker too, ensure both containers share a network and use the Tika container name as the host in `tikaserver.url`.
80
+
81
+
=== Limitations
82
+
Currently, the `tikaserver` option lacks some features and will return HTTP 400 in these cases:
83
+
84
+
- `capture` and `captureAttr`: Selecting specific XHTML elements/attributes during indexing requires Solr’s SAX ContentHandler and is not supported by the `tikaserver` backend.
85
+
- `xpath`: Server-side XPath filtering of the XHTML is not supported.
86
+
- `passwordsFile` and `resource.password` for the indexing path: these options trigger the legacy SAX path in Solr and are not currently supported.
87
+
88
+
Metadata produced by Tika Server can differ slightly from local Tika, particularly in key names and the presence/absence of certain fields. Adjust your `fmap.*` mappings accordingly.
55
89
56
90
== Module
57
91
@@ -61,7 +95,7 @@ The "techproducts" example included with Solr is pre-configured to have Solr Cel
61
95
If you are not using the example, you will want to pay attention to the section <<solrconfig.xml Configuration>> below.
Rich document formats are frequently not well documented, and even in cases where there is documentation for the format, not everyone who creates documents will follow the specifications faithfully.
67
101
@@ -76,7 +110,8 @@ the request handler is running in the same JVM that Solr uses for other operatio
76
110
Indexing can also consume all available Solr resources, particularly with large PDFs, presentations, or other files
77
111
that have a lot of rich media embedded in them.
78
112
79
-
For these reasons, Solr Cell is not recommended for use in a production system.
113
+
For these reasons, Solr Cell with `local` backend is not recommended for use in a production system. Prefer the
114
+
`tikaserver` backend, which is more robust and isolates failures from Solr itself.
80
115
81
116
It is a best practice to use Solr Cell as a proof-of-concept tool during development and then run Tika as an external
82
117
process that sends the extracted documents to Solr (via xref:deployment-guide:solrj.adoc[]) for indexing.
@@ -181,7 +216,7 @@ These parameters can be set for each indexing request (as request parameters), o
181
216
|===
182
217
+
183
218
Choose the backend to use for extraction. The options are `local` or `tikaserver`.
184
-
The `local` backend uses Tika libraries included with Solr to do the extraction, and is the default in Solr 9.
219
+
The `local` backend uses Tika libraries included with Solr to do the extraction, and is the default in Solr 9.x.
185
220
The `tikaserver` backend uses an external Tika server process to do the extraction.
186
221
**The `local` backend is deprecated and will be removed in a future release.**
187
222
+
@@ -195,9 +230,9 @@ Example: In `solrconfig.xml`: `<str name="extraction.backend">tikaserver</str>`.
195
230
|===
196
231
+
197
232
Specifies the URL of the Tika server to use when the `extraction.backend` parameter is set to `tikaserver`.
198
-
This parameter is required when using the `tikaserver` backend.
233
+
This parameter is required when using the `tikaserver` backend. Defaults to `http://localhost:9998` if not specified.
199
234
+
200
-
Example: In `solrconfig.xml`: `<str name="tikaserver.url">http://my.tika.server</str>`.
235
+
Example: In `solrconfig.xml`: `<str name="tikaserver.url">http://localhost:9998</str>`.
201
236
202
237
`capture`::
203
238
+
@@ -500,6 +535,8 @@ So you can use the other URPs without worrying about unexpected field additions.
500
535
501
536
=== Parser-Specific Properties
502
537
538
+
NOTE: This setting applies to `local` backend only.
539
+
503
540
Parsers used by Tika may have specific properties to govern how data is extracted.
504
541
These can be passed through Solr for special parsing situations.
505
542
@@ -521,6 +558,8 @@ Consult the Tika Java API documentation for configuration parameters that can be
521
558
522
559
=== Indexing Encrypted Documents
523
560
561
+
NOTE: The `tikaserver` backend does not currently support indexing encrypted documents.
562
+
524
563
The ExtractingRequestHandler will decrypt encrypted files and index their content if you supply a password in either `resource.password` in the request, or in a `passwordsFile` file.
525
564
526
565
In the case of `passwordsFile`, the file supplied must be formatted so there is one line per rule.
@@ -658,6 +697,7 @@ public class SolrCellRequestDemo {
0 commit comments