Skip to content

Commit b769c06

Browse files
committed
More documentation
1 parent 1cfcce9 commit b769c06

File tree

1 file changed

+46
-6
lines changed

1 file changed

+46
-6
lines changed

solr/solr-ref-guide/modules/indexing-guide/pages/indexing-with-tika.adoc

Lines changed: 46 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,41 @@ The next step after any update handler is the xref:configuration-guide:update-re
5151

5252
== Tika Server
5353

54-
TODO: Add documentation about Tika Server backend.
54+
The `tikaserver` backend lets Solr delegate content extraction to an external Apache Tika Server process instead of running Tika parsers inside the Solr JVM. This can improve operational isolation (crashes or heavy parsing won’t impact Solr), simplify dependency management, and allow you to scale Tika independently of Solr.
55+
56+
Example handler configuration:
57+
58+
[source,xml]
59+
----
60+
<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler">
61+
<lst name="defaults">
62+
<!-- Select the tikaserver backend by default for this handler -->
63+
<str name="extraction.backend">tikaserver</str>
64+
</lst>
65+
<!-- Point Solr to your Tika Server -->
66+
<str name="tikaserver.url">http://localhost:9998</str>
67+
</requestHandler>
68+
----
69+
70+
=== Starting Tika Server with Docker
71+
72+
The quickest way to run Tika Server for development is using Docker. The examples below expose Tika on port 9998 on localhost, matching the default value when `tikaserver.url` is not explicitly set.
73+
74+
[,bash]
75+
----
76+
docker run --rm -p 9998:9998 apache/tika:3.2.3.0-full
77+
----
78+
79+
NOTE: If Solr runs in Docker too, ensure both containers share a network and use the Tika container name as the host in `tikaserver.url`.
80+
81+
=== Limitations
82+
Currently, the `tikaserver` option lacks some features and will return HTTP 400 in these cases:
83+
84+
- `capture` and `captureAttr`: Selecting specific XHTML elements/attributes during indexing requires Solr’s SAX ContentHandler and is not supported by the `tikaserver` backend.
85+
- `xpath`: Server-side XPath filtering of the XHTML is not supported.
86+
- `passwordsFile` and `resource.password` for the indexing path: these options trigger the legacy SAX path in Solr and are not currently supported.
87+
88+
Metadata produced by Tika Server can differ slightly from local Tika, particularly in key names and the presence/absence of certain fields. Adjust your `fmap.*` mappings accordingly.
5589

5690
== Module
5791

@@ -61,7 +95,7 @@ The "techproducts" example included with Solr is pre-configured to have Solr Cel
6195
If you are not using the example, you will want to pay attention to the section <<solrconfig.xml Configuration>> below.
6296

6397

64-
=== Solr Cell Performance Implications
98+
=== Solr Cell Performance Implications (local mode)
6599

66100
Rich document formats are frequently not well documented, and even in cases where there is documentation for the format, not everyone who creates documents will follow the specifications faithfully.
67101

@@ -76,7 +110,8 @@ the request handler is running in the same JVM that Solr uses for other operatio
76110
Indexing can also consume all available Solr resources, particularly with large PDFs, presentations, or other files
77111
that have a lot of rich media embedded in them.
78112

79-
For these reasons, Solr Cell is not recommended for use in a production system.
113+
For these reasons, Solr Cell with `local` backend is not recommended for use in a production system. Prefer the
114+
`tikaserver` backend, which is more robust and isolates failures from Solr itself.
80115

81116
It is a best practice to use Solr Cell as a proof-of-concept tool during development and then run Tika as an external
82117
process that sends the extracted documents to Solr (via xref:deployment-guide:solrj.adoc[]) for indexing.
@@ -181,7 +216,7 @@ These parameters can be set for each indexing request (as request parameters), o
181216
|===
182217
+
183218
Choose the backend to use for extraction. The options are `local` or `tikaserver`.
184-
The `local` backend uses Tika libraries included with Solr to do the extraction, and is the default in Solr 9.
219+
The `local` backend uses Tika libraries included with Solr to do the extraction, and is the default in Solr 9.x.
185220
The `tikaserver` backend uses an external Tika server process to do the extraction.
186221
**The `local` backend is deprecated and will be removed in a future release.**
187222
+
@@ -195,9 +230,9 @@ Example: In `solrconfig.xml`: `<str name="extraction.backend">tikaserver</str>`.
195230
|===
196231
+
197232
Specifies the URL of the Tika server to use when the `extraction.backend` parameter is set to `tikaserver`.
198-
This parameter is required when using the `tikaserver` backend.
233+
This parameter is required when using the `tikaserver` backend. Defaults to `http://localhost:9998` if not specified.
199234
+
200-
Example: In `solrconfig.xml`: `<str name="tikaserver.url">http://my.tika.server</str>`.
235+
Example: In `solrconfig.xml`: `<str name="tikaserver.url">http://localhost:9998</str>`.
201236

202237
`capture`::
203238
+
@@ -500,6 +535,8 @@ So you can use the other URPs without worrying about unexpected field additions.
500535

501536
=== Parser-Specific Properties
502537

538+
NOTE: This setting applies to `local` backend only.
539+
503540
Parsers used by Tika may have specific properties to govern how data is extracted.
504541
These can be passed through Solr for special parsing situations.
505542

@@ -521,6 +558,8 @@ Consult the Tika Java API documentation for configuration parameters that can be
521558

522559
=== Indexing Encrypted Documents
523560

561+
NOTE: The `tikaserver` backend does not currently support indexing encrypted documents.
562+
524563
The ExtractingRequestHandler will decrypt encrypted files and index their content if you supply a password in either `resource.password` in the request, or in a `passwordsFile` file.
525564

526565
In the case of `passwordsFile`, the file supplied must be formatted so there is one line per rule.
@@ -658,6 +697,7 @@ public class SolrCellRequestDemo {
658697
req.setParam(ExtractingParams.EXTRACT_ONLY, "true");
659698
NamedList<Object> result = client.request(req);
660699
System.out.println("Result: " + result);
700+
}
661701
}
662702
----
663703

0 commit comments

Comments
 (0)