SOLR-7632 TikaServer as pluggable backend to existing extraction handler #3670

janhoy · 2025-09-19T15:14:00Z

https://issues.apache.org/jira/browse/SOLR-7632

This work builds on the one in #3361 but instead of making a new module, we add it as a capability to the existing extraction handler through specifying extraction.backend=tikaserver.

This first required refactoring extraction handler to detach it from the Tika-v1 API. There is a new interface ExtractionBackend that takes generic ExtractionRequest object in and returns an ExtractionResult bean, and a new LocalTikaExtractionBackend implementation that encapsulates all Tikav1 api handling. This implementation can be deprecated, and in Solr 10, the tikaserver one can be made default.

See each commit for the progression, it starts with refactoring existing code and ends with adding tikaserver impl.

All existing tests pass. New tests are added using TestContainers to spin up Tika.

Note: Most of the coding was done by JetBrains Junie, so reviewers may want to ensure nothing fancy has slipped into the code.

…ika API Refactor some tests to LocalTikaExtractionBackendTest

epugh · 2025-09-19T17:18:56Z

Exciting!

janhoy · 2025-09-20T01:16:13Z

Status:

Parses docs using TikaServer
Can switch between xml (html) and text format of the content field
Randomized the choice of backend for the main test class
ExtractOnly not fully implemented for tikaserver, some tests fail

TBD:

The whole xpath / SAX parsing of XML response is missing
We use JDK HTTP client, could perhaps use Jetty client. See other POC for example, including making timeouts configurable
Must make sure that tikaserver.url is only configurable on requesthandler config in solrconfig, not as a request parameter (security)
RefGuide docs, especially how to start TikaServer etc
Remove the DummyExtractionBackend

Anyone, please feel free to hack away on this if it looks exciting, committing directly to the PR branch.

Question: Would it bring value to isolate the refactoring in one PR and then another one to add the tikaserver impl?

Cleanup TestContainer Refactor ExtractionMetadata Add returnType to ExtractionRequest Remove static initializers

epugh · 2025-09-21T00:18:41Z

Any luck with security manager?? I had many difficulties

epugh · 2025-09-22T14:09:03Z

Testcontainers and docker don't love the SecurityManager. I had claude try to run the tests and add additional permissions to solr-tests.policy, and after an hour or so, I had a lot more, but no love:

// Needed for testcontainers
  permission java.io.FilePermission "/Users/epugh/.testcontainers.properties", "read";
  permission java.io.FilePermission "/Users/epugh/.docker-java.properties", "read";
  permission java.io.FilePermission "/Users/epugh/.docker/-", "read";
  permission java.io.FilePermission "/usr/local/opt/[email protected]/bin/docker-machine", "read";
  permission java.io.FilePermission "/usr/local/opt/[email protected]/bin/docker-machine", "read";
  permission java.io.FilePermission "/Users/epugh/.asdf/installs/nodejs/20.18.3/bin/docker-machine", "read";
  permission java.io.FilePermission "/Users/epugh/.asdf/shims/docker-machine", "read";
  permission java.io.FilePermission "/Users/epugh/.nvm/versions/node/v14.21.2/bin/docker-machine", "read";
  permission java.io.FilePermission "/Users/epugh/.rbenv/shims/docker-machine", "read";
  permission java.io.FilePermission "/usr/local/bin/docker-machine", "read";
  permission java.io.FilePermission "/usr/local/sbin/docker-machine", "read";
  permission java.io.FilePermission "/System/Cryptexes/App/usr/bin/docker-machine", "read";
  permission java.io.FilePermission "/usr/bin/docker-machine", "read";
  permission java.io.FilePermission "/bin/docker-machine", "read";
  permission java.io.FilePermission "/usr/sbin/docker-machine", "read";
  permission java.io.FilePermission "/sbin/docker-machine", "read";
  permission java.io.FilePermission "/usr/local/MacGPG2/bin/docker-machine", "read";
  permission java.io.FilePermission "/Library/Apple/usr/bin/docker-machine", "read";
  permission java.io.FilePermission "/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin/docker-machine", "read";
  permission java.io.FilePermission "/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin/docker-machine", "read";
  permission java.io.FilePermission "/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin/docker-machine", "read";

janhoy · 2025-09-22T14:26:11Z

Yea, that’s annoying. Perhaps we could disable JSM for this test or for tests in the entire module?

iamsanjay · 2025-09-22T14:27:19Z

I had the similar experience as I was upgrading kafka. And then I stopped.

Java Security Manager and Testcontainers do not play nicely together. We prefer Testcontainers, so disable JSM

epugh · 2025-09-22T14:48:38Z

When I first saw DummyExtractionBackend my first thought was that it should be in the test class hierarchy. However, would there be value in keeping it? If you wanted to test your set up in Solr (and not worry about the Tika side), could it be useful for that? "I send a doc and I get something back"....

Add common metadata Adjust some tests with dc:title instead of title Support passwords in TikaServer backend

solr/modules/extraction/src/test-files/extraction/solr/collection1/conf/solrconfig.xml

epugh · 2025-09-24T00:20:14Z

I vote for moving in the direction of Tika 3 and how it works and maybe update the tests? If this is Solr 10, can't we add the changes to being "breaking changes"? Also, a thought... could we have a set of tests that validate how Tika 1 worked, and are specific to Tika 1, and antoher that handle what Tika 3 does. Then we don't have to make everything from 1 work in 3... ANd maybe there are things in 3 that we would want to test? AFter all, in Solr 10, don't we eliminate tika 1 approach anyway?

epugh · 2025-09-24T13:46:12Z

I had a thought late last night @janhoy... So while I am super excited about the pluggable idea, I wonder if we have lost the core goal? The core goal is to offer a way of reading rich documents for indexing in Solr without the maintenance burden on Solr and to be more in line with future of Tika. If that is the core goal, I wonder if we should just target TikaServer in Solr 10, and not worry about any backcompat, beyond documenting it etc. We should just embrace the new ways of Tika working. If that makes sense, maybe the fact that some capabilities in Solr don't work at this point, like passwords or xpath, is okay if it's a Solr 10 only thing?

Does saying this is a Solr 10 only thing make it easier to have the tests pass by tweaking them and our implementation to leverage how Tika 3 and TikaServer works?

epugh · 2025-09-24T13:46:55Z

I have 90% of a working .bats test that downloads and fires up TikaServer. Should we add that so our scripts tests will validate the code? Thoughts?

janhoy · 2025-09-24T14:20:56Z

My thought was to land tikaserver in Solr 9.x as opt-in while deprecating local. The server variant need not respond with exactly same metadata, and some of the tests which test specifically 1.x functionaly can be moved to that test class. But for simple use cases that 90% of users need, like extracting text and normal metadata from PDF, Word etc we get feature parity. Then we remove the local Tika parser in 10.0 and make server the default. I.e. users will have a transition path even in 9.x.

I started with the JSON output from Tika Server, but since it does not support streaming but a full copy in memory, I'm moving to the /tika endpoint with XML response, where the TikaServer streams XHTML as parsing happens, without buffering all in memory first. Same on SolrCell side, I'm successfully parsing the XHTML with SAX, parsing all the <meta> tags. Next is to feed the sax stream into SolrContentHandler which will handle the capturing stuff. This shuold both give a small mem footprint and unlock more of the SolrCell features.

While it is true that Tika 1.x and Tika 3.x has many breaking changes, that is mainly for the Java API. The XML parse result which is a content string and a metadata map stays the same, so no conceptual difference there. The metadata keys are a bit different/normalized, but we don't need to bridge that. We can simply document that when using tikaserver they should look for dc:title istead of title, and SolrCell allows you to map those to whatever schema field you like already.

The big question is of course whether we manage to get a stable tika server impl which is production ready before 10.0, and whether the refactoring leaves the old local impl as stable as it has been, the memory footprint may have increased etc.

epugh · 2025-09-24T14:33:38Z

Sounds like a plan! I didnt' know about the need for the xml streaming parsing... Having both in 9x is a much nicer migration then just a hard swap in 10. Plus, if someone wanted to keep the old 9x local processing version in 10, they could of course create their own backend and reference it!

* Add back-compat option for metadata * Fix true SAX streaming parser for Tika XML response * Simplify ExtractionBackend interface

janhoy · 2025-09-25T15:07:27Z

So, pushed a commit with some nice changes:

Refactor some logic back to ExtractingDocumentLoader, simplify ExtractionBackend interface to two methods
Add backCompatibility=true config option to enable duplicating some metadata like Tika 1.x did, e.g. both dc:title and title
Fix true SAX streaming parser for Tika-Server XML response. We now have or own TikaXmlResponseSaxContentHandler which takes care of pulling metadata from the response, while delegating other SAX parsing to whatever ContentHandler is passed to the parse method. This lets us re-use existing code to extract plain-text, xml, or capturing, xpath style tags

Not all tests pass, but two more are green: testExtraction and testPasswordProtected

The test testPdfWithImages fails since we do not use recursive parser with Tika currently, i.e. /rmeta endpoint. Thus only the PDF is considered, not the images. Likewise, ZIP file would likely not work.

The testXpath test fails since the body should have been "News" but it is in fact "linkNews". The test doc has two <a> tags and the output is a concatenation of both. The xpath is /xhtml:html/xhtml:body/xhtml:a/descendant::node() so I assume it should only select those below body tag. I suspect that the XML returned from Tika1 is a bit different from what we get from server?

epugh · 2025-09-25T15:52:25Z

Are you seeing any issues in how TikaServer works that maybe are better fixed there? Some great progress!

janhoy · 2025-09-25T22:54:09Z

Are you seeing any issues in how TikaServer works that maybe are better fixed there? Some great progress!

I think not really. Only "quirk" I saw was that if you ingest a plain txt document, you get back an XML with a tile like <title></title>. It is parsed by TextAndCSVParser. When injecting that XML into SAX parser, it bails out on invalid character, so I inserted an XML sanitizer.

Other than that, I think TIka Server has what we need. It accepts password as HTTP header. And it accepts some PDF parser config also through headers. But more advanced parser config shuold be done on the TikaServer side, and good thing is that the user will have 100% control over their TikaServer and can configure it as they wish, much more than you could with SolrrCell.

We should probably start using /rmeta endpoint (recursive meta) parser, since our current SolrCell parser is recursive. But that may mean some more advanced XML parsing? Have not really checked.

I'm fairly optimistic on getting the remaining tests passing.

Earlier today we had 8 tests passing, after last commit, there is now 11 passing and 4 failing.

janhoy · 2025-09-25T23:29:54Z

Now testLiteralsOverride also passes. That leaves

testXPath - known error in pulling xpath - difference in XML format between local Tika and server?
testCapture - needs debugging why capturing tags won't work
testPdfWithImages - nested docs need /rmeta endpoint, not yet tested

epugh · 2025-09-26T00:03:31Z

I love seeing the updates as you make progress. Commits are fun to read too! I am really impressed that we are actually able to use the existing tests to measure progress, it's a reminder on the value of the tests in helping us understand "what features of Tika do we use? The ones in the tests!".

janhoy · 2025-09-26T00:43:33Z

The textXpath test tries to capture <a> tags directly under <body>. But it also captures the <div><a> tag. I checked the XML I get from local Tika, and it is different from XML we get from Tika Server 3. From TikaServer all the <div> tags are stripped, so that the <a> element appears to be just below <body>. I believe it is because the default HTML parser is now JSoup, which has some other rules. See https://issues.apache.org/jira/browse/TIKA-2562

Thus, this test document can be rewritten to use something else than div, and the test will work.

I believe the same is the issue with testCapture test, as it relies on capturing <div>.

That gives us a solution for the remaining three failing tests 🥳

janhoy · 2025-09-26T00:58:50Z

Yey! Rewrote tests to capture/xpath <h1> tags instead of <div> tags:

epugh · 2025-09-26T01:18:23Z

Love the way you fixed it. Does this mean in practice that folks might see different resutls depending on which backend they use and the specfici document? On the other hand, that also seems totally okay in the sense that they are different backends...

…" config) Move pdf-with-image test to local test Add recursive test to TikaServer test case

janhoy · 2025-09-26T09:00:30Z

janhoy · 2025-09-26T09:20:54Z

We now have a separate github workflow testing extraction code, with TestContainers. It is only for the sake of this PR, not intendend for merge :)

The thread leaks definitely looks related to ordinary Solr objects.

> Task :solr:modules:extraction:test
ExtractingRequestHandlerTikaServerTest > classMethod FAILED
    java.lang.AssertionError: ObjectTracker found 11 object(s) that were not released!!! [MockDirectoryWrapper, Http2SolrClient, Http2SolrClient, Http2SolrClient, MDCAwareThreadPoolExecutor, SolrIndexSearcher, MockDirectoryWrapper, SolrCore, Http2SolrClient, LBHttp2SolrClient, MockDirectoryWrapper]
    org.apache.lucene.tests.store.MockDirectoryWrapper:org.apache.solr.common.util.ObjectReleaseTracker$ObjectTrackerException: org.apache.lucene.tests.store.MockDirectoryWrapper
    	at org.apache.solr.common.util.ObjectReleaseTracker.track(ObjectReleaseTracker.java:54)
...

janhoy · 2025-09-26T09:25:13Z

@epugh and others - I'll be on holiday for a week from today. Feel free to commit anything you like directly to this branch without asking, if you want to play around or move things closer to perfection. Normal review comments are of course welcome too, but commits eats comments for breakfast :)

Any phased merge can be done later, as the interface boundaries are fairly clean, hopefully.

janhoy · 2025-09-26T09:57:50Z

...modules/extraction/src/java/org/apache/solr/handler/extraction/ExtractingRequestHandler.java

- * @deprecated Will be replaced with something similar that calls out to a separate Tika Server
- *     process running in its own JVM.
 */
-@Deprecated(since = "9.10.0")


@epugh I undeprecated this and the Loader, and instead deprecated the Local backend. This part needs to be backported before 9.10 release. Also perhaps wording in major-changes...

janhoy added 6 commits September 19, 2025 15:15

Introduce ExtractionBackend interface

26bde10

Move some tika tests to new test file

57d8d4e

ExtractingRequestHandler and ExtractingDocumentLoader not depend on T…

dc151c5

…ika API Refactor some tests to LocalTikaExtractionBackendTest

Use a factory to create the backend to keep it DRY

5a19251

Add TikaServerExtractionBackend

35fef11

Change testing to use TestContainers

196dcdc

janhoy marked this pull request as draft September 19, 2025 15:14

github-actions bot added dependencies Dependency upgrades module:extraction tool:build tests labels Sep 19, 2025

janhoy requested a review from epugh September 19, 2025 15:14

Draft docs

11ea400

github-actions bot added the documentation Improvements or additions to documentation label Sep 19, 2025

Use json response from Tika

a3794ce

Cleanup TestContainer Refactor ExtractionMetadata Add returnType to ExtractionRequest Remove static initializers

janhoy force-pushed the refactor-extraction-handler branch from cc3d43f to a3794ce Compare September 20, 2025 01:24

malliaridis mentioned this pull request Sep 20, 2025

SOLR-17888: Upgrade Apache Tika to 3.2.3 #3674

Draft

10 tasks

Allow testcontainers to read config

cf97169

epugh added 3 commits September 22, 2025 10:34

Disable JSM

87cb45c

Java Security Manager and Testcontainers do not play nicely together. We prefer Testcontainers, so disable JSM

IntelliJ prompted me.. and I couldn't resist.

7ebed82

lint

f25631d

Split test in two sub classes

5aa381f

Add common metadata Adjust some tests with dc:title instead of title Support passwords in TikaServer backend

epugh reviewed Sep 23, 2025

View reviewed changes

solr/modules/extraction/src/test-files/extraction/solr/collection1/conf/solrconfig.xml Show resolved Hide resolved

janhoy added 3 commits September 24, 2025 00:29

Review feedback. Simplify metadata add code

902355d

Error handling for factory

1cfcce9

More documentation

b769c06

* Refactor some logic back to ExtractingDocumentLoader

83296a9

* Add back-compat option for metadata * Fix true SAX streaming parser for Tika XML response * Simplify ExtractionBackend interface

Fix forbiddenAPI

14b556b

better back-compat metadata logic

45e7e41

janhoy added 2 commits September 26, 2025 01:08

More tests pass

5ba9391

Fix test testLiteralsOverride

dbca234

Rewrite tests to use h1 instead of div

e6ee706

Add support for recursive parsing in TikaServer impl ("recursive=true…

5863118

…" config) Move pdf-with-image test to local test Add recursive test to TikaServer test case

Attempt of a github workflow to run the extraction tests

d5fef5e

janhoy added 2 commits September 26, 2025 11:50

Merge branch 'main' into refactor-extraction-handler-clone

413bcfd

Undeprecate handler, deprecate local backend

488b6bc

janhoy commented Sep 26, 2025

View reviewed changes

SOLR-7632 TikaServer as pluggable backend to existing extraction handler #3670

Are you sure you want to change the base?

SOLR-7632 TikaServer as pluggable backend to existing extraction handler #3670

Uh oh!

Conversation

janhoy commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

epugh commented Sep 19, 2025

Uh oh!

janhoy commented Sep 20, 2025

Uh oh!

epugh commented Sep 21, 2025

Uh oh!

epugh commented Sep 22, 2025

Uh oh!

janhoy commented Sep 22, 2025

Uh oh!

iamsanjay commented Sep 22, 2025

Uh oh!

epugh commented Sep 22, 2025

Uh oh!

Uh oh!

epugh commented Sep 24, 2025

Uh oh!

epugh commented Sep 24, 2025

Uh oh!

epugh commented Sep 24, 2025

Uh oh!

janhoy commented Sep 24, 2025

Uh oh!

epugh commented Sep 24, 2025

Uh oh!

janhoy commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

epugh commented Sep 25, 2025

Uh oh!

janhoy commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

janhoy commented Sep 25, 2025

Uh oh!

epugh commented Sep 26, 2025

Uh oh!

janhoy commented Sep 26, 2025

Uh oh!

janhoy commented Sep 26, 2025

Uh oh!

epugh commented Sep 26, 2025

Uh oh!

janhoy commented Sep 26, 2025

Uh oh!

janhoy commented Sep 26, 2025

Uh oh!

janhoy commented Sep 26, 2025

Uh oh!

janhoy Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

epugh Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

janhoy commented Sep 19, 2025 •

edited

Loading

janhoy commented Sep 25, 2025 •

edited

Loading

janhoy commented Sep 25, 2025 •

edited

Loading