Skip to content

Conversation

janhoy
Copy link
Contributor

@janhoy janhoy commented Sep 19, 2025

https://issues.apache.org/jira/browse/SOLR-7632

This work builds on the one in #3361 but instead of making a new module, we add it as a capability to the existing extraction handler through specifying extraction.backend=tikaserver.

This first required refactoring extraction handler to detach it from the Tika-v1 API. There is a new interface ExtractionBackend that takes generic ExtractionRequest object in and returns an ExtractionResult bean, and a new LocalTikaExtractionBackend implementation that encapsulates all Tikav1 api handling. This implementation can be deprecated, and in Solr 10, the tikaserver one can be made default.

See each commit for the progression, it starts with refactoring existing code and ends with adding tikaserver impl.

All existing tests pass. New tests are added using TestContainers to spin up Tika.

Note: Most of the coding was done by JetBrains Junie, so reviewers may want to ensure nothing fancy has slipped into the code.

@janhoy janhoy marked this pull request as draft September 19, 2025 15:14
@janhoy janhoy requested a review from epugh September 19, 2025 15:14
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Sep 19, 2025
@epugh
Copy link
Contributor

epugh commented Sep 19, 2025

Exciting!

@janhoy
Copy link
Contributor Author

janhoy commented Sep 20, 2025

Status:

  • Parses docs using TikaServer
  • Can switch between xml (html) and text format of the content field
  • Randomized the choice of backend for the main test class
  • ExtractOnly not fully implemented for tikaserver, some tests fail

TBD:

  • The whole xpath / SAX parsing of XML response is missing
  • We use JDK HTTP client, could perhaps use Jetty client. See other POC for example, including making timeouts configurable
  • Must make sure that tikaserver.url is only configurable on requesthandler config in solrconfig, not as a request parameter (security)
  • RefGuide docs, especially how to start TikaServer etc
  • Remove the DummyExtractionBackend

Anyone, please feel free to hack away on this if it looks exciting, committing directly to the PR branch.

Question: Would it bring value to isolate the refactoring in one PR and then another one to add the tikaserver impl?

Cleanup TestContainer
Refactor ExtractionMetadata
Add returnType to ExtractionRequest
Remove static initializers
@janhoy janhoy force-pushed the refactor-extraction-handler branch from cc3d43f to a3794ce Compare September 20, 2025 01:24
@epugh
Copy link
Contributor

epugh commented Sep 21, 2025

Any luck with security manager?? I had many difficulties

@epugh
Copy link
Contributor

epugh commented Sep 22, 2025

Testcontainers and docker don't love the SecurityManager. I had claude try to run the tests and add additional permissions to solr-tests.policy, and after an hour or so, I had a lot more, but no love:

// Needed for testcontainers
  permission java.io.FilePermission "/Users/epugh/.testcontainers.properties", "read";
  permission java.io.FilePermission "/Users/epugh/.docker-java.properties", "read";
  permission java.io.FilePermission "/Users/epugh/.docker/-", "read";
  permission java.io.FilePermission "/usr/local/opt/[email protected]/bin/docker-machine", "read";
  permission java.io.FilePermission "/usr/local/opt/[email protected]/bin/docker-machine", "read";
  permission java.io.FilePermission "/Users/epugh/.asdf/installs/nodejs/20.18.3/bin/docker-machine", "read";
  permission java.io.FilePermission "/Users/epugh/.asdf/shims/docker-machine", "read";
  permission java.io.FilePermission "/Users/epugh/.nvm/versions/node/v14.21.2/bin/docker-machine", "read";
  permission java.io.FilePermission "/Users/epugh/.rbenv/shims/docker-machine", "read";
  permission java.io.FilePermission "/usr/local/bin/docker-machine", "read";
  permission java.io.FilePermission "/usr/local/sbin/docker-machine", "read";
  permission java.io.FilePermission "/System/Cryptexes/App/usr/bin/docker-machine", "read";
  permission java.io.FilePermission "/usr/bin/docker-machine", "read";
  permission java.io.FilePermission "/bin/docker-machine", "read";
  permission java.io.FilePermission "/usr/sbin/docker-machine", "read";
  permission java.io.FilePermission "/sbin/docker-machine", "read";
  permission java.io.FilePermission "/usr/local/MacGPG2/bin/docker-machine", "read";
  permission java.io.FilePermission "/Library/Apple/usr/bin/docker-machine", "read";
  permission java.io.FilePermission "/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin/docker-machine", "read";
  permission java.io.FilePermission "/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin/docker-machine", "read";
  permission java.io.FilePermission "/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin/docker-machine", "read";


@janhoy
Copy link
Contributor Author

janhoy commented Sep 22, 2025

Yea, that’s annoying. Perhaps we could disable JSM for this test or for tests in the entire module?

@iamsanjay
Copy link
Contributor

I had the similar experience as I was upgrading kafka. And then I stopped.

Java Security Manager and Testcontainers do not play nicely together.  We prefer Testcontainers, so disable JSM
@epugh
Copy link
Contributor

epugh commented Sep 22, 2025

When I first saw DummyExtractionBackend my first thought was that it should be in the test class hierarchy. However, would there be value in keeping it? If you wanted to test your set up in Solr (and not worry about the Tika side), could it be useful for that? "I send a doc and I get something back"....

Add common metadata
Adjust some tests with dc:title instead of title
Support passwords in TikaServer backend
@epugh
Copy link
Contributor

epugh commented Sep 24, 2025

I vote for moving in the direction of Tika 3 and how it works and maybe update the tests? If this is Solr 10, can't we add the changes to being "breaking changes"? Also, a thought... could we have a set of tests that validate how Tika 1 worked, and are specific to Tika 1, and antoher that handle what Tika 3 does. Then we don't have to make everything from 1 work in 3... ANd maybe there are things in 3 that we would want to test? AFter all, in Solr 10, don't we eliminate tika 1 approach anyway?

@epugh
Copy link
Contributor

epugh commented Sep 24, 2025

I had a thought late last night @janhoy... So while I am super excited about the pluggable idea, I wonder if we have lost the core goal? The core goal is to offer a way of reading rich documents for indexing in Solr without the maintenance burden on Solr and to be more in line with future of Tika. If that is the core goal, I wonder if we should just target TikaServer in Solr 10, and not worry about any backcompat, beyond documenting it etc. We should just embrace the new ways of Tika working. If that makes sense, maybe the fact that some capabilities in Solr don't work at this point, like passwords or xpath, is okay if it's a Solr 10 only thing?

Does saying this is a Solr 10 only thing make it easier to have the tests pass by tweaking them and our implementation to leverage how Tika 3 and TikaServer works?

@epugh
Copy link
Contributor

epugh commented Sep 24, 2025

I have 90% of a working .bats test that downloads and fires up TikaServer. Should we add that so our scripts tests will validate the code? Thoughts?

@janhoy
Copy link
Contributor Author

janhoy commented Sep 24, 2025

My thought was to land tikaserver in Solr 9.x as opt-in while deprecating local. The server variant need not respond with exactly same metadata, and some of the tests which test specifically 1.x functionaly can be moved to that test class. But for simple use cases that 90% of users need, like extracting text and normal metadata from PDF, Word etc we get feature parity. Then we remove the local Tika parser in 10.0 and make server the default. I.e. users will have a transition path even in 9.x.

I started with the JSON output from Tika Server, but since it does not support streaming but a full copy in memory, I'm moving to the /tika endpoint with XML response, where the TikaServer streams XHTML as parsing happens, without buffering all in memory first. Same on SolrCell side, I'm successfully parsing the XHTML with SAX, parsing all the <meta> tags. Next is to feed the sax stream into SolrContentHandler which will handle the capturing stuff. This shuold both give a small mem footprint and unlock more of the SolrCell features.

While it is true that Tika 1.x and Tika 3.x has many breaking changes, that is mainly for the Java API. The XML parse result which is a content string and a metadata map stays the same, so no conceptual difference there. The metadata keys are a bit different/normalized, but we don't need to bridge that. We can simply document that when using tikaserver they should look for dc:title istead of title, and SolrCell allows you to map those to whatever schema field you like already.

The big question is of course whether we manage to get a stable tika server impl which is production ready before 10.0, and whether the refactoring leaves the old local impl as stable as it has been, the memory footprint may have increased etc.

@epugh
Copy link
Contributor

epugh commented Sep 24, 2025

Sounds like a plan! I didnt' know about the need for the xml streaming parsing... Having both in 9x is a much nicer migration then just a hard swap in 10. Plus, if someone wanted to keep the old 9x local processing version in 10, they could of course create their own backend and reference it!

* Add back-compat option for metadata
* Fix true SAX streaming parser for Tika XML response
* Simplify ExtractionBackend interface
@janhoy
Copy link
Contributor Author

janhoy commented Sep 25, 2025

So, pushed a commit with some nice changes:

  • Refactor some logic back to ExtractingDocumentLoader, simplify ExtractionBackend interface to two methods
  • Add backCompatibility=true config option to enable duplicating some metadata like Tika 1.x did, e.g. both dc:title and title
  • Fix true SAX streaming parser for Tika-Server XML response. We now have or own TikaXmlResponseSaxContentHandler which takes care of pulling metadata from the response, while delegating other SAX parsing to whatever ContentHandler is passed to the parse method. This lets us re-use existing code to extract plain-text, xml, or capturing, xpath style tags

Not all tests pass, but two more are green: testExtraction and testPasswordProtected

Skjermbilde 2025-09-25 kl  17 02 08

The test testPdfWithImages fails since we do not use recursive parser with Tika currently, i.e. /rmeta endpoint. Thus only the PDF is considered, not the images. Likewise, ZIP file would likely not work.

The testXpath test fails since the body should have been "News" but it is in fact "linkNews". The test doc has two <a> tags and the output is a concatenation of both. The xpath is /xhtml:html/xhtml:body/xhtml:a/descendant::node() so I assume it should only select those below body tag. I suspect that the XML returned from Tika1 is a bit different from what we get from server?

@epugh
Copy link
Contributor

epugh commented Sep 25, 2025

Are you seeing any issues in how TikaServer works that maybe are better fixed there? Some great progress!

@janhoy
Copy link
Contributor Author

janhoy commented Sep 25, 2025

Are you seeing any issues in how TikaServer works that maybe are better fixed there? Some great progress!

I think not really. Only "quirk" I saw was that if you ingest a plain txt document, you get back an XML with a tile like <title>&#0;</title>. It is parsed by TextAndCSVParser. When injecting that XML into SAX parser, it bails out on invalid character, so I inserted an XML sanitizer.

Other than that, I think TIka Server has what we need. It accepts password as HTTP header. And it accepts some PDF parser config also through headers. But more advanced parser config shuold be done on the TikaServer side, and good thing is that the user will have 100% control over their TikaServer and can configure it as they wish, much more than you could with SolrrCell.

We should probably start using /rmeta endpoint (recursive meta) parser, since our current SolrCell parser is recursive. But that may mean some more advanced XML parsing? Have not really checked.

I'm fairly optimistic on getting the remaining tests passing.

Earlier today we had 8 tests passing, after last commit, there is now 11 passing and 4 failing.
Skjermbilde 2025-09-26 kl  01 02 18

@janhoy
Copy link
Contributor Author

janhoy commented Sep 25, 2025

Now testLiteralsOverride also passes. That leaves

  • testXPath - known error in pulling xpath - difference in XML format between local Tika and server?
  • testCapture - needs debugging why capturing tags won't work
  • testPdfWithImages - nested docs need /rmeta endpoint, not yet tested

@epugh
Copy link
Contributor

epugh commented Sep 26, 2025

I love seeing the updates as you make progress. Commits are fun to read too! I am really impressed that we are actually able to use the existing tests to measure progress, it's a reminder on the value of the tests in helping us understand "what features of Tika do we use? The ones in the tests!".

@janhoy
Copy link
Contributor Author

janhoy commented Sep 26, 2025

The textXpath test tries to capture <a> tags directly under <body>. But it also captures the <div><a> tag. I checked the XML I get from local Tika, and it is different from XML we get from Tika Server 3. From TikaServer all the <div> tags are stripped, so that the <a> element appears to be just below <body>. I believe it is because the default HTML parser is now JSoup, which has some other rules. See https://issues.apache.org/jira/browse/TIKA-2562

Thus, this test document can be rewritten to use something else than div, and the test will work.

I believe the same is the issue with testCapture test, as it relies on capturing <div>.

That gives us a solution for the remaining three failing tests 🥳

@janhoy
Copy link
Contributor Author

janhoy commented Sep 26, 2025

Yey! Rewrote tests to capture/xpath <h1> tags instead of <div> tags:
Skjermbilde 2025-09-26 kl  02 57 56

@epugh
Copy link
Contributor

epugh commented Sep 26, 2025

Love the way you fixed it. Does this mean in practice that folks might see different resutls depending on which backend they use and the specfici document? On the other hand, that also seems totally okay in the sense that they are different backends...

…" config)

Move pdf-with-image test to local test
Add recursive test to TikaServer test case
@janhoy
Copy link
Contributor Author

janhoy commented Sep 26, 2025

Last commit adds recursive parsing as an option &recursive=true for backend tikaserver. I moved the failing image test to the Local test class, since there is a difference in how PDFs with images are parsed in 1.x and 3.x. In 1.x embedded images would leave traces in the extracted text by default. Not anymore. But I added a test to the tikaserver backend test class that extracts the same pdf, with recursive enabled and also a special HTTP header to TikaServer X-Tika-PDFextractInlineImages=true which indeed extracts the images an adds their names to the text. This recursive business is highly experimental, it's a different endpoint /rmeta, returning JSON which needs to be buffered entirely in memory both on TikaServer side and on Solr side. Also all the embedded documents are returned concatenated together in the content, and I believe that the metadata from the main object is the only retained?

All tests are now green. However, there is still a thread leak in the tikaserver test. I think there are some HttpClient stuff not released.

Other TODO:

  • Lots of code from GenAI, which needs review and rewrite / simplification.
  • There may be debug print and TODOs left here and there
  • The back-compat metadata map is AI generated and by no means complete or even correct 🤣
  • Lack of JavaDoc everywhere
  • Perhaps some code is Java21, making backport a challenge
  • Not much attention has been given to exception handling, retrying, timeout values etc
  • Should probably use Jetty HTTP client.
  • Throughput with many update requests? Right now the request thread will be blocking on Tika response and parsing..
  • If TikaServer supports HTTPS we'd probably need to handle self-signed SSL either through truststore, or custom config like we did for JWT-auth.
  • If TIkaServer supports Auth, that must be thought of
  • Complete refGuide docs
  • Split the huge PR into stages, i.e. first only support for pluggable backend, then add the tikaserver backend

That concludes the "POC", proving that it is doable to do a drop-in replacement for users.

@janhoy
Copy link
Contributor Author

janhoy commented Sep 26, 2025

We now have a separate github workflow testing extraction code, with TestContainers. It is only for the sake of this PR, not intendend for merge :)

The thread leaks definitely looks related to ordinary Solr objects.

> Task :solr:modules:extraction:test
ExtractingRequestHandlerTikaServerTest > classMethod FAILED
    java.lang.AssertionError: ObjectTracker found 11 object(s) that were not released!!! [MockDirectoryWrapper, Http2SolrClient, Http2SolrClient, Http2SolrClient, MDCAwareThreadPoolExecutor, SolrIndexSearcher, MockDirectoryWrapper, SolrCore, Http2SolrClient, LBHttp2SolrClient, MockDirectoryWrapper]
    org.apache.lucene.tests.store.MockDirectoryWrapper:org.apache.solr.common.util.ObjectReleaseTracker$ObjectTrackerException: org.apache.lucene.tests.store.MockDirectoryWrapper
    	at org.apache.solr.common.util.ObjectReleaseTracker.track(ObjectReleaseTracker.java:54)
...

@janhoy
Copy link
Contributor Author

janhoy commented Sep 26, 2025

@epugh and others - I'll be on holiday for a week from today. Feel free to commit anything you like directly to this branch without asking, if you want to play around or move things closer to perfection. Normal review comments are of course welcome too, but commits eats comments for breakfast :)

Any phased merge can be done later, as the interface boundaries are fairly clean, hopefully.

* @deprecated Will be replaced with something similar that calls out to a separate Tika Server
* process running in its own JVM.
*/
@Deprecated(since = "9.10.0")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@epugh I undeprecated this and the Loader, and instead deprecated the Local backend. This part needs to be backported before 9.10 release. Also perhaps wording in major-changes...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Dependency upgrades documentation Improvements or additions to documentation module:extraction tests tool:build
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants