-
Notifications
You must be signed in to change notification settings - Fork 767
SOLR-7632 TikaServer as pluggable backend to existing extraction handler #3670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ika API Refactor some tests to LocalTikaExtractionBackendTest
Exciting! |
Status:
TBD:
Anyone, please feel free to hack away on this if it looks exciting, committing directly to the PR branch. Question: Would it bring value to isolate the refactoring in one PR and then another one to add the tikaserver impl? |
Cleanup TestContainer Refactor ExtractionMetadata Add returnType to ExtractionRequest Remove static initializers
cc3d43f
to
a3794ce
Compare
Any luck with security manager?? I had many difficulties |
Testcontainers and docker don't love the SecurityManager. I had claude try to run the tests and add additional permissions to
|
Yea, that’s annoying. Perhaps we could disable JSM for this test or for tests in the entire module? |
I had the similar experience as I was upgrading kafka. And then I stopped. |
Java Security Manager and Testcontainers do not play nicely together. We prefer Testcontainers, so disable JSM
When I first saw |
Add common metadata Adjust some tests with dc:title instead of title Support passwords in TikaServer backend
solr/modules/extraction/src/test-files/extraction/solr/collection1/conf/solrconfig.xml
Show resolved
Hide resolved
I vote for moving in the direction of Tika 3 and how it works and maybe update the tests? If this is Solr 10, can't we add the changes to being "breaking changes"? Also, a thought... could we have a set of tests that validate how Tika 1 worked, and are specific to Tika 1, and antoher that handle what Tika 3 does. Then we don't have to make everything from 1 work in 3... ANd maybe there are things in 3 that we would want to test? AFter all, in Solr 10, don't we eliminate tika 1 approach anyway? |
I had a thought late last night @janhoy... So while I am super excited about the pluggable idea, I wonder if we have lost the core goal? The core goal is to offer a way of reading rich documents for indexing in Solr without the maintenance burden on Solr and to be more in line with future of Tika. If that is the core goal, I wonder if we should just target TikaServer in Solr 10, and not worry about any backcompat, beyond documenting it etc. We should just embrace the new ways of Tika working. If that makes sense, maybe the fact that some capabilities in Solr don't work at this point, like passwords or xpath, is okay if it's a Solr 10 only thing? Does saying this is a Solr 10 only thing make it easier to have the tests pass by tweaking them and our implementation to leverage how Tika 3 and TikaServer works? |
I have 90% of a working |
My thought was to land tikaserver in Solr 9.x as opt-in while deprecating local. The server variant need not respond with exactly same metadata, and some of the tests which test specifically 1.x functionaly can be moved to that test class. But for simple use cases that 90% of users need, like extracting text and normal metadata from PDF, Word etc we get feature parity. Then we remove the local Tika parser in 10.0 and make server the default. I.e. users will have a transition path even in 9.x. I started with the JSON output from Tika Server, but since it does not support streaming but a full copy in memory, I'm moving to the While it is true that Tika 1.x and Tika 3.x has many breaking changes, that is mainly for the Java API. The XML parse result which is a The big question is of course whether we manage to get a stable tika server impl which is production ready before 10.0, and whether the refactoring leaves the old |
Sounds like a plan! I didnt' know about the need for the xml streaming parsing... Having both in 9x is a much nicer migration then just a hard swap in 10. Plus, if someone wanted to keep the old 9x local processing version in 10, they could of course create their own backend and reference it! |
* Add back-compat option for metadata * Fix true SAX streaming parser for Tika XML response * Simplify ExtractionBackend interface
So, pushed a commit with some nice changes:
Not all tests pass, but two more are green: ![]() The test The |
Are you seeing any issues in how TikaServer works that maybe are better fixed there? Some great progress! |
Now
|
I love seeing the updates as you make progress. Commits are fun to read too! I am really impressed that we are actually able to use the existing tests to measure progress, it's a reminder on the value of the tests in helping us understand "what features of Tika do we use? The ones in the tests!". |
The Thus, this test document can be rewritten to use something else than div, and the test will work. I believe the same is the issue with That gives us a solution for the remaining three failing tests 🥳 |
Love the way you fixed it. Does this mean in practice that folks might see different resutls depending on which backend they use and the specfici document? On the other hand, that also seems totally okay in the sense that they are different backends... |
…" config) Move pdf-with-image test to local test Add recursive test to TikaServer test case
Last commit adds recursive parsing as an option All tests are now green. However, there is still a thread leak in the tikaserver test. I think there are some HttpClient stuff not released. Other TODO:
That concludes the "POC", proving that it is doable to do a drop-in replacement for users. |
We now have a separate github workflow testing extraction code, with TestContainers. It is only for the sake of this PR, not intendend for merge :) The thread leaks definitely looks related to ordinary Solr objects.
|
@epugh and others - I'll be on holiday for a week from today. Feel free to commit anything you like directly to this branch without asking, if you want to play around or move things closer to perfection. Normal review comments are of course welcome too, but commits eats comments for breakfast :) Any phased merge can be done later, as the interface boundaries are fairly clean, hopefully. |
* @deprecated Will be replaced with something similar that calls out to a separate Tika Server | ||
* process running in its own JVM. | ||
*/ | ||
@Deprecated(since = "9.10.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@epugh I undeprecated this and the Loader, and instead deprecated the Local backend. This part needs to be backported before 9.10 release. Also perhaps wording in major-changes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally!
https://issues.apache.org/jira/browse/SOLR-7632
This work builds on the one in #3361 but instead of making a new module, we add it as a capability to the existing extraction handler through specifying
extraction.backend=tikaserver
.This first required refactoring extraction handler to detach it from the Tika-v1 API. There is a new interface
ExtractionBackend
that takes genericExtractionRequest
object in and returns anExtractionResult
bean, and a newLocalTikaExtractionBackend
implementation that encapsulates all Tikav1 api handling. This implementation can be deprecated, and in Solr 10, thetikaserver
one can be made default.See each commit for the progression, it starts with refactoring existing code and ends with adding tikaserver impl.
All existing tests pass. New tests are added using TestContainers to spin up Tika.
Note: Most of the coding was done by JetBrains Junie, so reviewers may want to ensure nothing fancy has slipped into the code.