Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File name parsing for Merton search problems #21

Open
beville opened this issue Feb 21, 2024 · 8 comments
Open

File name parsing for Merton search problems #21

beville opened this issue Feb 21, 2024 · 8 comments

Comments

@beville
Copy link

beville commented Feb 21, 2024

Looks like Metron search based a parsed file name (auto-tagging) is failing some cases.

A series title with colons (:) and slashes (/) will of have those replaced with space-minus-space (-) in a filename. A great example is "Batman / Superman: World's Finest" which might have a filename with Batman - Superman - World's Finest (or even one without the ' in it) The minuses cause the Metron search the search to fail.

Probably CT just needs to remove minus/dash characters (-) from the search string before submitting it.


Another issue I think is probably server-side, and I should probably make a issue with @bpepple over on his repo. Metron doesn't like a missing apostrophe in some cases.

So when the search string is:

Cory Doctorow's Futuristic Tales of the Here and Now it works
Cory Doctorow s Futuristic Tales of the Here and Now also works
Cory Doctorow Futuristic Tales of the Here and Now also works

but for some reason

Cory Doctorows Futuristic Tales of the Here and Now fails.

Unfortunately for auto-tagging, it's pretty common to see the dropped apostrophe. I can't think of a good client-side solution for that one, though, but maybe you all have an idea.

Thanks!

@bpepple
Copy link

bpepple commented Feb 21, 2024

Probably CT just needs to remove minus/dash characters (-) from the search string before submitting it.

Yeah, I think I mentioned in #17 that sanitizing the query string would greatly help with matching.

Another issue I think is probably server-side, and I should probably make a issue with @bpepple over on his repo. Metron doesn't like a missing apostrophe in some cases.

Possibly using SearchVector/SearchRank or a Trigram Similarity in Django/PostgreSQL may improve results, but there is a significant performance penalty to be paid.

Might be worth testing to see if it gives improved results.

@mizaki
Copy link
Collaborator

mizaki commented Feb 21, 2024

I was all set to tell you it does sanitise and then I checked and it does not. I'm wonder what the reason was... I think it may have been because Metron was returning some tests without it but obviously there is still some sanitising needed.

@bpepple
Copy link

bpepple commented Feb 21, 2024

I was all set to tell you it does sanitise and then I checked and it does not. I'm wonder what the reason was... I think it may have been because Metron was returning some tests without it but obviously there is still some sanitising needed.

Yeah, it's much easier on my end to see what metron_talker is submitting for a series name. 😉

@mizaki
Copy link
Collaborator

mizaki commented Mar 21, 2024

I tried using the same as CT but that causes the ' problem so I've done everything the same as CT but remove the ' without a space in #24 if you want to give a try on any you had trouble with.

@beville
Copy link
Author

beville commented Mar 28, 2024

Seems to work better for some titles!

@bpepple
Copy link

bpepple commented Apr 22, 2024

Another issue I think is probably server-side, and I should probably make a issue with @bpepple over on his repo. Metron doesn't like a missing apostrophe in some cases.

So when the search string is:

Cory Doctorow's Futuristic Tales of the Here and Now it works Cory Doctorow s Futuristic Tales of the Here and Now also works Cory Doctorow Futuristic Tales of the Here and Now also works

but for some reason

Cory Doctorows Futuristic Tales of the Here and Now fails.

Unfortunately for auto-tagging, it's pretty common to see the dropped apostrophe. I can't think of a good client-side solution for that one, though, but maybe you all have an idea.

Looked into this a bit more today, and played around with Postgresql's Trigram Similarity Matching, which helps deal with the apostrophe matching issues. Unfortunately, Django's support isn't implemented with Transform, so I'd lose support for Unaccent and other lookup options, unless I made some hand-crafted artisanal SQL statements (which I don't really have the time to do), so this isn't a great solution.

@mizaki
Copy link
Collaborator

mizaki commented Apr 23, 2024

Seems to work better for some titles!

I've been a bit slow on the talker front but if you have any examples of problem titles I'll see what I can do (and put in some test).

@beville
Copy link
Author

beville commented Apr 23, 2024

I think the classic problem title in this vein is:

"Batman/Superman: World's Finest"

which includes a slash, a colon, and an apostrophe.

Filenames might have the slash and colon replaced with a space or a "-". The apostrophe often seem to typically be just removed. (I can't remember if that's an problem character on Windows filesystems?)

"Batman - Superman- Worlds Finest"

I think in general the apostrophe (in English anyways) is most problematic for filename-to-search, since it tends to be replaced with nothing rather than a space in some filenames.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants