Skip to content

Conversation

@alexanderadam
Copy link

@alexanderadam alexanderadam commented Oct 11, 2025

This will work for "simple" things only because we're using a different regex engine.
But out of the 36 regular expressions, only the one for application/x-dbf fails.

However, due to uncertainties in their real-world handling and possible regex warnings, I introduced a well_known_regex_types variable where we have fixtures available in the repo and they're actively tested in the build.

For the moment, regex handling is activated for application/x-bzip2 and text/html only.
Therefore we can get rid of that text/html definition in lib/marcel/mime_type/definitions.rb now. 🎉

Furthermore I hoped that this would also make PR #130 obsolete, because

    <magic priority="60">
      <match value="(?i)&lt;(html|head|body|title|div)[ >]" type="regex" offset="0"/>
      <match value="(?i)&lt;h[123][ >]" type="regex" offset="0"/>
    </magic>

looked pretty decent at first but a JS string with html='<html><h1>'; will still trick it, right? 🤔

As the cherry on top, this PR removes the buggy and unhandled entries in the MAGIC constant because the regular expressions (as well as any other unsupported types) were unfortunately always handled as binary data.

We wrote a warning (warn "#{mime['type']}: unsupported #{type} match: #{match.to_s}") during generation, but we still returned the string, resulting in wrong rules and unnecessary entries in MIME. 😉

Hence we had rules like the following

['application/illustrator+ps', [[0..8192, b["[\\r\\n]%AI5_FileFormat [1-4][\\r\\n]"]]]],
['message/rfc822', [[0, b['Relay-Version:']], [0, b['#! rnews']], [0, b['N#! rnews']], [0, b['Forward to']], [0, b['Pipe to']], [0, b['Return-Path:']], [0, b['Message-ID:']], [0, b['X-Mailer:']], [0, b['X-Notes-Item:'], [[0..8192, b['Message-ID:']]]], [0, b['(X|DKIM|ARC)-'], [[0..8192, b["\nDate:"]], [0..8192, b["\nDelivered-To:"]], [0..8192, b["\nFrom:"]], [0..8192, b["\nMessage-ID:"]], [0..8192, b["\nMIME-Version:"]], [0..8192, b["\nReceived:"]], [0..8192, b["\nRelay-Version:"]], [0..8192, b["\nReturn-Path:"]], [0..8192, b["\nStatus:"]], [0..8192, b["\nUser-Agent:"]], [0..8192, b["\nX-Mailer:"]], [0..8192, b["\nX-Originating-IP:"]]]]]],
['application/x-dbf', [[0, b["(?s)^[\\\\002\\\\003\\\\060\\\\061\\\\062\\\\103\\\\143\\\\203\\\\213\\\\313\\\\365\\\\345\\\\373].[\\\\001-\\\\014][\\\\001-\\\\037].{4}(?:.[^\\\\000]|[\\\\101-\\\\377].)(?:[^\\\\000\\\\001].|.[^\\\\000]).{31}(?<=[\\\\000][^\\\\000]{0,10})[A-Z@+]"]]]],

Its content doesn't make any sense if it's not treated as a regular expression and they will hopefully never match anyway, yet they were just wasting a bit of computation and memory.

Now I am much happier with how the PR looks like. 🙂

EDIT: So this is actually more complicated. In-between I had some failing runs on TruffleRuby and JRuby.
EDIT2: Okay, it looks like everything is running now on all the platforms. What a ride! 😆 🎉

/CC @tomhughes

PS: I'm looking for a new adventure in case anybody is looking to hire or work with a Ruby/Rails/Crystal dev

@alexanderadam alexanderadam force-pushed the fix/bzip2_detection_with_regex_issue_128 branch from 61d8d7d to be31098 Compare October 11, 2025 00:30
@alexanderadam alexanderadam force-pushed the fix/bzip2_detection_with_regex_issue_128 branch 2 times, most recently from 51a18d3 to 48ae619 Compare October 11, 2025 13:27
@alexanderadam alexanderadam force-pushed the fix/bzip2_detection_with_regex_issue_128 branch from 48ae619 to e0fde31 Compare October 11, 2025 15:06
@alexanderadam alexanderadam changed the title add tika.xml regex support WIP: add tika.xml regex support Oct 11, 2025
@alexanderadam alexanderadam marked this pull request as draft October 11, 2025 16:03
@alexanderadam alexanderadam force-pushed the fix/bzip2_detection_with_regex_issue_128 branch 2 times, most recently from 31b0420 to 73165bd Compare October 11, 2025 16:25
This will work for simple things only because we're using a different regex engine.
But out of all the current regular expressions, only the one for `application/x-dbf` fails.
So I guess we're good.

And we can get rid of that html definition now.
@alexanderadam alexanderadam force-pushed the fix/bzip2_detection_with_regex_issue_128 branch from 73165bd to c70d3d3 Compare October 11, 2025 16:41
@alexanderadam alexanderadam changed the title WIP: add tika.xml regex support add tika.xml regex support Oct 11, 2025
@alexanderadam alexanderadam marked this pull request as ready for review October 11, 2025 16:48
Simplifying everything by just allowing what we are actively testing. 🎉

refs rails#128
@alexanderadam alexanderadam force-pushed the fix/bzip2_detection_with_regex_issue_128 branch from da332ed to 9650447 Compare October 14, 2025 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants