Fix bug with an escaped slash #23

ygdkn · 2017-08-05T17:27:48Z

This PR fixes incorrect checking of Disallow: /*%2F rule

dlecocq · 2017-08-07T17:57:00Z

Thanks for your interest! C++ REP parsing is kind of a niche thing, and it's always fun, surprising when someone finds this little corner of the world.

I'm not sure I understand this use case -- is the goal to treat %2F specially? Other escaped entities are treated interchangeably with their unescaped versions (in test-agent.cpp, we have EscapedRule, UnescapedRule, EscapedRuleWildcard, UnescapedRuleWildcard to explicitly test that). It's entirely possible that provision is explicitly made for special treatment of /, but off hand I'm unaware of that.

My other concern is the use of regex. Before this PR, running ./bench, gave me the approximately 3.8x the performance as with this PR on parsing the RFC-provided REP. That's currently the only benchmark that goes through the Agent::escape method:

# Without regex:
Benchmarking parse RFC with 100000 per run:
    Run 0: 801.7 ms
    Run 1: 793.409 ms
    Run 2: 811.514 ms
    Run 3: 802.514 ms
    Run 4: 782.877 ms
  Average: 798.403 ms
     Rate: 125.25 k-iter / s

# With regex:
Benchmarking parse RFC with 100000 per run:
    Run 0: 2876.68 ms
    Run 1: 3068.78 ms
    Run 2: 3132.61 ms
    Run 3: 3013.67 ms
    Run 4: 3093.32 ms
  Average: 3037.01 ms
     Rate: 32.9271 k-iter / s

ygdkn · 2017-08-08T16:31:28Z

Thank you for your response! Yes, performance decreases dramatically and I should find better solution. And thank you pointing out on ./bench, I totally missed it.

This case appears to be important:

MK1996 says, 'If a %xx encoded octet is encountered it is unencoded
prior to comparison, unless it is the "/" character, which has
special meaning in a path.'

http://www.robotstxt.org/norobots-rfc.txt

ygdkn · 2017-08-08T16:40:00Z

E.g. we have rule
Disallow: /*%2F

So, in current implementation http://example.com/about is allowed, but http://example.com/about/ is not.

dlecocq · 2017-08-08T16:47:52Z

Interesting! I suppose I missed that section of the RFC initially.

The code internal to Url that deals with the escaping is here: https://github.com/seomoz/url-cpp/blob/master/src/url.cpp#L658 .

Originally, I thought it would be possible that code could be adapted or the interface of Url#escape could be extended to support providing a custom set of safe characters for each portion of the URL, enabling us to still use Url#escape, but explicitly provide configuration earmarking / as unsafe for use. I'm not sure whether that would work, but if it did it would probably involve using strict mode.

Still, it's something to investigate, and that escape code is relatively performant and is part of a code path that is currently used anyway.

b4hand · 2017-08-15T23:49:09Z

This is a very interesting edge case. I'm not sure exactly the best way to handle this is, but the recommendation by @dlecocq seems like a reasonable approach to investigate as a first pass. Alternatively, you could try doing the substitution manually using direct character replacements rather than using the regex replace which is probably doing multiple string allocations.

b4hand · 2017-08-15T23:49:53Z

src/agent.cpp

@@ -82,6 +83,14 @@ namespace Rep

    std::string Agent::escape(const std::string& query)
    {
-        return Url::Url(query).defrag().escape().fullpath();
+      	std::regex escaped_slash ("%2[Ff]");
+      	std::regex escaped_newline ("\n");


Using a newline character for this seems like a gross hack rather than a long term solution. I'd really prefer a fix that was cleaner.

b4hand · 2017-09-06T18:12:37Z

After thinking about this issue a little more, the 1996 RFC for REP does not actually support wildcards in the path. Wildcard support in REP was added later by Yahoo, Google, and other search engines, and I find no mention of special treatment of the escaped slash character in Google's spec, so this particular edge case is odd in that it falls in between any of the public standards for REP.

Fix bug with an escaped slash

2bde156

dlecocq mentioned this pull request Aug 7, 2017

Vendor in google test #24

Merged

ygdkn closed this Aug 8, 2017

ygdkn reopened this Aug 8, 2017

b4hand suggested changes Aug 15, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug with an escaped slash #23

Fix bug with an escaped slash #23

ygdkn commented Aug 5, 2017

dlecocq commented Aug 7, 2017

ygdkn commented Aug 8, 2017 •

edited

Loading

ygdkn commented Aug 8, 2017

dlecocq commented Aug 8, 2017

b4hand commented Aug 15, 2017

b4hand Aug 15, 2017

b4hand commented Sep 6, 2017 •

edited

Loading

Fix bug with an escaped slash #23

Are you sure you want to change the base?

Fix bug with an escaped slash #23

Conversation

ygdkn commented Aug 5, 2017

dlecocq commented Aug 7, 2017

ygdkn commented Aug 8, 2017 • edited Loading

ygdkn commented Aug 8, 2017

dlecocq commented Aug 8, 2017

b4hand commented Aug 15, 2017

b4hand Aug 15, 2017

Choose a reason for hiding this comment

b4hand commented Sep 6, 2017 • edited Loading

ygdkn commented Aug 8, 2017 •

edited

Loading

b4hand commented Sep 6, 2017 •

edited

Loading