Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug with an escaped slash #23

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Fix bug with an escaped slash #23

wants to merge 1 commit into from

Conversation

ygdkn
Copy link

@ygdkn ygdkn commented Aug 5, 2017

This PR fixes incorrect checking of Disallow: /*%2F rule

@dlecocq dlecocq mentioned this pull request Aug 7, 2017
@dlecocq
Copy link
Contributor

dlecocq commented Aug 7, 2017

Thanks for your interest! C++ REP parsing is kind of a niche thing, and it's always fun, surprising when someone finds this little corner of the world.

I'm not sure I understand this use case -- is the goal to treat %2F specially? Other escaped entities are treated interchangeably with their unescaped versions (in test-agent.cpp, we have EscapedRule, UnescapedRule, EscapedRuleWildcard, UnescapedRuleWildcard to explicitly test that). It's entirely possible that provision is explicitly made for special treatment of /, but off hand I'm unaware of that.

My other concern is the use of regex. Before this PR, running ./bench, gave me the approximately 3.8x the performance as with this PR on parsing the RFC-provided REP. That's currently the only benchmark that goes through the Agent::escape method:

# Without regex:
Benchmarking parse RFC with 100000 per run:
    Run 0: 801.7 ms
    Run 1: 793.409 ms
    Run 2: 811.514 ms
    Run 3: 802.514 ms
    Run 4: 782.877 ms
  Average: 798.403 ms
     Rate: 125.25 k-iter / s

# With regex:
Benchmarking parse RFC with 100000 per run:
    Run 0: 2876.68 ms
    Run 1: 3068.78 ms
    Run 2: 3132.61 ms
    Run 3: 3013.67 ms
    Run 4: 3093.32 ms
  Average: 3037.01 ms
     Rate: 32.9271 k-iter / s

@ygdkn
Copy link
Author

ygdkn commented Aug 8, 2017

Thank you for your response! Yes, performance decreases dramatically and I should find better solution. And thank you pointing out on ./bench, I totally missed it.

This case appears to be important:

MK1996 says, 'If a %xx encoded octet is encountered it is unencoded
prior to comparison, unless it is the "/" character, which has
special meaning in a path.'

http://www.robotstxt.org/norobots-rfc.txt

@ygdkn ygdkn closed this Aug 8, 2017
@ygdkn ygdkn reopened this Aug 8, 2017
@ygdkn
Copy link
Author

ygdkn commented Aug 8, 2017

E.g. we have rule
Disallow: /*%2F

So, in current implementation http://example.com/about is allowed, but http://example.com/about/ is not.

@dlecocq
Copy link
Contributor

dlecocq commented Aug 8, 2017

Interesting! I suppose I missed that section of the RFC initially.

The code internal to Url that deals with the escaping is here: https://github.com/seomoz/url-cpp/blob/master/src/url.cpp#L658 .

Originally, I thought it would be possible that code could be adapted or the interface of Url#escape could be extended to support providing a custom set of safe characters for each portion of the URL, enabling us to still use Url#escape, but explicitly provide configuration earmarking / as unsafe for use. I'm not sure whether that would work, but if it did it would probably involve using strict mode.

Still, it's something to investigate, and that escape code is relatively performant and is part of a code path that is currently used anyway.

@b4hand
Copy link
Contributor

b4hand commented Aug 15, 2017

This is a very interesting edge case. I'm not sure exactly the best way to handle this is, but the recommendation by @dlecocq seems like a reasonable approach to investigate as a first pass. Alternatively, you could try doing the substitution manually using direct character replacements rather than using the regex replace which is probably doing multiple string allocations.

@@ -82,6 +83,14 @@ namespace Rep

std::string Agent::escape(const std::string& query)
{
return Url::Url(query).defrag().escape().fullpath();
std::regex escaped_slash ("%2[Ff]");
std::regex escaped_newline ("\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a newline character for this seems like a gross hack rather than a long term solution. I'd really prefer a fix that was cleaner.

@b4hand
Copy link
Contributor

b4hand commented Sep 6, 2017

After thinking about this issue a little more, the 1996 RFC for REP does not actually support wildcards in the path. Wildcard support in REP was added later by Yahoo, Google, and other search engines, and I find no mention of special treatment of the escaped slash character in Google's spec, so this particular edge case is odd in that it falls in between any of the public standards for REP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants