Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode handling of --include and --exclude #145

Open
mr-bo-jangles opened this issue Jul 9, 2021 · 8 comments
Open

Unicode handling of --include and --exclude #145

mr-bo-jangles opened this issue Jul 9, 2021 · 8 comments

Comments

@mr-bo-jangles
Copy link
Contributor

So my specific usecase here is attempting to mirror a site with a lot of directories of various languages, but skipping the static files at a higher level.

Example Folder Structure

/Static/<collection of unwanted static files>
/Assets/<collection of unwanted static files>
/Books/
      ./ -> /Books/
      ../ -> /
      ===/<directory tree of unwanted static files>
      121/<directory tree of static files>
      Help/<directory tree of static files>
      مساعدة/<directory tree of static files>
      Помощь/<directory tree of static files>

I want to be sure that by running a command similar to suckit https://domain.tld -i "/Books/[a-Z0-9]+/" I will download the Tree under /Books/ while excluding anything under ./, ../, and ===/

@Skallwar
Copy link
Owner

Skallwar commented Jul 9, 2021

This looks correct. The best way to know is by testing it, and I would love to see the result of such a test. If you can build this directory tree, just serve it using a webserver and try to run suckit on localhost

@Skallwar
Copy link
Owner

@mr-bo-jangles Did it worked ?

@raphCode
Copy link
Contributor

Maybe we can add an option to output URL filtering information to stdout or a file, e.g, if the include or exclude regex matches?
I think this would lead to more transparency what suckit is doing.
I also plan to implement functionality to rewrite the local URLs that could profit from this debug feature.

@Skallwar
Copy link
Owner

Maybe we can add an option to output URL filtering information to stdout or a file

Good idea

I also plan to implement functionality to rewrite the local URLs that could profit from this debug feature.

What do you mean?

@raphCode
Copy link
Contributor

raphCode commented Mar 22, 2022

What do you mean?

To download a phpBB forum, I added a hack to rewrite some URLs, namely remove a ?sid=<hash> parameter. Otherwise the same pages get downloaded over and over again with different sid hashes.
If you want to take a look:
https://github.com/raphCode/suckit/blob/fusornet_hack/src/scraper.rs#L191

I originally planned to flesh this out into a dedicated feature / command line option, but eventually didn't. I already achieved my goal and I could not figure out a way to do it properly.

@Skallwar
Copy link
Owner

Skallwar commented May 2, 2022

The problem with removing parameters such as ?sid is that they might have changed the content of the requested page. If you remove them, 2 links identical except the parameters will have a common page downloaded by suckit while they should have 2 different pages

@raphCode
Copy link
Contributor

raphCode commented May 2, 2022

In general you are correct, but in the specific case of phpBB the content is always the same, no matter the ?sid parameter value.
One solution would be to just ignore all links with this parameter, like suggested here, but this may create a swath of broken links. I just removed the parameter from the URL and collapsed all links into their "canonical" form without the session id parameter.

I actually just found a different solution, namely to send session cookies, which avoids ?sid parameters getting appended to links in the first place.

@Skallwar
Copy link
Owner

Skallwar commented May 3, 2022

We could imagine a solution where you whould have a list of tuple with a regex and list of arguments to remove

Vec<(regex, Vec<parameter>)>

But it might be really costly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants