Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement - End Engine's task once it's done scraping and reached all the target pages available. #20

Open
bogdan799 opened this issue Jul 11, 2023 · 2 comments

Comments

@bogdan799
Copy link

Hello,

First of all - I very much appreciate the work you did for this project. I've just tried the library and man it's cool, works like magic and is very configurable.

As far as I understand, the main use case is the long-running engine to collect lots of data from the website and store it somewhere.
However, it might also be very useful when there's a finite amount of data to retrieve and this has to be done in finite time, let's say to navigate a few pages, parse some data and return it back. And as far I can tell, it's very hard to achieve here.

Two possible ways I found to get data and parse it to an object are either by using Subscribe() and then using the JObject to deserialize data to object or implementing own IScraperSink and storing the data there for further usage. I'm fine with both solutions and I've tested them - they both work perfectly. However, when we start the engine - it never stops even when there's nothing to parse because no one closes Channel, since it's opened forever - AsyncEnumerable never ends.

Therefore, I propose to make a change that the engine would actually store the current parse status in a form of a tree and once we've reached the state where all the leaf pages are of PageCategory TargetPage type, we close the channel and allow Parallel.ForEachAsync stops its execution returning us back from the engine and allowing actually await engine execution before retrieving results. It might not be perfect and I'm sure it's not, it's just the first thing I have in my mind that could work, maybe you have different ideas.

Please let me know what you think about this and whether you have plans and time for this enhancement.

Thank you,
Bogdan

@pavlovtech
Copy link
Owner

pavlovtech commented Aug 11, 2023

Hi Bogdan,

Appreciate your feedback and suggestions! It was quite intense at work, so I had little time for improvements.

I like your idea and plan to implement it one way or another. At the moment the only way to stop the engine is to specify the page crawl limit beforehand:

var engine = await new ScraperEngineBuilder()
   ...
    .PageCrawlLimit(100)
    .BuildAsync();

@Marcel0024
Copy link

Marcel0024 commented Jun 16, 2024

This should be documented in the Readme. I think it's a common use case to run this in the background somewhere, you'd think it would stop on its own when it's done.

Also another way, if you know you don't have that much links to scrape you can use the CancelationToken to timebox it.

var cts = new CancellationTokenSource();

cts.CancelAfter(TimeSpan.FromMinutes(5));

try
{
    await engine.RunAsync(cts.Token);
}
catch (OperationCanceledException)
{
    // do nothing
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants