You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all - I very much appreciate the work you did for this project. I've just tried the library and man it's cool, works like magic and is very configurable.
As far as I understand, the main use case is the long-running engine to collect lots of data from the website and store it somewhere.
However, it might also be very useful when there's a finite amount of data to retrieve and this has to be done in finite time, let's say to navigate a few pages, parse some data and return it back. And as far I can tell, it's very hard to achieve here.
Two possible ways I found to get data and parse it to an object are either by using Subscribe() and then using the JObject to deserialize data to object or implementing own IScraperSink and storing the data there for further usage. I'm fine with both solutions and I've tested them - they both work perfectly. However, when we start the engine - it never stops even when there's nothing to parse because no one closes Channel, since it's opened forever - AsyncEnumerable never ends.
Therefore, I propose to make a change that the engine would actually store the current parse status in a form of a tree and once we've reached the state where all the leaf pages are of PageCategory TargetPage type, we close the channel and allow Parallel.ForEachAsync stops its execution returning us back from the engine and allowing actually await engine execution before retrieving results. It might not be perfect and I'm sure it's not, it's just the first thing I have in my mind that could work, maybe you have different ideas.
Please let me know what you think about this and whether you have plans and time for this enhancement.
Thank you,
Bogdan
The text was updated successfully, but these errors were encountered:
Appreciate your feedback and suggestions! It was quite intense at work, so I had little time for improvements.
I like your idea and plan to implement it one way or another. At the moment the only way to stop the engine is to specify the page crawl limit beforehand:
var engine = await new ScraperEngineBuilder()
...
.PageCrawlLimit(100)
.BuildAsync();
This should be documented in the Readme. I think it's a common use case to run this in the background somewhere, you'd think it would stop on its own when it's done.
Also another way, if you know you don't have that much links to scrape you can use the CancelationToken to timebox it.
varcts=new CancellationTokenSource();
cts.CancelAfter(TimeSpan.FromMinutes(5));try{await engine.RunAsync(cts.Token);}catch(OperationCanceledException){// do nothing}
Hello,
First of all - I very much appreciate the work you did for this project. I've just tried the library and man it's cool, works like magic and is very configurable.
As far as I understand, the main use case is the long-running engine to collect lots of data from the website and store it somewhere.
However, it might also be very useful when there's a finite amount of data to retrieve and this has to be done in finite time, let's say to navigate a few pages, parse some data and return it back. And as far I can tell, it's very hard to achieve here.
Two possible ways I found to get data and parse it to an object are either by using Subscribe() and then using the JObject to deserialize data to object or implementing own IScraperSink and storing the data there for further usage. I'm fine with both solutions and I've tested them - they both work perfectly. However, when we start the engine - it never stops even when there's nothing to parse because no one closes Channel, since it's opened forever - AsyncEnumerable never ends.
Therefore, I propose to make a change that the engine would actually store the current parse status in a form of a tree and once we've reached the state where all the leaf pages are of PageCategory TargetPage type, we close the channel and allow Parallel.ForEachAsync stops its execution returning us back from the engine and allowing actually await engine execution before retrieving results. It might not be perfect and I'm sure it's not, it's just the first thing I have in my mind that could work, maybe you have different ideas.
Please let me know what you think about this and whether you have plans and time for this enhancement.
Thank you,
Bogdan
The text was updated successfully, but these errors were encountered: