-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory problems: SplashRequest references keep going up #76
Comments
@nehakansal are this references from requests that have already been made? I'm asking because there might be two possible explanations for what you are seeing:
|
Thanks. I will try using muppy to get more information. Follow-up questions though -
|
|
|
Yes, that's correct.
I think it's hard to tell if an hour is enough or not, depending on the site. The only reliable way is to check is to check whether there any any alive requests that have already been crawled. |
Thanks. I checked in one of my tests that the oldest url in SplashRequest references was already crawled, but I will check for that again in few more tests. And I will also use muppy soon. Will post my updates here once I do. |
Hi, here is some data from one of my runs where it seemed like something is wrong. I didnt see this behavior in couple of other runs so its probably not consistent, but don't have enough sample runs to be completely sure. I thought I will at least post what I have and see if you have any thoughts on it. I crawled a site 'jair.org'. And one of the urls that got successfully crawled very early on, stayed the oldest 'SplashRequest' object until I aborted the crawl, after 3 minutes of the successful crawl of the url. I couldn't get muppy to run for some reason. I got the data below using scrapy's trackref tool and Objgraph. Below is the meta and the headers dict data for the url that stayed the oldest url 'https://www.jair.org/index.php/jair/article/view/11210/26421'. Any idea why this SplashRequest object had references even after getting crawled and scraped successfully? Here's how I got the oldest SplashRequest object
Meta
Headers
The meta and the headers were same every time. I have attached the objgraph of the oldest request, one from first time it was the oldest url and one from right before I aborted the crawl. And you will notice that the url in the graph for some reason is the referrer url instead of the actual url, don't know why, but I did double check that the object that I used to print the metadata is the same object I passed to the objgraph. |
@nehakansal I'm not 100% sure, but I would check how long does the I can also suggest a different strategy for debugging it - instead of tracking individual requests, I'd enable disk queues and start a longer crawl, and see if memory is leaking or it's bounded, and then if it's leaking, enter the console after it has leaked enough and see why objects are kept alive. In this case, you'll be sure that this is a real leak. |
Thanks, that info helps. As for the other suggestion - actually that's exactly what I started doing yesterday. I enabled the disk queue yesterday. I need to run more tests to make sure that using the disk queue is working well for me or even if there are some inconsistencies in memory but queue helps enough to get me going for now. |
I noticed that the memory usage by the scrapy process gradually keeps going up when I use Undercrawler. And then I used the prefs() function (mentioned in the scrapy docs) to monitor the live-references. I noticed that the SplashRequest references never go down and the oldest in there is from the time when the crawl started. Depending on the site, the references can climb up very fast. Here are some numbers:
(1) Crawling cnn, after 5 minutes
Crawled 65 pages
prefs() results -
BaseSpider 1 oldest: 313s ago
SplashRequest 2120 oldest: 300s ago
(2) Crawling reddit, after 5 minutes
Crawled 156 pages
prefs() results -
BaseSpider 1 oldest: 308s ago
SplashRequest 16100 oldest: 302s ago
So, even though, for both the crawls, the SplashRequest object seems to be a problem, for Reddit for some reason there are too many of them in just 5 minutes.
I would expect the SplashRequest to be released after the request is made for all the crawls. Can someone explain this behavior? And what can be done about it?
Thanks.
The text was updated successfully, but these errors were encountered: